Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Jingkuan Song; Zhao Guo; Lianli Gao; Wu Liu; Dongxiang Zhang; Heng Tao; Shen

arXiv:1706.01231·cs.CV·June 6, 2017·36 cites

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Jingkuan Song, Zhao Guo, Lianli Gao, Wu Liu, Dongxiang Zhang, Heng Tao, Shen

PDF

Open Access

TL;DR

This paper introduces a hierarchical LSTM with adjusted temporal attention for video captioning, selectively applying attention to visual and non-visual words to improve caption quality.

Contribution

The proposed framework uniquely combines hierarchical LSTMs with adjusted temporal attention to better differentiate visual and language cues during caption generation.

Findings

01

Outperforms state-of-the-art on MSVD and MSR-VTT datasets

02

Effectively distinguishes visual from non-visual words during captioning

03

Improves accuracy of generated video descriptions

Abstract

Recent progress has been made in using attention based encoder-decoder framework for video captioning. However, most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of video captioning. To address this issue, we propose a hierarchical LSTM with adjusted temporal attention (hLSTMat) approach for video captioning. Specifically, the proposed framework utilizes the temporal attention for selecting specific frames to predict the related words, while the adjusted temporal attention is for deciding whether to depend on the visual information or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory