Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
Jingkuan Song, Zhao Guo, Lianli Gao, Wu Liu, Dongxiang Zhang, Heng Tao, Shen

TL;DR
This paper introduces a hierarchical LSTM with adjusted temporal attention for video captioning, selectively applying attention to visual and non-visual words to improve caption quality.
Contribution
The proposed framework uniquely combines hierarchical LSTMs with adjusted temporal attention to better differentiate visual and language cues during caption generation.
Findings
Outperforms state-of-the-art on MSVD and MSR-VTT datasets
Effectively distinguishes visual from non-visual words during captioning
Improves accuracy of generated video descriptions
Abstract
Recent progress has been made in using attention based encoder-decoder framework for video captioning. However, most existing decoders apply the attention mechanism to every generated word including both visual words (e.g., "gun" and "shooting") and non-visual words (e.g. "the", "a"). However, these non-visual words can be easily predicted using natural language model without considering visual signals or attention. Imposing attention mechanism on non-visual words could mislead and decrease the overall performance of video captioning. To address this issue, we propose a hierarchical LSTM with adjusted temporal attention (hLSTMat) approach for video captioning. Specifically, the proposed framework utilizes the temporal attention for selecting specific frames to predict the related words, while the adjusted temporal attention is for deciding whether to depend on the visual information or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
