Visual-aware Attention Dual-stream Decoder for Video Captioning
Zhixin Sun, Xian Zhong, Shuqin Chen, Lin Li, and Luo Zhong

TL;DR
This paper introduces a novel Visual-aware Attention Dual-stream Decoder for video captioning, combining temporal sequence features and semantic information to generate more coherent and accurate video descriptions, while addressing training-inference discrepancies.
Contribution
It proposes a new dual-stream decoder architecture with visual-aware attention and self-forcing mechanisms, improving semantic coherence and reducing exposure bias in video captioning.
Findings
Enhanced captioning accuracy on MSVD and MSR-VTT datasets.
Improved semantic coherence in generated sentences.
Reduced exposure bias during training.
Abstract
Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence. The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically. This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.To generate semantically coherent sentences, we propose a new Visual-aware Attention (VA) model, which concatenates dynamic changes of temporal sequence frames with the words at the previous moment, as the input of attention mechanism to extract sequence features.In addition, the prevalent approaches widely use the teacher-forcing (TF) learning during training, where the next token is generated conditioned on the previous ground-truth tokens. The semantic information in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
