Visual-aware Attention Dual-stream Decoder for Video Captioning

Zhixin Sun; Xian Zhong; Shuqin Chen; Lin Li; and Luo Zhong

arXiv:2110.08578·cs.CV·October 19, 2021

Visual-aware Attention Dual-stream Decoder for Video Captioning

Zhixin Sun, Xian Zhong, Shuqin Chen, Lin Li, and Luo Zhong

PDF

Open Access

TL;DR

This paper introduces a novel Visual-aware Attention Dual-stream Decoder for video captioning, combining temporal sequence features and semantic information to generate more coherent and accurate video descriptions, while addressing training-inference discrepancies.

Contribution

It proposes a new dual-stream decoder architecture with visual-aware attention and self-forcing mechanisms, improving semantic coherence and reducing exposure bias in video captioning.

Findings

01

Enhanced captioning accuracy on MSVD and MSR-VTT datasets.

02

Improved semantic coherence in generated sentences.

03

Reduced exposure bias during training.

Abstract

Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence. The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically. This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.To generate semantically coherent sentences, we propose a new Visual-aware Attention (VA) model, which concatenates dynamic changes of temporal sequence frames with the words at the previous moment, as the input of attention mechanism to extract sequence features.In addition, the prevalent approaches widely use the teacher-forcing (TF) learning during training, where the next token is generated conditioned on the previous ground-truth tokens. The semantic information in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques