Global2Local: A Joint-Hierarchical Attention for Video Captioning
Chengpeng Dai, Fuhai Chen, Xiaoshuai Sun, Rongrong Ji, Qixiang Ye,, Yongjian Wu

TL;DR
This paper introduces a joint-hierarchical attention model for video captioning that captures key semantic items at multiple levels, improving the accuracy of descriptions by hierarchically selecting key frames and regions.
Contribution
It proposes a novel hierarchical attention mechanism that jointly models key clips, frames, and regions for more effective video captioning.
Findings
Outperforms state-of-the-art methods on MSVD and MSR-VTT datasets.
Achieves more accurate global-to-local feature representation.
Demonstrates significant improvements in captioning quality.
Abstract
Recently, automatic video captioning has attracted increasing attention, where the core challenge lies in capturing the key semantic items, like objects and actions as well as their spatial-temporal correlations from the redundant frames and semantic content. To this end, existing works select either the key video clips in a global level~(across multi frames), or key regions within each frame, which, however, neglect the hierarchical order, i.e., key frames first and key regions latter. In this paper, we propose a novel joint-hierarchical attention model for video captioning, which embeds the key clips, the key frames and the key regions jointly into the captioning model in a hierarchical manner. Such a joint-hierarchical attention model first conducts a global selection to identify key frames, followed by a Gumbel sampling operation to identify further key regions based on the key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
