Global2Local: A Joint-Hierarchical Attention for Video Captioning

Chengpeng Dai; Fuhai Chen; Xiaoshuai Sun; Rongrong Ji; Qixiang Ye,; Yongjian Wu

arXiv:2203.06663·cs.CV·April 15, 2025

Global2Local: A Joint-Hierarchical Attention for Video Captioning

Chengpeng Dai, Fuhai Chen, Xiaoshuai Sun, Rongrong Ji, Qixiang Ye,, Yongjian Wu

PDF

Open Access

TL;DR

This paper introduces a joint-hierarchical attention model for video captioning that captures key semantic items at multiple levels, improving the accuracy of descriptions by hierarchically selecting key frames and regions.

Contribution

It proposes a novel hierarchical attention mechanism that jointly models key clips, frames, and regions for more effective video captioning.

Findings

01

Outperforms state-of-the-art methods on MSVD and MSR-VTT datasets.

02

Achieves more accurate global-to-local feature representation.

03

Demonstrates significant improvements in captioning quality.

Abstract

Recently, automatic video captioning has attracted increasing attention, where the core challenge lies in capturing the key semantic items, like objects and actions as well as their spatial-temporal correlations from the redundant frames and semantic content. To this end, existing works select either the key video clips in a global level~(across multi frames), or key regions within each frame, which, however, neglect the hierarchical order, i.e., key frames first and key regions latter. In this paper, we propose a novel joint-hierarchical attention model for video captioning, which embeds the key clips, the key frames and the key regions jointly into the captioning model in a hierarchical manner. Such a joint-hierarchical attention model first conducts a global selection to identify key frames, followed by a Gumbel sampling operation to identify further key regions based on the key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques