Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning
Yunbin Tu, Liang Li, Li Su, Qingming Huang

TL;DR
This paper introduces QUAG, a query-centric audio-visual network that enhances multi-modal video understanding for retrieval, segmentation, and captioning by modeling hierarchical and associative relations across modalities.
Contribution
The work proposes a novel hierarchical audio-visual cognition framework that improves multi-task video content understanding by integrating global and local modality interactions with query-guided filtering.
Findings
Achieves state-of-the-art results on HIREST benchmark.
Demonstrates strong generalization to query-based video summarization.
Effectively models hierarchical and associative audio-visual relations.
Abstract
Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech and Audio Processing
