Query-centric Audio-Visual Cognition Network for Moment Retrieval,   Segmentation and Step-Captioning

Yunbin Tu; Liang Li; Li Su; Qingming Huang

arXiv:2412.13543·cs.CV·December 19, 2024

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning

Yunbin Tu, Liang Li, Li Su, Qingming Huang

PDF

Open Access 1 Video

TL;DR

This paper introduces QUAG, a query-centric audio-visual network that enhances multi-modal video understanding for retrieval, segmentation, and captioning by modeling hierarchical and associative relations across modalities.

Contribution

The work proposes a novel hierarchical audio-visual cognition framework that improves multi-task video content understanding by integrating global and local modality interactions with query-guided filtering.

Findings

01

Achieves state-of-the-art results on HIREST benchmark.

02

Demonstrates strong generalization to query-based video summarization.

03

Effectively models hierarchical and associative audio-visual relations.

Abstract

Video has emerged as a favored multimedia format on the internet. To better gain video contents, a new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. The pioneering work chooses the pre-trained CLIP-based model for video retrieval, and leverages it as a feature extractor for other three challenging tasks solved in a multi-task learning paradigm. Nevertheless, this work struggles to learn the comprehensive cognition of user-preferred content, due to disregarding the hierarchies and association relations across modalities. In this paper, guided by the shallow-to-deep principle, we propose a query-centric audio-visual cognition (QUAG) network to construct a reliable multi-modal representation for moment retrieval, segmentation and step-captioning. Specifically, we first design the modality-synergistic perception to obtain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning· underline

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization · Speech and Audio Processing