Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

Junan Lin; Daizong Liu; Xianke Chen; Xiaoye Qu; Xun Yang; Jixiang Zhu; Sanyuan Zhang; Jianfeng Dong

arXiv:2508.04273·cs.IR·October 28, 2025

Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

Junan Lin, Daizong Liu, Xianke Chen, Xiaoye Qu, Xun Yang, Jixiang Zhu, Sanyuan Zhang, Jianfeng Dong

PDF

TL;DR

This paper introduces a novel importance-aware multi-granularity fusion model for video moment retrieval that selectively integrates audio, visual, and textual information, effectively handling noisy audio and improving retrieval accuracy.

Contribution

The paper proposes a dynamic, importance-aware fusion approach with a pseudo-label-supervised audio importance predictor and multi-granularity fusion, advancing multimodal reasoning in VMR.

Findings

01

Achieves state-of-the-art results on VMR with audio-visual fusion.

02

Effectively mitigates noisy audio interference through importance weighting.

03

Demonstrates the benefit of multi-granularity fusion in capturing complementary contexts.

Abstract

Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.