AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering

Jiayu Zhang; Shuo Ye; Qilang Ye; Xun Lin; Zihan Song; Zitong Yu

arXiv:2510.18346·cs.CV·April 28, 2026

AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering

Jiayu Zhang, Shuo Ye, Qilang Ye, Xun Lin, Zihan Song, Zitong Yu

PDF

TL;DR

AV-Master introduces a dual-path framework for audio-visual question answering, dynamically focusing on relevant segments and modalities to improve reasoning in complex scenes.

Contribution

It proposes a novel adaptive sampling and modality preference strategy, along with a dual-path contrastive loss, enhancing cross-modal reasoning capabilities.

Findings

01

Significantly outperforms existing methods on four benchmarks.

02

Improves focus on relevant audio-visual segments for complex questions.

03

Enhances cross-modal collaboration through contrastive learning.

Abstract

Audio-Visual Question Answering (AVQA) requires models to effectively utilize both visual and auditory modalities to answer complex and diverse questions about audio-visual scenes. However, existing methods lack sufficient flexibility and dynamic adaptability in temporal sampling and modality preference awareness, making it difficult to focus on key information based on the question. This limits their reasoning capability in complex scenarios. To address these challenges, we propose a novel framework named AV-Master. It enhances the model's ability to extract key information from complex audio-visual scenes with substantial redundant content by dynamically modeling both temporal and modality dimensions. In the temporal dimension, we introduce a dynamic adaptive focus sampling mechanism that progressively focuses on audio-visual segments most relevant to the question, effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.