Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Xuanchen Li; Yuheng Lu; Chenrui Cui; Tianrui Wang; Zikang Huang; Yu Jiang; Long Zhou; Longbiao Wang; Jianwu Dang

arXiv:2605.09906·cs.AI·May 12, 2026

Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

Xuanchen Li, Yuheng Lu, Chenrui Cui, Tianrui Wang, Zikang Huang, Yu Jiang, Long Zhou, Longbiao Wang, Jianwu Dang

PDF

TL;DR

The paper introduces SFFL, a framework for audio-visual reasoning that reduces cross-modal interference by enforcing modality-specific reasoning and selective evidence fusion, improving accuracy and robustness.

Contribution

SFFL is a novel approach that enforces separate reasoning for audio and visual modalities and uses reinforcement learning to optimize modality preference.

Findings

01

Achieves 5.16% average accuracy gain on AVQA benchmarks.

02

Yields 11.17% improvement on a cross-modal hallucination benchmark.

03

Demonstrates enhanced robustness and reduced hallucinations.

Abstract

Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.