Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

Daiqing Wu; Xuan Zhang; Dongbao Yang; Jiashu Yao; Longfei Chen; Qingsong Liu; Sicheng Zhao; Can Ma; Yangyang Kang; Yu Zhou

arXiv:2602.11909·cs.SD·March 3, 2026

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning

Daiqing Wu, Xuan Zhang, Dongbao Yang, Jiashu Yao, Longfei Chen, Qingsong Liu, Sicheng Zhao, Can Ma, Yangyang Kang, Yu Zhou

PDF

Open Access 3 Reviews

TL;DR

Echo introduces a novel audio-interleaved reasoning approach for large audio language models, enabling dynamic re-listening and improved comprehension of complex audio, surpassing existing methods on benchmark tasks.

Contribution

It proposes a two-stage training framework and structured data pipeline to enable LALMs to actively re-listen to audio during reasoning, advancing audio comprehension capabilities.

Findings

01

Echo outperforms existing models on audio comprehension benchmarks.

02

The approach demonstrates improved reasoning and generalization in complex audio tasks.

03

Audio-interleaved reasoning enhances sustained audio engagement and understanding.

Abstract

The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

- Turning audio from a static pre-embedding into an active element during reasoning is simple, intuitive, and well thought out with the <seg> scheme and inference adaptation. - The SFT to RL pipeline is coherent; rewards are explicitly specified (format, "consistency" after </seg>, accuracy, + a segment-use bonus), making the approach reproducible in principle. - The paper details how temporal metadata + Qwen2.5-Omni descriptors feed DeepSeek-R1 for QA-CoT synthesis and filtering into SFT and R

Weaknesses

- No human evaluation of the thinking chain. - No qualitative analysis of the data being generated by the LLMs for training, this might induce bias or the training data might also contain some hallucinated data. - EAQA-SFT CoTs are constrained by source temporal metadata (the paper later notes SFT annotations limited to first 10s), yet claimed generalization beyond 10s is argued only by aggregate coverage stats; no targeted stress-tests show failure modes when informative cues cluster late. The

Reviewer 02Rating 8Confidence 4

Strengths

Strong results on the latest benchmarks. Novel two stage training/inference, bringing RL and Audio revisiting into mainstream of LALMs Details of dataset creation and frameworks used.

Weaknesses

I would like a stronger commitment to releasing the code than the vague one of releasing it "in the future" Examples of which segments are revisited would have been good. Some audio examples would be good. The biological basis argument could be strengthened.

Reviewer 03Rating 4Confidence 4

Strengths

- The writing of the paper is very clear. - The segment revisiting approach is novel in the LALM domain and it makes a lot of sense to expect it to improve reasoning quality, so it is a good intuition. - The data generation and annotation pipeline is clear and easy for reproducing, thus benefiting the community to build upon this work. - The main results indicate that applying ECHO has consistent better results on MMAU and MMAR. In addition, ablation studies clearly show the benefits of audi

Weaknesses

I think the main weakness of this paper is that current experiments do not prove the quality (such as accuracy) of the segment selection, which is a fundamental basis of this paper. In data generation, the paper uses AudioSet-SL and MusicBench. It is known that AudioSet-SL annotations are very noisy, and this could impose noisy training data during data generation. In RL training, the "format" and "consist" rewards are mostly for formats; the "acc" reward is for the final prediction; the "seg" r

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis