Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning
Daiqing Wu, Xuan Zhang, Dongbao Yang, Jiashu Yao, Longfei Chen, Qingsong Liu, Sicheng Zhao, Can Ma, Yangyang Kang, Yu Zhou

TL;DR
Echo introduces a novel audio-interleaved reasoning approach for large audio language models, enabling dynamic re-listening and improved comprehension of complex audio, surpassing existing methods on benchmark tasks.
Contribution
It proposes a two-stage training framework and structured data pipeline to enable LALMs to actively re-listen to audio during reasoning, advancing audio comprehension capabilities.
Findings
Echo outperforms existing models on audio comprehension benchmarks.
The approach demonstrates improved reasoning and generalization in complex audio tasks.
Audio-interleaved reasoning enhances sustained audio engagement and understanding.
Abstract
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a critical information bottleneck. Drawing inspiration from human cognition, we propose audio-interleaved reasoning to break through this bottleneck. It treats audio as an active reasoning component, enabling sustained audio engagement and perception-grounded analysis. To instantiate it, we introduce a two-stage training framework, first teaching LALMs to localize salient audio segments through supervised fine-tuning, and then incentivizing proficient re-listening via reinforcement learning. In parallel, a structured data generation pipeline is developed to produce high-quality training data. Consequently, we…
Peer Reviews
Decision·ICLR 2026 Poster
- Turning audio from a static pre-embedding into an active element during reasoning is simple, intuitive, and well thought out with the <seg> scheme and inference adaptation. - The SFT to RL pipeline is coherent; rewards are explicitly specified (format, "consistency" after </seg>, accuracy, + a segment-use bonus), making the approach reproducible in principle. - The paper details how temporal metadata + Qwen2.5-Omni descriptors feed DeepSeek-R1 for QA-CoT synthesis and filtering into SFT and R
- No human evaluation of the thinking chain. - No qualitative analysis of the data being generated by the LLMs for training, this might induce bias or the training data might also contain some hallucinated data. - EAQA-SFT CoTs are constrained by source temporal metadata (the paper later notes SFT annotations limited to first 10s), yet claimed generalization beyond 10s is argued only by aggregate coverage stats; no targeted stress-tests show failure modes when informative cues cluster late. The
Strong results on the latest benchmarks. Novel two stage training/inference, bringing RL and Audio revisiting into mainstream of LALMs Details of dataset creation and frameworks used.
I would like a stronger commitment to releasing the code than the vague one of releasing it "in the future" Examples of which segments are revisited would have been good. Some audio examples would be good. The biological basis argument could be strengthened.
- The writing of the paper is very clear. - The segment revisiting approach is novel in the LALM domain and it makes a lot of sense to expect it to improve reasoning quality, so it is a good intuition. - The data generation and annotation pipeline is clear and easy for reproducing, thus benefiting the community to build upon this work. - The main results indicate that applying ECHO has consistent better results on MMAU and MMAR. In addition, ablation studies clearly show the benefits of audi
I think the main weakness of this paper is that current experiments do not prove the quality (such as accuracy) of the segment selection, which is a fundamental basis of this paper. In data generation, the paper uses AudioSet-SL and MusicBench. It is known that AudioSet-SL annotations are very noisy, and this could impose noisy training data during data generation. In RL training, the "format" and "consist" rewards are mostly for formats; the "acc" reward is for the final prediction; the "seg" r
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
