Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

Longhao Li; Hongjie Chen; Zehan Li; Qihan Hu; Jian Kang; Jie Li; Lei Xie; Yongxiang Li

arXiv:2604.12527·eess.AS·April 21, 2026

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

Longhao Li, Hongjie Chen, Zehan Li, Qihan Hu, Jian Kang, Jie Li, Lei Xie, Yongxiang Li

PDF

TL;DR

Audio-Cogito introduces an open-source deep audio reasoning framework with a large dataset and self-distillation, achieving top performance on audio reasoning benchmarks.

Contribution

The paper presents a new open-source audio reasoning model, a large dataset of reasoning samples, and a self-distillation training strategy for improved performance.

Findings

01

Achieved best open-source model performance on MMAR benchmark.

02

Produced 545k high-quality reasoning samples for audio tasks.

03

Ranked among top systems in Interspeech 2026 Audio Reasoning Challenge.

Abstract

Recent advances in reasoning models have driven significant progress in text and multimodal domains, yet audio reasoning remains relatively limited. Only a few Large Audio Language Models (LALMs) incorporate explicit Chain-of-Thought (CoT) reasoning, and their capabilities are often inconsistent and insufficient for complex tasks. To bridge this gap, we introduce Audio-Cogito, a fully open-source solution for deep audio reasoning. We develop Cogito-pipe for high-quality audio reasoning data curation, producing 545k reasoning samples that will be released after review. Based on this dataset, we adopt a self-distillation strategy for model fine-tuning. Experiments on the MMAR benchmark, the only audio benchmark evaluating the CoT process, show that our model achieves the best performance among open-source models and matches or surpasses certain closed-source models in specific metrics.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.