Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding

Zhifeng Kong; Arushi Goel; Joao Felipe Santos; Sreyan Ghosh; Rafael Valle; Wei Ping; Bryan Catanzaro

arXiv:2508.11818·cs.SD·August 19, 2025

Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding

Zhifeng Kong, Arushi Goel, Joao Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, Bryan Catanzaro

PDF

Open Access 1 Models

TL;DR

This paper introduces a new benchmark and training pipeline to enhance chain-of-thought reasoning in audio language models, demonstrating significant improvements in sound understanding tasks.

Contribution

It proposes AF-Reasoning-Eval for sound reasoning assessment and AF-CoT-Train, a large dataset for training, enabling better reasoning capabilities in audio models.

Findings

01

Finetuning Audio Flamingo with AF-CoT-Train improves reasoning performance.

02

AF-Reasoning-Eval effectively measures sound reasoning abilities.

03

Chain-of-thought finetuning enhances sound understanding in models.

Abstract

Chain-of-thought reasoning has demonstrated significant improvements in large language models and vision language models, yet its potential for audio language models remains largely unexplored. In this technical report, we take a preliminary step towards closing this gap. For better assessment of sound reasoning, we propose AF-Reasoning-Eval, a benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices. To prepare training corpus for sound reasoning abilities, we propose automatic pipelines that transform existing audio question answering and classification data into explicit reasoning chains, yielding AF-CoT-Train with 1.24M samples. We study the effect of finetuning Audio Flamingo series on AF-CoT-Train and observe considerable improvements on several reasoning benchmarks, validating the effectiveness of chain-of-thought finetuning on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
nvidia/audio-flamingo-2-SoundCoT
model· ♡ 10
♡ 10

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies