Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel; Sreyan Ghosh; Jaehyeon Kim; Sonal Kumar; Zhifeng Kong; Sang-gil Lee; Chao-Han Huck Yang; Ramani Duraiswami; Dinesh Manocha; Rafael Valle; Bryan Catanzaro

arXiv:2507.08128·cs.SD·July 30, 2025

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro

PDF

4 Models 5 Datasets 1 Video

TL;DR

Audio Flamingo 3 is a fully open large audio-language model that significantly advances reasoning and understanding across speech, sound, and music through novel training strategies and multi-modal capabilities.

Contribution

It introduces AF-Whisper, a unified audio encoder, and a comprehensive training curriculum, enabling state-of-the-art performance on diverse audio understanding benchmarks.

Findings

01

Achieves new SOTA on over 20 audio benchmarks.

02

Trained solely on open-source data, surpassing larger closed models.

03

Supports long audio reasoning up to 10 minutes and voice-to-voice interaction.

Abstract

We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models· slideslive