Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Sreyan Ghosh; Arushi Goel; Kaousheik Jayakumar; Lasha Koroshinadze; Nishit Anand; Zhifeng Kong; Siddharth Gururani; Sang-gil Lee; Jaehyeon Kim; Aya Aljafari; Chao-Han Huck Yang; Sungwon Kim; Ramani Duraiswami; Dinesh Manocha; Mohammad Shoeybi; Bryan Catanzaro; Ming-Yu Liu; Wei Ping

arXiv:2604.10905·cs.SD·April 14, 2026

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music

Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar, Lasha Koroshinadze, Nishit Anand, Zhifeng Kong, Siddharth Gururani, Sang-gil Lee, Jaehyeon Kim, Aya Aljafari, Chao-Han Huck Yang, Sungwon Kim, Ramani Duraiswami, Dinesh Manocha, Mohammad Shoeybi, Bryan Catanzaro, Ming-Yu Liu

PDF

3 Models

TL;DR

Audio Flamingo Next is a state-of-the-art large audio-language model that advances understanding, reasoning, and interpretability over speech, sounds, and music, supporting long inputs and complex tasks.

Contribution

It introduces a stronger foundational model, scalable data strategies, long-input support, and a novel reasoning paradigm, significantly improving audio understanding and reasoning capabilities.

Findings

01

Outperforms similar-sized open models on 20 benchmarks

02

Supports audio inputs up to 30 minutes long

03

Demonstrates strong transferability and robustness in real-world tasks

Abstract

We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.