Music Flamingo: Scaling Music Understanding in Audio Language Models
Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze, Sang-gil Lee, Zhifeng Kong, Joao Felipe Santos, Ramani Duraiswami, Dinesh Manocha, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro

TL;DR
Music Flamingo is a large, multimodal audio-language model that significantly advances music understanding by leveraging a new dataset, specialized training, and reasoning techniques, achieving state-of-the-art results across multiple benchmarks.
Contribution
The paper introduces Music Flamingo, a novel large-scale audio-language model for music understanding, with a new dataset, training methods, and reasoning capabilities, surpassing previous models in performance.
Findings
Achieves state-of-the-art results on 10+ benchmarks.
Demonstrates layered, human-like music perception.
Sets new standards for music understanding models.
Abstract
We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an…
Peer Reviews
Decision·ICLR 2026 Poster
- Datasets and model proposed in this work are established on full-length songs rather than clips. This is important as some information can only be well determined by observing the full track (e.g. musical structure, mood, cultural context, etc.). - Making useful incorporation of speech-centric datasets, such as including multilingual ASR and multi-talker ASR to improve the model capability on bridging texts(e.g. lyrics) and phonemes (speech, vocal). - The annotations included in MF-Skills i
- There's no ablation study on how each post training step improves the model, which is important to show the significance of introducing reinforcement learning. It's also a pity that there's no further discussion about the stability issue of GRPO. - Although it's a positive sign that proposed work is aware of the dataset issue on cultural bias, according to Figure 4(the same as Figure 27?), the composition of MF-Skills is still western-centric. It can be understood that collecting dataset that
- I think MF-Skills and MF-Think are very good resources for music understanding, where annotations and datasets are rare. The two datasets could potentially benefit follow-up work of developing better models in this field. - SOTA performance on several music understanding benchmarks. - Adding a post-training stage to improve reasoning of the music understanding model is well motivated. Current music understanding models tend to output surface-level outputs about a music. It is great to see
- MF stresses a lot on music with vocal (sec 3.1), and motivates to add several ASR datasets. While this motivation is valid, but non-vocal music seem less covered in the discussions. - In the human evaluation from music experts (Tab4), it seems MF often attempts to provide deep details, but often inaccurate. This poses some concerns on real-world use cases, beyond the good performance on benchmark numbers.
Comprehensive system design — The adaptation of the Flamingo multimodal framework to the music domain is well-motivated and technically executed. Training across multiple modalities (audio, symbolic, text) is a significant engineering contribution. Unified modeling framework — Unlike prior task-specific MIR models, the same model handles captioning, retrieval, tagging, and QA, demonstrating promising few-shot and zero-shot generalization. Dataset and reproducibility — The paper provides a subs
1. Necessity over specialized MIR pipelines While the unified approach is appealing, it remains unclear why such a large, general model is necessary for tasks that existing MIR pipelines already solve effectively (e.g., onset detection, chord recognition, instrument tagging). The paper could better justify what emergent capabilities arise from this unified model that cannot be achieved by composing established MIR tools. 2. Missing comparisons to domain-specialized music–language models The
Code & Models
- 🤗nvidia/music-flamingo-2601-hfmodel· 9.5k dl· ♡ 899.5k dl♡ 89
- 🤗nvidia/music-flamingo-think-2601-hfmodel· 912 dl· ♡ 33912 dl♡ 33
- 🤗nvidia/music-flamingo-hfmodel· 5.2k dl· ♡ 865.2k dl♡ 86
- 🤗henry1477/music-flamingo-ggufmodel· 456 dl· ♡ 3456 dl♡ 3
- 🤗henry1477/music-flamingo-2601-hf-fp8model· 173 dl· ♡ 1173 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech Recognition and Synthesis
