Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation
Or Tal, Felix Kreuk, Yossi Adi

TL;DR
This study systematically compares auto-regressive and flow-matching paradigms in text-to-music generation, analyzing their performance, robustness, scalability, and editing capabilities to guide future system design.
Contribution
It provides a controlled empirical comparison of two main modeling paradigms in text-to-music generation, highlighting their respective strengths and limitations.
Findings
Auto-regressive models excel in generation quality.
Flow-matching models show better robustness and scalability.
Trade-offs identified between quality, robustness, and editing capabilities.
Abstract
Recent progress in text-to-music generation has enabled models to synthesize high-quality musical segments, full compositions, and even respond to fine-grained control signals, e.g. chord progressions. State-of-the-art (SOTA) systems differ significantly in many dimensions, such as training datasets, modeling paradigms, and architectural choices. This diversity complicates efforts to evaluate models fairly and identify which design choices influence performance the most. While factors like data and architecture are important, in this study we focus exclusively on the modeling paradigm. We conduct a systematic empirical analysis to isolate its effects, offering insights into associated trade-offs and emergent behaviors that can guide future text-to-music generation systems. Specifically, we compare the two arguably most common modeling paradigms: auto-regressive decoding and conditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
