TL;DR
DiffRhythm 2 presents an end-to-end, high-fidelity song generation framework that aligns lyrics with vocals, maintains long-term coherence, and effectively incorporates diverse human preferences through novel optimization techniques.
Contribution
The paper introduces DiffRhythm 2, a semi-autoregressive model with block flow matching and a music VAE, enabling faithful lyric alignment, efficient long-sequence generation, and robust multi-preference optimization.
Findings
Achieves high-quality, long-duration song generation with accurate lyric alignment.
Reduces performance degradation in multi-preference optimization via cross-pair preference method.
Maintains high audio fidelity with low frame rate music VAE.
Abstract
Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The method is simple, elegant, and easy to follow 2. The paper is well written and easy to follow 3. Results and provided samples are impressive
1. The experimental setup could be strengthened, particularly in relation to the paper’s primary contribution. 2. Several methodological details and results are missing—for example, information regarding the VAE configuration, hyperparameters, and related implementation choices. 3. The overall contribution of the proposed approach could be better articulated and justified. At the moment, some components appear to be loosely integrated ideas aimed primarily at improving generation performance, ra
1. Clear technical framing & originality. The block flow matching formulation that is non-AR within block yet AR across blocks is a neat, minimalistic way to get alignment while keeping efficiency. The timestep trick (S/L set to −1, clean=1, noisy∈[0,1]) to disambiguate streams is simple and effective. 2. Practical long-sequence engineering. Using a 5 Hz VAE plus block-level KV cache gives a realistic path to multi-minute songs with stable inference time; the paper also discusses the EOP design
1. My main concern is that this paper reads more like a technical report or system description rather than a research paper with focused insights. Although the engineering contributions like block flow matching, low-frame-rate VAE, stochastic REPA loss, and cross-pair preference optimization are each well-motivated and empirically validated, the work overall feels like a composition of effective engineering tricks rather than a cohesive theoretical or methodological advancement. 2. Missing bas
1. The block flow matching addresses lyric–vocal alignment without timestamp labels, reducing data preprocessing requirements and improving usability. 2. The cross-pair preference optimization takes into account the interdependence among different optimization dimensions in song generation. 3. The work involves substantial engineering effort and system implementation.
1. The definition of “lyrics alignment” is insufficiently precise. It is unclear whether this term refers purely to lyric accuracy (i.e., no omissions or mispronunciations of words) or also encompasses prosodic naturalness—such as rhythm, phrasing, and pauses. In general, compared with AR models, non-autoregressive (NAR) models tend to excel in lyric accuracy but struggle with natural rhythmic expressiveness. 2. Block flow matching is not a new idea; it has been applied in video generation [1] a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
