Diversity-Rewarded CFG Distillation
Geoffrey Cideron, Andrea Agostinelli, Johan Ferret, Sertan Girgin,, Romuald Elie, Olivier Bachem, Sarah Perrin, Alexandre Ram\'e

TL;DR
This paper introduces a finetuning method called diversity-rewarded CFG distillation that retains CFG's quality while enhancing diversity and reducing inference costs in generative models, demonstrated on MusicLM.
Contribution
The novel approach combines distillation and reinforcement learning to improve diversity and quality in generative models without increasing inference time.
Findings
Outperforms CFG in quality-diversity trade-off
Enables weight interpolation for quality-diversity control
Achieves higher human-evaluated diversity and quality
Abstract
Generative models are transforming creative domains such as music generation, with inference-time strategies like Classifier-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By finetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model…
Peer Reviews
Decision·ICLR 2025 Poster
Originality: The paper presents an innovative approach to overcoming CFG’s limitations, particularly by combining distillation with a diversity-promoting reinforcement learning reward. It proposes a metric of diversity based on pair-level dissimilarity. Quality: The methodology is robust, with well-designed experiments that rigorously evaluate quality-diversity trade-offs. The use of human evaluations adds significant credibility, and the proposed model-merging strategy for adjustable quality-d
About the diversity training According to equation 3, D(\theta) dependents on the expected value of dissimilarity of y given x. How many ‘y’s are sampled given an x ? Besides, this paper does not further investigate the impact of the sample number of an x on the diversity training . The definition of diversity is based on pair-level dissimilarity. It may be better to use some metrics like gini or Kurtosis to measure the global diversity of the generated songs.
1. This work provides a new strategy for managing the quality-diversity balance in generative models and has potential applications in other areas of creative content generation. 2. To achieve both diversity and quality, instead of doing direct multi-objective training which can downgrade the performance, the authors first distill model in terms of audio quality with classifier free guidance, which increase the quality while maintaining the capability of text understanding to music. 3. The analy
1. For music, the terms “quality” and “diversity” are not obvious in the paper. “Quality” can not only refer to audio quality, but can also refer to other aspects of music, like style consistency, the key consistency, rhythmic patterns, etc. I assume “quality” in this paper means the audio quality only. "Diversity" is specified in Appendix C.1, which I think it "Quality" and "Diversity" should be mentioned clearly in the introduction or methodology section. 2. In terms of evaluation of music, us
- Proposed CFG distillation method seems strong and is reasonably novel in the context of autoregressive LM models - As a whole, proposed method is exceedingly clear, and while simple, shows the strengths of each facet of their 3-pronged method articulately. - The significant time dedicated to discussing results and ablations is particular welcome, and the authors do a great job of exploring the design space of their method. - The insight from Fig. 4 is in particular quite fascinating, do you ha
Overall, while I think this is a reasonably strong work, there are a number of concerns I have addressed below that shift my overall recommendation to reject. In fixing some or all of these issues (in particular improving the discussion of CFG distillation novelty, using publicly available eval metrics such as FAD/CLAP/Recall/Coverage, including human eval error bars / statistical tests, and focusing more on the diversity model and its calibration in the main text) I would raise my score accordi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProcess Optimization and Integration
MethodsBalanced Selection
