Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization
Matteo Gallici, Haitz S\'aez de Oc\'ariz Borde

TL;DR
This paper demonstrates that applying Group Relative Policy Optimization to fine-tune visual autoregressive models improves image quality, style control, and generalization beyond initial training data by aligning outputs with complex reward signals.
Contribution
The paper introduces a novel RL-based fine-tuning method for visual autoregressive models that enhances image quality, style control, and generalization capabilities beyond pre-training.
Findings
Enhanced image quality and style control through RL fine-tuning.
Models can generate images aligned with unseen styles beyond initial training.
RL fine-tuning is efficient and benefits from fast inference speeds.
Abstract
Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs more closely with nuanced human preferences. In this paper, we investigate the application of Group Relative Policy Optimization (GRPO) to fine-tune next-scale visual autoregressive (VAR) models. Our empirical results demonstrate that this approach enables alignment to intricate reward signals derived from aesthetic predictors and CLIP embeddings, significantly enhancing image quality and enabling precise control over the generation style. Interestingly, by leveraging CLIP, our method can help VAR models generalize beyond their initial ImageNet distribution: through RL-driven exploration, these models can generate images aligned with prompts referencing image styles that were absent during pre-training. In summary, we show that RL-based fine-tuning is both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
