Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

Matteo Gallici; Haitz S\'aez de Oc\'ariz Borde

arXiv:2505.23331·cs.CV·July 1, 2025

Fine-Tuning Next-Scale Visual Autoregressive Models with Group Relative Policy Optimization

Matteo Gallici, Haitz S\'aez de Oc\'ariz Borde

PDF

TL;DR

This paper demonstrates that applying Group Relative Policy Optimization to fine-tune visual autoregressive models improves image quality, style control, and generalization beyond initial training data by aligning outputs with complex reward signals.

Contribution

The paper introduces a novel RL-based fine-tuning method for visual autoregressive models that enhances image quality, style control, and generalization capabilities beyond pre-training.

Findings

01

Enhanced image quality and style control through RL fine-tuning.

02

Models can generate images aligned with unseen styles beyond initial training.

03

RL fine-tuning is efficient and benefits from fast inference speeds.

Abstract

Fine-tuning pre-trained generative models with Reinforcement Learning (RL) has emerged as an effective approach for aligning outputs more closely with nuanced human preferences. In this paper, we investigate the application of Group Relative Policy Optimization (GRPO) to fine-tune next-scale visual autoregressive (VAR) models. Our empirical results demonstrate that this approach enables alignment to intricate reward signals derived from aesthetic predictors and CLIP embeddings, significantly enhancing image quality and enabling precise control over the generation style. Interestingly, by leveraging CLIP, our method can help VAR models generalize beyond their initial ImageNet distribution: through RL-driven exploration, these models can generate images aligned with prompts referencing image styles that were absent during pre-training. In summary, we show that RL-based fine-tuning is both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training