FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

Kaihang Pan; Wendong Bu; Yuruo Wu; Yang Wu; Kai Shen; Yunfei Li; Hang Zhao; Juncheng Li; Siliang Tang; Yueting Zhuang

arXiv:2506.05501·cs.CV·June 9, 2025

FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL

Kaihang Pan, Wendong Bu, Yuruo Wu, Yang Wu, Kai Shen, Yunfei Li, Hang Zhao, Juncheng Li, Siliang Tang, Yueting Zhuang

PDF

Open Access 1 Models 3 Reviews

TL;DR

FocusDiff introduces a reinforcement learning-based method to improve fine-grained text-image alignment, enabling more precise control over visual details in autoregressive image generation, especially on challenging paired prompts with subtle semantic differences.

Contribution

The paper presents FocusDiff, a novel approach that enhances fine-grained semantic alignment in text-to-image models using reinforcement learning and a new dataset, outperforming existing methods.

Findings

01

Achieves state-of-the-art results on standard benchmarks.

02

Significantly outperforms prior methods on the PairComp benchmark.

03

Improves control over subtle semantic differences in generated images.

Abstract

Recent studies extend the autoregression paradigm to text-to-image generation, achieving performance comparable to diffusion models. However, our new PairComp benchmark -- featuring test cases of paired prompts with similar syntax but different fine-grained semantics -- reveals that existing models struggle with fine-grained text-image alignment thus failing to realize precise control over visual tokens. To address this, we propose FocusDiff, which enhances fine-grained text-image semantic alignment by focusing on subtle differences between similar text-image pairs. We construct a new dataset of paired texts and images with similar overall expressions but distinct local semantics, further introducing a novel reinforcement learning algorithm to emphasize such fine-grained semantic differences for desired image generation. Our approach achieves state-of-the-art performance on existing…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The PairComp benchmark is a strong contribution. The design is a more direct and rigorous probe than existing single-prompt compositional benchmarks. The justification for the geometric mean ($s_g$) (l. 137) is sound and identifies an important failure mode (unstable generation). 2. The pipeline for creating FocusDiff-Data (Sec 3.1, Appendix C) is innovative. Using image editing datasets as a source for contrastive pairs is a clever solution to the data-sourcing problem, and the resulting dat

Weaknesses

1. Conflating reward model + evaluator. The QA-based reward model for RL is InternVL2.5-26B (l. 820). The primary evaluation model for the PairComp benchmark is also InternVL2.5-26B (l. 124, 216). This is a confound. The RL policy is being optimized to maximize the score of the same model that is used to judge its performance. The gains reported in the main results (Table 1) could represent reward hacking or overfitting to the specific biases of the InternVL2.5-26B evaluator, rather than a true

Reviewer 02Rating 4Confidence 4

Strengths

- The paper tackles a highly important problem, improving compositionality in AR text-to-image generative models, which is crucial for achieving more reliable and interpretable generation. - A key contribution is the contrastive pair dataset construction: rather than synthetically generating contrastive pairs (as done in prior work, which often introduces inconsistencies), the authors leverage existing high-quality image editing datasets and apply additional filtering strategies to ensure semant

Weaknesses

- The writing quality requires significant improvement. The paper often lacks clarity and coherence in some parts. For instance, in the introduction, the arithmetic and geometric evaluation metrics are introduced abruptly without proper explanation, which can confuse readers. In addition, several notation errors and inconsistencies are present (e.g., around lines 136 and 243), which further detract from readability. - The evaluation methodology is overly simplistic. The authors rely on a single

Reviewer 03Rating 4Confidence 4

Strengths

1. Clear Narrative: The paper presents a clear and logical narrative, effectively identifying a key limitation in AR models through a benchmark and then proposing a targeted solution comprising a new dataset and a RL training algorithm. 2. Comprehensive Experiments: The evaluation is thorough, demonstrating the method's effectiveness not only on the proposed PairComp benchmark but also on established general-purpose benchmarks, which validates the model's overall capabilities post-training. 3. I

Weaknesses

1. [Major] Insufficient Motivation and Positioning: The motivation for the PairComp benchmark feels insufficiently urgent. The challenge of fine-grained control is a known problem. The paper should acknowledge and differentiate its contributions from prior work like Winoground-T2I [1] and EvoGen [2] rather than explain its unique value beyond existing single-prompt benchmarks. 2. [Major] Incremental Algorithmic Contribution: The novelty of Pair-GRPO over the standard GRPO appears limited. It see

Code & Models

Models

🤗
wendell0218/Janus-FocusDiff-7B
model· 5 dl
5 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Historical Architecture and Urbanism