Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models
Hossein Shahabadi, Niki Sepasian, Arash Marioriyad, Ali Sharifi-Zarchi, Mahdieh Soleymani Baghshah

TL;DR
This paper systematically compares the compositional alignment capabilities of VAR and diffusion-based text-to-image models, revealing strengths and weaknesses across multiple benchmarks and establishing baselines for future research.
Contribution
It provides the first comprehensive benchmark comparison of VAR and diffusion T2I models on compositional tasks, highlighting the strengths of Infinity-8B and Infinity-2B.
Findings
Infinity-8B achieves the best overall compositional alignment.
Infinity-2B matches or exceeds larger diffusion models in several categories.
SDXL and PixArt-α show weaknesses in attribute and spatial tasks.
Abstract
Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt-, Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques
