Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

Hossein Shahabadi; Niki Sepasian; Arash Marioriyad; Ali Sharifi-Zarchi; Mahdieh Soleymani Baghshah

arXiv:2512.11542·cs.CV·March 17, 2026

Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

Hossein Shahabadi, Niki Sepasian, Arash Marioriyad, Ali Sharifi-Zarchi, Mahdieh Soleymani Baghshah

PDF

Open Access

TL;DR

This paper systematically compares the compositional alignment capabilities of VAR and diffusion-based text-to-image models, revealing strengths and weaknesses across multiple benchmarks and establishing baselines for future research.

Contribution

It provides the first comprehensive benchmark comparison of VAR and diffusion T2I models on compositional tasks, highlighting the strengths of Infinity-8B and Infinity-2B.

Findings

01

Infinity-8B achieves the best overall compositional alignment.

02

Infinity-2B matches or exceeds larger diffusion models in several categories.

03

SDXL and PixArt-α show weaknesses in attribute and spatial tasks.

Abstract

Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined. We benchmark six diverse T2I systems - SDXL, PixArt- $α$ , Flux-Dev, Flux-Schnell, Infinity-2B, and Infinity-8B - across the full T2I-CompBench++ and GenEval suites, evaluating alignment in color and attribute binding, spatial relations, numeracy, and complex multi-object prompts. Across both benchmarks, Infinity-8B achieves the strongest overall compositional alignment, while Infinity-2B also matches or exceeds larger diffusion models in several categories, highlighting favorable efficiency-performance trade-offs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques