CART: Compositional Auto-Regressive Transformer for Image Generation
Siddharth Roheda, Rohit Chowdhury, Aniruddha Bala, Rohan Jaiswal

TL;DR
CART introduces a hierarchical auto-regressive transformer for image generation that models images as interpretable layered compositions, improving control, interpretability, and scalability over traditional methods.
Contribution
It presents a novel hierarchical AR approach with multiple decomposition strategies, enhancing image generation quality and controllability in vision tasks.
Findings
Outperforms traditional next-token models in image quality
Enables structured image manipulation and control
Demonstrates flexibility across different decomposition methods
Abstract
We propose a novel Auto-Regressive (AR) image generation approach that models images as hierarchical compositions of interpretable visual layers. While AR models have achieved transformative success in language modeling, replicating this success in vision tasks remains challenging due to inherent spatial dependencies in images. Addressing the unique challenges of vision tasks, our method (CART) adds image details iteratively via semantically meaningful decompositions. We demonstrate the flexibility and generality of CART by applying it across three distinct decomposition strategies: (i) Base-Detail Decomposition (Mumford-Shah smoothness), (ii) Intrinsic Decomposition (albedo/shading), and (iii) Specularity Decomposition (diffuse/specular). This next-detail strategy outperforms traditional next-token and next-scale approaches, improving controllability, semantic interpretability, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Brain Tumor Detection and Classification · Medical Image Segmentation Techniques
MethodsBalanced Selection
