Sample- and Parameter-Efficient Auto-Regressive Image Models
Elad Amrani, Leonid Karlinsky, Alex Bronstein

TL;DR
XTRA introduces a block-based auto-regressive vision model that significantly improves sample and parameter efficiency, achieving state-of-the-art performance on multiple image recognition benchmarks with fewer samples and parameters.
Contribution
The paper proposes a novel block causal masking approach in auto-regressive vision models, enhancing their efficiency and scalability over previous methods.
Findings
XTRA surpasses previous models on 15 image benchmarks.
XTRA uses 152× fewer samples than prior models.
XTRA achieves comparable or better performance with 7-16× fewer parameters.
Abstract
We introduce XTRA, a vision model pre-trained with a novel auto-regressive objective that significantly enhances both sample and parameter efficiency compared to previous auto-regressive image models. Unlike contrastive or masked image modeling methods, which have not been demonstrated as having consistent scaling behavior on unbalanced internet data, auto-regressive vision models exhibit scalable and promising performance as model and dataset size increase. In contrast to standard auto-regressive models, XTRA employs a Block Causal Mask, where each Block represents k k tokens rather than relying on a standard causal mask. By reconstructing pixel values block by block, XTRA captures higher-level structural patterns over larger image regions. Predicting on blocks allows the model to learn relationships across broader areas of pixels, enabling more abstract and semantically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Image Segmentation Techniques · Image Retrieval and Classification Techniques · Image and Signal Denoising Methods
