Autoregressive Models in Vision: A Survey

Jing Xiong; Gongye Liu; Lun Huang; Chengyue Wu; Taiqiang Wu; Yao Mu; Yuan Yao; Hui Shen; Zhongwei Wan; Jinfa Huang; Chaofan Tao; Shen Yan; Huaxiu Yao; Lingpeng Kong; Hongxia Yang; Mi Zhang; Guillermo Sapiro; Jiebo Luo; Ping Luo; Ngai Wong

arXiv:2411.05902·cs.CV·June 3, 2025·2 cites

Autoregressive Models in Vision: A Survey

Jing Xiong, Gongye Liu, Lun Huang, Chengyue Wu, Taiqiang Wu, Yao Mu, Yuan Yao, Hui Shen, Zhongwei Wan, Jinfa Huang, Chaofan Tao, Shen Yan, Huaxiu Yao, Lingpeng Kong, Hongxia Yang, Mi Zhang, Guillermo Sapiro, Jiebo Luo, Ping Luo, Ngai Wong

PDF

Open Access 1 Repo

TL;DR

This survey reviews the development and application of autoregressive models in computer vision, categorizing them by representation strategy and exploring their use in various domains like image, video, and 3D generation, highlighting future challenges.

Contribution

It provides a comprehensive categorization and analysis of autoregressive models in vision, including their interconnections with other generative models and emerging application areas.

Findings

01

Categorization into pixel-, token-, and scale-based models.

02

Application in diverse domains including medical and embodied AI.

03

Identification of current challenges and future research directions.

Abstract

Autoregressive modeling has been a huge success in the field of natural language processing (NLP). Recently, autoregressive models have emerged as a significant area of focus in computer vision, where they excel in producing high-quality visual content. Autoregressive models in NLP typically operate on subword tokens. However, the representation strategy in computer vision can vary in different levels, i.e., pixel-level, token-level, or scale-level, reflecting the diverse and hierarchical nature of visual data compared to the sequential structure of language. This survey comprehensively examines the literature on autoregressive models applied to vision. To improve readability for researchers from diverse research backgrounds, we start with preliminary sequence representation and modeling in vision. Next, we divide the fundamental frameworks of visual autoregressive models into three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chaofantao/autoregressive-models-in-vision-survey
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies · Advanced Vision and Imaging · Image Processing Techniques and Applications

MethodsSparse Evolutionary Training · Focus