A Survey on Vision Autoregressive Model
Kai Jiang, Jiaxing Huang

TL;DR
This survey comprehensively reviews the development, applications, and challenges of vision autoregressive models, highlighting their recent advancements in various vision tasks and multimodal generation.
Contribution
It provides a systematic taxonomy, analyzes recent progress, benchmarks methods, and discusses future directions for vision autoregressive models.
Findings
Autoregressive models excel in diverse vision tasks including image and video generation.
Recent advancements enable unified multimodal visual understanding and generation.
Benchmarking reveals strengths and limitations of current methods.
Abstract
Autoregressive models have demonstrated great performance in natural language processing (NLP) with impressive scalability, adaptability and generalizability. Inspired by their notable success in NLP field, autoregressive models have been intensively investigated recently for computer vision, which perform next-token predictions by representing visual data as visual tokens and enables autoregressive modelling for a wide range of vision tasks, ranging from visual generation and visual understanding to the very recent multimodal generation that unifies visual generation and understanding with a single autoregressive model. This paper provides a systematic review of vision autoregressive models, including the development of a taxonomy of existing methods and highlighting their major contributions, strengths, and limitations, covering various vision tasks such as image generation, video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Remote Sensing and Land Use · Infrared Target Detection Methodologies
