Towards Unifying Understanding and Generation in the Era of Vision   Foundation Models: A Survey from the Autoregression Perspective

Shenghao Xie; Wenqiang Zu; Mingyang Zhao; Duo Su; Shilong Liu; Ruohua; Shi; Guoqi Li; Shanghang Zhang; Lei Ma

arXiv:2410.22217·cs.CV·October 31, 2024

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Shenghao Xie, Wenqiang Zu, Mingyang Zhao, Duo Su, Shilong Liu, Ruohua, Shi, Guoqi Li, Shanghang Zhang, Lei Ma

PDF

Open Access

TL;DR

This survey reviews recent advances in autoregressive vision foundation models, emphasizing their potential to unify understanding and generation in vision tasks, and discusses future research directions.

Contribution

First comprehensive survey on autoregressive vision foundation models focusing on unifying understanding and generation capabilities.

Findings

01

Analyzes limitations of current vision foundation models.

02

Categorizes models based on tokenizers and backbones.

03

Identifies promising future research challenges.

Abstract

Autoregression in large language models (LLMs) has shown impressive scalability by unifying all language tasks into the next token prediction paradigm. Recently, there is a growing interest in extending this success to vision foundation models. In this survey, we review the recent advances and discuss future directions for autoregressive vision foundation models. First, we present the trend for next generation of vision foundation models, i.e., unifying both understanding and generation in vision tasks. We then analyze the limitations of existing vision foundation models, and present a formal definition of autoregression with its advantages. Later, we categorize autoregressive vision foundation models from their vision tokenizers and autoregression backbones. Finally, we discuss several promising research challenges and directions. To the best of our knowledge, this is the first survey…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducation and Islamic Studies