Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Lingxi Xie, Longhui Wei, Xiaopeng Zhang, Kaifeng Bi, Xiaotao Gu,, Jianlong Chang, Qi Tian

TL;DR
This paper explores the challenges and lessons from GPT and large language models to guide the development of artificial general intelligence in computer vision, emphasizing unification and environment-based learning.
Contribution
It proposes a conceptual framework for achieving AGI in CV by integrating environment interaction, pre-training, and instruction fine-tuning, inspired by NLP successes.
Findings
Unification is key to advancing CV towards AGI.
Environment-based learning is essential for CV to reach GPT-like capabilities.
A pipeline involving environment interaction, pre-training, and fine-tuning is proposed.
Abstract
The AI community has been pursuing algorithms known as artificial general intelligence (AGI) that apply to any kind of real-world problem. Recently, chat systems powered by large language models (LLMs) emerge and rapidly become a promising direction to achieve AGI in natural language processing (NLP), but the path towards AGI in computer vision (CV) remains unclear. One may owe the dilemma to the fact that visual signals are more complex than language signals, yet we are interested in finding concrete reasons, as well as absorbing experiences from GPT and LLMs to solve the problem. In this paper, we start with a conceptual definition of AGI and briefly review how NLP solves a wide range of tasks via a chat system. The analysis inspires us that unification is the next important goal of CV. But, despite various efforts in this direction, CV is still far from a system like GPT that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Discriminative Fine-Tuning · Cosine Annealing · Layer Normalization · Weight Decay · Residual Connection · Softmax
