VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang, Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai

TL;DR
VisionLLM introduces a unified framework that leverages large language models as open-ended decoders for vision-centric tasks, enabling flexible, instruction-based task customization and achieving competitive results in object detection.
Contribution
It presents a novel approach that treats images as a language, aligning vision tasks with language tasks, and demonstrates the effectiveness of LLMs in open-ended vision applications.
Findings
Achieves over 60% mAP on COCO dataset.
Enables flexible task customization via language instructions.
Matches detection-specific models in performance.
Abstract
Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
