VisionLLM: Large Language Model is also an Open-Ended Decoder for   Vision-Centric Tasks

Wenhai Wang; Zhe Chen; Xiaokang Chen; Jiannan Wu; Xizhou Zhu; Gang; Zeng; Ping Luo; Tong Lu; Jie Zhou; Yu Qiao; Jifeng Dai

arXiv:2305.11175·cs.CV·May 26, 2023·131 cites

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang, Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai

PDF

Open Access 2 Repos

TL;DR

VisionLLM introduces a unified framework that leverages large language models as open-ended decoders for vision-centric tasks, enabling flexible, instruction-based task customization and achieving competitive results in object detection.

Contribution

It presents a novel approach that treats images as a language, aligning vision tasks with language tasks, and demonstrates the effectiveness of LLMs in open-ended vision applications.

Findings

01

Achieves over 60% mAP on COCO dataset.

02

Enables flexible task customization via language instructions.

03

Matches detection-specific models in performance.

Abstract

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques