VisionGPT: Vision-Language Understanding Agent Using Generalized   Multimodal Framework

Chris Kelly; Luhui Hu; Bang Yang; Yu Tian; Deshun Yang; Cindy Yang,; Zaoshan Huang; Zihao Li; Jiayin Hu; Yuexian Zou

arXiv:2403.09027·cs.CV·March 15, 2024·3 cites

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Chris Kelly, Luhui Hu, Bang Yang, Yu Tian, Deshun Yang, Cindy Yang,, Zaoshan Huang, Zihao Li, Jiayin Hu, Yuexian Zou

PDF

Open Access

TL;DR

VisionGPT is a versatile framework that combines large language models with vision foundation models to enable open-world visual perception and comprehensive vision-language understanding across various applications.

Contribution

It introduces a generalized multimodal framework that automates the integration of multiple foundation models using LLMs as a central coordinator, enhancing flexibility and performance.

Findings

01

Effective integration of multiple foundation models for vision tasks.

02

Versatile applications including image understanding, editing, and visual question answering.

03

Demonstrated potential to advance open-world visual perception.

Abstract

With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In this paper, we introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models, thereby facilitating vision-language understanding and the development of vision-oriented AI. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features: (1) utilizing LLMs (e.g., LLaMA-2) as the pivot to break down users' requests into detailed action proposals to call suitable foundation models; (2) integrating multi-source outputs from foundation models automatically and generating comprehensive responses for users; (3) adaptable to a wide range of applications such as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Religious Tourism and Spaces · AI in Service Interactions