PerceptionGPT: Effectively Fusing Visual Perception into LLM

Renjie Pi; Lewei Yao; Jiahui Gao; Jipeng Zhang; Tong Zhang

arXiv:2311.06612·cs.CV·November 14, 2023·1 cites

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Renjie Pi, Lewei Yao, Jiahui Gao, Jipeng Zhang, Tong Zhang

PDF

Open Access

TL;DR

PerceptionGPT introduces an end-to-end framework that enhances large language models with visual perception capabilities by leveraging token embeddings, lightweight encoders, and decoders, achieving superior performance with less training data and computational resources.

Contribution

The paper presents a novel method that integrates visual perception into LLMs using token embeddings and lightweight modules, reducing training complexity and resource requirements.

Findings

01

Significant performance improvements over previous methods.

02

Fewer trainable parameters and GPU hours needed.

03

Effective handling of multiple visual outputs.

Abstract

The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs). However, effectively harnessing VLLMs for intricate visual perception tasks remains a challenge. In this paper, we present a novel end-to-end framework named PerceptionGPT, which efficiently and effectively equips the VLLMs with visual perception abilities by leveraging the representation power of LLMs' token embedding. Our proposed method treats the token embedding of the LLM as the carrier of spatial information, then leverage lightweight visual task encoders and decoders to perform visual perception tasks (e.g., detection, segmentation). Our approach significantly alleviates the training difficulty suffered by previous approaches that formulate the visual outputs as discrete tokens, and enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques