Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided   Visual Prompts

Honglin Li; Yuting Gao; Chenglu Zhu; Jingdong Chen; Ming Yang; Lin; Yang

arXiv:2411.13909·cs.CV·November 25, 2024

Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts

Honglin Li, Yuting Gao, Chenglu Zhu, Jingdong Chen, Ming Yang, Lin, Yang

PDF

Open Access

TL;DR

Panther is a multimodal large language model that enhances visual perception by integrating user instructions early in the vision encoder, reducing redundant information, and accurately locating small objects, especially on vision-centric benchmarks.

Contribution

Introduces Panther, a novel MLLM with instruction-guided visual prompts, featuring modules that improve visual focus and reduce training costs without restricting decoder architecture.

Findings

01

Effective on vision-centric benchmarks

02

Improves accuracy in locating small objects

03

Reduces training costs significantly

Abstract

Multimodal large language models (MLLMs) are closing the gap to human visual perception capability rapidly, while, still lag behind on attending to subtle images details or locating small objects precisely, etc. Common schemes to tackle these issues include deploying multiple vision encoders or operating on original high-resolution images. Few studies have concentrated on taking the textual instruction into improving visual representation, resulting in losing focus in some vision-centric tasks, a phenomenon we herein termed as Amblyopia. In this work, we introduce Panther, a MLLM that closely adheres to user instruction and locates targets of interests precisely, with the finesse of a black panther. Specifically, Panther comprises three integral components: Panther-VE, Panther-Bridge, and Panther-Decoder. Panther-VE integrates user instruction information at the early stages of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling

MethodsFocus