GiVE: Guiding Visual Encoder to Perceive Overlooked Information

Junjie Li; Jianghong Ma; Xiaofeng Zhang; Yuhang Li; Jianyang Shi

arXiv:2410.20109·cs.CV·March 24, 2025

GiVE: Guiding Visual Encoder to Perceive Overlooked Information

Junjie Li, Jianghong Ma, Xiaofeng Zhang, Yuhang Li, Jianyang Shi

PDF

Open Access 1 Datasets

TL;DR

GiVE introduces a novel visual encoder enhancement with specialized modules and loss functions, significantly improving object perception and retrieval in multimodal models, leading to state-of-the-art results.

Contribution

The paper presents GiVE, a new visual encoder framework with attention-guided modules and loss functions, plus a new dataset, to better perceive overlooked objects in multimodal tasks.

Findings

01

Achieves state-of-the-art performance on relevant benchmarks.

02

Enhances object retrieval accuracy and comprehensiveness.

03

Improves visual focus adjustment in multimodal models.

Abstract

Multimodal Large Language Models have advanced AI in applications like text-to-video generation and visual question answering. These models rely on visual encoders to convert non-text data into vectors, but current encoders either lack semantic alignment or overlook non-salient objects. We propose the Guiding Visual Encoder to Perceive Overlooked Information (GiVE) approach. GiVE enhances visual representation with an Attention-Guided Adapter (AG-Adapter) module and an Object-focused Visual Semantic Learning module. These incorporate three novel loss terms: Object-focused Image-Text Contrast (OITC) loss, Object-focused Image-Image Contrast (OIIC) loss, and Object-focused Image Discrimination (OID) loss, improving object consideration, retrieval accuracy, and comprehensiveness. Our contributions include dynamic visual focus adjustment, novel loss functions to enhance object retrieval,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

DF1024/MOInst
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition

MethodsFocus · Adapter