Visual Grounding with Multi-modal Conditional Adaptation

Ruilin Yao; Shengwu Xiong; Yichen Zhao; Yi Rong

arXiv:2409.04999·cs.CV·September 10, 2024

Visual Grounding with Multi-modal Conditional Adaptation

Ruilin Yao, Shengwu Xiong, Yichen Zhao, Yi Rong

PDF

1 Repo 1 Models

TL;DR

This paper introduces Multi-modal Conditional Adaptation (MMCA), a novel method that adaptively updates the visual encoder in visual grounding tasks by integrating multi-modal information, leading to improved accuracy and efficiency.

Contribution

The paper proposes MMCA, a lightweight approach that dynamically adapts the visual encoder using multi-modal embeddings, addressing limitations of previous methods that rely solely on textual guidance.

Findings

01

Achieves state-of-the-art results on four datasets.

02

Demonstrates significant performance improvements over existing methods.

03

Shows that MMCA is efficient and lightweight through ablation studies.

Abstract

Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mr-bigworth/mmca
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training · Focus