TL;DR
This paper introduces Multi-modal Conditional Adaptation (MMCA), a novel method that adaptively updates the visual encoder in visual grounding tasks by integrating multi-modal information, leading to improved accuracy and efficiency.
Contribution
The paper proposes MMCA, a lightweight approach that dynamically adapts the visual encoder using multi-modal embeddings, addressing limitations of previous methods that rely solely on textual guidance.
Findings
Achieves state-of-the-art results on four datasets.
Demonstrates significant performance improvements over existing methods.
Shows that MMCA is efficient and lightweight through ablation studies.
Abstract
Visual grounding is the task of locating objects specified by natural language expressions. Existing methods extend generic object detection frameworks to tackle this task. They typically extract visual and textual features separately using independent visual and textual encoders, then fuse these features in a multi-modal decoder for final prediction. However, visual grounding presents unique challenges. It often involves locating objects with different text descriptions within the same image. Existing methods struggle with this task because the independent visual encoder produces identical visual features for the same image, limiting detection performance. Some recently approaches propose various language-guided visual encoders to address this issue, but they mostly rely solely on textual information and require sophisticated designs. In this paper, we introduce Multi-modal Conditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Focus
