SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding

Liangtao Shi; Ting Liu; Xiantao Hu; Yue Hu; Quanjun Yin; Richang Hong

arXiv:2502.16786·cs.CV·March 3, 2025

SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding

Liangtao Shi, Ting Liu, Xiantao Hu, Yue Hu, Quanjun Yin, Richang Hong

PDF

1 Repo

TL;DR

SwimVG introduces a step-wise multimodal fusion framework with prompts and adapters that enhance visual grounding accuracy and efficiency by replacing traditional transformer stacks with parameter-efficient modules.

Contribution

The paper proposes a novel step-wise fusion and adaptation framework, SwimVG, using prompts and adapters for more efficient and effective visual grounding.

Findings

01

Achieves state-of-the-art results on four benchmarks.

02

Improves alignment between vision and language representations.

03

Reduces computational costs compared to traditional methods.

Abstract

Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liuting20/swimvg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.