TL;DR
SwimVG introduces a step-wise multimodal fusion framework with prompts and adapters that enhance visual grounding accuracy and efficiency by replacing traditional transformer stacks with parameter-efficient modules.
Contribution
The paper proposes a novel step-wise fusion and adaptation framework, SwimVG, using prompts and adapters for more efficient and effective visual grounding.
Findings
Achieves state-of-the-art results on four benchmarks.
Improves alignment between vision and language representations.
Reduces computational costs compared to traditional methods.
Abstract
Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained models, followed by a simple stack of visual-language transformers for multimodal fusion. However, these approaches not only limit adequate interaction between visual and linguistic contexts, but also incur significant computational costs. Therefore, to address these issues, we explore a step-wise multimodal fusion and adaption framework, namely SwimVG. Specifically, SwimVG proposes step-wise multimodal prompts (Swip) and cross-modal interactive adapters (CIA) for visual grounding, replacing the cumbersome transformer stacks for multimodal fusion. Swip can improve {the} alignment between the vision and language representations step by step, in a token-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
