TL;DR
HiVG introduces a hierarchical multimodal framework that enhances visual grounding by addressing cross-modal inconsistencies and reducing perceptual errors, leveraging pre-training and low-rank adaptation for improved accuracy and efficiency.
Contribution
The paper proposes HiVG, a novel hierarchical multimodal fine-grained modulation framework that effectively bridges the gap between pre-training and grounding tasks using a multi-layer adaptive cross-modal bridge and HiLoRA.
Findings
Outperforms existing methods on five datasets
Demonstrates significant grounding accuracy improvements
Offers energy-efficient model design
Abstract
Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
