HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual   Grounding

Linhui Xiao; Xiaoshan Yang; Fang Peng; Yaowei Wang; Changsheng Xu

arXiv:2404.13400·cs.CV·September 6, 2024

HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding

Linhui Xiao, Xiaoshan Yang, Fang Peng, Yaowei Wang, Changsheng Xu

PDF

1 Repo 2 Models

TL;DR

HiVG introduces a hierarchical multimodal framework that enhances visual grounding by addressing cross-modal inconsistencies and reducing perceptual errors, leveraging pre-training and low-rank adaptation for improved accuracy and efficiency.

Contribution

The paper proposes HiVG, a novel hierarchical multimodal fine-grained modulation framework that effectively bridges the gap between pre-training and grounding tasks using a multi-layer adaptive cross-modal bridge and HiLoRA.

Findings

01

Outperforms existing methods on five datasets

02

Demonstrates significant grounding accuracy improvements

03

Offers energy-efficient model design

Abstract

Visual grounding, which aims to ground a visual region via natural language, is a task that heavily relies on cross-modal alignment. Existing works utilized uni-modal pre-trained models to transfer visual or linguistic knowledge separately while ignoring the multimodal corresponding information. Motivated by recent advancements in contrastive language-image pre-training and low-rank adaptation (LoRA) methods, we aim to solve the grounding task based on multimodal pre-training. However, there exists significant task gaps between pre-training and grounding. Therefore, to address these gaps, we propose a concise and efficient hierarchical multimodal fine-grained modulation framework, namely HiVG. Specifically, HiVG consists of a multi-layer adaptive cross-modal bridge and a hierarchical multimodal low-rank adaptation (HiLoRA) paradigm. The cross-modal bridge can address the inconsistency…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

linhuixiao/hivg
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.