SimVG: A Simple Framework for Visual Grounding with Decoupled   Multi-modal Fusion

Ming Dai; Lingfeng Yang; Yihao Xu; Zhenhua Feng; Wankou Yang

arXiv:2409.17531·cs.CV·October 29, 2024·6 cites

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Ming Dai, Lingfeng Yang, Yihao Xu, Zhenhua Feng, Wankou Yang

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

SimVG introduces a simple, transformer-based framework for visual grounding that decouples multimodal fusion from downstream tasks, leveraging pre-trained models and a lightweight reasoning branch to improve performance and efficiency.

Contribution

The paper proposes a novel decoupled multimodal fusion framework, SimVG, utilizing pre-trained models and a lightweight reasoning branch with dynamic distillation for improved visual grounding.

Findings

01

Achieves state-of-the-art results on six VG datasets.

02

Improves reasoning speed and training efficiency.

03

Demonstrates robustness on complex textual expressions.

Abstract

Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. However, their performance significantly drops when dealing with complex textual expressions. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple. In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated. In this paper, we present a simple yet robust transformer-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dmmm1997/simvg
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Multimodal Machine Learning Applications

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings