An Efficient and Effective Transformer Decoder-Based Framework for   Multi-Task Visual Grounding

Wei Chen; Long Chen; Yu Wu

arXiv:2408.01120·cs.CV·August 5, 2024

An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding

Wei Chen, Long Chen, Yu Wu

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces EEVG, a Transformer Decoder-based framework for multi-task visual grounding that significantly reduces computational costs, enabling more efficient processing of complex scenes and long language expressions.

Contribution

It proposes a novel Transformer Decoder-based approach that scales linearly with language length and employs a parameter-free method to eliminate background tokens, improving efficiency.

Findings

01

Achieves state-of-the-art results on visual grounding benchmarks.

02

Demonstrates reduced computational costs compared to traditional Transformer encoders.

03

Effective in handling complex scenes with lengthy language expressions.

Abstract

Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenwei746/eevg
pytorchOfficial

Models

🤗
linhuixiao/Awesome-Visual-Grounding
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Advanced Vision and Imaging

MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections