An Efficient and Effective Transformer Decoder-Based Framework for Multi-Task Visual Grounding
Wei Chen, Long Chen, Yu Wu

TL;DR
This paper introduces EEVG, a Transformer Decoder-based framework for multi-task visual grounding that significantly reduces computational costs, enabling more efficient processing of complex scenes and long language expressions.
Contribution
It proposes a novel Transformer Decoder-based approach that scales linearly with language length and employs a parameter-free method to eliminate background tokens, improving efficiency.
Findings
Achieves state-of-the-art results on visual grounding benchmarks.
Demonstrates reduced computational costs compared to traditional Transformer encoders.
Effective in handling complex scenes with lengthy language expressions.
Abstract
Most advanced visual grounding methods rely on Transformers for visual-linguistic feature fusion. However, these Transformer-based approaches encounter a significant drawback: the computational costs escalate quadratically due to the self-attention mechanism in the Transformer Encoder, particularly when dealing with high-resolution images or long context sentences. This quadratic increase in computational burden restricts the applicability of visual grounding to more intricate scenes, such as conversation-based reasoning segmentation, which involves lengthy language expressions. In this paper, we propose an efficient and effective multi-task visual grounding (EEVG) framework based on Transformer Decoder to address this issue, which reduces the cost in both language and visual aspects. In the language aspect, we employ the Transformer Decoder to fuse visual and linguistic features, where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Visual Attention and Saliency Detection · Advanced Vision and Imaging
MethodsLinear Layer · Residual Connection · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections
