ScanFormer: Referring Expression Comprehension by Iteratively Scanning

Wei Su; Peihan Miao; Huanzhang Dou; Xi Li

arXiv:2406.18048·cs.CV·June 27, 2024

ScanFormer: Referring Expression Comprehension by Iteratively Scanning

Wei Su, Peihan Miao, Huanzhang Dou, Xi Li

PDF

Open Access

TL;DR

ScanFormer introduces an iterative, coarse-to-fine approach to referring expression comprehension that selectively focuses on linguistically relevant image regions, reducing computational overhead while maintaining accuracy.

Contribution

It proposes a novel ScanFormer framework that iteratively discards irrelevant visual patches, improving efficiency in referring expression comprehension tasks.

Findings

01

Effective reduction of computational overhead.

02

Maintains competitive accuracy on standard datasets.

03

Balances efficiency and performance in vision-language tasks.

Abstract

Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries, leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks, with limited exploration in vision-language fields. To address this, we propose a coarse-to-fine iterative perception framework, called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration, irrelevant patches are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsFocus