ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Wei Su, Peihan Miao, Huanzhang Dou, Xi Li

TL;DR
ScanFormer introduces an iterative, coarse-to-fine approach to referring expression comprehension that selectively focuses on linguistically relevant image regions, reducing computational overhead while maintaining accuracy.
Contribution
It proposes a novel ScanFormer framework that iteratively discards irrelevant visual patches, improving efficiency in referring expression comprehension tasks.
Findings
Effective reduction of computational overhead.
Maintains competitive accuracy on standard datasets.
Balances efficiency and performance in vision-language tasks.
Abstract
Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries, leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks, with limited exploration in vision-language fields. To address this, we propose a coarse-to-fine iterative perception framework, called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration, irrelevant patches are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsFocus
