LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision

Anthony Fuller; Yousef Yassin; Junfeng Wen; Daniel G. Kyrollos; Tarek Ibrahim; James R. Green; Evan Shelhamer

arXiv:2505.18051·cs.CV·February 6, 2026

LookWhere? Efficient Visual Recognition by Learning Where to Look and What to See from Self-Supervision

Anthony Fuller, Yousef Yassin, Junfeng Wen, Daniel G. Kyrollos, Tarek Ibrahim, James R. Green, Evan Shelhamer

PDF

TL;DR

LookWhere is a novel method that learns to efficiently focus computation on important image regions, reducing processing costs significantly while maintaining or improving recognition accuracy across various tasks.

Contribution

It introduces a self-supervised, joint training approach for adaptive computation that selectively processes high-resolution image regions without task-specific supervision.

Findings

01

Reduces FLOPs by up to 34x in high-resolution recognition

02

Maintains accuracy while reducing processing time in various tasks

03

Outperforms prior token reduction and selection methods

Abstract

Vision transformers are ever larger, more accurate, and more expensive to compute. The expense is even more extreme at high resolution as the number of tokens grows quadratically with the image size. We turn to adaptive computation to cope with this cost by learning to predict where to compute. Our LookWhere method divides the computation between a low-resolution selector and a high-resolution extractor without ever processing the full high-resolution input. We jointly pretrain the selector and extractor without task supervision by distillation from a self-supervised teacher, in effect, learning where and what to compute simultaneously. Unlike prior token reduction methods, which pay to save by pruning already-computed tokens, and prior token selection methods, which require complex and expensive per-task optimization, LookWhere economically and accurately selects and extracts…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.