LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute
Ali Salamatian, Anthony Fuller, Pritam Sarkar, James R. Green, Leonid Sigal, Evan Shelhamer

TL;DR
LookWhen is a framework that improves video recognition efficiency by learning when, where, and what to compute, reducing redundant processing while maintaining high accuracy across multiple datasets.
Contribution
It introduces a novel selector-extractor approach with effective supervision strategies for efficient video recognition, outperforming existing models in accuracy and speed.
Findings
LookWhen achieves better accuracy-computation trade-off than similar models.
It Pareto-dominates in accuracy-FLOPs on 9 out of 12 cases.
It is 6.7x faster than InternVideo2-B at equal accuracy.
Abstract
Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector-extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
