Test-Time Canonicalization by Foundation Models for Robust Perception
Utkarsh Singhal, Ryan Feng, Stella X. Yu, Atul Prakash

TL;DR
FOCAL is a test-time framework that enhances perception robustness by transforming inputs into typical views using foundation models, without retraining or architectural changes, effectively handling various transformations.
Contribution
Introduces FOCAL, a novel test-time optimization method inspired by mental rotation, improving robustness across diverse viewing conditions without retraining.
Findings
Significantly improves robustness of models like CLIP and SAM.
Effective against 2D/3D rotations, lighting, and day-night shifts.
No retraining or architectural modifications needed.
Abstract
Perception in the real world requires robustness to diverse viewing conditions. Existing approaches often rely on specialized architectures or training with predefined data augmentations, limiting adaptability. Taking inspiration from mental rotation in human vision, we propose FOCAL, a test-time robustness framework that transforms the input into the most typical view. At inference time, FOCAL explores a set of transformed images and chooses the one with the highest likelihood under foundation model priors. This test-time optimization boosts robustness while requiring no retraining or architectural changes. Applied to models like CLIP and SAM, it significantly boosts robustness across a wide range of transformations, including 2D and 3D rotations, contrast and lighting shifts, and day-night changes. We also explore potential applications in active vision. By reframing invariance as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Industrial Vision Systems and Defect Detection · Image Processing Techniques and Applications
MethodsSegment Anything Model · Contrastive Language-Image Pre-training
