TL;DR
This paper systematically studies design choices for visual Test-Time Training (TTT), introduces the ViT$^3$ model with linear complexity, and demonstrates its effectiveness across various visual tasks.
Contribution
It provides empirical insights and guidelines for designing effective visual TTT models, culminating in the ViT$^3$ architecture with state-of-the-art performance.
Findings
ViT$^3$ achieves linear complexity and parallelizable computation.
ViT$^3$ matches or outperforms existing linear models on multiple tasks.
The study offers practical design principles for visual TTT models.
Abstract
Test-Time Training (TTT) has recently emerged as a promising direction for efficient sequence modeling. TTT reformulates attention operation as an online learning problem, constructing a compact inner model from key-value pairs at test time. This reformulation opens a rich and flexible design space while achieving linear computational complexity. However, crafting a powerful visual TTT design remains challenging: fundamental choices for the inner module and inner training lack comprehensive understanding and practical guidelines. To bridge this critical gap, in this paper, we present a systematic empirical study of TTT designs for visual sequence modeling. From a series of experiments and analyses, we distill six practical insights that establish design principles for effective visual TTT and illuminate paths for future improvement. These findings culminate in the Vision Test-Time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
