Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training
Quan Kong, Yanru Xiao, Yuhao Shen, Cong Wang

TL;DR
Vision-TTT introduces a novel self-supervised learning method for vision transformers that significantly improves efficiency and accuracy by modeling 2D visual correlations with reduced computational complexity.
Contribution
It proposes Vision-TTT, a new linear-time sequence modeling approach that extends test-time training to vision, enhancing scalability and performance of vision transformers.
Findings
Achieves over 77% Top-1 accuracy on ImageNet with ViT models.
Reduces FLOPs by 79.4% and runs 4.72 times faster at high resolution.
Outperforms baseline models on downstream tasks.
Abstract
Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which treats visual sequences as datasets and compresses the visual token sequences in a novel self-supervised learning manner. By incorporating the dual-dataset strategy and Conv2d-based dataset preprocessing, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve Top-1 accuracy on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
