Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Quan Kong; Yanru Xiao; Yuhao Shen; Cong Wang

arXiv:2603.00518·cs.CV·March 23, 2026

Vision-TTT: Efficient and Expressive Visual Representation Learning with Test-Time Training

Quan Kong, Yanru Xiao, Yuhao Shen, Cong Wang

PDF

Open Access

TL;DR

Vision-TTT introduces a novel self-supervised learning method for vision transformers that significantly improves efficiency and accuracy by modeling 2D visual correlations with reduced computational complexity.

Contribution

It proposes Vision-TTT, a new linear-time sequence modeling approach that extends test-time training to vision, enhancing scalability and performance of vision transformers.

Findings

01

Achieves over 77% Top-1 accuracy on ImageNet with ViT models.

02

Reduces FLOPs by 79.4% and runs 4.72 times faster at high resolution.

03

Outperforms baseline models on downstream tasks.

Abstract

Learning efficient and expressive visual representation has long been the pursuit of computer vision research. While Vision Transformers (ViTs) gradually replace traditional Convolutional Neural Networks (CNNs) as more scalable vision learners, their applications are plagued by the quadratic complexity of the self-attention mechanism. To address the challenge, we introduce a new linear-time sequence modeling method Test-Time Training (TTT) into vision and propose Vision-TTT, which treats visual sequences as datasets and compresses the visual token sequences in a novel self-supervised learning manner. By incorporating the dual-dataset strategy and Conv2d-based dataset preprocessing, Vision-TTT effectively extends vanilla TTT to model 2D visual correlations with global receptive fields. Extensive experiments show that \texttt{Vittt-T/S/B} achieve $77.7%, 81.8%, 82.7%$ Top-1 accuracy on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning