Linearizing Vision Transformer with Test-Time Training
Yining Li, Dongchen Han, Zeyu Liu, Hanyi Wang, Yulin Wang, Gao Huang

TL;DR
This paper proposes a method to linearize pretrained Vision Transformers using Test-Time Training, enabling efficient weight transfer and faster inference while maintaining quality.
Contribution
It introduces a novel approach combining architectural and representational alignment for linearizing Transformers via TTT, validated on Stable Diffusion models.
Findings
Achieves comparable image quality with faster inference.
Enables direct inheritance of pretrained weights.
Requires only 1 hour of fine-tuning on GPUs.
Abstract
While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we address this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
