Test-Time Training Done Right

Tianyuan Zhang; Sai Bi; Yicong Hong; Kai Zhang; Fujun Luan; Songlin Yang; Kalyan Sunkavalli; William T. Freeman; Hao Tan

arXiv:2505.23884·cs.LG·June 2, 2025

Test-Time Training Done Right

Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, Hao Tan

PDF

Open Access

TL;DR

This paper introduces Large Chunk Test-Time Training (LaCT), a scalable approach that significantly enhances long-context modeling across various modalities by using large batch updates, improving hardware utilization and state capacity.

Contribution

We propose LaCT, a novel test-time training method that employs large chunk updates to improve efficiency and scalability for long-context data across multiple modalities.

Findings

01

Enables training with sequences up to 1 million tokens.

02

Scales to 14 billion parameter models for video diffusion.

03

Achieves high hardware utilization with large batch updates.

Abstract

Test-Time Training (TTT) models context dependencies by adapting part of the model's weights (referred to as fast weights) during inference. This fast weight, akin to recurrent states in RNNs, stores temporary memories of past tokens in the current sequence. Existing TTT methods struggled to show effectiveness in handling long-context data, due to their inefficiency on modern GPUs. The TTT layers in many of these approaches operate with extremely low FLOPs utilization (often <5%) because they deliberately apply small online minibatch sizes (e.g., updating fast weights every 16 or 64 tokens). Moreover, a small minibatch implies fine-grained block-wise causal dependencies in the data, unsuitable for data beyond 1D ordered sequences, like sets or N-dimensional grids such as images or videos. In contrast, we pursue the opposite direction by using an extremely large chunk update, ranging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Resource Development and Performance Evaluation

MethodsDiffusion