TNT: Improving Chunkwise Training for Test-Time Memorization
Zeman Li, Ali Behrouz, Yuan Deng, Peilin Zhong, Praneeth Kacham, Mahdi Karami, Meisam Razaviyayn, Vahab Mirrokni

TL;DR
TNT introduces a two-stage training paradigm for RNNs with deep test-time memorization modules, significantly accelerating training and improving accuracy by decoupling efficiency from inference performance.
Contribution
The paper proposes TNT, a novel training method that separates training efficiency from inference accuracy, enabling faster training and better performance of RNNs with memorization modules.
Findings
Training speed increased up to 17 times
Model accuracy improved with TNT training
Decoupling training efficiency from inference performance
Abstract
Recurrent neural networks (RNNs) with deep test-time memorization modules, such as Titans and TTT, represent a promising, linearly-scaling paradigm distinct from Transformers. While these expressive models do not yet match the peak performance of state-of-the-art Transformers, their potential has been largely untapped due to prohibitively slow training and low hardware utilization. Existing parallelization methods force a fundamental conflict governed by the chunksize hyperparameter: large chunks boost speed but degrade performance, necessitating a fixed, suboptimal compromise. To solve this challenge, we introduce TNT, a novel training paradigm that decouples training efficiency from inference performance through a two-stage process. Stage one is an efficiency-focused pre-training phase utilizing a hierarchical memory. A global module processes large, hardware-friendly chunks for…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper shows strong originality by proposing the two-stage TNT paradigm, decoupling deep memory modules’ training efficiency and inference performance to break the balancing bottleneck in existing work. Its hierarchical memory (with locally periodic resets) and Q-K Projection solve non-linear module parallelization and memory compression-retrieval mismatch, as targeted innovations. 2. It features rigorous technicality and thorough experiments, defining core mechanisms via clear formulas fo
1. The paper only tests up to 4 local memory modules and does not analyze performance saturation points or optimal chunk size selection for multi-local configurations, leaving gaps in guiding practical hierarchical memory setup. 2. TNT lacks custom kernel optimization, and the speed comparison with optimized Transformer baselines (e.g., Gated Transformer with FlashAttention) is unfair due to hardware optimization mismatch, failing to highlight inherent efficiency advantages. 3. The paper does no
- The paper is well structured and easy to follow. - The proposed two-stage training method effectively balances performance and training efficiency. - Extensive experiments across different model architectures demonstrate the method’s effectiveness and robustness.
- There is no hyperparameter study on $C_G$. How does the global chunk size influence performance? - TNT only outperforms the FlashAttention method at a 32K sequence length. Is this due to an under-optimized kernel or the limitation of maintaining additional memory modules? - What are the sizes of the global and local memory modules? Since TNT introduces additional parameters for the global memory module, comparable parameter sizes should be used for Titans when comparing performance.
The work could bring some advantages: 1) stage 1 in the method can increase training throughput by introducing a novel hierarchical memory architecture that enables unprecedented parallelism. By the clear framework overview figure and TNT Memory Compression Rule, authors provide clear explanation for their method. 2) stage 2 can bridge the gap between the large chunk sizes required for efficient training and the small chunk sizes that yield the best performance at inference. 3) By the experiment
1) The presentation of the paper still needs more improvement. For example, in the main experimental results, I can not understand the meaning of column in table 2. 2) The datasets used by the paper are not clear. I am confused on this. 3) The four baselines seems to be not enough to better support the efficacy of method. 4) I think the table1 can show the effectiveness of method. But other results in paper can not show obviously better performance.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices
