Absorber LLM: Harnessing Causal Synchronization for Test-Time Training
Zhixin Zhang, Shabo Zhang, Chengcan Wu, Zeming Wei, Meng Sun

TL;DR
Absorber LLM introduces a self-supervised causal synchronization method that enables long-context retention in transformers, reducing memory use and improving accuracy in long-stream inference.
Contribution
It proposes a novel approach to long-context retention by absorbing historical context into model parameters through causal synchronization, addressing overfitting and memory issues.
Findings
Reduces inference memory in long-context tasks.
Improves accuracy over prior parameter-as-memory methods.
Effective on long-context and streaming benchmarks.
Abstract
Transformers suffer from a high computational cost that grows with sequence length for self-attention, making inference in long streams prohibited by memory consumption. Constant-memory alternatives such as RNNs and SSMs compress history into states with fixed size and thus lose long-tail dependencies, while methods that memorize contexts into parameters, such as Test-Time Training (TTT), are prone to overfitting token-level projection and fail to preserve the causal effect of context in pretrained LLMs. We propose Absorber LLM, which formulates long-context retention as a self-supervised causal synchronization: after absorbing historical contexts into parameters, a contextless model should match the original model with full context on future generations. We optimize this objective by synchronizing internal behaviors of the updated model with the original one, ensuring context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
