TPTT: Transforming Pretrained Transformers into Titans
Fabien Furfaro

TL;DR
TPTT is a framework that enhances pretrained Transformer models with linearized attention and memory gating, improving efficiency and accuracy for long-context NLP tasks without full retraining.
Contribution
It introduces TPTT, a novel method combining linearized attention and memory gating, enabling efficient fine-tuning of pretrained Transformers across various model sizes.
Findings
Up to 20% improvement in Exact Match scores on MMLU benchmark.
Feasibility of converting quadratic-attention models to linear-attention models.
Effective fine-tuning with modest computational resources.
Abstract
Transformer-based large language models (LLMs) have achieved strong performance across many natural language processing tasks. Nonetheless, their quadratic computational and memory requirements, particularly in self-attention layers, pose challenges for efficient inference on long contexts and for deployment in resource-limited environments. We present TPTT (Transforming Pretrained Transformers into Titans), a framework designed to augment pretrained Transformers with linearized attention (LiZA) and internal memory gating via Memory as Gate (MaG), applied without full retraining. TPTT supports parameter-efficient fine-tuning (LoRA) and integrates with standard toolkits such as Hugging Face Transformers. We evaluated TPTT on several pretrained models, including Llama-1B, OlMoE-1B-7B, Qwen2.5-1.5B, Gemma3-270m, OpenELM-1.3B, and Mistral-7B, in order to assess applicability across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWelding Techniques and Residual Stresses · Hydrogen embrittlement and corrosion behaviors in metals
