TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
Anurup Ganguli

TL;DR
TFGN is an innovative architecture enabling continual pre-training of large language models across diverse domains without replay or task labels, effectively preventing catastrophic forgetting and enabling positive transfer.
Contribution
It introduces TFGN, a parameter-efficient overlay for transformers that achieves continual learning at LLM scale without replay, task IDs, or regularization penalties.
Findings
Achieves near-zero backward transfer and high retention across six heterogeneous domains.
Demonstrates positive forward transfer, reducing perplexity in unseen domains.
Provides extensions for reducing forgetting and reshaping model behavior with high fidelity.
Abstract
Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
