Attractor Patch Networks: Reducing Catastrophic Forgetting with Routed Low-Rank Patch Experts
Shashank

TL;DR
This paper introduces Attractor Patch Networks (APN), a modular replacement for Transformer FFNs that improves contextual specialization and continual learning by routing tokens to low-rank patch experts, leading to better domain adaptation and retention.
Contribution
APN provides a novel patch-based architecture for Transformers that enhances expressivity and continual learning compatibility through a routing mechanism and low-rank residual updates.
Findings
APN achieves competitive perplexity in language modeling.
APN significantly improves continual domain adaptation and retention.
APN outperforms global fine-tuning in experiments.
Abstract
Transformers achieve strong language modeling accuracy, yet their position-wise feed-forward networks (FFNs) are dense, globally shared, and typically updated end to end. These properties create two practical tensions. First, dense FFNs spend the same compute on every token regardless of context, and they allocate capacity uniformly even when language exhibits highly clustered context structure. Second, continual learning, in the sense of updating the model while serving a data stream, often produces interference because a small update touches broadly shared weights. We propose Attractor Patch Networks (APN), a plug-compatible replacement for the Transformer FFN. APN is a bank of patch experts. A similarity router selects a small top-k set of patches for each token by matching the token representation to learned prototypes. Each selected patch emits a low-rank residual update…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
