Small transformer architectures for task switching
Claudius Gros

TL;DR
This paper investigates the effectiveness of small transformer architectures in task switching scenarios, revealing that standard transformers struggle but extended models with alternative attention mechanisms can achieve high performance.
Contribution
It introduces a comparative analysis of small transformer variants and alternative attention mechanisms in task switching, highlighting the potential of extensive attention for improved performance.
Findings
Standard transformers perform modestly on task switching
Extended attention mechanisms significantly improve accuracy
Cisformer and extensive attention achieve around 95% accuracy
Abstract
The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of 'task switching'. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
