Small transformer architectures for task switching

Claudius Gros

arXiv:2508.04461·cs.LG·August 7, 2025

Small transformer architectures for task switching

Claudius Gros

PDF

TL;DR

This paper investigates the effectiveness of small transformer architectures in task switching scenarios, revealing that standard transformers struggle but extended models with alternative attention mechanisms can achieve high performance.

Contribution

It introduces a comparative analysis of small transformer variants and alternative attention mechanisms in task switching, highlighting the potential of extensive attention for improved performance.

Findings

01

Standard transformers perform modestly on task switching

02

Extended attention mechanisms significantly improve accuracy

03

Cisformer and extensive attention achieve around 95% accuracy

Abstract

The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional approaches, such as multi-layer perceptrons or recurrent networks. We examine this problem in the context of 'task switching'. In this framework models work on ongoing token sequences with the current task being determined by stochastically interspersed control tokens. We show that standard transformers cannot solve a basic task switching reference model based on finite domain arithmetics which contains subtasks dedicated to increment / addition / reverse copy / context (IARC). We show that transformers, long short-term memory recurrent networks (LSTM), and plain multi-layer perceptrons (MLPs) achieve similar, but only modest prediction accuracies. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.