Reducing the Transformer Architecture to a Minimum
Bernhard Bermeitinger, Tomas Hrycej, Massimo Pavone, Julianus Kath,, Siegfried Handschuh

TL;DR
This paper demonstrates that a minimal transformer architecture, omitting MLPs and collapsing matrices, can achieve comparable performance to standard models on CV benchmarks while drastically reducing parameters.
Contribution
The authors propose a simplified transformer model that reduces parameters by removing MLPs and collapsing matrices, maintaining performance on CV tasks.
Findings
Achieved similar accuracy to standard transformers on MNIST and CIFAR-10.
Reduced parameter count by up to 90% without performance loss.
Validated the minimal architecture's effectiveness on benchmark datasets.
Abstract
Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem of extracting relevant context information from long sequences in NLP and realistic scenes in CV. A classical neural network component, a Multi-Layer Perceptron (MLP), complements the attention mechanism. Its necessity is frequently justified by its capability of modeling nonlinear relationships. However, the attention mechanism itself is nonlinear through its internal use of similarity measures. A possible hypothesis is that this nonlinearity is sufficient for modeling typical application problems. As the MLPs usually contain the most trainable parameters of the whole model, their omission would substantially reduce the parameter set size. Further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMagnetic Properties and Applications · Power Transformer Diagnostics and Insulation
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
