Reducing the Transformer Architecture to a Minimum

Bernhard Bermeitinger; Tomas Hrycej; Massimo Pavone; Julianus Kath,; Siegfried Handschuh

arXiv:2410.13732·cs.LG·November 25, 2024

Reducing the Transformer Architecture to a Minimum

Bernhard Bermeitinger, Tomas Hrycej, Massimo Pavone, Julianus Kath,, Siegfried Handschuh

PDF

Open Access

TL;DR

This paper demonstrates that a minimal transformer architecture, omitting MLPs and collapsing matrices, can achieve comparable performance to standard models on CV benchmarks while drastically reducing parameters.

Contribution

The authors propose a simplified transformer model that reduces parameters by removing MLPs and collapsing matrices, maintaining performance on CV tasks.

Findings

01

Achieved similar accuracy to standard transformers on MNIST and CIFAR-10.

02

Reduced parameter count by up to 90% without performance loss.

03

Validated the minimal architecture's effectiveness on benchmark datasets.

Abstract

Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem of extracting relevant context information from long sequences in NLP and realistic scenes in CV. A classical neural network component, a Multi-Layer Perceptron (MLP), complements the attention mechanism. Its necessity is frequently justified by its capability of modeling nonlinear relationships. However, the attention mechanism itself is nonlinear through its internal use of similarity measures. A possible hypothesis is that this nonlinearity is sufficient for modeling typical application problems. As the MLPs usually contain the most trainable parameters of the whole model, their omission would substantially reduce the parameter set size. Further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMagnetic Properties and Applications · Power Transformer Diagnostics and Insulation

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training