On the Expressive Power of Transformers for Maxout Networks and Continuous Piecewise Linear Functions
Linyan Gu, Lihua Yang, Feng Zhou

TL;DR
This paper explores the expressive power of Transformer networks, showing they can approximate maxout and ReLU networks, and analyzing their ability to represent complex piecewise linear functions with exponential growth in linear regions.
Contribution
It establishes a theoretical connection between Transformers and maxout/ReLU networks, demonstrating their universal approximation capabilities and analyzing their expressivity in terms of linear regions.
Findings
Transformers can approximate maxout networks with similar complexity.
Transformers inherit the universal approximation property of ReLU networks.
The number of linear regions in Transformers grows exponentially with depth.
Abstract
Transformer networks have achieved remarkable empirical success across a wide range of applications, yet their theoretical expressive power remains insufficiently understood. In this paper, we study the expressive capabilities of Transformer architectures. We first establish an explicit approximation of maxout networks by Transformer networks while preserving comparable model complexity. As a consequence, Transformers inherit the universal approximation capability of ReLU networks under similar complexity constraints. Building on this connection, we develop a framework to analyze the approximation of continuous piecewise linear functions by Transformers and quantitatively characterize their expressivity via the number of linear regions, which grows exponentially with depth. Our analysis establishes a theoretical bridge between approximation theory for standard feedforward neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Ferroelectric and Negative Capacitance Devices · Neural Networks and Applications
