Understanding the Expressive Power and Mechanisms of Transformer for   Sequence Modeling

Mingze Wang; Weinan E

arXiv:2402.00522·cs.LG·October 31, 2024·1 cites

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

Mingze Wang, Weinan E

PDF

Open Access

TL;DR

This paper provides a theoretical and experimental analysis of Transformer models, revealing how their components influence expressive power and offering insights for designing improved architectures.

Contribution

It systematically studies the approximation capabilities of Transformers for complex sequence modeling and clarifies the roles of key components and parameters.

Findings

01

Transformer components significantly impact expressive power

02

Explicit approximation rates are established for different architectures

03

Experimental validation supports theoretical insights

Abstract

We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the dot-product self-attention, positional encoding and feed-forward layer, affect its expressive power, and we study their combined effects through establishing explicit approximation rates. Our study reveals the roles of critical parameters in the Transformer, such as the number of layers and the number of attention heads. These theoretical insights are validated experimentally and offer natural suggestions for alternative architectures.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Residual Connection · Absolute Position Encodings · Dropout · Layer Normalization