InfoFlow: A Framework for Multi-Layer Transformer Analysis
Penghao Yu,Haotian Jiang,Zeyu Bao,Qianxiao Li

TL;DR
This paper introduces InfoFlow, a framework that analyzes the approximation capabilities of multi-layer Transformers, revealing their efficiency advantages over single-layer models for certain retrieval tasks.
Contribution
The work provides a theoretical framework for understanding multi-layer Transformer approximation properties and introduces InfoFlow to analyze information propagation and efficiency.
Findings
Multi-layer Transformers require fewer parameters than single-layer ones for certain tasks.
Softmax attention efficiently retrieves only the maximum scoring token, leading to exponential costs for other retrievals.
InfoFlow accurately predicts approximation bounds and aligns with experimental observations.
Abstract
While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least parameters to achieve precision , where grows linearly with sequence length , whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
