InfoFlow: A Framework for Multi-Layer Transformer Analysis

Penghao Yu,Haotian Jiang,Zeyu Bao,Qianxiao Li

arXiv:2605.17930·cs.LG·May 19, 2026

InfoFlow: A Framework for Multi-Layer Transformer Analysis

Penghao Yu,Haotian Jiang,Zeyu Bao,Qianxiao Li

PDF

TL;DR

This paper introduces InfoFlow, a framework that analyzes the approximation capabilities of multi-layer Transformers, revealing their efficiency advantages over single-layer models for certain retrieval tasks.

Contribution

The work provides a theoretical framework for understanding multi-layer Transformer approximation properties and introduces InfoFlow to analyze information propagation and efficiency.

Findings

01

Multi-layer Transformers require fewer parameters than single-layer ones for certain tasks.

02

Softmax attention efficiently retrieves only the maximum scoring token, leading to exponential costs for other retrievals.

03

InfoFlow accurately predicts approximation bounds and aligns with experimental observations.

Abstract

While the approximation properties of single-layer Transformer architectures have been studied in recent works, a rigorous theoretical understanding of the multi-layer setting remains limited. In this work, we establish that multi-layer Transformers possess fundamentally different approximation capabilities from single-layer ones: for certain retrieval tasks, any single-layer Transformer requires least $Ω (ε^{- k})$ parameters to achieve precision $ε$ , where $k$ grows linearly with sequence length $T$ , whereas a two-layer Transformer with a single head per layer achieves the same approximation precision with at most $O (ε^{- 1})$ parameters. To understand this separation, we identify two structural mechanisms underlying multi-layer approximation. Specifically, softmax attention can only efficiently retrieve the token attaining the maximum attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.