Knee-Deep in C-RASP: A Transformer Depth Hierarchy
Andy Yang, Micha\"el Cadilhac, David Chiang

TL;DR
This paper establishes a theoretical link between transformer depth and expressiveness, showing deeper transformers are more capable, supported by empirical evidence on sequential tasks.
Contribution
It provides a formal proof connecting transformer depth to increased expressiveness through C-RASP equivalence and demonstrates this relationship empirically.
Findings
Deeper transformers are more expressive than shallower ones.
Transformers with positional encodings also exhibit increased expressiveness with depth.
Empirical results align with the theory on length generalization tasks.
Abstract
It has been observed that transformers with greater depth (that is, more layers) have more capabilities, but can we establish formally which capabilities are gained? We answer this question with a theoretical proof followed by an empirical study. First, we consider transformers that round to fixed precision except inside attention. We show that this subclass of transformers is expressively equivalent to the programming language C-RASP and this equivalence preserves depth. Second, we prove that deeper C-RASP programs are more expressive than shallower C-RASP programs, implying that deeper transformers are more expressive than shallower transformers (within the subclass mentioned above). The same is also proven for transformers with positional encodings (like RoPE and ALiBi). These results are established by studying a temporal logic with counting operators equivalent to C-RASP. Finally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLogic, programming, and type systems · Constraint Satisfaction and Optimization · Formal Methods in Verification
