Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks
Mahdi Sabbaghi, George Pappas, Hamed Hassani, Surbhi Goel

TL;DR
This paper demonstrates that explicitly encoding structural symmetries in number formatting and positional encodings enables Transformers to generalize length in arithmetic tasks, outperforming traditional methods and highlighting the importance of structural awareness.
Contribution
The authors introduce a method to explicitly encode structural symmetries into Transformers, significantly improving length generalization in arithmetic tasks without extra data.
Findings
Transformers with explicit symmetry encoding generalize to 50-digit addition and multiplication.
Traditional absolute positional encodings fail to generalize to longer sequences.
Explicit structure encoding is necessary for out-of-distribution generalization.
Abstract
Despite the success of Transformers on language understanding, code generation, and logical reasoning, they still fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text; For example, the numbers are typically parsed from right to left, and there is a correspondence between digits at the same position across different numbers. In contrast, for text, such symmetries are quite unnatural. In this work, we propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings. Empirically, our method allows a Transformer trained on numbers with at most 5-digits for addition and multiplication to generalize up to 50-digit numbers, without using additional data for longer sequences. We further demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCognitive and developmental aspects of mathematical skills · Mathematics Education and Teaching Techniques
MethodsSoftmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention
