Arbitrary-Length Generalization for Addition in a Tiny Transformer
Alexandre Galvao Patriota

TL;DR
This paper presents a new training method for Tiny Transformers that enables arbitrary-length addition generalization, using an autoregressive approach that mimics manual addition, with reproducible results and available code.
Contribution
It introduces a novel autoregressive training technique for Transformers to perform addition on numbers of unseen lengths, enhancing generalization capabilities.
Findings
Transformer generalizes addition to unseen lengths
Method achieves accurate results on large numbers
Reproducible with publicly available code
Abstract
This paper introduces a novel training methodology that enables a Transformer model to generalize the addition of two-digit numbers to numbers with unseen lengths of digits. The proposed approach employs an autoregressive generation technique, processing from right to left, which mimics a common manual method for adding large numbers. To the best of my knowledge, this methodology has not been previously explored in the literature. All results are reproducible, and the corresponding R code is available at github.com/AGPatriota/ALGA-R/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Numerical Analysis Techniques · Advanced Optimization Algorithms Research · Matrix Theory and Algorithms
MethodsSoftmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention
