Arbitrary-Length Generalization for Addition in a Tiny Transformer

Alexandre Galvao Patriota

arXiv:2406.00075·cs.LG·June 13, 2024

Arbitrary-Length Generalization for Addition in a Tiny Transformer

Alexandre Galvao Patriota

PDF

Open Access 1 Repo

TL;DR

This paper presents a new training method for Tiny Transformers that enables arbitrary-length addition generalization, using an autoregressive approach that mimics manual addition, with reproducible results and available code.

Contribution

It introduces a novel autoregressive training technique for Transformers to perform addition on numbers of unseen lengths, enhancing generalization capabilities.

Findings

01

Transformer generalizes addition to unseen lengths

02

Method achieves accurate results on large numbers

03

Reproducible with publicly available code

Abstract

This paper introduces a novel training methodology that enables a Transformer model to generalize the addition of two-digit numbers to numbers with unseen lengths of digits. The proposed approach employs an autoregressive generation technique, processing from right to left, which mimics a common manual method for adding large numbers. To the best of my knowledge, this methodology has not been previously explored in the literature. All results are reproducible, and the corresponding R code is available at github.com/AGPatriota/ALGA-R/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

agpatriota/alga-r
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Numerical Analysis Techniques · Advanced Optimization Algorithms Research · Matrix Theory and Algorithms

MethodsSoftmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Attention Is All You Need · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention