Positional Description Matters for Transformers Arithmetic
Ruoqi Shen, S\'ebastien Bubeck, Ronen Eldan, Yin Tat Lee, Yuanzhi Li,, Yi Zhang

TL;DR
This paper investigates the impact of positional encoding on transformer models' ability to perform arithmetic tasks and proposes modifications to improve their generalization to larger numbers and longer sequences.
Contribution
The authors identify the reliance on naive positional encoding as a key issue and introduce methods to modify positional encoding, significantly enhancing transformer arithmetic performance.
Findings
Transformers with modified positional encoding excel at multi-digit multiplication.
Enhanced models demonstrate strong extrapolation from shorter to longer sequences.
Proposed methods achieve near-perfect accuracy on larger arithmetic problems with limited training data.
Abstract
Transformers, central to the successes in modern Natural Language Processing, often falter on arithmetic tasks despite their vast capabilities --which paradoxically include remarkable coding abilities. We observe that a crucial challenge is their naive reliance on positional information to solve arithmetic problems with a small number of digits, leading to poor performance on larger numbers. Herein, we delve deeper into the role of positional encoding, and propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently. We investigate the value of these modifications for three tasks: (i) classical multiplication, (ii) length extrapolation in addition, and (iii) addition in natural language context. For (i) we train a small model on a small dataset…
Peer Reviews
Decision·Submitted to ICLR 2024
This paper does extensive experimental work to investigate the described settings.
This paper contains flippant sentences such as - "Our findings reveal that even such modest-sized models can adeptly execute intricate arithmetic tasks" (when they use GPT models of >100M parameters to solve a very straight-forward task. For comparison, for the case of addition, people have been able to train 1-layer transformers to do it) - "In Section 2.1, we show a simple 12-layer transformer can output the product of 15 × 15-multiplication directly, demonstrating the immense potential of tra
The paper cleverly uses simple settings that isolate the problem and allow testing the effect of different solutions. The results, such as the ability to extrapolate simply through padding, are also quite interesting and simple enough that can be implemented (even if only as a temporary solution) to allow models perform well on a larger set of numbers. The investigation into the effect of positional encoding also takes the effect of pre-training into account and points out that using no positi
1. The paper is not structured very cohesively. There is also no conclusion or future work section which makes the paper a bit incomplete. The proposed solutions are more experimental and it is not clear which ones can be readily applied in practice. 2. My understanding is that the problem with extrapolation seems to lie with absolute positional encoding. However, there are many other encodings already designed and being used in practice including relative ones such as rotary. It is possible t
- The article is clear and well-written. The research questions are well-motivated. - The authors carried out a substantial amount of work and present results from a variety of interesting analyses. - Focusing on “basic” architectures (GPT-2) and simple symbolic domains, such as arithmetic, allows to get useful insights about the computational capabilities of Transformers, which might then be extended to more complex architectures and reasoning tasks.
- Related works could be expanded (see below). Furthermore, the training/testing setups used in the present work differ from those used in other similar work, making it more challenging to compare the current results with previous contributions. - Different Sections investigate different problems (e.g., multiplication vs. addition vs. math word problems), often by introducing opposite approaches (e.g., padding vs. insertion of random spaces) or by relying on quite different training regimens (e.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNumerical Methods and Algorithms · Topic Modeling · Mathematics, Computing, and Information Processing
MethodsSparse Evolutionary Training
