Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks
Xingcheng Xu, Zibo Zhao, Haipeng Zhang, Yanqing Yang

TL;DR
This paper presents a theoretical framework to understand how transformer models generalize in arithmetic reasoning tasks, emphasizing the role of task structure and positional encoding in length generalization.
Contribution
It introduces a unified theory linking positional encoding and task structure to transformer generalization, validated through experiments on GPT models.
Findings
Translation invariance in addition aids generalization
Base mismatch in modular operations causes generalization failure
Framework accurately predicts transformer behavior in arithmetic tasks
Abstract
Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood. Arithmetic tasks provide a controlled framework to explore these capabilities, yet performance anomalies persist, such as inconsistent effectiveness in multiplication and erratic generalization in modular addition (e.g., modulo 100 vs. 101). This paper develops a unified theoretical framework for understanding the generalization behaviors of transformers in arithmetic tasks, focusing on length generalization. Through detailed analysis of addition, multiplication, and modular operations, we reveal that translation invariance in addition aligns with relative positional encoding for robust generalization, while base mismatch in modular operations disrupts this alignment. Experiments across GPT-family models validate our framework,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMathematics Education and Teaching Techniques
MethodsSoftmax · Attention Is All You Need
