Position Coupling: Improving Length Generalization of Arithmetic   Transformers Using Task Structure

Hanseul Cho; Jaeyoung Cha; Pranjal Awasthi; Srinadh Bhojanapalli,; Anupam Gupta; Chulhee Yun

arXiv:2405.20671·cs.LG·October 31, 2024

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli,, Anupam Gupta, Chulhee Yun

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces position coupling, a method that embeds task structure into positional encoding, enabling Transformers to generalize arithmetic tasks like addition to much longer sequences than trained on.

Contribution

The paper proposes position coupling, a novel positional encoding technique that improves length generalization in Transformers for arithmetic and other algorithmic tasks.

Findings

01

Models with position coupling generalize to 200-digit addition after training on 30-digit addition.

02

Theoretically, a 1-layer Transformer with position coupling can solve exponential-length addition.

03

Position coupling is applicable to other algorithmic tasks like multiplication and 2D tasks.

Abstract

Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, our models trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanseuljo/position-coupling
pytorchOfficial

Videos

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure· slideslive

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections