Universal Length Generalization with Turing Programs
Kaiying Hou, David Brandfonbrener, Sham Kakade, Samy Jelassi, Eran, Malach

TL;DR
This paper introduces Turing Programs, a universal Chain-of-Thought strategy that enables large language models to generalize lengthwise across various algorithmic tasks by mimicking Turing Machine computations.
Contribution
The paper proposes Turing Programs, a novel, simple, and universal CoT approach that achieves robust length generalization on multiple algorithmic tasks and demonstrates transformers' ability to implement Turing Programs.
Findings
Robust length generalization on addition, multiplication, and in-context SGD.
Transformers can generalize lengthwise on random Turing Programs.
Theoretically, transformers can implement Turing Programs to simulate Turing machines.
Abstract
Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputability, Logic, AI Algorithms
MethodsSparse Evolutionary Training · Stochastic Gradient Descent
