On the Power of Decision Trees in Auto-Regressive Language Modeling
Yulu Gan, Tomer Galanti, Tomaso Poggio, Eran Malach

TL;DR
This paper explores the theoretical and practical capabilities of Auto-regressive Decision Trees (ARDTs) in language modeling, demonstrating their computational power and effectiveness in generating coherent text and reasoning tasks.
Contribution
It introduces ARDTs to language modeling, providing theoretical bounds and empirical evidence of their ability to perform complex computations and generate high-quality language.
Findings
ARDTs can simulate automata, Turing machines, and circuits.
They can generate coherent, grammatically correct text.
ARDTs enhance reasoning when used with transformer representations.
Abstract
Originally proposed for handling time series data, Auto-regressive Decision Trees (ARDTs) have not yet been explored for language modeling. This paper delves into both the theoretical and practical applications of ARDTs in this new context. We theoretically demonstrate that ARDTs can compute complex functions, such as simulating automata, Turing machines, and sparse circuits, by leveraging "chain-of-thought" computations. Our analysis provides bounds on the size, depth, and computational efficiency of ARDTs, highlighting their surprising computational power. Empirically, we train ARDTs on simple language generation tasks, showing that they can learn to generate coherent and grammatically correct text on par with a smaller Transformer model. Additionally, we show that ARDTs can be used on top of transformer representations to solve complex reasoning tasks. This research reveals the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsLinear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Attention Is All You Need · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding
