Language Models Need Inductive Biases to Count Inductively

Yingshan Chang; Yonatan Bisk

arXiv:2405.20131·cs.LG·November 19, 2024

Language Models Need Inductive Biases to Count Inductively

Yingshan Chang, Yonatan Bisk

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper investigates how different neural network architectures, including RNNs and Transformers, generalize counting tasks out-of-domain, revealing that inductive biases are crucial for robust generalization.

Contribution

The study provides extensive empirical analysis of counting generalization across various architectures and highlights the importance of inductive biases and design choices.

Findings

01

RNNs trivially achieve inductive counting

02

Transformers rely on positional embeddings for out-of-domain counting

03

Modern RNNs underperform traditional RNNs in inductive counting

Abstract

Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano's axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer "reasoning" to the simplest case of counting, investigating length generalization does occur throughout the literature. In the "train short, test long" paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

The generalization splits and counting tasks seem reasonable to me, clearly thought out well. The different performance across all of the kinds of positional encodings is very interesting, and I love the contrast between modular counting and typical unbounded state counting. It's really interesting to focus on the positional embeddings as the design choice that might build in an inductive bias towards being able to count. It seems like core cognitive skills are implicitly built into different a

Weaknesses

The abstract is very wordy and spends too much time explaining why counting is important. I would shorten the first half of the abstract to two or three sentences. The quote at the beginning of the introduction, I’m not sure who the quote is being attributed to. I am surprised you didn’t cite the bootstrapping counting paper by Steve Piantadosi, Josh Tenebaum, and Noah Goodman. This seems like an important citation, as this paper makes good on Carey’s earlier ideas in a computational framework

Reviewer 02Rating 6Confidence 5

Strengths

1. This work focuses on a specific problem, the counting task, for the language models. The authors conduct many experiments to investigate the ability to count systematically. 2. This paper not only focuses on the standard transformer architectures but also investigates many popular modern architectures.

Weaknesses

1. **Lack of Insights:** Although this work conducts many experiments to support their findings, it offers limited insights into the reasons behind the poor performance of Transformers and modern RNNs. I encourage the authors to provide more intuition or explanations for the observed empirical phenomena. 2. **Lack of Generality:** This paper focuses on the counting task, which I acknowledge is an important task. However, it remains unclear how performance on this task influences real-world appl

Reviewer 03Rating 5Confidence 4

Strengths

- Comprehensive empirical evaluation of the 5 position embeddings. - Analysis is reasonably thorough. e.g. I like the finding that RoPE fails to do modular counting, but not if there is a BOS token.

Weaknesses

Counting setting far removed from practical language models. Issues with overclaiming in the interpretation and discussion of results. Specifically: - "poor results for 1L and 2L models suggest that counting in Transformers may require a non-trivial computation budget". I do not think the results are strong enough to support this claim. Firstly, 4 layers are still only a small fraction of most practical language models (e.g. even llama 8B has 32 layers), so "non-trivial" is some what of a stre

Code & Models

Repositories

zdxdsw/inductive_counting_with_lms
pytorchOfficial

Videos

Language Models Need Inductive Biases to Count Inductively· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques