On the Bias of Next-Token Predictors Toward Systematically Inefficient Reasoning: A Shortest-Path Case Study
Riccardo Alberghi, Elizaveta Demyanenko, Luca Biggio, Luca Saglietti

TL;DR
This study investigates how the structure and efficiency of reasoning traces affect the generalization of language models, revealing that models trained on longer, redundant reasoning traces can outperform those trained on optimal traces.
Contribution
The paper introduces a controlled shortest-path task to analyze the impact of reasoning trace efficiency on model generalization, highlighting the importance of coherent, incremental reasoning.
Findings
Models trained on inefficient, longer traces generalize better.
Redundancy in reasoning traces can improve model performance.
Coherent, incremental reasoning traces facilitate training and generalization.
Abstract
Recent advances in natural language processing highlight two key factors for improving reasoning in large language models (LLMs): (i) allocating more test-time compute tends to help on harder problems but often introduces redundancy in the reasoning trace, and (ii) compute is most effective when reasoning is systematic and incremental, forming structured chains of thought (CoTs) akin to human problem-solving. To study these factors in isolation, we introduce a controlled setting based on shortest-path tasks in layered graphs. We train decoder-only transformers on question-trace-answer triples using a custom tokenizer, comparing models trained on optimal bottom-up dynamic programming traces with those trained on longer, valid traces involving backtracking. Surprisingly, with the same training-token budget, models trained on inefficient traces generalize better to unseen graphs. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Natural Language Processing Techniques
