TL;DR
This paper investigates the generalization capabilities of language models in shortest-path planning, revealing strengths in spatial transfer but persistent failures in length scaling due to recursive instability.
Contribution
It introduces a synthetic environment to systematically analyze factors affecting model generalization in problem solving.
Findings
Models transfer well to unseen maps spatially.
Models fail to scale to longer horizons due to recursive instability.
Training data coverage limits capabilities, and inference scaling cannot fix length failures.
Abstract
Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
