Too long; didn't solve
Luc\'ia M. Cabrera, Isaac Saxton-Knight

TL;DR
This study examines how prompt and solution lengths affect large language models' performance on math problems, finding longer lengths generally increase failure rates and relate to problem difficulty.
Contribution
It provides the first detailed analysis of how structural properties like length influence model performance on adversarial math datasets.
Findings
Longer prompt and solution lengths correlate with higher failure rates.
Structural length variables are linked to increased difficulty and model failure.
Prompt length shows a slightly stronger association with model separation.
Abstract
Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. We find that both prompt and solution lengths correlate positively with increased model failure across models. We also include a secondary, exploratory analysis of cross-model disagreement. Under a difficulty-adjusted normalised analysis, both variables retain weak negative associations with realised model separation, slightly stronger for prompt length. Overall, our main robust finding is that structural length is linked to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
