TL;DR
Brevity constraints in prompts can reverse performance hierarchies among language models, revealing that large models' capabilities are masked by overelaborate responses and can be unlocked through concise prompting.
Contribution
This work demonstrates that prompt brevity can invert performance rankings, highlighting the importance of scale-aware prompt engineering for accurate evaluation and deployment.
Findings
Constraining large models to produce brief responses improves accuracy by 26 percentage points.
Brevity constraints reverse performance hierarchies on mathematical and scientific benchmarks.
Inverse scaling persists across the full parameter spectrum, with dataset-specific optimal scales.
Abstract
Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
