TL;DR
This paper investigates how reasoning chain length affects accuracy in large language models, revealing that newer models improve performance not by longer reasoning but through more efficient reasoning, with implications for model evaluation and scaling.
Contribution
The study systematically analyzes reasoning length and accuracy in different model generations, showing that performance gains are due to more effective reasoning rather than longer chains.
Findings
o3-mini outperforms o1-mini without longer reasoning chains
Accuracy declines as reasoning chains grow, even when controlling for difficulty
More proficient models use test-time compute more efficiently
Abstract
Large language models have demonstrated remarkable progress in mathematical reasoning, leveraging chain-of-thought and test-time compute scaling. However, many open questions remain regarding the interplay between reasoning token usage and accuracy gains. In particular, when comparing models across generations, it is unclear whether improved performance results from longer reasoning chains or more efficient reasoning. We systematically analyze chain-of-thought length across o1-mini and o3-mini variants on the Omni-MATH benchmark, finding that o3-mini (m) achieves superior accuracy without requiring longer reasoning chains than o1-mini. Moreover, we show that accuracy generally declines as reasoning chains grow across all models and compute settings, even when controlling for difficulty of the questions. This accuracy drop is significantly smaller in more proficient models, suggesting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
