Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models
\'Akos Prucs, M\'arton Csutora, M\'aty\'as Antal, M\'ark Marosi

TL;DR
This paper evaluates open-source large language models on reasoning tasks considering both accuracy and computational cost, revealing a Pareto frontier and identifying the Mixture of Experts architecture as an efficient choice.
Contribution
It introduces a test-time compute-aware evaluation framework for open-source LLMs, mapping their Pareto frontiers and analyzing efficiency trends over time.
Findings
Mixture of Experts models balance performance and efficiency
Accuracy gains plateau beyond certain compute thresholds
Emergent trend shows improved accuracy per compute unit over time
Abstract
Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
