Unified Deployment-Aware Evaluation of Open Reasoning Language Models
Md Motaleb Hossen Manik, Ge Wang

TL;DR
This paper introduces a comprehensive, deployment-aware evaluation framework for open reasoning language models, analyzing multiple configurations across diverse benchmarks with various prompting strategies and metrics.
Contribution
It presents a unified, multi-objective evaluation methodology that considers accuracy, latency, memory, and robustness, enabling better practical deployment decisions for open reasoning models.
Findings
Gemma-4-26B-A4B with zero-shot prompting achieves the highest weighted score of 0.794.
Gemma-4-E4B offers a strong balance of performance, lower latency, and memory usage.
Prompting strategies influence model rankings and reveal robustness and interface issues.
Abstract
Open reasoning language models are often compared under mixed sample sizes, partially standardized prompts, and accuracy-centered summaries, which makes practical model selection difficult to interpret. We present a unified evaluation of seven open reasoning language model configurations across four benchmarks: ARC-Challenge, GSM8K, MATH levels 1 to 3, and TruthfulQA MC1. We test zero-shot, chain-of-thought (CoT), and few-shot CoT prompting on the same 238-example subset for every model--dataset--strategy condition, yielding a complete 7 x 4 x 3 design with 84 conditions and 19,992 evaluated examples. Beyond accuracy, we report Wilson confidence intervals, latency, peak video random access memory (VRAM), weighted aggregate performance, Pareto-efficient operating points, prompt-sensitivity metrics, and compatibility diagnostics. Gemma-4-26B-A4B with zero-shot prompting achieves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
