The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models
Claudia Herambourg, Dawid Siuda, Julia Kopczy\'nska, Joao R. L. Santos, Wojciech Sas, Joanna \'Smieta\'nska-Nowak

TL;DR
The ORCA Benchmark assesses large language models' real-world calculation accuracy across multiple domains, revealing significant error patterns and domain-specific strengths and weaknesses in quantitative reasoning.
Contribution
Introduces ORCA, a comprehensive benchmark for evaluating LLMs on multi-domain, real-life quantitative reasoning with verified outputs, emphasizing step-by-step reasoning and error analysis.
Findings
Models achieve 45-63% accuracy on average.
Errors mainly due to rounding and calculation mistakes.
Strengths in mathematics and engineering, weaknesses in physics.
Abstract
We present ORCA (Omni Research on Calculation in AI) Benchmark - a novel benchmark that evaluates large language models (LLMs) on multi-domain, real-life quantitative reasoning using verified outputs from Omni's calculator engine. In 500 natural-language tasks across domains such as finance, physics, health, and statistics, the five state-of-the-art systems (ChatGPT-5, Gemini~2.5~Flash, Claude~Sonnet~4.5, Grok~4, and DeepSeek~V3.2) achieved only accuracy, with errors mainly related to rounding () and calculation mistakes (). Results in specific domains indicate strengths in mathematics and engineering, but weaknesses in physics and natural sciences. Correlation analysis () shows that the models often fail together but differ in the types of errors they make, highlighting their partial complementarity rather than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Explainable Artificial Intelligence (XAI)
