The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models

Claudia Herambourg; Dawid Siuda; Julia Kopczy\'nska; Joao R. L. Santos; Wojciech Sas; Joanna \'Smieta\'nska-Nowak

arXiv:2511.02589·cs.AI·November 6, 2025

The ORCA Benchmark: Evaluating Real-World Calculation Accuracy in Large Language Models

Claudia Herambourg, Dawid Siuda, Julia Kopczy\'nska, Joao R. L. Santos, Wojciech Sas, Joanna \'Smieta\'nska-Nowak

PDF

Open Access

TL;DR

The ORCA Benchmark assesses large language models' real-world calculation accuracy across multiple domains, revealing significant error patterns and domain-specific strengths and weaknesses in quantitative reasoning.

Contribution

Introduces ORCA, a comprehensive benchmark for evaluating LLMs on multi-domain, real-life quantitative reasoning with verified outputs, emphasizing step-by-step reasoning and error analysis.

Findings

01

Models achieve 45-63% accuracy on average.

02

Errors mainly due to rounding and calculation mistakes.

03

Strengths in mathematics and engineering, weaknesses in physics.

Abstract

We present ORCA (Omni Research on Calculation in AI) Benchmark - a novel benchmark that evaluates large language models (LLMs) on multi-domain, real-life quantitative reasoning using verified outputs from Omni's calculator engine. In 500 natural-language tasks across domains such as finance, physics, health, and statistics, the five state-of-the-art systems (ChatGPT-5, Gemini~2.5~Flash, Claude~Sonnet~4.5, Grok~4, and DeepSeek~V3.2) achieved only $45 - 63 %$ accuracy, with errors mainly related to rounding ( $35 %$ ) and calculation mistakes ( $33 %$ ). Results in specific domains indicate strengths in mathematics and engineering, but weaknesses in physics and natural sciences. Correlation analysis ( $r \approx 0.40 - 0.65$ ) shows that the models often fail together but differ in the types of errors they make, highlighting their partial complementarity rather than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Explainable Artificial Intelligence (XAI)