Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Mahdi Mostajabdaveh; Timothy T. Yu; Samarendra Chandan Bindu Dash,; Rindranirina Ramamonjison; Jabo Serge Byusa; Giuseppe Carenini; Zirui Zhou,; Yong Zhang

arXiv:2412.17874·cs.CL·February 11, 2025·2 cites

Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Mahdi Mostajabdaveh, Timothy T. Yu, Samarendra Chandan Bindu Dash,, Rindranirina Ramamonjison, Jabo Serge Byusa, Giuseppe Carenini, Zirui Zhou,, Yong Zhang

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces ORQA, a benchmark to evaluate LLMs' reasoning in Operations Research, revealing their limited generalization in complex, domain-specific problems and highlighting areas for future improvement.

Contribution

The paper presents ORQA, a new dataset and benchmark for assessing LLMs' reasoning in Operations Research, emphasizing the gap in their domain-specific generalization capabilities.

Findings

01

LLMs show modest performance on ORQA benchmark

02

Current LLMs struggle with complex, specialized optimization problems

03

The dataset and evaluation tools are publicly available for future research

Abstract

In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark designed to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark evaluates whether LLMs can emulate the knowledge and reasoning skills of OR experts when confronted with diverse and complex optimization problems. The dataset, developed by OR experts, features real-world optimization problems that demand multistep reasoning to construct their mathematical models. Our evaluations of various open source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral, reveal their modest performance, highlighting a gap in their ability to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs generalization capabilities, offering valuable insights for future research…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Evaluating LLM Reasoning in the Operations Research Domain with ORQA· underline

Taxonomy

TopicsBusiness Process Modeling and Analysis · Semantic Web and Ontologies

MethodsLLaMA