RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios
Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang

TL;DR
RuleArena is a new benchmark that evaluates large language models' ability to follow complex, real-world rules across various practical domains, revealing current limitations and potential improvements in rule-guided reasoning.
Contribution
The paper introduces RuleArena, a challenging benchmark grounded in real-world scenarios to assess LLMs' rule-following and reasoning abilities beyond traditional logic-based tasks.
Findings
LLMs struggle to correctly identify and apply relevant rules.
Mathematical computation accuracy in LLMs is limited.
External tools significantly improve LLM performance on reasoning tasks.
Abstract
This paper introduces RuleArena, a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning. Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions that demand long-context understanding, logical reasoning, and accurate mathematical computation. Two key attributes distinguish RuleArena from traditional rule-based reasoning benchmarks: (1) it extends beyond standard first-order logic representations, and (2) it is grounded in authentic, practical scenarios, providing insights into the suitability and reliability of LLMs for real-world applications. Our findings reveal several notable limitations in LLMs: (1) they struggle to identify and apply the appropriate rules,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Business Process Modeling and Analysis · Multi-Agent Systems and Negotiation
