LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
Ming Zhang, Qiyuan Peng, Yinxi Wei, Yujiong Shen, Kexin Tan, Yuhui Wang, Zhenghao Xiang, Junjie Ye, Zhangyue Yin, Zhiheng Xi, Shihan Dou, Tao Gui, Maxm Pan, Ruizhi Yang, Qi Zhang, Xuanjing Huang

TL;DR
LLMEval-Logic is a Chinese logical reasoning benchmark for LLMs, built from realistic scenarios, verified with formal methods, and hardened through adversarial workflows, revealing current models' limitations.
Contribution
It introduces a novel Chinese logical reasoning benchmark with expert verification and adversarial hardening, addressing limitations of previous templated datasets.
Findings
The best model achieves only 37.5% accuracy on hard items.
Formalization scores among models are generally low, with a maximum of 60.16%.
The benchmark exposes significant gaps in current LLM logical reasoning capabilities.
Abstract
Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
