LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Ming Zhang; Qiyuan Peng; Yinxi Wei; Yujiong Shen; Kexin Tan; Yuhui Wang; Zhenghao Xiang; Junjie Ye; Zhangyue Yin; Zhiheng Xi; Shihan Dou; Tao Gui; Maxm Pan; Ruizhi Yang; Qi Zhang; Xuanjing Huang

arXiv:2605.19597·cs.CL·May 20, 2026

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

Ming Zhang, Qiyuan Peng, Yinxi Wei, Yujiong Shen, Kexin Tan, Yuhui Wang, Zhenghao Xiang, Junjie Ye, Zhangyue Yin, Zhiheng Xi, Shihan Dou, Tao Gui, Maxm Pan, Ruizhi Yang, Qi Zhang, Xuanjing Huang

PDF

1 Repo 1 Datasets

TL;DR

LLMEval-Logic is a Chinese logical reasoning benchmark for LLMs, built from realistic scenarios, verified with formal methods, and hardened through adversarial workflows, revealing current models' limitations.

Contribution

It introduces a novel Chinese logical reasoning benchmark with expert verification and adversarial hardening, addressing limitations of previous templated datasets.

Findings

01

The best model achieves only 37.5% accuracy on hard items.

02

Formalization scores among models are generally low, with a maximum of 60.16%.

03

The benchmark exposes significant gaps in current LLM logical reasoning capabilities.

Abstract

Evaluating large language models (LLMs) on natural-language logical reasoning is essential because rule-governed tasks require conclusions to follow strictly from stated premises. Many existing logical-reasoning benchmarks are generated by templating natural-language items from sampled formulas, provide only coarse or unaudited formal annotations, and are now quickly saturated by frontier reasoning models. We present LLMEval-Logic, a Chinese logical reasoning benchmark built from realistic situational scenarios. Its pipeline forward-authors and expert-audits natural-language items together with their reference formalizations, verifies annotated answers with Z3, constructs expert rubrics for natural-to-formal grading, and hardens selected items through a closed-loop adversarial workflow. The benchmark is released in two paired subsets: a 246-item Base subset shipped with 1,400…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llmeval/LLMEval-Logic
github

Datasets

llmeval-fdu/LLMEval-Logic
dataset· 461 dl
461 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.