From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs
Hanmeng Liu, Shichao Weng, Xiulai Liu, Zhicai Zhang, Anli Yan, Xiaozhang Liu

TL;DR
This paper introduces LogiHard, a formal framework that transforms simple selection tasks into complex logical judgments, revealing reasoning weaknesses in state-of-the-art language models through a new challenging dataset.
Contribution
The paper presents LogiHard, a novel framework for creating combinatorially hardened reasoning questions, and demonstrates its effectiveness in exposing reasoning failures in large language models.
Findings
Models show 31% to 56% accuracy decline on hardened questions.
LLMs exhibit multi-select failure and early exit bias, unlike humans.
Zero-shot transfer to MMLU results in 47% accuracy drop.
Abstract
Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
