From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

Hanmeng Liu; Shichao Weng; Xiulai Liu; Zhicai Zhang; Anli Yan; Xiaozhang Liu

arXiv:2605.07268·cs.CL·May 11, 2026

From 0-Order Selection to 2-Order Judgment: Combinatorial Hardening Exposes Compositional Failures in Frontier LLMs

Hanmeng Liu, Shichao Weng, Xiulai Liu, Zhicai Zhang, Anli Yan, Xiaozhang Liu

PDF

TL;DR

This paper introduces LogiHard, a formal framework that transforms simple selection tasks into complex logical judgments, revealing reasoning weaknesses in state-of-the-art language models through a new challenging dataset.

Contribution

The paper presents LogiHard, a novel framework for creating combinatorially hardened reasoning questions, and demonstrates its effectiveness in exposing reasoning failures in large language models.

Findings

01

Models show 31% to 56% accuracy decline on hardened questions.

02

LLMs exhibit multi-select failure and early exit bias, unlike humans.

03

Zero-shot transfer to MMLU results in 47% accuracy drop.

Abstract

Multiple-choice reasoning benchmarks face dual challenges: rapid saturation from advancing models and data contamination that undermines static evaluations. Ad-hoc hardening methods (paraphrasing, perturbation) attempt to increase difficulty but sacrifice logical validity for surface complexity, falling short to challenge advanced reasoning models. We present LogiHard, a formal framework that deterministically transforms 0-order selection into 2-order logical judgment, which significantly increases the thinking overhead and reasoning steps. The framework integrates Item Response Theory (IRT) for computerized adaptive testing (CAT), enabling precise difficulty control with fewer questions than static benchmarks. We instantiate LogiHard-2k, a logical reasoning dataset constructed by cognitively ranking high-stakes examination questions via 9-dimensional analysis of model thinking traces,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.