How Order-Sensitive Are LLMs? OrderProbe for Deterministic Structural Reconstruction

Yingjie He; Zhaolu Kang; Kehan Jiang; Qianyuan Zhang; Jiachen Qian; Chunlei Meng; Yujie Feng; Yuan Wang; Jiabao Dou; Aming Wu; Leqi Zheng; Pengxiang Zhao; Jiaxin Liu; Zeyu Zhang; Lei Wang; Guansu Wang; Qishi Zhan; Xiaomin He; Meisheng Zhang; Jianyuan Ni

arXiv:2601.08626·cs.CL·January 21, 2026

How Order-Sensitive Are LLMs? OrderProbe for Deterministic Structural Reconstruction

Yingjie He, Zhaolu Kang, Kehan Jiang, Qianyuan Zhang, Jiachen Qian, Chunlei Meng, Yujie Feng, Yuan Wang, Jiabao Dou, Aming Wu, Leqi Zheng, Pengxiang Zhao, Jiaxin Liu, Zeyu Zhang, Lei Wang, Guansu Wang, Qishi Zhan, Xiaomin He, Meisheng Zhang, Jianyuan Ni

PDF

Open Access

TL;DR

This paper introduces OrderProbe, a benchmark for assessing how well large language models can reconstruct the original structure of scrambled Chinese, Japanese, and Korean expressions, revealing challenges in structural understanding beyond semantic accuracy.

Contribution

The paper presents OrderProbe, a novel deterministic benchmark and diagnostic framework for evaluating LLMs' ability to reconstruct internal structure from scrambled inputs.

Findings

01

Structural reconstruction remains difficult for current LLMs, with recovery often below 35%.

02

Semantic recall and structural planning are dissociated, indicating separate capabilities.

03

Structural robustness is not an automatic consequence of semantic competence.

Abstract

Large language models (LLMs) excel at semantic understanding, yet their ability to reconstruct internal structure from scrambled inputs remains underexplored. Sentence-level restoration is ill-posed for automated evaluation because multiple valid word orders often exist. We introduce OrderProbe, a deterministic benchmark for structural reconstruction using fixed four-character expressions in Chinese, Japanese, and Korean, which have a unique canonical order and thus support exact-match scoring. We further propose a diagnostic framework that evaluates models beyond recovery accuracy, including semantic fidelity, logical validity, consistency, robustness sensitivity, and information density. Experiments on twelve widely used LLMs show that structural reconstruction remains difficult even for frontier systems: zero-shot recovery frequently falls below 35%. We also observe a consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques