CryptoX : Compositional Reasoning Evaluation of Large Language Models
Jiajun Shi, Chaoren Wei, Liqun Yang, Zekun Moore Wang, Chenghao Yang,, Ge Zhang, Stephen Huang, Tao Peng, Jian Yang, Zhoufutu Wen

TL;DR
This paper introduces CryptoX and CryptoBench, innovative frameworks for systematically evaluating the compositional reasoning abilities of large language models, revealing significant differences between open-source and closed-source models.
Contribution
The paper presents CryptoX and CryptoBench, the first comprehensive benchmarks combining cryptographic principles to quantify and analyze LLMs' compositional reasoning capabilities.
Findings
Large gap between open-source and closed-source LLMs in compositional reasoning
Inner mechanisms involve subproblem decomposition and inference
Highlights importance of improving LLM reasoning abilities
Abstract
The compositional reasoning capacity has long been regarded as critical to the generalization and intelligence emergence of large language models LLMs. However, despite numerous reasoning-related benchmarks, the compositional reasoning capacity of LLMs is rarely studied or quantified in the existing benchmarks. In this paper, we introduce CryptoX, an evaluation framework that, for the first time, combines existing benchmarks and cryptographic, to quantify the compositional reasoning capacity of LLMs. Building upon CryptoX, we construct CryptoBench, which integrates these principles into several benchmarks for systematic evaluation. We conduct detailed experiments on widely used open-source and closed-source LLMs using CryptoBench, revealing a huge gap between open-source and closed-source LLMs. We further conduct thorough mechanical interpretability experiments to reveal the inner…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The method and idea are straightforward and easy to understand. The authors conduct extensive experiments. Experiment details, including prompts and codes, are well demonstrated and shared.
1. This paper narrows down the topic to deciphering, making the research question less important. (not representative and far from the real world) A reasonable benchmark should be representative and evaluate a general ability. 2. Some claims are made ambiguously. For example, "the true CR ability", how to define true. Even with the explanation in Section 4.1, readers may still be confused about "true". 3. The analysis shares little value to the broader ICLR community. The results in Sections 4
- The authors present a scalable method to take existing benchmarks for LLMs and encrypt them with a combination of transformations that happen at a word level, and then whole scale replacement with moise / emoji code. This makes it easy to apply to any existing benchmark - 4.1 Analysis (2) attempts to disentangle the one step vs combined reasoning. This is exactly the sort of analysis I was hoping to see, and is a very good start to addressing the concern I raise in the questions. - A
- The key weakness I find in the paper is a lack of evidence that the proposed method tests for compositional reasoning rather than robustness to noise. I appreciate the authors attempts in this direction that are already provided, but I would like to see some further evidence to address specific concerns. Showing more evidence that this method is meaningfully different to noise robustness would go a long way to convincing me that this proposed method is novel and useful. “As shown in Tabl
- Originality & Significance: The approach of extending existing benchmarks to compositional reasoning is interesting, as it aims to balance controllable reasoning complexity with practical downstream tasks. When properly executed, this framework could provide an effective framework for analyzing LLM compositional reasoning behavior. - Quality: The experiments include a wide range of evaluated models, which strengthens the credibility of the results. - Clarity: The results in Section 4.1(2) demo
- The definition of compositionality in this study is too limited. Each example follows a fixed two-step process of decryption followed by question answering, with no exchange of information between them. Because the reasoning path is identical across all samples and the reasoning graph connects only a single pair of tasks (decryption and downstream reasoning), the task becomes a static routine rather than a test of adaptive decomposition. The task type itself is also limited since the encryptio
1.The paper introduces a novel and systematic evaluation framework **(CryptoX)** for compositional reasoning, with a carefully designed benchmark (CryptoBench) that allows controlled difficulty scaling and clear attribution of model failures. This provides a reproducible and extensible way to study an important but underexplored capability of LLMs. 2.The authors conduct comprehensive experiments across 40+ open- and closed-source models, combined with mechanistic interpretability analysis (logi
1.**Reasonableness of Task**: DesignCryptoX's rules are artificially constructed (encryption/transformation). Can they represent real-world "combinatorial reasoning"? I'm concerned that the task is too symbolic and may lack generalizability. 2.**Writing Issues**: The paper suffers from clarity problems in terminology and exposition. Key definitions (e.g., Crypto-MMLU, -Num, -Alpha) are scattered across the appendix rather than consolidated in the main text, which forces readers to cross-referen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Semantic Web and Ontologies
