The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think
Seongyun Lee, Seungone Kim, Minju Seo, Yongrae Jo, Dongyoung Go, Hyeonbin Hwang, Jinho Park, Xiang Yue, Sean Welleck, Graham Neubig, Moontae Lee, Minjoon Seo

TL;DR
The paper introduces the CoT Encyclopedia, a framework for analyzing, predicting, and guiding reasoning strategies in large language models, leading to improved interpretability and performance in chain-of-thought reasoning.
Contribution
It presents a novel bottom-up approach to automatically categorize and interpret diverse reasoning behaviors in models, surpassing prior human-intuition-based methods.
Findings
Model reasoning strategies can be effectively categorized and predicted.
Training data format significantly influences reasoning behavior.
Understanding strategies enables performance improvements.
Abstract
Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate…
Peer Reviews
Decision·ICLR 2026 Poster
- Clear narrative. - Clear empirical insights. Demonstrates that format/domain in shaping reasoning patterns, and that merging model weights interpolates strategies. These provide useful guidance for dataset and model design. - Sensible empirical analysis. Includes ablations on taxonomy construction (embedding, clustering, … ), human evals of report quality, and analyses of stability across families/sizes.
- Lack of a rigorous scoping for the problem of reasoning. - Lack of an argument for the construction of methods which can deliver a comprehensive feature set which describes the CoT reasoning phenomena.
# 1. Originality and Conceptual Contribution The work shifts the paradigm from top-down, predefined reasoning taxonomies to a bottom-up, data-driven discovery of reasoning strategies. This formulation is original and theoretically meaningful: it operationalizes reasoning diversity without relying on human-crafted categories, enabling emergent taxonomies directly grounded in model behavior. The introduction of contrastive rubrics (e.g., “bottom-up vs. top-down,” “inductive vs. deductive”) repres
# 1. Lack of analysis on classifier choice and sensitivity The framework relies on a single LLM (GPT-4o) to perform all classification tasks in the taxonomy pipeline—deciding whether each reasoning trace aligns with one side of a contrastive rubric. Although Appendix B.1 examines benchmark-induced differences (showing that GPQA, MMLU, and MATH benchmarks produce similar criteria while Arena-Hard yields a distinct “User Understanding” dimension), this analysis only reflects task-level variabili
- The goal of making LLM reasoning more interpretable and controllable is important and timely. - The paper provides a large-scale qualitative analysis that could inspire follow-up interpretability research. - It attempts to link reasoning diversity with data format, a less-explored dimension in CoT studies. - Presentation is relatively polished and readable, making the high-level idea easy to follow.
- The criterion appears to be very important, yet the paper seems to use only a single one. How robust is this criterion to variations from weaker models or the effects of randomness? - It seems that a prior for a specific reasoning logic has been incorporated for a certain class of questions, and the performance benefits are evident. I am curious how your framework would be applied when the test set does not include CoT outputs. - Figure 6 is blurry and difficult to read. - How does the perform
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
