The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Seongyun Lee; Seungone Kim; Minju Seo; Yongrae Jo; Dongyoung Go; Hyeonbin Hwang; Jinho Park; Xiang Yue; Sean Welleck; Graham Neubig; Moontae Lee; Minjoon Seo

arXiv:2505.10185·cs.CL·May 16, 2025

The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Seongyun Lee, Seungone Kim, Minju Seo, Yongrae Jo, Dongyoung Go, Hyeonbin Hwang, Jinho Park, Xiang Yue, Sean Welleck, Graham Neubig, Moontae Lee, Minjoon Seo

PDF

Open Access 3 Reviews

TL;DR

The paper introduces the CoT Encyclopedia, a framework for analyzing, predicting, and guiding reasoning strategies in large language models, leading to improved interpretability and performance in chain-of-thought reasoning.

Contribution

It presents a novel bottom-up approach to automatically categorize and interpret diverse reasoning behaviors in models, surpassing prior human-intuition-based methods.

Findings

01

Model reasoning strategies can be effectively categorized and predicted.

02

Training data format significantly influences reasoning behavior.

03

Understanding strategies enables performance improvements.

Abstract

Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 4

Strengths

- Clear narrative. - Clear empirical insights. Demonstrates that format/domain in shaping reasoning patterns, and that merging model weights interpolates strategies. These provide useful guidance for dataset and model design. - Sensible empirical analysis. Includes ablations on taxonomy construction (embedding, clustering, … ), human evals of report quality, and analyses of stability across families/sizes.

Weaknesses

- Lack of a rigorous scoping for the problem of reasoning. - Lack of an argument for the construction of methods which can deliver a comprehensive feature set which describes the CoT reasoning phenomena.

Reviewer 02Rating 6Confidence 4

Strengths

# 1. Originality and Conceptual Contribution The work shifts the paradigm from top-down, predefined reasoning taxonomies to a bottom-up, data-driven discovery of reasoning strategies. This formulation is original and theoretically meaningful: it operationalizes reasoning diversity without relying on human-crafted categories, enabling emergent taxonomies directly grounded in model behavior. The introduction of contrastive rubrics (e.g., “bottom-up vs. top-down,” “inductive vs. deductive”) repres

Weaknesses

# 1. Lack of analysis on classifier choice and sensitivity The framework relies on a single LLM (GPT-4o) to perform all classification tasks in the taxonomy pipeline—deciding whether each reasoning trace aligns with one side of a contrastive rubric. Although Appendix B.1 examines benchmark-induced differences (showing that GPQA, MMLU, and MATH benchmarks produce similar criteria while Arena-Hard yields a distinct “User Understanding” dimension), this analysis only reflects task-level variabili

Reviewer 03Rating 8Confidence 4

Strengths

- The goal of making LLM reasoning more interpretable and controllable is important and timely. - The paper provides a large-scale qualitative analysis that could inspire follow-up interpretability research. - It attempts to link reasoning diversity with data format, a less-explored dimension in CoT studies. - Presentation is relatively polished and readable, making the high-level idea easy to follow.

Weaknesses

- The criterion appears to be very important, yet the paper seems to use only a single one. How robust is this criterion to variations from weaker models or the effects of randomness? - It seems that a prior for a specific reasoning logic has been incorporated for a certain class of questions, and the performance benefits are evident. I am curious how your framework would be applied when the test set does not include CoT outputs. - Figure 6 is blurry and difficult to read. - How does the perform

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning