Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

Yoichi Ishibashi; Taro Yano; Masafumi Oyamada

arXiv:2505.10182·cs.CL·May 16, 2025

Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning

Yoichi Ishibashi, Taro Yano, Masafumi Oyamada

PDF

Open Access 3 Reviews

TL;DR

This paper evaluates a novel continual pretraining method that uses synthetic data to mimic hidden thought processes, significantly enhancing reasoning abilities of large language models across multiple domains.

Contribution

It introduces Reasoning CPT, a new approach that synthesizes training data to improve reasoning in LLMs, demonstrating broad domain transfer and difficulty-adaptive reasoning.

Findings

01

Consistent performance improvements across all domains evaluated.

02

Effective transfer of reasoning skills between different domains.

03

Models adjust reasoning depth based on problem difficulty.

Abstract

Large Language Models (LLMs) have demonstrated significant improvements in reasoning capabilities through supervised fine-tuning and reinforcement learning. However, when training reasoning models, these approaches are primarily applicable to specific domains such as mathematics and programming, which imposes fundamental constraints on the breadth and scalability of training data. In contrast, continual pretraining (CPT) offers the advantage of not requiring task-specific signals. Nevertheless, how to effectively synthesize training data for reasoning and how such data affect a wide range of domains remain largely unexplored. This study provides a detailed evaluation of Reasoning CPT, a form of CPT that uses synthetic data to reconstruct the hidden thought processes underlying texts, based on the premise that texts are the result of the author's thinking process. Specifically, we apply…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The detailed analysis of hidden thought characteristics (e.g., length correlation with original texts, domain variance) offers valuable insights into synthetic data construction for reasoning pretraining. addresses a critical limitation of existing LLM reasoning training (over-reliance on task-specific signals) by validating non-STEM data’s effectiveness, providing a novel direction for data selection

Weaknesses

- The abstract and introduction lack logical coherence. For instance, there is an abrupt transition from reasoning models to CPT in the abstract, with insufficient contextual connection to justify this shift. - No comparisons are made with other related state-of-the-art works. This omission prevents a clear demonstration of the proposed method’s advantages over existing approaches. - The proposed method bears significant similarities to knowledge distillation directly from stronger reasoning

Reviewer 02Rating 2Confidence 4

Strengths

1. The hidden thoughts are effective for reasoning tasks. 2. The mechanism behind hidden thoughts are transferable.

Weaknesses

1. The ``hidden thoughts'' is similar to generating chain-of-thought or slow thinking process for a piece of pre-training text. 2. Insufficient experiments. 3. The analyses of synthetic data provide neither in-depth explanation for the proposed method nor useful insights for future directions. details can refer to the questions below.

Reviewer 03Rating 8Confidence 4

Strengths

- **The results are quite promising.** The hidden thoughts on the Law domain seem to really aid the downstream performance in somewhat surprising ways, i.e., enhancing the performance on GPQA for Qwen2.5-7B - **Good reproducability** The process is very straightforward and well documented for other researchers/practitioners to use and experiment on. - **Very clear, nice figures & tables, and well written.** The tables and figures present clear results and the paper itself is structured nicely (g

Weaknesses

- **Weak analysis**, more time could be spent on figuring out where these gains are coming from. For example, "To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning" is a paper that shows most of the gain on MMLU comes from questions that are math-heavy (they find it by looking for "=" in the generated responses because math questions are sometimes placed into unassuming categories like "business"). I would be interested in seeing if some of the cross-domain transfe

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques · Semantic Web and Ontologies