CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Ping Yu; Jack Lanchantin; Tianlu Wang; Weizhe Yuan; Olga Golovneva; Ilia Kulikov; Sainbayar Sukhbaatar; Jason Weston; Jing Xu

arXiv:2507.23751·cs.AI·September 4, 2025

CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks

Ping Yu, Jack Lanchantin, Tianlu Wang, Weizhe Yuan, Olga Golovneva, Ilia Kulikov, Sainbayar Sukhbaatar, Jason Weston, Jing Xu

PDF

Open Access 1 Datasets

TL;DR

CoT-Self-Instruct introduces a method for generating high-quality synthetic prompts for reasoning and instruction tasks by instructing LLMs to reason, plan, and generate data, resulting in superior training datasets.

Contribution

The paper presents a novel synthetic data generation approach that leverages Chain-of-Thought reasoning and automatic filtering to improve LLM training datasets.

Findings

01

Synthetic data outperforms existing datasets in reasoning tasks.

02

Method surpasses human and Self-Instruct data in instruction-following benchmarks.

03

Significant improvements in reasoning and instruction tasks performance.

Abstract

We propose CoT-Self-Instruct, a synthetic data generation method that instructs LLMs to first reason and plan via Chain-of-Thought (CoT) based on given seed tasks, and then generate a new synthetic example of similar quality and complexity. This is followed by a filtering step to select high-quality data using automatic metrics, which are then used for LLM training. In verifiable reasoning, our synthetic data significantly outperforms existing training datasets, such as s1k and OpenMathReasoning, when evaluated on MATH500, AMC23, AIME24, and GPQA-Diamond. For non-verifiable instruction-following tasks, our method surpasses the performance of both human and standard Self-Instruct training data on the AlpacaEval 2.0 and Arena-Hard benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

uv-scripts/synthetic-data
dataset· 75 dl
75 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Online Learning and Analytics