Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations

Wenrui Cai; Chengyu Wang; Junbing Yan; Jun Huang; Xiangzhong Fang

arXiv:2505.10937·cs.CL·May 19, 2025

Reasoning with OmniThought: A Large CoT Dataset with Verbosity and Cognitive Difficulty Annotations

Wenrui Cai, Chengyu Wang, Junbing Yan, Jun Huang, Xiangzhong Fang

PDF

Open Access 5 Models 1 Datasets

TL;DR

OmniThought is a large-scale dataset with 2 million chain-of-thought reasoning processes, annotated with verbosity and difficulty scores, designed to improve large reasoning models' training and reasoning abilities.

Contribution

The paper introduces OmniThought, a comprehensive CoT dataset with novel annotations, and demonstrates its effectiveness in enhancing large reasoning models.

Findings

01

Scores improve LRM training effectiveness

02

Enhanced reasoning abilities in trained LRMs

03

Optimal CoT verbosity and difficulty levels identified

Abstract

The emergence of large reasoning models (LRMs) has transformed Natural Language Processing by excelling in complex tasks such as mathematical problem-solving and code generation. These models leverage chain-of-thought (CoT) processes, enabling them to emulate human-like reasoning strategies. However, the advancement of LRMs is hindered by the lack of comprehensive CoT datasets. Current resources often fail to provide extensive reasoning problems with coherent CoT processes distilled from multiple teacher models and do not account for multifaceted properties describing the internal characteristics of CoTs. To address these challenges, we introduce OmniThought, a large-scale dataset featuring 2 million CoT processes generated and validated by two powerful LRMs as teacher models. Each CoT process in OmniThought is annotated with novel Reasoning Verbosity (RV) and Cognitive Difficulty (CD)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

alibaba-pai/OmniThought
dataset· 725 dl
725 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Multimodal Machine Learning Applications