D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

Weibo Zhou; Lingbo Li; Shangsong Liang

arXiv:2508.01309·cs.CL·February 9, 2026

D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation

Weibo Zhou, Lingbo Li, Shangsong Liang

PDF

Open Access

TL;DR

D-SCoRE is a training-free framework that automatically generates diverse, high-quality question-answering datasets from text sources using LLMs and prompt engineering, enhancing domain adaptation for QA models.

Contribution

It introduces a novel, training-free method combining document segmentation, CoT reasoning, and structured export to create tailored QA datasets efficiently.

Findings

01

LLMs fine-tuned on D-SCoRE data outperform those trained on human data.

02

The method generates over 1,100 QA pairs per GPU-hour.

03

D-SCoRE enables rapid domain-specific QA dataset creation.

Abstract

The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce $D-SCoRE$ , a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating $D$ ocument-centric processing, $S$ egmentation, $Co$ T $R$ easoning, and structured $E$ xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications