D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation
Weibo Zhou, Lingbo Li, Shangsong Liang

TL;DR
D-SCoRE is a training-free framework that automatically generates diverse, high-quality question-answering datasets from text sources using LLMs and prompt engineering, enhancing domain adaptation for QA models.
Contribution
It introduces a novel, training-free method combining document segmentation, CoT reasoning, and structured export to create tailored QA datasets efficiently.
Findings
LLMs fine-tuned on D-SCoRE data outperform those trained on human data.
The method generates over 1,100 QA pairs per GPU-hour.
D-SCoRE enables rapid domain-specific QA dataset creation.
Abstract
The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce , a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating ocument-centric processing, egmentation, T easoning, and structured xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
