CDS: Knowledge Component-Driven Data Synthesis Guided by Cognitive Diagnosis Theory
Haokun Zhao, Jinyi Han, Jiaqing Liang, Yanghua Xiao, Xiaojun Meng, Jiansheng Wei

TL;DR
This paper introduces a knowledge component-driven data synthesis method based on Cognitive Diagnosis Theory, improving the quality of synthetic data for training large language models and enhancing their performance across various tasks.
Contribution
The paper proposes a novel diagnostic approach for fine-grained evaluation of LLMs and introduces targeted data synthesis strategies based on knowledge component diagnostics.
Findings
Up to 6.00% improvement in code generation
13.10% boost in mathematical reasoning
5.43% enhancement in academic exam performance
Abstract
Large Language Models (LLMs) have achieved significant advancements, but the increasing complexity of tasks and higher performance demands highlight the need for continuous improvement. Some approaches utilize synthetic data generated by advanced LLMs based on evaluation results to train models. However, conventional evaluation methods fail to provide detailed, fine-grained profiles of LLMs, limiting their guidance for data synthesis. In this paper, we introduce the Cognitive Diagnostic Synthesis (CDS) method, which incorporates a diagnostic process inspired by Cognitive Diagnosis Theory (CDT) to refine evaluation results and characterize model profiles at the knowledge component level. Based on these diagnostics, we propose two diagnosis-synthesis strategies for weakness-targeted data synthesis. Additionally, we present an enhanced data augmentation and selection pipeline to improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Processing Techniques
