Code2Doc: A Quality-First Curated Dataset for Code Documentation
Recep Kaan Karaman, Meftun Akarsu

TL;DR
This paper introduces Code2Doc, a high-quality, curated dataset for function-level code documentation across five programming languages, aiming to improve automatic documentation models by providing cleaner, more reliable training data.
Contribution
The paper presents a novel, multi-stage curation pipeline that filters and refines code-documentation pairs, resulting in a high-quality dataset for code documentation generation.
Findings
Fine-tuning on Code2Doc improves BLEU by 29.47%
Dataset contains 13,358 high-quality pairs
Only 2.9% of samples are AI-generated
Abstract
The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping of public repositories with limited quality control. As a result, they often contain noisy documentation, extensive duplication, and increasing contamination from AI generated content. These issues weaken the supervision signal available to learning-based models and complicate evaluation. We introduce Code2Doc, a quality-first curated dataset for function-level code documentation generation. Code2Doc consists of 13,358 high-quality function-documentation pairs extracted from widely used open-source repositories spanning five programming languages: Python, Java, TypeScript, JavaScript, and C++. The dataset is constructed using a four-stage curation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Software Testing and Debugging Techniques
