A parallel corpus of Python functions and documentation strings for automated code documentation and code generation
Antonio Valerio Miceli Barone, Rico Sennrich

TL;DR
This paper introduces a large, diverse parallel corpus of Python functions and docstrings to advance automated code documentation and generation, providing baseline results and data augmentation techniques to foster further research.
Contribution
The authors created and released a large dataset of Python code and documentation, enabling improved neural models for code-related tasks and addressing data scarcity issues.
Findings
Baseline neural machine translation results for code documentation and generation.
Data augmentation techniques improve training data size and model performance.
The dataset supports future research in automated code understanding and generation.
Abstract
Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains. In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings ("docstrings") generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with data augmentation techniques to further increase the amount of training data. We release our datasets and processing scripts in order to stimulate research in these areas.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational Physics and Python Applications
