CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance
Kunal Pai, Premkumar Devanbu, Toufique Ahmed

TL;DR
This paper introduces CoDocBench, a large dataset of code and documentation changes from GitHub commits, designed to facilitate training and evaluating models on software maintenance tasks involving code-documentation alignment.
Contribution
The paper presents a new dataset for code-documentation alignment in software maintenance, along with methodology for data collection and initial evaluation of model performance.
Findings
Current models struggle with maintenance-related tasks
The dataset enables realistic training and evaluation
Challenges remain for neural models in code-documentation tasks
Abstract
One of the central tasks in software maintenance is being able to understand and develop code changes. Thus, given a natural language description of the desired new operation of a function, an agent (human or AI) might be asked to generate the set of edits to that function to implement the desired new operation; likewise, given a set of edits to a function, an agent might be asked to generate a changed description, of that function's new workings. Thus, there is an incentive to train a neural model for change-related tasks. Motivated by this, we offer a new, "natural", large dataset of coupled changes to code and documentation mined from actual high-quality GitHub projects, where each sample represents a single commit where the code and the associated docstring were changed together. We present the methodology for gathering the dataset, and some sample, challenging (but realistic) tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Software Engineering Techniques and Practices
