Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?
Antonio Vitale, Antonio Mastropaolo, Rocco Oliveto, Massimiliano Di, Penta, and Simone Scalabrino

TL;DR
This paper investigates whether removing incoherent code-comment pairs based on a coherence metric can optimize datasets for code summarization without sacrificing model performance, revealing many irrelevant examples in current datasets.
Contribution
It demonstrates that filtering datasets by code-comment coherence does not significantly impact summarization model effectiveness, highlighting the presence of many irrelevant training instances.
Findings
Dataset filtering by coherence does not reduce model accuracy.
Halving training data size maintains summarization quality.
Current datasets contain many irrelevant or noisy examples.
Abstract
Automated code summarization is a long-standing goal for code comprehension. This task automatically generates documentation using a given method. Deep Learning (DL)-based approaches have been proven beneficial for various software engineering (SE) tasks, including this one. Most state-of-the-art datasets for code summarization are automatically mined from GitHub and, thus, might contain erroneous or sub-optimal examples. Previous work showed that using a simple rule-based approach for removing noisy instances allows for a tangible reduction of the training set size while not reducing the effectiveness of the trained models. Motivated by this finding, we conjecture that it is possible to further reduce the dataset size by removing instances that contain different issues. In this paper, we explore the extent to which code-comment coherence, a specific quality attribute of code summaries,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
