Optimizing Datasets for Code Summarization: Is Code-Comment Coherence   Enough?

Antonio Vitale; Antonio Mastropaolo; Rocco Oliveto; Massimiliano Di; Penta; and Simone Scalabrino

arXiv:2502.07611·cs.SE·February 12, 2025

Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?

Antonio Vitale, Antonio Mastropaolo, Rocco Oliveto, Massimiliano Di, Penta, and Simone Scalabrino

PDF

Open Access

TL;DR

This paper investigates whether removing incoherent code-comment pairs based on a coherence metric can optimize datasets for code summarization without sacrificing model performance, revealing many irrelevant examples in current datasets.

Contribution

It demonstrates that filtering datasets by code-comment coherence does not significantly impact summarization model effectiveness, highlighting the presence of many irrelevant training instances.

Findings

01

Dataset filtering by coherence does not reduce model accuracy.

02

Halving training data size maintains summarization quality.

03

Current datasets contain many irrelevant or noisy examples.

Abstract

Automated code summarization is a long-standing goal for code comprehension. This task automatically generates documentation using a given method. Deep Learning (DL)-based approaches have been proven beneficial for various software engineering (SE) tasks, including this one. Most state-of-the-art datasets for code summarization are automatically mined from GitHub and, thus, might contain erroneous or sub-optimal examples. Previous work showed that using a simple rule-based approach for removing noisy instances allows for a tangible reduction of the training set size while not reducing the effectiveness of the trained models. Motivated by this finding, we conjecture that it is possible to further reduce the dataset size by removing instances that contain different issues. In this paper, we explore the extent to which code-comment coherence, a specific quality attribute of code summaries,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research