CupCleaner: A Hybrid Data Cleaning Approach for Comment Updating
Qingyuan Liang, Zeyu Sun, Qihao Zhu, Junhao Hu, Yifan Zhao, Yakun Zhang, Lu Zhang

TL;DR
This paper introduces CupCleaner, a hybrid data cleaning method that improves comment updating models by filtering noisy data through static semantic analysis and dynamic loss monitoring, leading to better model performance.
Contribution
The paper presents a novel hybrid statistical approach combining static and dynamic strategies for cleaning comment updating datasets, enhancing model training quality.
Findings
Both static and dynamic strategies effectively filter noisy data.
Ensemble of strategies further improves model performance.
Data cleaning enhances comment updating accuracy.
Abstract
Comment updating is an emerging task in software evolution that aims to automatically revise source code comments in accordance with code changes. This task plays a vital role in maintaining code-comment consistency throughout software development. Recently, deep learning-based approaches have shown great potential in addressing comment updating by learning complex patterns between code edits and corresponding comment modifications. However, the effectiveness of these learning-based approaches heavily depends on the quality of training data. Existing datasets are typically constructed by mining version histories from open-source repositories such as GitHub, where there is often a lack of quality control over comment edits. As a result, these datasets may contain noisy or inconsistent samples that hinder model learning and generalization. In this paper, we focus on cleaning existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Topic Modeling
