Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering
Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh

TL;DR
This paper empirically investigates how data bugs in code, text, and metric datasets affect deep learning models in software engineering, revealing symptoms like bias, overfitting, and gradient instability, and emphasizing the importance of data quality.
Contribution
It provides a comprehensive analysis of data bug impacts on DL models in software engineering, highlighting specific symptoms and effects across different data types, and validates findings with multiple datasets.
Findings
Data bugs cause biased learning and gradient instability in code data.
Text data issues lead to overfitting and poor generalization.
Metric data problems result in exploding gradients and overfitting.
Abstract
Deep learning (DL) techniques have achieved significant success in various software engineering tasks (e.g., code completion by Copilot). However, DL systems are prone to bugs from many sources, including training data. Existing literature suggests that bugs in training data are highly prevalent, but little research has focused on understanding their impacts on the models used in software engineering tasks. In this paper, we address this research gap through a comprehensive empirical investigation focused on three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based. Using state-of-the-art baselines, we compare the models trained on clean datasets with those trained on datasets with quality issues and without proper preprocessing. By analysing the gradients, weights, and biases from neural networks under training, we identify the symptoms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Software Reliability and Analysis Research
