Towards Understanding the Impacts of Textual Dissimilarity on Duplicate Bug Report Detection
Sigma Jahan, Mohammad Masudur Rahman

TL;DR
This study investigates how textual dissimilarity affects duplicate bug report detection, revealing that current methods struggle with dissimilar reports and highlighting the need for more effective solutions.
Contribution
The paper provides a large-scale empirical analysis of textual dissimilarity impacts on bug report duplication detection and evaluates domain-specific embeddings for improved performance.
Findings
Existing techniques perform poorly on textually dissimilar duplicates.
Textually dissimilar reports often lack key components like expected behaviors.
Domain-specific embeddings show mixed results in improving detection.
Abstract
About 40% of software bug reports are duplicates of one another, which pose a major overhead during software maintenance. Traditional techniques often focus on detecting duplicate bug reports that are textually similar. However, in bug tracking systems, many duplicate bug reports might not be textually similar, for which the traditional techniques might fall short. In this paper, we conduct a large-scale empirical study to better understand the impacts of textual dissimilarity on the detection of duplicate bug reports. First, we collect a total of 92,854 bug reports from three open-source systems and construct two datasets containing textually similar and textually dissimilar duplicate bug reports. Then we determine the performance of three existing techniques in detecting duplicate bug reports and show that their performance is significantly poor for textually dissimilar duplicate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software System Performance and Reliability · Advanced Malware Detection Techniques
