DeepLSH: Deep Locality-Sensitive Hash Learning for Fast and Efficient Near-Duplicate Crash Report Detection
Youcef Remil, Anes Bendimerad, Romain Mathonat, Chedy Raissi, and Mehdi Kaytoue

TL;DR
DeepLSH introduces a novel deep learning approach to approximate locality-sensitive hashing for near-duplicate crash report detection, enabling efficient real-time bug bucketing with high accuracy.
Contribution
This paper presents DeepLSH, a Siamese neural network architecture that learns to approximate LSH for metrics like Jaccard and Cosine, filling a gap in crash bucketing literature.
Findings
DeepLSH achieves high approximation accuracy for LSH on crash report metrics.
Experimental results demonstrate improved efficiency in near-duplicate detection.
The dataset used for evaluation is made publicly available.
Abstract
Automatic crash bucketing is a crucial phase in the software development process for efficiently triaging bug reports. It generally consists in grouping similar reports through clustering techniques. However, with real-time streaming bug collection, systems are needed to quickly answer the question: What are the most similar bugs to a new one?, that is, efficiently find near-duplicates. It is thus natural to consider nearest neighbors search to tackle this problem and especially the well-known locality-sensitive hashing (LSH) to deal with large datasets due to its sublinear performance and theoretical guarantees on the similarity search accuracy. Surprisingly, LSH has not been considered in the crash bucketing literature. It is indeed not trivial to derive hash functions that satisfy the so-called locality-sensitive property for the most advanced crash bucketing metrics. Consequently,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Advanced Image and Video Retrieval Techniques · Hepatitis B Virus Studies
