Stack Trace Deduplication: Faster, More Accurately, and in More Realistic Scenarios
Egor Shibaev, Denis Sushentsev, Yaroslav Golubev, Aleksandr Khvorov

TL;DR
This paper introduces a new stack trace deduplication model, a large industry dataset, and a comprehensive evaluation, demonstrating improved accuracy and speed in real-world scenarios for large-scale software error analysis.
Contribution
The work presents a novel two-part model, a new industry dataset, and a realistic evaluation framework for stack trace deduplication at scale.
Findings
Outperforms existing models on open-source and industry datasets.
Balances accuracy with operational speed effectively.
Handles large-scale, real-world stack trace data efficiently.
Abstract
In large-scale software systems, there are often no fully-fledged bug reports with human-written descriptions when an error occurs. In this case, developers rely on stack traces, i.e., series of function calls that led to the error. Since there can be tens and hundreds of thousands of them describing the same issue from different users, automatic deduplication into categories is necessary to allow for processing. Recent works have proposed powerful deep learning-based approaches for this, but they are evaluated and compared in isolation from real-life workflows, and it is not clear whether they will actually work well at scale. To overcome this gap, this work presents three main contributions: a novel model, an industry-based dataset, and a multi-faceted evaluation. Our model consists of two parts - (1) an embedding model with byte-pair encoding and approximate nearest neighbor search…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies
