Reducing Maintenance Burden in Behaviour-Driven Development: A Paraphrase-Robust Duplicate-Step Detector with a 1.1M-Step Open Benchmark
Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

TL;DR
This paper introduces a large, cross-organizational benchmark and a paraphrase-robust detector for identifying duplicate steps in Behaviour-Driven Development (BDD) Gherkin files, aiming to reduce maintenance effort.
Contribution
It provides the largest publicly available BDD step corpus, a new multi-strategy detection method, and a calibration benchmark to improve duplicate detection accuracy.
Findings
The detector achieves an F1 score of 0.822 on near-exact duplicates.
Semantic detection reaches an F1 of 0.906, outperforming lexical baselines.
Approximately 62.5% of step lines are estimated to be eliminable in median repositories.
Abstract
Context. Behaviour-Driven Development (BDD) suites in Gherkin accumulate step-text duplication with documented maintenance cost. Prior detectors either require runnable tests or are single-organisation, leaving a gap: a static, paraphrase-robust, step-level detector and a public benchmark to calibrate it. Objective. We release (i) the largest cross-organisational BDD step corpus to date, (ii) a labelled pair-level calibration benchmark, and (iii) a four-strategy detector with a consolidation-savings model linking clusters to ISO/IEC 25010 maintainability sub-characteristics. Method. The corpus contains 347 public GitHub repositories, 23,667 .feature files, and 1,113,616 Gherkin steps, SPDX-tagged. The detector layers exact hashing, normalised Levenshtein, sentence-transformer cosine, and a Levenshtein-banded hybrid. Calibration uses 1,020 manually labelled step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
