Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

Ali Hassaan Mughal; Noor Fatima; Muhammad Bilal

arXiv:2605.14568·cs.SE·May 15, 2026

Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

PDF

TL;DR

This paper presents a machine learning approach to identify and classify refactoring opportunities in Behavior-Driven Development test suites, leveraging paraphrase detection and LLM judges, and releases the tools and data openly.

Contribution

It introduces a novel pipeline combining paraphrase clustering, human labeling, and ML classification to automate detection of refactoring patterns in BDD test suites.

Findings

01

The classifier achieved an F1 score of 0.891, outperforming rule-based and LLM judges.

02

The study identified that 75% of scenarios contain within-file Background candidates.

03

Over 5 million slices were processed, revealing 692,020 recurring patterns.

Abstract

Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. Sentence-BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.