NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification
Khondaker Tasnia Hoque, Toukir Ahammed

TL;DR
NeuroFlake is a neuro-symbolic framework that improves flaky test classification by integrating high-fidelity code tokens into LLMs, achieving better accuracy and robustness on imbalanced real-world datasets.
Contribution
It introduces Discriminative Token Mining to enhance LLM attention with symbolic signals, improving flaky test classification performance and robustness.
Findings
F1-score improved to 69.34% from 65.79%
NeuroFlake maintains stability with only 4-7 pp performance drop under adversarial augmentations
Baseline models degrade by 8-18 pp on perturbed tests.
Abstract
Flaky tests, which exhibit non-deterministic pass/fail behavior for the same version of code, pose significant challenges to reliable regression testing. While large language models (LLMs) promise for automated flaky test classification, they often fail to comprehend the actual logic behind test flakiness, instead overfitting to superficial textual artifacts (e.g., specific variable names). This semantic fragility leads to poor generalization on real-world imbalance dataset and vulnerability to perturbations. In this paper, we introduce NeuroFlake, a novel neuro-Symbolic framework for classifying flaky tests on highly imbalanced, real-world datasets (FlakeBench). Unlike prior approaches that rely on brittle manual rule and black box learning, NeuroFlake integrates a Discriminative Token Mining (DTM) module to automate the discovery of high-fidelity, statistically significant source code…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
