Comment on Iacobescu et al. Evaluating Binary Classifiers for Cardiovascular Disease Prediction: Enhancing Early Diagnostic Capabilities. J. Cardiovasc. Dev. Dis. 2024, 11, 396

Mohamed Eltawil; Laura Byham-Gray; Yuane Jia; Neil Mistry; James Parrott; Suril Gohel

PMC · DOI:10.3390/jcdd13010046·January 13, 2026

Comment on Iacobescu et al. Evaluating Binary Classifiers for Cardiovascular Disease Prediction: Enhancing Early Diagnostic Capabilities. J. Cardiovasc. Dev. Dis. 2024, 11, 396

Mohamed Eltawil, Laura Byham-Gray, Yuane Jia, Neil Mistry, James Parrott, Suril Gohel

PDF

Open Access

TL;DR

This paper critiques a study that claimed nearly perfect accuracy in predicting cardiovascular disease, showing the results were due to flawed methods that leaked data.

Contribution

The paper identifies and explains how data leakage occurred through improper use of SMOTE-ENN and small kNN parameters, offering practical guidelines to prevent such errors.

Findings

01

Applying SMOTE-ENN before train/test split caused synthetic data to leak into the test set, inflating accuracy to nearly 99%.

02

Using k = 2 in kNN with leaked data further amplified the misleading performance metrics.

03

Correcting the workflow reduced accuracy to realistic levels (~80%), aligning with standard benchmarks.

Abstract

Machine learning is increasingly applied to cardiovascular disease prediction yet reported performance metrics often appear implausibly high due to methodological errors. Recent work has reported nearly perfect predictive accuracy (≈99%) using a k-Nearest Neighbors (kNN) model on CDC heart-disease data. Such performance greatly exceeds typical BRFSS-based benchmarks and strongly indicates data leakage. In this commentary, we replicate and re-analyze the original workflow, showing that the authors applied the SMOTE-ENN resampling method prior to the train/test split, thereby allowing synthetic data generated from the full dataset to contaminate the test set. Combined with an excessively small neighborhood parameter (k = 2), this produced misleadingly high accuracy. It is noted that (1) with SMOTE-ENN performed globally, synthetic samples appear nearly identical to test points, leading to…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases2

cardiovascular disease heart-disease

Keywords

machine learningcardiovascular diseasedata leakagek-Nearest Neighbors (kNN)SMOTE-ENNmodel validationclass imbalancereproducibilitymedical AIperformance evaluation

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education

Full text

1. Introduction

Machine learning shows promise for cardiovascular risk prediction and early diagnosis of heart disease [1,2]. A recent study by Iacobescu et al. [3] evaluated several classifiers on a widely used CDC dataset and reported extraordinarily high performance, with k-Nearest Neighbors (kNN) achieving 99% accuracy and AUC ≈ 0.99. These near-perfect metrics far exceed typical results in heart disease prediction and raise serious concerns [4,5]. In this commentary, we show that data leakage likely inflated these results, as well as share concerns about the unusually small k (neighbors = 2) used in the kNN model. We share re-analysis findings that correct these issues, yielding more realistic performance, and discuss lessons for robust model evaluation in medical AI.

2. Unusually High Performance Raises Concerns

Near 100% accuracy in predicting cardiovascular disease is prima facie implausible and inconsistent with existing literature [4,5]. Most machine learning models for heart disease, even with advanced techniques, report far more modest accuracy and AUC values [2,3]. Thus, Iacobescu et al.’s claim that a kNN model “stood out with the highest accuracy and F1-score” at essentially 99% is an extreme outlier. The authors explicitly describe the sequence: “Following data cleaning, feature engineering, and addressing class imbalance, the next step was data transformation … [min–max] normalization … Finally, the dataset was divided into training and testing subsets (70:30)”. Absent trivial problems (e.g., an intrinsic duplication in the dataset leading to similar training and testing sets), such performance suggests a leakage or overfitting issue in the evaluation procedure as shall be further explained here.

Two red flags stand out in the study’s methodology: (1) the handling of data resampling for class imbalance, and (2) the choice of kNN with k = 2 neighbors. We discuss each below.

3. Data Leakage Through Improper Resampling

Data leakage occurs when there is unintended sharing of information between training and test data, leading to over-optimistic model performance. A common pitfall is performing data pre-processing (e.g., normalization, feature selection, or oversampling) before splitting the dataset, which can allow knowledge of the test set to influence the model [6].

Several concrete examples from the healthcare domain illustrate the deceptive performance caused by data leakage. In neuroimaging analysis, Yagis et al. [7] demonstrated that using a flawed validation strategy can drastically inflate results: they found that splitting MRI scans on a per-image basis (rather than by unique patients) erroneously boosted diagnostic accuracy by 30–50%, because the same patient’s images appeared in both training and testing sets. Another study showed that allowing any overlap in subjects or using feature selection with knowledge of test data inflated prediction performance significantly, whereas properly segregating data eliminated this illusion of high accuracy [8]. In clinical prediction tasks, including variables that are definitive for the diagnosis can lead to nearly perfect performance during testing—an obvious red flag.

In Iacobescu et al.’s study, the dataset (drawn from a large health survey) was heavily imbalanced (only ~8% positive for heart disease). The authors employed the SMOTE-ENN technique (Synthetic Minority Oversampling followed by Edited Nearest Neighbors) to balance classes, expanding the data from 306,939 to 472,485 records. Crucially, this resampling was performed on the entire dataset prior to the train–test split, as stated by the authors: “Following … addressing class imbalance, the next step was … the dataset was divided into training and testing subsets (70:30)” (i.e., the pipeline follows those steps: cleaning/feature-engineering → SMOTE-ENN → normalization → split). In other words, synthetic minority examples were generated using information from all data, including the eventual test set, and only afterward was the data split into training and test sets. This constitutes a textbook example of data leakage [6,9,10,11]. It means the model was effectively trained on a dataset that already contained engineered points influenced by test-set examples. As a result, the test data was no longer truly independent.

When oversampling is applied before the train/test split, some synthetic samples in the training set can be near-duplicates of test instances (or vice versa), making classification trivial. The model can “cheat” by learning from those synthetic points. This directly resonates with Iacobescu et al.’s situation—a nearly perfect kNN result likely masks a methodological flaw.

4. The kNN = 2 Issue and Overfitting

The grid search hyperparameter tuning selected a kNN model with k = 2 (using Euclidean distance and uniform weights). A very small neighborhood size like k = 2 (or the extreme case k = 1) often indicates a high-variance, overly complex model that can overfit the training data noise or specific patterns [12,13]. With k = 2, the classifier essentially makes decisions based on the two closest data points in the feature space [14]. In a massive dataset (over 300k samples), one would expect a somewhat larger k to be optimal for smooth generalization; a tiny k implies the model is leveraging extremely local relationships. If the training data has near-duplicate or synthetically generated points, a k = 2 classifier can exploit them to perfectly classify those training instances—but this will not translate to truly new data. In other words, the choice of k = 2 likely compounded the leakage: the model zeroed in on nearest neighbors that were often artificially similar to the query (due to SMOTE-ENN across the whole dataset), yielding a falsely high accuracy.

5. Results After Correcting the Methodology

We used BRFSS 2021 and the same 18 features described by Iacobescu et al. (Table 1), starting from n ≈ 306,939 after cleaning to reconstruct their analysis and further test alternative corrections [3]. We implemented three pipelines:

Reconstruction (global SMOTE-ENN + min–max, then 70/30 split)—following the exact same workflow as in [3] with the aim to replicate—as close as possible—the results established;
Leak-free resampling (split first; SMOTE–ENN and scalers fit only inside training folds)—limiting the application of SMOTE-ENN only to the training set, maintaining the ‘blindness’ of the model to the test set, and evaluating the model on a truly held-out 30% test set;
Undersampling (split first; RandomUnderSampler within folds)—using the same exact workflow in [3] while replacing the oversampling with undersampling (thus avoiding the data synthesis and leakage to the test set).

Eliminating the leakage, the overall findings in approaches (2) and (3) changed substantially from (1) (Table 1). All classifiers perform at much lower levels but are more consistent with prevailing results in the literature, and the reported gap between kNN and other algorithms closes. In fact, algorithms like Random Forest and Gradient Boosting, which are often strong performers in tabular data, achieve comparable if not better-balanced accuracy once proper validation is in place. The kNN model’s accuracy falls to a level commensurate with known benchmarks—roughly in the 80% range. The dramatic drop in kNN’s metric (around 15 points) underscores how the original evaluation was misleading. It was not that kNN is an unexpectedly high-performing model for heart disease, but rather that the evaluation leak gave it an unrealistic advantage. Once leakage is corrected, no model in our experiment approached 99% accuracy on the test set which is the expected result given the complexity of cardiovascular risk prediction.

It is important to note that class imbalance handling itself is not the culprit—indeed, techniques like SMOTE, when correctly used on training data only, can improve model learning for minority classes. The key lesson is that the train–test-split must be sacrosanct boundaries. Any operation that learns from the entire dataset (especially one creating synthetic data points) must be confined to the training portion during each fold. Iacobescu et al.’s oversight of failing to maintain the train/test split otherwise invalidated their impressive numbers.

Our replication shows that the reported 99% accuracy is not achievable under a leakage-free pipeline. This aligns with broader concerns: methodological pitfalls that inflate performance [10], with leakage identified as a root cause of the reproducibility crisis in ML-based science [11].

6. Other Issues Noted

In addition to the leakage, the paper used mean squared error (MSE) as the loss function for a binary neural network model—an unconventional choice for classification (where perhaps binary cross-entropy is more standard [15]). Furthermore, the terms “validation” and “testing” are used interchangeably without describing a held-out test set or nested validation, limiting reproducibility [16].

7. Discussion and Recommendations

This case shows how inadvertent data leakage can lead to deceptively high performance and misguided conclusions. It also highlights the responsibility of researchers to ensure evaluation rigor, especially in medical AI, when overestimating a model’s accuracy could have real clinical repercussions. Given the criticality of medical AI applications, several considerations need to be followed to draw utility from such analysis [10,17,18,19]. We summarize the following recommendations for future studies to avoid similar pitfalls:

Split Data Early and Properly: Always separate the test set (or use cross-validation folds) before any resampling, normalization, or feature engineering steps. This ensures the model is evaluated on truly unseen data.
Use Nested Validation for Tuning: Hyperparameter tuning (e.g., GridSearchCV) should be performed within a training fold, with an independent validation mechanism, rather than on the full dataset. This prevents “peeking” at test data during model selection.
Apply Oversampling Only to Training Data: Techniques like SMOTE should never have knowledge of the entire dataset. Generate synthetic samples after splitting, within the training subset (and if using cross-validation, redo it for each fold). This avoids contaminating the test set with synthetic points derived from it.
Be Wary of Extreme Metrics: Treat near-100% results with healthy skepticism. Examine whether any feature or preprocessing step could be unintentionally leaking information. Often, a deep dive will reveal either data leakage, label proxy features, or an overly simplistic dataset if performance is too good to be true.
Cross-Check Model Complexity: If an automated search selects an unusual hyperparameter (e.g., k = 1 or 2 in kNN, very deep trees, etc.), consider if this may be overfitting. Manually inspect performance on validation vs. training sets. A small k in kNN yielding huge accuracy gains is a hint to double-check the data pipeline for leaks or anomalies. Hyperparameter choices optimized purely by an algorithm should be interpreted in the context of clinical goals—ensuring that the resulting model serves meaningful and generalizable predictions rather than just maximizing mathematical metrics.
Report Methodology Transparently: Provide clear details on when each preprocessing step was performed relative to splitting. Ambiguity in this can hide leakage. Diagrams are helpful, but they must include these details, not just a high-level pipeline. Transparent reporting allows others to trust and reproduce the findings or catch issues if present.

The impressive results of Iacobescu et al.’s CVD prediction model were likely an artifact of evaluation errors rather than a breakthrough in classifier capability. Once corrected, the kNN model does not in fact vastly outperform more established algorithms, nor does it reach the virtually perfect accuracy originally claimed. The dataset and approach to combining risk factors remain valuable despite the methodological flaws. But the case demonstrates that rigorous validation matters in machine learning studies. Especially in healthcare applications, we must ensure our models are truly generalizing and not just “learning the test by heart.” Leakage-induced overestimation in medical AI is not a technical nuisance but a patient-safety risk—misleading clinicians and overstating model readiness for deployment. By avoiding leakage and adhering to sound evaluation practices, future research can build on these results in a reliable way, helping translate AI advances into genuine clinical utility rather than illusory performance.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Dey D. Slomka P.J. Leeson P. Comaniciu D. Shrestha S. Sengupta P.P. Marwick T.H. Artificial Intelligence in Cardiovascular Imaging J. Am. Coll. Cardiol.2019731317133510.1016/j.jacc.2018.12.05430898208 PMC 6474254 · doi ↗ · pubmed ↗
2Krittanawong C. Virk H.U.H. Bangalore S. Wang Z. Johnson K.W. Pinotti R. Zhang H. Kaplin S. Narasimhan B. Kitai T. Machine learning prediction in cardiovascular diseases: A meta-analysis Sci. Rep.2020101605710.1038/s 41598-020-72685-132994452 PMC 7525515 · doi ↗ · pubmed ↗
3Iacobescu P. Marina V. Anghel C. Anghele A.-D. Evaluating Binary Classifiers for Cardiovascular Disease Prediction: Enhancing Early Diagnostic Capabilities J. Cardiovasc. Dev. Dis.20241139610.3390/jcdd 1112039639728286 PMC 11678659 · doi ↗ · pubmed ↗
4Van Calster B. Nieboer D. Vergouwe Y. De Cock B. Pencina M.J. Steyerberg E.W. A calibration hierarchy for risk models was defined: From utopia to empirical data J. Clin. Epidemiol.20167416717610.1016/j.jclinepi.2015.12.00526772608 · doi ↗ · pubmed ↗
5Ioannidis J.P.A. Why Most Published Research Findings Are False P Lo S Med.20052 e 12410.1371/journal.pmed.002012416060722 PMC 1182327 · doi ↗ · pubmed ↗
6Alturayeif N. Hassine J. Data leakage detection in machine learning code: Transfer learning, active learning, or low-shot prompting?Peer J Comput. Sci.202511 e 273010.7717/peerj-cs.273040134878 PMC 11935776 · doi ↗ · pubmed ↗
7Yagis E. Atnafu S.W. de Herrera A.G.S. Marzi C. Scheda R. Giannelli M. Tessa C. Citi L. Diciotti S. Effect of data leakage in brain MRI classification using 2D convolutional neural networks Sci. Rep.2021112254410.1038/s 41598-021-01681-w 34799630 PMC 8604922 · doi ↗ · pubmed ↗
8Rosenblatt M. Tejavibulya L. Jiang R. Noble S. Scheinost D. Data leakage inflates prediction performance in connectome-based machine learning models Nat. Commun.202415182910.1038/s 41467-024-46150-w 38418819 PMC 10901797 · doi ↗ · pubmed ↗