Histological Grade, Tumor Breadth, and Hypertension Predict Early Recurrence in Pediatric Sarcoma: A LASSO-Regularized Micro-Cohort Study
Alexander Fiedler, Mehran Dadras, Marius Drysch, Sonja Verena Schmidt, Flemming Puscz, Felix Reinkemeier, Marcus Lehnhardt, Christoph Wallner

TL;DR
This study identifies clinical features like tumor grade, size, and hypertension that may predict early recurrence in pediatric sarcoma, using a machine learning approach.
Contribution
Hypertension is identified as a novel, potentially modifiable factor associated with early recurrence in pediatric sarcoma.
Findings
A four-variable risk signature including histological grade, tumor width, hypertension, and limb location was identified.
Tumor grade and width each approximately doubled the odds of recurrence.
PCA revealed distinct outlier patterns linked to high-risk profiles.
Abstract
Background/Objectives: Pediatric sarcomas are a biologically diverse group of mesenchymal tumors associated with morbidity due to recurrence, despite aggressive multimodal treatment. Reliable predictors of early recurrence remain limited. This exploratory study aimed to identify clinical features associated with first tumor recurrence using a machine learning approach tailored to low-event settings. Methods: We conducted a retrospective, single-center cohort study of 23 pediatric patients with histologically confirmed sarcoma. Forty-six baseline variables were extracted per patient, including clinical, histological, and comorbidity data. Tumor recurrence was the primary binary endpoint. A LASSO-regularized logistic regression model was developed using leave-one-out cross-validation (LOOCV) to identify the most informative predictors. Dimensionality reduction (PCA) and SHAP-value…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSarcoma Diagnosis and Treatment · Testicular diseases and treatments · Lymphoma Diagnosis and Treatment
1. Introduction
Sarcomas are a biologically heterogeneous group of malignant tumors originating from mesenchymal tissues. Although they are only responsible for 1% of all adult cancer types [1], they account for approximately 8% to 20% of all cancers diagnosed in children under the age of 15, underscoring their clinical relevance [2,3].
Despite advances in multimodal treatment, pediatric sarcomas—especially synovial sarcoma—remain prone to early relapse and limited salvage options [4,5,6]. Soole et al. reported a median time to first relapse of 24 months, with 73% of recurrences localized and 24.3% metastatic [7]. Although most patients received aggressive multimodal salvage therapy—including resection (75.7%), second-line chemotherapy (73.0%), and radiotherapy (48.6%)—response rates remained low (for example, 36.4% for ifosfamide-based regimens), and five-year event-free and overall survival were only 32.8% and 42.1%, respectively [7].
Complete surgical resection continues to be the cornerstone of curative treatment [8,9,10], as incomplete R1 resections have been associated with both shorter disease-free survival and overall survival [11]. In pediatric cases, ensuring adequate resection is especially important to minimize the need for adjuvant radiotherapy, which can cause long-term complications, including secondary malignancies [12,13]. Moreover, early detection of recurrent disease at a smaller tumor size could allow for more conservative surgery and better functional outcomes [14].
Accordingly, numerous clinical and pathological predictors of recurrence have been proposed, including histological subtypes, tumor size, and surgical margins. However, findings remain inconsistent. For example, Sawamura et al. reported that while radiotherapy reduced local recurrence in pediatric extremity sarcomas, traditional prognostic factors were not statistically significant predictors of survival [15].
Given the complexity of and heterogeneity in sarcoma behavior, particularly in pediatric patients, classical statistical approaches may be insufficient. Therefore, machine learning (ML) techniques offer a promising alternative to detect factors revealing an early sarcoma recurrence. Weskamp et al., for instance, used machine learning to analyze data from over 6000 patients with soft tissue sarcomas, identifying a range of lifestyle, genetic, and treatment-related predictors [16].
Beyond general ML, advanced methods, such as LASSO (least absolute shrinkage and selection operator) and selective inference, have proven effective in identifying predictive variables in high-dimensional, low-sample-size datasets [17]. Further developments in machine learning (ML) and deep learning (DL) have enabled more accurate modeling of complex clinical outcomes. Byeon et al. demonstrated DL’s applicability in pediatric oncology settings [18], while Placido et al. and Miotto et al. highlighted how these methods can detect clinically relevant recurrence patterns often missed by conventional models [19,20].
Given the high recurrence burden and the need for personalized follow-up strategies, this study aims to identify clinical predictors of recurrence in pediatric sarcoma using a machine learning approach optimized for low-event-rate datasets. By applying LASSO-regularized logistic regression and interpretability techniques, such as SHAP analysis, we seek to uncover robust, clinically meaningful patterns that may support early risk stratification and inform individualized care. To our knowledge, this is the first study to assess risk factors indicating a sarcoma recurrence in children using a machine learning approach.
2. Materials and Methods
2.1. Study Design and Dataset
We performed an exploratory, retrospective analysis of pediatric sarcoma patients treated at a single tertiary-care center. After quality control, the analytic dataset comprised 23 patients and 46 baseline variables (numeric and categorical). Variables were selected based on clinical relevance and accessibility, as well as the literature. The primary endpoint was the first tumor recurrence, recorded as a binary outcome. The present study was a retrospective, single-center cohort study that was conducted in compliance with the Declaration of Helsinki and approved by the responsible ethics committee of Ruhr University Bochum.
2.2. Data Cleaning and Encoding
The following processing steps were implemented in Python 3.11:
- Missing-value handling:
- ○Numeric features: median imputation.
- ○Categorical features: mode imputation.
- Feature encoding:
- ○Numeric variables were z-standardized (μ = 0, σ = 1).
- ○Categorical variables underwent one-hot encoding with drop_first = True; unseen levels were handled via handle_unknown = ‘ignore’.
A unified ColumnTransformer (scikit-learn 1.4) ensured that all preprocessing was carried out inside the cross-validation loop, preventing information leakage.
2.3. Dimensionality Reduction and Outlier Detection
To visualize latent structure and identify atypical cases, we applied principal component analysis (PCA) to the fully processed design matrix (n_components = 2, svd_solver = ‘full’). Outlierness was quantified as the Euclidean distance of each observation from the multivariate centroid in PC space; the three largest distances were flagged a priori as potential outliers for clinical chart review.
2.4. Predictive Modeling
Traditional stepwise models would over-fit at ≤5 events; LASSO shrinks noisy coefficients to zero and is recommended for rare-event settings [21]. Given the low events-per-variable ratio, we chose a logistic regression with an L1-penalty (LASSO) to enforce sparsity. The solver was saga, with a regularization path of 100 λ values on a log scale, and the class weights were balanced. The optimal inverse-penalty parameter was C = 1.0, corresponding to λ = 1.0 (selected by leave-one-out minimization of the negative log-likelihood across a log-spaced grid from 10^−2^ to 10^2^). All coefficients reported in Table 1 were estimated with this penalty strength.
Hyper-parameter selection and performance estimation employed leave-one-out cross-validation (LOOCV), the only unbiased option for n = 23. Performance metrics included balanced accuracy and area under the ROC curve (AUROC).
2.5. Variable Importance
Final model coefficients were extracted at the λ yielding the minimum LOOCV deviance. Non-zero coefficients were interpreted as putative risk factors. To aid clinical interpretation, we supplied SHAP additive explanations for each patient (TreeExplainer on the linear model) and ranked the features by mean SHAP value.
2.6. Software and Reproducibility
All analyses were executed in a reproducible Jupyter environment (Ubuntu 22.04, 4 vCPU, 8 GB RAM). Pandas version 2.2.1, scikit-learn version 1.4.2, numpy version 1.26.4, matplotlib 3.8.4, and shap 0.45.0 were utilized.
3. Results
During the study window, 23 consecutively treated children and adolescents (all < 18 years) with histologically confirmed sarcoma were evaluable. The cohort had a mean age of 13.0 ± 4.7 years (mean ± SD) at the time of primary resection. The ages ranged from 2 to 17 years, with an inter-quartile range of 11.5–16.3 years. The median observational period was 76 months (6.3 years), yielding an aggregate of approx. 158 patient-years of follow-up. Importantly, no patient died of disease- or treatment-related causes during this interval; the overall survival, therefore, remained at 100%. In view of this zero-mortality backdrop, first tumor recurrence was adopted as the primary endpoint for all subsequent modelling exercises. Recurrences were documented in five patients (22%), providing an events-per-variable ratio of roughly 1:9 for the 46 baseline covariates analyzed. The distribution of continuous tumor metrics was wide—the primary width ranged from 0.2 cm to 8.7 cm, with a median of 1.5 cm. Figure 1 depicts this heterogeneity in a two-dimensional principal component map, with the three extreme outliers highlighted.
Among the 23 evaluable patients, synovial sarcoma was the leading subtype, occurring in 7 cases (30%). The next most common diagnosis was dermatofibrosarcoma protuberans (DFSP), seen in four patients (17%), followed by epithelioid sarcoma in three patients (13%). Three entities were less frequent yet clinically relevant, including aggressive fibromatosis, alveolar rhabdomyosarcoma, and low-grade fibromyxoid sarcoma, each diagnosed twice (9% each). Finally, single-patient occurrences (4% each) were noted for chondrosarcoma, dedifferentiated liposarcoma, fibrosarcoma, and an osteoblastoma-like variant of osteosarcoma. This spectrum underscores the biological heterogeneity that underpins the recurrence-risk analysis presented above.
Despite the statistical head-winds posed by a 1:9 events-per-variable ratio, a rigorously regularized, leave-one-out-validated LASSO model isolated a four-variable clinical signature that hinted at meaningful biology. The corresponding coefficient profile is visualized in Figure 2, underscoring the dominant weight of histological grade and primary width.
Tumor histological grade and primary width stood out: each incremental step in grade or every additional centimeter of width was associated with a little more than a two-fold increase in the odds of recurrence (β ≈ +0.78 and +0.71, translating to OR ≈ 2.2 and 2.0, respectively). Intriguingly, the presence of arterial hypertension, recorded in only a handful of patients, conferred a 70% relative increase in risk. This is an observation that aligns with emerging work linking vascular dysregulation to sarcoma aggressiveness. Localization-related parameters carried smaller, directionally inconsistent weights and were interpreted as surrogate markers of surgical complexity rather than true biological drivers.
To contextualize these signals, we projected all 46 baseline variables into a two-dimensional latent space (PCA, 50% explained variance). Three patients emerged as conspicuous outliers, occupying the periphery of the manifold. Two of them encapsulated the full “high-risk triad” (high grade, large width, and hypertension) and indeed recurred early. The third, conversely, harbored the smallest recorded tumor volume and remained disease-free, raising the twin possibilities of data mis-entry or exceptional therapeutic response.
Predictive Performance
Using an L1-penalized logistic regression model that was entirely trained and evaluated under a leave-one-out cross-validation (LOOCV) scheme, we obtained a balanced accuracy of 0.61 ± 0.08 and an area under the receiver operating characteristic curve (AUROC) of 0.47 ± 0.12.
In practical terms, the model correctly identified recurrence and non-recurrence cases 61% of the time once class imbalance was accounted for, exceeding the 50% benchmark of random guessing. However, its ability to rank-order patients by risk was poor: an AUROC of ≈0.5 indicates that, when a randomly chosen child who recurred is paired with one who did not, the algorithm can only rarely assign the higher risk score to the correct patient. The discrepancy between balanced accuracy (a threshold-dependent metric) and AUROC (threshold-free) is not unexpected in micro-datasets where a single misclassified event can shift the ROC curve substantially.
The relatively wide standard deviations (±0.08 and ±0.12, respectively) reflect the inherent volatility of LOOCV with only four positive events. Taken together, these figures suggest that while the LASSO constraint prevented gross over-fitting, the model remains little more than a hypothesis generator, a quantitative scaffold that highlights potentially relevant variables rather than a clinically deployable risk score.
A calibration plot grouped by risk quartiles (Table A1) showed close agreement between the observed and expected recurrence rates (Hosmer–Lemeshow χ^2^ = 1.9, df = 2, p = 0.39), suggesting that the LASSO model, while exploratory, is not over-confident.
Bias-reduced (Firth) logistic regression reproduced the rank order and sign of all four non-zero LASSO predictors, with odds-ratio deviations < 5%. This concordance indicates that the signal is not an artifact of small-sample separation (see Table A2).
Taken together, our findings suggest that even within a micro-cohort, grade and gross tumor width retain their primacy as harbingers of early failure, while comorbid hypertension surfaces as a provocative, potentially modifiable cofactor. Although these results are inevitably hypothesis-generating, they argue for systematic blood pressure surveillance in forthcoming pediatric sarcoma trials and underscore the value of latent-space mapping for flagging clinically actionable outliers.
4. Discussion
This exploratory study applied LASSO-regularized logistic regression to a micro-cohort of pediatric sarcoma patients to identify factors associated with early tumor recurrence. Despite the small sample size and low event rate, the model identified a four-variable clinical signature comprising histological grade, primary tumor width, extremity localization, and arterial hypertension. These features may help refine recurrence risk stratification in children with soft tissue sarcomas after a multicenter control trial.
Histological grade and tumor width emerged as the strongest predictors, each associated with more than a two-fold increase in recurrence odds. These findings are consistent with established oncological principles: higher-grade tumors typically demonstrate increased proliferative capacity and metastatic potential [22,23], while larger tumors pose surgical challenges and are more likely to harbor microscopic spread [24]. Their prognostic relevance is well-documented across both pediatric and adult populations [24,25,26], reaffirming the biological relevance of these variables across age groups [27,28,29], and they remain central to clinical tools such as the Sarculator [30].
The identification of arterial hypertension as a potential predictor is both novel and biologically plausible. Although its causal role remains speculative and balanced accuracy is modest, secondary hypertension has also been linked to aggressive tumor biology in other malignancies, such as adrenal cancers, via activation of the renin–angiotensin–aldosterone system [31,32]. In our cohort, hypertension was uncommon but consistently associated with recurrence across modeling strategies. This may suggest a role in sarcoma pathophysiology, potentially through mechanisms involving chronic inflammation, endothelial dysfunction, and elevated vascular endothelial growth factor (VEGF) expression [33,34]. Furthermore, tumors with high angiogenic activity often exhibit elevated vascular endothelial growth factor (VEGF) expression, a known contributor to hypertension through capillary rarefaction and increased vascular tone [34,35,36]. Hypoxia-inducible factor 1-alpha (HIF-1α), often upregulated in sarcomas, is also linked to both VEGF expression and systemic hypertension [37,38]. Although these biological pathways offer a plausible mechanistic link, we did not assess VEGF, HIF-1α, or other molecular markers in our study. Therefore, the observed association between hypertension and recurrence should be regarded as hypothesis-generating. Whether hypertension reflects a true biological driver, a surrogate marker of tumor aggressiveness, or an iatrogenic effect (e.g., from corticosteroids or pain) remains unclear. Future prospective studies incorporating molecular profiling and systematic blood pressure surveillance are needed to clarify the causal direction and clinical relevance of this finding. Additionally, tumor localization to the extremities was associated with higher recurrence risk, echoing prior reports that found local recurrence rates as high as 25% in limb sarcomas [39,40], compared to 13.6% for tumors of the trunk wall and head/neck regions [41]. This stresses the surgical complexity of achieving clear margins in anatomically constrained sites.
While the model’s predictive performance was modest (balanced accuracy = 0.61; AUROC ≈ 0.5), this is not unexpected given the low event rate, making the model more susceptible to perturbations. Importantly, model calibration was acceptable, and the risk factor ranking remained stable under bias-reduced logistic regression, indicating that the findings—though exploratory—are not artifacts of statistical instability. The use of leave-one-out cross-validation (LOOCV) ensured that each prediction was tested on an unseen patient, a critical safeguard in low-sample settings that helps prevent overfitting [42].
Additionally, dimensionality reduction using principal component analysis (PCA) proved valuable in highlighting patient-level outliers, such as one individual exhibiting all three major risk factors who experienced early recurrence. In rare cancers with limited sample sizes, such visual tools can augment clinical interpretation and identify high-risk phenotypes that warrant closer monitoring or intervention. By reducing noise from high-dimensional data, PCA also supported the robustness of model interpretation [43].
These findings must be interpreted in light of several limitations. The retrospective, single-center design limits generalizability due to variations in treatment protocols and patients’ collective, as well as the data collection practices and other factors. The small sample size (n = 23) and low event count (n = 5 recurrences) constrained statistical power and model complexity. Although LASSO regularization addresses overfitting, the model may still be sensitive to minor data perturbations. Moreover, rare covariates—such as hypertension—require cautious interpretation, as measurement may be confounded by situational factors (e.g., white coat effect [44], pain [45], anxiety [46]).
5. Conclusions
This exploratory study demonstrates that classical histopathological features—tumor grade and size—remain central to relapse risk in pediatric sarcoma. Additionally, it introduces arterial hypertension as a novel, potentially modifiable cofactor associated with recurrence. While predictive accuracy was limited by sample size, the model showed internal consistency and biological plausibility, warranting validation in larger prospective cohorts.
Machine learning methods, such as LASSO regression, can yield meaningful insights even in small datasets, provided their exploratory nature is recognized. Future studies should aim to validate these exploratory findings in larger cohorts and explore the mechanistic links between hypertension and sarcoma biology.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Zahm S.H. Fraumeni J.F.Jr. The epidemiology of soft tissue sarcoma Semin. Oncol.1997245045149344316 · pubmed ↗
- 2Linabery A.M. Ross J.A. Childhood and adolescent cancer survival in the US by race and ethnicity for the diagnostic period 1975–1999 Cancer 20081132575259610.1002/cncr.2386618837040 PMC 2765225 · doi ↗ · pubmed ↗
- 3Burningham Z. Hashibe M. Spector L. Schiffman J.D. The epidemiology of sarcoma Clin. Sarcoma Res.201221410.1186/2045-3329-2-1423036164 PMC 3564705 · doi ↗ · pubmed ↗
- 4Loeb D.M. Thornton K. Shokek O. Pediatric soft tissue sarcomas Surg. Clin. N. Am.20088861562710.1016/j.suc.2008.03.00818514702 PMC 4273573 · doi ↗ · pubmed ↗
- 5Zagars G.K. Ballo M.T. Pisters P.W.T. Pollock R.E. Patel S.R. Benjamin R.S. Evans H.L. Prognostic factors for patients with localized soft-tissue sarcoma treated with conservation surgery and radiation therapy: An analysis of 1225 patients Cancer 2003972530254310.1002/cncr.1136512733153 · doi ↗ · pubmed ↗
- 6Fernebro J. Bladström A. Rydholm A. Gustafson P. Olsson H. Engellau J. Nilbert M. Increased risk of malignancies in a population-based study of 818 soft-tissue sarcoma patients Br. J. Cancer 20069598699010.1038/sj.bjc.660340117008869 PMC 2360699 · doi ↗ · pubmed ↗
- 7Soole F. Maupain C. Defachelles A.-S. Taque S. Minard-Colin V. Bergeron C. De Rycke Y. Orbach D. Synovial sarcoma relapses in children and adolescents: Prognostic factors, treatment, and outcome Pediatr. Blood Cancer 2014611387139310.1002/pbc.2500124664883 · doi ↗ · pubmed ↗
- 8Garcia-Ortega D.Y. Comprehensive treatment strategy for improving surgical resection rate of retroperitoneal sarcomas: A histology-specific approach narrative review Front. Oncol.202414143290010.3389/fonc.2024.143290039435281 PMC 11491436 · doi ↗ · pubmed ↗
