# Phylogenomics to structure: evolutionary and clinical signals in the TP53 DNA-binding core through LOOCV-validated ensemble learning

**Authors:** Syed Raza Abbas, Zeeshan Abbas, Arifa Zahir, Mobeen Ur Rehman, Seung Won Lee

PMC · DOI: 10.1093/bib/bbag087 · Briefings in Bioinformatics · 2026-02-26

## TL;DR

This study combines evolutionary data and machine learning to identify critical residues in the TP53 tumor suppressor protein that are important for DNA binding and disease.

## Contribution

A novel computational framework integrating phylogenomics, structural biology, and clinical data to prioritize functionally critical TP53 residues.

## Key findings

- Codon 129 in TP53 is under positive selection and is a robust evolutionary hotspot.
- Residues 239–248 are identified as the primary pathogenic hotspot based on structural and clinical data integration.
- Machine learning models, particularly Ridge–ExtraTrees and deep neural networks, outperformed Random Forest in predicting evolutionary and clinical signals.

## Abstract

TP53 encodes a master tumor suppressor, and understanding its evolutionary constraints is critical for interpreting pathogenic variation. We developed a fully reproducible computational pipeline integrating evolutionary genomics, structural biology, and clinical variant analysis to systematically prioritize functionally critical residues in TP53. We used fixed effects likelihood and fast unconstrained Bayesian approximation to perform genome-wide alignment, maximum-likelihood phylogenetic estimation, and site-specific selection testing over 19 vertebrate orthologs. We mapped these evolutionary signals onto the AlphaFold-predicted structure and integrated 3936 human variants from ClinVar and UniProt. Selection analysis identified five sites under positive or diversifying selection, with a single consensus position detected by both methods: multiple-sequence-alignment position 606 (human codon 129) in the DNA-binding domain. Structural mapping revealed that pathogenic variants concentrate at the DNA-contacting interface, with residues 239–248 emerging as the highest-priority targets based on our composite scoring system that integrates evolutionary constraint, pathogenic burden, hotspot density, and domain importance. Machine learning validation under leave-one-out cross-validation (LOOCV) demonstrated robust predictive performance. A Ridge–ExtraTrees ensemble achieved \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$\textrm{MAE (mean absolute error)}=2.84$\end{document}, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$\textrm{RMSE(root mean squared error)}=3.72$\end{document}, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$R^{2}=0.91$\end{document} for phylogenetic-distance regression and 89.5% accuracy (17/19) for clade classification. A multi-branch deep neural network attained comparable results (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$\textrm{MAE}=2.33$\end{document}, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$\textrm{RMSE}=2.56$\end{document}, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$R^{2}=0.86$\end{document}), while Random Forest substantially underperformed (\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$\textrm{MAE}\approx 7.19$\end{document}, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$\textrm{RMSE}\approx 8.82$\end{document}, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$R^{2}\approx 0.47$\end{document}, accuracy \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{upgreek}
\usepackage{mathrsfs}
\setlength{\oddsidemargin}{-69pt}
\begin{document}
$\approx 63\%$\end{document}) due to shrinkage and class-imbalance bias. Our findings show that evolutionary signals and clinical variants converge within the structurally constrained DNA-binding core of TP53, with codon 129 representing a robust positive-selection site and residues 239–248 constituting the primary pathogenic hotspot. This AlphaFold-anchored, LOOCV-validated framework offers a systematic, generalizable approach for residue-level prioritization to guide mechanistic studies and potentially inform precision oncology applications pending experimental validation.

## Linked entities

- **Genes:** TP53 (tumor protein p53) [NCBI Gene 7157]

## Full-text entities

- **Genes:** TP73 (tumor protein p73) [NCBI Gene 7161] {aka CILD47, P73}, TP63 (tumor protein p63) [NCBI Gene 8626] {aka AIS, B(p51A), B(p51B), EEC3, KET, LMS}, TP53 (tumor protein p53) [NCBI Gene 7157] {aka BCC7, BMFS5, LFS1, P53, TRP53}
- **Diseases:** cancer (MESH:D009369), DL (MESH:C537113), hereditary cancer predisposition disorder (MESH:D009386), Li-Fraumeni syndrome (MESH:D016864), breast, lung, colorectal, and ovarian malignancies (MESH:D010051)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606], Canis lupus familiaris (dog, subspecies) [taxon 9615], teleost fish (species) [taxon 70862], Macaca mulatta (rhesus macaque, species) [taxon 9544], Danio rerio (leopard danio, species) [taxon 7955], Pan troglodytes (chimpanzee, species) [taxon 9598], Gallus gallus (bantam, species) [taxon 9031], Taeniopygia guttata (zebra finch, species) [taxon 59729], Takifugu (genus) [taxon 31032], Xenopus laevis (African clawed frog, species) [taxon 8355]
- **Mutations:** G245, R248, G245S, R273, R175H

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12936793/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12936793/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12936793/full.md

---
Source: https://tomesphere.com/paper/PMC12936793