# Integrated phenotypic analysis, predictive modeling, and identification of novel trait-associated loci in a diverse Theobroma cacao collection

**Authors:** Insuck Baek, Minhyeok Cha, Seunghyun Lim, Brian M. Irish, Sookyung Oh, Jishnu Bhatt, Rakesh K. Upadhyay, Moon S. Kim, Lyndel W. Meinhardt, Sunchung Park, Ezekiel Ahn

PMC · DOI: 10.1186/s12870-025-07128-y · BMC Plant Biology · 2025-08-09

## TL;DR

This study analyzed a diverse cacao collection to understand trait diversity, predict yield, and identify genetic markers linked to important traits for breeding.

## Contribution

The study provides novel trait-linked genetic markers and robust yield prediction models for cacao breeding.

## Key findings

- Significant phenotypic variation and strong trait correlations were observed in the cacao collection.
- A genetic marker on chromosome 5 was associated with both 'Total pods' and 'Yield', offering a target for marker-assisted selection.
- Machine learning models identified 'Total pods', 'Infection rate', and 'Pod weight' as the most influential yield predictors.

## Abstract

Cacao (Theobroma cacao L.) breeding and improvement rely on understanding germplasm diversity and trait architecture. This study characterized a cacao collection (173 accessions) evaluated in Puerto Rico, examining phenotypic diversity, trait interrelationships, and performing comparative analyses with published Trinidad and Colombia datasets. We also developed machine learning (ML) models for yield prediction and identified yield-associated SNP markers.

The cacao collection showed significant phenotypic variation and strong intra-collection trait correlations. Comparative analyses revealed conserved trait responses across environments, notably linking susceptibility to black pod rot in Puerto Rico with Witches' Broom Disease in Colombia, suggesting a broad-spectrum disease response mechanism. Machine learning models effectively modeled yield, quantifying a hierarchy of predictor importance, with ‘Total pods’, ‘Infection rate’, and ‘Pod weight’ being the most influential. Integrating existing SNP data for 28 common accessions, multiple SNPs were identified as significantly associated with key horticultural traits, including ‘Total pods’, ‘Infection rate’, and ‘Yield’ (FDR < 0.01). Notably, a single genetic marker on chromosome 5 (TcSNP475), located within a putative zinc finger stress-associated protein gene (Tc05_t008610), was associated with both ‘Total pods’ and ‘Yield’, representing a prime target for marker-assisted selection.

This research provides a detailed characterization of a wide germplasm collection, robust yield predictors, and a suite of novel trait-linked genetic markers, offering valuable resources for cacao breeding. These integrated findings will provide a solid foundation for targeted breeding strategies and deeper molecular investigations into the mechanisms underpinning yield and stress resilience in this vital global crop.

The online version contains supplementary material available at 10.1186/s12870-025-07128-y.

## Linked entities

- **Species:** Theobroma cacao (taxon 3641)

## Full-text entities

- **Diseases:** Witches' Broom Disease (MESH:D004194), black (MESH:D007898), Infection (MESH:D007239)
- **Species:** Theobroma cacao (cacao, species) [taxon 3641]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12335022/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12335022/full.md

## References

1 references — full list in the complete paper: https://tomesphere.com/paper/PMC12335022/full.md

---
Source: https://tomesphere.com/paper/PMC12335022