# AncestryGeni: a novel genetic ancestry classification pipeline for small and noisy sequence data

**Authors:** Eran Elhaik, Sara Behnamian, Michael Howe, Hongwei Tang, Huihuang Yan, Shulan Tian, Suganti Shivaram, Cinthya Zepeda Mendoza, Kylee MacLachlan, Saad Usmani, Mehdi Pirooznia, Gareth Morgan, Patrick Blaney, Francesco Maura, Linda B Baughn

PMC · DOI: 10.1093/bioinformatics/btaf391 · Bioinformatics · 2025-07-08

## TL;DR

AncestryGeni is a new machine-learning tool that accurately infers genetic ancestry from small and noisy genomic datasets, improving health disparity research.

## Contribution

AncestryGeni introduces a supervised machine-learning pipeline for genetic ancestry classification using as few as 100 markers and diverse genomic data types.

## Key findings

- AncestryGeni outperforms FastNGSadmix in accuracy when using nonstandard genomic data.
- Tumor-derived WES and RNA-Seq data can reliably estimate genetic similarity to continental groups.
- The tool works effectively with limited AIMs and different variant calling software.

## Abstract

Efforts to address health disparities are often limited by the lack of robust computational tools for inferring genetic ancestry by calculating an individual’s genetic similarity to continental groups. We have already shown that a preferred alternative to self-described race is using ancestry-informative markers (AIMs) that can be classified into ancestral components and used to estimate their similarity to those of known populations to identify continental groups. However, real-world genomic data can present challenges, including limited availability of germline DNA, a small number of AIMs for each sample, and the use of different variant calling software, limiting the application of existing solutions.

Here, we describe a novel supervised machine-learning tool AncestryGeni, which infers genetic ancestry for samples with even a hundred markers and is applicable to any genomic data, including whole exome sequencing (WES) and RNA sequencing (RNA-Seq) data. Applying AncestryGeni to a real-world genomic dataset obtained from the Multiple Myeloma Research Foundation (MMRF) CoMMpass study, we show that it is more accurate than the commonly used FastNGSadmix when using nonstandard genomic material. We also demonstrate that when using AncestryGeni, the tumor-derived sequence obtained from WES and RNA-Seq can be a robust data source to accurately estimate an individual’s genetic similarity to a continental group.

AncestryGeni pipeline is available at https://github.com/eelhaik/AncestryGeni/tree/main.

## Linked entities

- **Diseases:** Multiple Myeloma (MONDO:0009693)

## Full-text entities

- **Diseases:** Myeloma (MESH:D009101), tumor (MESH:D009369)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12289551/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12289551/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/PMC12289551/full.md

---
Source: https://tomesphere.com/paper/PMC12289551