# Learning genotype–phenotype associations from gaps in multi-species sequence alignments

**Authors:** Uwaise Ibna Islam, Andre Luiz Campelo dos Santos, Ria Kanjilal, Raquel Assis

PMC · DOI: 10.1093/bib/bbaf022 · Briefings in Bioinformatics · 2025-02-20

## TL;DR

This paper introduces GAP, a machine learning tool that predicts phenotypes from gaps in multi-species sequence alignments, offering a new way to study genetic-phenotypic relationships.

## Contribution

GAP is the first tool to predict binary phenotypes using only alignment gaps, without requiring additional data.

## Key findings

- GAP achieved perfect prediction accuracy for vitamin C synthesis in 34 vertebrates.
- GAP identified positions in the Gulo gene consistent with previous studies.
- Genome-wide analysis with GAP revealed new genes potentially linked to vitamin C synthesis with immune-related functions.

## Abstract

Understanding the genetic basis of phenotypic variation is fundamental to biology. Here we introduce GAP, a novel machine learning framework for predicting binary phenotypes from gaps in multi-species sequence alignments. GAP employs a neural network to predict the presence or absence of phenotypes solely from alignment gaps, contrasting with existing tools that require additional and often inaccessible input data. GAP can be applied to three distinct problems: predicting phenotypes in species from known associated genomic regions, pinpointing positions within such regions that are important for predicting phenotypes, and extracting sets of candidate regions associated with phenotypes. We showcase the utility of GAP by exploiting the well-known association between the L-gulonolactone oxidase (Gulo) gene and vitamin C synthesis, demonstrating its perfect prediction accuracy in 34 vertebrates. This exceptional performance also applies more generally, with GAP achieving high accuracy and power on a large simulated dataset. Moreover, predictions of vitamin C synthesis in species with unknown status mirror their phylogenetic relationships, and positions with high predictive importance are consistent with those identified by previous studies. Last, a genome-wide application of GAP identifies many additional genes that may be associated with vitamin C synthesis, and analysis of these candidates uncovers functional enrichment for immunity, a widely recognized role of vitamin C. Hence, GAP represents a simple yet useful tool for predicting genotype–phenotype associations and addressing diverse evolutionary questions from data available in a broad range of study systems.

## Linked entities

- **Genes:** GULOP (gulonolactone (L-) oxidase, pseudogene) [NCBI Gene 2989]
- **Chemicals:** vitamin C (PubChem CID 54670067)

## Full-text entities

- **Chemicals:** vitamin C (MESH:D001205)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11840556/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11840556/full.md

## References

104 references — full list in the complete paper: https://tomesphere.com/paper/PMC11840556/full.md

---
Source: https://tomesphere.com/paper/PMC11840556