A compact encoding of the genome suitable for machine learning prediction of traits and genetic risk scores
Yasaman Fatapour, James P. Brody

TL;DR
This paper introduces a compact genotype encoding that enables machine learning to predict traits like gender and race with high accuracy using limited genetic data.
Contribution
The novel contribution is a chromosome-scale length variation encoding that reduces genotype complexity for effective machine learning predictions.
Findings
The compact genotype encoding achieved high accuracy in classifying gender (AUC = 0.9988) and race (AUCs up to 0.970).
The method effectively predicted human height using genotype data and age.
This approach works with fewer predictors than samples, overcoming a typical machine learning limitation in genetics.
Abstract
Genotype to phenotype prediction is a central problem in biology and medicine. Machine learning is a natural tool to address this problem. However, a person’s genotype is usually represented by a few million single-nucleotide polymorphisms and most datasets only have a few thousand patients. Thus, this problem typically has many more predictors than the number of samples (patients), making it unsuitable for machine learning. The objective of this paper is to examine the efficacy of a compact genotype representation, which employs a limited number of predictors, in predicting a person’s phenotype through the application of machine learning. We characterized a person’s genotype using chromosome-scale length variation, a measure that is computed as the average value of reported log R ratios across a portion of a chromosome. We computed these numbers from data collected by the NIH All of Us…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Associations and Epidemiology · Genetic and phenotypic traits in livestock · Genetic Mapping and Diversity in Plants and Animals
