# Genomic perplexity and the evolution of context-dependent function

**Authors:** James O McInerney

PMC · DOI: 10.1093/molbev/msag041 · Molecular Biology and Evolution · 2026-02-25

## TL;DR

This paper proposes a new framework for understanding how genes function in different contexts, using concepts from large language models and information theory.

## Contribution

Introduces the concept of 'genomic perplexity' as a novel metric for evaluating gene integration potential and compatibility.

## Key findings

- Genomic function is context-dependent, varying across backgrounds and cellular states.
- Genomic perplexity quantifies the fitness cost of interspecies gene transfer.
- The framework offers a testable model for synthetic biology and evolutionary research.

## Abstract

The fundamental principle that selection acts on a gene's function often assumes implicitly that this function is fixed and intrinsic. However, empirical evidence from pangenomics, synthetic biology, and GWAS consistently demonstrates that organismal function is highly context-dependent, varying across genomic backgrounds and cellular states, even for core genes. Drawing a conceptual parallel with modern large language models (LLMs), I propose that genomes, like LLMs, do not encode fixed functions but rather “probability distributions” over functional and phenotypic outcomes. This framework draws a conceptual analogy between epistasis and transformer-style “attention mechanisms,” suggesting that genomic context weights the influence of distant genetic elements. I also introduce the concept of “genomic perplexity”—an information-theoretic measure of the statistical unexpectedness and incompatibility of a genetic element within its host context. I demonstrate how perplexity serves as a quantifiable metric for the well-known fitness cost associated with interspecies gene flow (eg horizontal gene transfer (HGT) and introgression), where a new gene represents a high-perplexity token. This perspective formalizes long-standing observations of genomic fit and provides a testable framework for predicting the integration potential of accessory genes and directing future research in synthetic biology and evolutionary modeling.

## Full-text entities

- **Genes:** LYZ (lysozyme) [NCBI Gene 4069] {aka AMYLD5, LYZF1, LZM}, CRP [NCBI Gene 20468888]
- **Diseases:** intellectual disability (MESH:D008607), immune dysfunction (MESH:D007154), pigmentation defects (MESH:D010859)
- **Chemicals:** magnesium (MESH:D008274), Y (MESH:D015019), Tryptophan (MESH:D014364), CTG (-), fatty acids (MESH:D005227), leucine (MESH:D007930), tyrosine (MESH:D014443), valine (MESH:D014633), glutamate (MESH:D018698), ester (MESH:D004952), phenylacetate (MESH:C025136), alanine (MESH:D000409), lactose (MESH:D007785), isoleucine (MESH:D007532)
- **Species:** Homo sapiens (human, species) [taxon 9606], Mycobacterium tuberculosis (species) [taxon 1773], Streptococcus pneumoniae (species) [taxon 1313], Escherichia coli (E. coli, species) [taxon 562], Bacillus subtilis (species) [taxon 1423], Candida albicans (species) [taxon 5476], Saccharomyces cerevisiae (baker's yeast, species) [taxon 4932], Pseudomonas aeruginosa (species) [taxon 287]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12951670/full.md

## References

150 references — full list in the complete paper: https://tomesphere.com/paper/PMC12951670/full.md

---
Source: https://tomesphere.com/paper/PMC12951670