Entropy, Disagreement, and the Limits of Foundation Models in Genomics
Maxime Rochkoulets, Lovro Vr\v{c}ek, Mile \v{S}iki\'c

TL;DR
This paper investigates how high entropy in genomic sequences limits foundation models' ability to learn, leading to uniform outputs, disagreement, and unstable embeddings, challenging current training assumptions.
Contribution
It identifies entropy as a key factor affecting genomic foundation models and analyzes its impact through experiments on text and DNA sequences.
Findings
High entropy causes near-uniform output distributions.
Model disagreement increases due to entropy.
Fisher information concentrates in embedding layers.
Abstract
Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences -- from the point of view of unseen token prediction -- leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
