Long-range gene expression prediction with token alignment of large language model
Edouardo Honig, Huixin Zhan, Ying Nian Wu, Zijun Frank Zhang

TL;DR
This paper introduces GTA, a novel cross-modal approach that leverages pretrained large language models with token alignment to improve long-range gene expression prediction and interpretability.
Contribution
It presents Genetic sequence Token Alignment (GTA), a new method that aligns genetic sequences with language tokens, enabling in-context learning and better long-range regulatory grammar modeling.
Findings
GTA outperforms state-of-the-art models like Enformer with a 10% higher Spearman correlation.
GTA enhances interpretation of long-range interactions in gene regulation.
The approach demonstrates the power of cross-modal adaptation in genomics.
Abstract
Gene expression is a cellular process that plays a fundamental role in human phenotypical variations and diseases. Despite advances of deep learning models for gene expression prediction, recent benchmarks have revealed their inability to learn distal regulatory grammar. Here, we address this challenge by leveraging a pretrained large language model to enhance gene expression prediction. We introduce Genetic sequence Token Alignment (GTA), which aligns genetic sequence features with natural language tokens, allowing for symbolic reasoning of genomic sequence features via the frozen language model. This cross-modal adaptation learns the regulatory grammar and allows us to further incorporate gene-specific human annotations as prompts, enabling in-context learning that is not possible with existing models. Trained on lymphoblastoid cells, GTA was evaluated on cells from the Geuvadis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGene expression and cancer classification
