# Introducing a foundational sequence transformer for range adaptive nucleotide decoding (STRAND)

**Authors:** Shant Ayanian, Collin Osborne, Clark Xu, Carl Molnar, Pravat Das, Xoab Perez, Natalia Vassilieva, Vinay Pondenkandath, Bhargav Kanakiya, Ganesh Venkatesh, May Levin, Matt Redlon, Marc Blasi, Vijay H Shah, Matthew Callstrom, Konstantinos N Lazaridis, Panos Korfiatis, Alexander Ryu, Elena Myasoedova

PMC · DOI: 10.1093/bib/bbaf618 · Briefings in Bioinformatics · 2025-11-24

## TL;DR

A new model for genomic data analysis improves variant detection and disease prediction using a transformer architecture and large-scale training.

## Contribution

A novel exomic foundational model with a short-range transformer architecture for improved variant detection and downstream genomic tasks.

## Key findings

- The model outperforms existing models in variant effect prediction and disease state identification.
- The largest model variant achieves a mean accuracy of 0.880, surpassing previous benchmarks by 8.2% and 7%.
- A unique exomic ClinVar dataset was constructed to evaluate pathogenicity and disease state performance.

## Abstract

The advent of high-throughput sequencing has led to an exponential increase in genomic data, highlighting the need for efficient and accurate models to analyze and interpret this information. In this study, we introduce a novel, exomic foundational model that leverages a combination of the human reference genome and multispecies data to improve variant detection and interpretation. Our model utilizes a short-range transformer architecture and is trained on a large dataset of human exomic sequences derived from the Tapestry study. Through a series of ablation studies and scaling experiments, we demonstrate the effectiveness of our model in predicting next token accuracy and identifying clinically pathogenic variants. We also show that our model outperforms existing models in a range of downstream tasks, including variant effect prediction and disease state identification. In fact, our largest sequence transformer for range adaptive nucleotide decoding variant (1B parameters) surpassed previous benchmarks, demonstrating a mean accuracy of 0.880 [an 8.2% improvement over the original nucleotide transformer (NT) and a 7% improvement over NT-v2]. Furthermore, we construct a unique exomic ClinVar dataset to evaluate the model’s performance on pathogenicity and disease states. Our results highlight the potential of this model to improve our understanding of the human exome and its role in disease. The model and its applications have significant implications for genomics-based diagnosis and personalized medicine, including tailored therapeutic development.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12641612/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12641612/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12641612/full.md

---
Source: https://tomesphere.com/paper/PMC12641612