Fast, memory-efficient genomic interval tokenizers for modern machine learning
Nathan J. LeRoy, Donald R. Campbell Jr, Seth Stadick, Oleksandr Khoroshevskyi, Sang-Hoon Park, Ziyang Hu, Nathan C. Sheffield

TL;DR
This paper introduces gtars-tokenizers, a high-performance library that efficiently converts genomic intervals into standardized tokens, enabling scalable and consistent deep learning analysis of large epigenomic datasets.
Contribution
The paper presents a novel, efficient tokenization method for genomic intervals that integrates seamlessly with modern machine learning frameworks, addressing heterogeneity in genomic data.
Findings
Achieves top efficiency for large-scale datasets
Enables standard ML workflows without ad hoc preprocessing
Supports scalable analysis across diverse environments
Abstract
Introduction: Epigenomic datasets from high-throughput sequencing experiments are commonly summarized as genomic intervals. As the volume of this data grows, so does interest in analyzing it through deep learning. However, the heterogeneity of genomic interval data, where each dataset defines its own regions, creates barriers for machine learning methods that require consistent, discrete vocabularies. Methods: We introduce gtars-tokenizers, a high-performance library that maps genomic intervals to a predefined universe or vocabulary of regions, analogous to text tokenization in natural language processing. Built in Rust with bindings for Python, R, CLI, and WebAssembly, gtars-tokenizers implements two overlap methods (BITS and AIList) and integrates seamlessly with modern ML frameworks through Hugging Face-compatible APIs. Results: The gtars-tokenizers package achieves top efficiency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · Environmental Monitoring and Data Management · Biomedical Text Mining and Ontologies
