# Optimizing genomic language models for promoter prediction: a comparative study of tokenization and cross-species learning

**Authors:** Eyal Hadad, Noia Kogman, Lina Golan, Anva Avraham, Reut Ben-Hamo, Zhi Wei, Lior Rokach, Isana Veksler-Lublinsky

PMC · DOI: 10.1093/nargab/lqag025 · NAR Genomics and Bioinformatics · 2026-03-12

## TL;DR

This paper compares different tokenization methods for genomic language models in predicting gene promoters and finds that k-mer approaches perform best, especially when using related species for training.

## Contribution

The study introduces a comparative analysis of tokenization methods and cross-species learning for promoter prediction in genomics.

## Key findings

- Non-overlapping 6-mer tokenization outperforms BPE and WPC across eight organisms.
- Models trained on phylogenetically related species improve performance in low-data scenarios.
- Positional SHAP analysis confirms models learn biologically plausible patterns.

## Abstract

Large Language Models (LLMs) are increasingly applied to genomic tasks, yet core challenges remain concerning tokenization, evaluation, and data scarcity. This study focuses on promoter classification and systematically evaluates four tokenization methods: non-overlapping 6-mer, overlapping 6-mer, Byte Pair Encoding (BPE), and WordPiece (WPC). We show that the commonly used k-mer approach, specifically the non-overlapping variant, outperforms BPE and WPC across eight organisms, challenging assumptions derived from natural language processing. To ensure robustness, we evaluated performance under two distinct negative data strategies: positive-promoter-shuffled and random-non-promoter-fragments. Using a positional SHAP framework, we demonstrate that the model learns biologically plausible positional patterns rather than exploiting artifacts from these negative data generation processes. Furthermore, evolutionary-informed transfer learning experiments and external validation on an unseen organism reveal that training on phylogenetically related species significantly improves performance, particularly in low-data regimes. These findings underscore the significant impact of tokenization and negative data design, providing practical guidance for refining genomic classifiers.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12980338/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12980338/full.md

## References

50 references — full list in the complete paper: https://tomesphere.com/paper/PMC12980338/full.md

---
Source: https://tomesphere.com/paper/PMC12980338