# PolyA-GLM: A comprehensive framework for De novo polyadenylation site prediction using genome language models

**Authors:** Sourav Saha, Naima Ahmed Fahmi, Jeongsik Yong, Wei Zhang

PMC · DOI: 10.1016/j.csbj.2025.12.011 · Computational and Structural Biotechnology Journal · 2025-12-17

## TL;DR

This paper introduces PolyA-GLM, a new framework using genome language models to accurately predict polyadenylation sites in DNA sequences, improving RNA regulation analysis.

## Contribution

The novel contribution is the application of genome language models for de novo polyadenylation site prediction with an end-to-end pipeline.

## Key findings

- HyenaDNA achieved an AUC of 0.751 in few-shot poly(A) site prediction.
- GLMs effectively detect canonical polyadenylation signals and their spatial relationships.
- A token-level classification approach enables precise position-wise site identification.

## Abstract

Polyadenylation sites (poly(A) sites) play a key role in the post-transcriptional regulation of gene expression. Accurate prediction of poly(A) sites is essential for identifying RNA processing defects associated with cancer and developmental disorders. Traditional approaches based on sequence motifs and experimental validation often struggle to generalize across different cell types and species. To address this limitation, we investigate the use of genome language models (GLMs) for poly(A) site prediction, leveraging their ability to capture long-range dependencies within genomic sequences. Specifically, we evaluate three state-of-the-art GLMs, DNABERT-2, Nucleotide Transformer, and HyenaDNA, using both few-shot classification and fine-tuning strategies. These models effectively recognize canonical polyadenylation signals (PASs) (i.e., AATAAA or other variants) and their spatial relationship (10-30 bp) to cleavage sites, with HyenaDNA achieving an AUC of 0.751 in the few-shot setting and improved performance after fine-tuning. We further validate model interpretability through systematic signal perturbation experiments, confirming their capacity to detect canonical PASs. Additionally, we propose a token-level classification approach for precise position-wise poly(A) site identification across extended gene regions. Finally, we present PolyA-GLM, an end-to-end pipeline for discovering novel poly(A) sites, highlighting the potential of GLMs to reveal regulatory elements overlooked by conventional methods. Overall, this work demonstrates the promise of GLMs in advancing our understanding of RNA processing and regulatory element discovery.

## Linked entities

- **Diseases:** cancer (MONDO:0004992)

## Full-text entities

- **Diseases:** developmental disorders (MESH:D002658), cancer (MESH:D009369)
- **Chemicals:** PolyA (MESH:D011061)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12799945/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12799945/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/PMC12799945/full.md

---
Source: https://tomesphere.com/paper/PMC12799945