# A transformer-based method for the cap analysis of gene expression and gene expression tag associated capping region prediction in RNA

**Authors:** Dibya Kanti Haldar, Avik Pramanick, Chandrama Mukherjee, Pralay Mitra

PMC · DOI: 10.1080/15476286.2026.2629530 · RNA Biology · 2026-02-10

## TL;DR

This paper introduces a transformer-based method to predict RNA 5’ capping regions from DNA sequences, improving the understanding of gene expression.

## Contribution

A novel transformer-based model for predicting RNA capping regions using Llama and LoRA techniques.

## Key findings

- The model achieved 79.12% accuracy and 78.11% F1-score on the human genome after fine-tuning.
- Attention peaks revealed statistically significant motifs with strong p-values, indicating biological relevance.
- Predicted motifs matched known transcription factor motifs, supporting the model's biological validity.

## Abstract

5’ RNA capping is one of the major post-transcriptional modifications for the mobility and stability of RNA molecules. Measuring 5’ caps of RNAs can help quantify expression levels of mRNAs and lncRNAs. One of the most successful RNAseq methods that has used capping as a tool to quantify expression of transcription is Cap Analysis of Gene Expression (CAGE). Computational prediction of capping can therefore be used as a precursor to the prediction of transcriptional expression. Unfortunately, there is hardly any computational technique that has focused purely on predicting 5’ capping. We have developed a transformer-based method for computational prediction of capping from DNA sequences. Our Llama and ReLoRA-based pre-training model, and Llama and LoRA-based fine-tuning model predict capping associated regions. We have used Leave-one-chromosome-out-cross-validation for our model. The average accuracy, and F1-score after fine-tuning the human genome hg19 (mouse genome mm9) for sequence classification is 79.12% (78.09%) and 78.11% (76.17%), respectively. We noted attention peak-based motifs having an aggregate Wilcoxon rank-sum p-value of 1.075e-10 between the attention peak region and the entire context window for the predicted positive motifs; an aggregate p-value of 7.17e-18 for the predicted negative motifs; and an aggregate p-value of 6.70e-08 between the attention peaks of the predicted positive and the predicted negative motifs. Our Llama-based approach aims to create a sequence-based framework to identify capping associated regions corresponding to CAGE peaks. Our analysis reveals statistically significant motifs from the regions of peak attention scores, which demonstrates biological relevance for some through their resident sites matching with known TF motifs.

## Linked entities

- **Species:** Homo sapiens (taxon 9606), Mus musculus (taxon 10090)

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606], Mus musculus (house mouse, species) [taxon 10090]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12915862/full.md

## Figures

27 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12915862/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/PMC12915862/full.md

---
Source: https://tomesphere.com/paper/PMC12915862