# GENNUS: generative approaches for nucleotide sequences enhance mirtron classification

**Authors:** Alisson Gaspar Chiquitto, Liliane Santana Oliveira, Pedro Henrique Bugatti, Priscila Tiemi Maeda Saito, Mark Basham, Roberto Tadeu Raittz, Alexandre Rossi Paschoal

PMC · DOI: 10.1093/nargab/lqaf072 · 2025-06-20

## TL;DR

This paper introduces GENNUS, a new method using generative models to improve the classification of mirtrons and microRNAs by addressing data imbalance.

## Contribution

GENNUS introduces novel data augmentation strategies using GANs and SMOTE to enhance mirtron and miRNA classification.

## Key findings

- GAN-based methods generate high-quality synthetic data that improve classification accuracy.
- Models trained with GAN-generated data outperform those using only real data or traditional SMOTE.
- The approach enhances generalization across different machine learning frameworks.

## Abstract

Classifying non-coding RNA (ncRNA) sequences, particularly mirtrons, is essential for elucidating gene regulation mechanisms. However, the prevalent class imbalance in ncRNA datasets presents significant challenges, often resulting in overfitting and diminished generalization in machine learning models. In this study, GENNUS (GENerative approaches for NUcleotide Sequences) is proposed, introducing novel data augmentation strategies using generative adversarial networks (GANs) and synthetic minority over-sampling technique (SMOTE) to enhance mirtron and canonical microRNA (miRNA) classification performance. Our GAN-based methods effectively generate high-quality synthetic data that capture the intricate patterns and diversity of real mirtron sequences, eliminating the need for extensive feature engineering. Through four experiments, it is demonstrated that models trained on a combination of real and GAN-generated data improve classification accuracy compared to traditional SMOTE techniques or only with real data. Our findings reveal that GANs enhance model performance and provide a richer representation of minority classes, thus improving generalization capabilities across various machine learning frameworks. This work highlights the transformative potential of synthetic data generation in addressing data limitations in genomics, offering a pathway for more effective and scalable mirtron and canonical miRNA classification methodologies. GENNUS is available at https://github.com/chiquitto/GENNUS; and https://doi.org/10.6084/m9.figshare.28207328.

Graphical Abstract

## Full-text entities

- **Genes:** MIR7107 (microRNA 7107) [NCBI Gene 102465665] {aka hsa-mir-7107}, MIR6839 (microRNA 6839) [NCBI Gene 102465505] {aka hsa-mir-6839}, MIR5004 (microRNA 5004) [NCBI Gene 100847012] {aka mir-5004}, MIR7159 (microRNA 7159) [NCBI Gene 102466816] {aka hsa-mir-7159}, MIR6891 (microRNA 6891) [NCBI Gene 102465537] {aka hsa-mir-6891}, MIR6859-3 (microRNA 6859-3) [NCBI Gene 102465910] {aka hsa-mir-6859-3}, MIRLET7E (microRNA let-7e) [NCBI Gene 406887] {aka LET7E, MIRNLET7E, hsa-let-7e, let-7e}, MIR6853 (microRNA 6853) [NCBI Gene 102466201] {aka hsa-mir-6853}, MIR4695 (microRNA 4695) [NCBI Gene 100616120], MIR4700 (microRNA 4700) [NCBI Gene 100616329] {aka mir-4700}, MIR4800 (microRNA 4800) [NCBI Gene 100616358] {aka mir-4800}, MIR6770-1 (microRNA 6770-1) [NCBI Gene 102465461] {aka hsa-mir-6770-1}, MIR4638 (microRNA 4638) [NCBI Gene 100616342] {aka mir-4638}, MIR6740 (microRNA 6740) [NCBI Gene 102465443] {aka hsa-mir-6740}
- **Diseases:** FBGAN (MESH:D056768), GANs (MESH:D004829)
- **Chemicals:** FBGAN (-)
- **Species:** Mus musculus (house mouse, species) [taxon 10090], Homo sapiens (human, species) [taxon 9606], Macaca mulatta (rhesus macaque, species) [taxon 9544]

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12204755/full.md

---
Source: https://tomesphere.com/paper/PMC12204755