# Transformer-Based Classification of Transposable Element Consensus Sequences with TEclass2

**Authors:** Lucas Bickmann, Matias Rodriguez, Xiaoyi Jiang, Wojciech Makałowski

PMC · DOI: 10.3390/biology15010059 · Biology · 2025-12-29

## TL;DR

TEclass2 is a new tool that uses transformer-based deep learning to classify transposable elements in genomes more accurately and efficiently.

## Contribution

TEclass2 introduces a transformer-based deep learning model for transposable element classification with improved performance and customizable training.

## Key findings

- TEclass2 classifies transposable elements into sixteen superfamilies using a transformer architecture.
- The tool offers a web interface and pre-trained models for rapid and reliable classification.
- It supports custom model training, enhancing flexibility for genomic annotation.

## Abstract

Transposable elements (TEs) are mobile genetic elements that are present in great numbers in the majority of eukaryotic genomes. They are major drivers of genome evolution as they can facilitate chromosome rearrangements, provide mechanisms for genomic shuffling, and contribute to genome expansion, thereby altering genome architecture. Additionally, TEs can also modify gene expression by disrupting regulatory sequences, impairing genes, and promoting the emergence of new sequences. Despite their abundance, identifying TEs is challenging and time-consuming due to their extreme diversity in DNA sequence. Many TE families are ancient, and most of their sequences have become inactive due to accumulated mutations and fragmentation. Consequently, different copies of the same TE can differ greatly from each other, which makes identifying decayed copies particularly difficult. In this work, we employ transformer architecture for TE classification. TEclass2 is an integrated classifier that rapidly predicts TE orders and superfamilies using models built on this advanced machine learning approach. The software is available through a web interface, allowing users to classify sequences into sixteen superfamilies according to the Wicker classification system, or alternatively, users can download the source code to train and build custom classification models.

Transposable elements (TEs) constitute a significant portion of eukaryotic genomes and play crucial roles in genome evolution, yet their diverse and complex sequences pose challenges for accurate classification. Existing tools often lack reliability in TE classification, limiting genomic analyses. Here, we present TEclass2, a software employing a deep learning approach based on a linear transformer architecture with k-mer tokenization and sequence-specific adaptations to classify TE consensus sequences into sixteen superfamilies. TEclass2 demonstrates improved classification performance and offers flexible model training on custom datasets. Accessible via a web interface with pre-trained models, TEclass2 facilitates rapid and reliable TE classification. These advancements provide a foundation for enhanced genomic annotation and support further bioinformatics research involving transposable elements.

## Full-text entities

- **Genes:** hat (half out) [NCBI Gene 47804]
- **Diseases:** COVID-19 (MESH:D000086382), SINEs (MESH:D031368), injury to (MESH:D014947), TEs (MESH:C565217)
- **Species:** Homo sapiens (human, species) [taxon 9606], Danio rerio (leopard danio, species) [taxon 7955], Drosophila melanogaster (fruit fly, species) [taxon 7227], Oryza sativa (Asian cultivated rice, species) [taxon 4530], Mus musculus (house mouse, species) [taxon 10090], Platyhelminthes (flatworm, phylum) [taxon 6157]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12785036/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12785036/full.md

## References

50 references — full list in the complete paper: https://tomesphere.com/paper/PMC12785036/full.md

---
Source: https://tomesphere.com/paper/PMC12785036