DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities   of Coding Units

Lei Mao; Yuanhe Tian; Yan Song

arXiv:2505.02206·cs.LG·May 6, 2025

DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units

Lei Mao, Yuanhe Tian, Yan Song

PDF

Open Access 1 Repo

TL;DR

DNAZEN introduces a novel genomic representation framework that leverages mixed granularities of coding units, including G-grams, to improve gene sequence modeling and downstream task performance.

Contribution

The paper proposes a new method to incorporate multiple granularities of gene sequence units, especially G-grams, into Transformer-based models for enhanced genomic representations.

Findings

01

DNAZEN outperforms existing models on benchmark datasets.

02

Whole G-gram masking improves training effectiveness.

03

Incorporating G-grams enhances downstream task accuracy.

Abstract

Genome modeling conventionally treats gene sequence as a language, reflecting its structured motifs and long-range dependencies analogous to linguistic units and organization principles such as words and syntax. Recent studies utilize advanced neural networks, ranging from convolutional and recurrent models to Transformer-based models, to capture contextual information of gene sequence, with the primary goal of obtaining effective gene sequence representations and thus enhance the models' understanding of various running gene samples. However, these approaches often directly apply language modeling techniques to gene sequences and do not fully consider the intrinsic information organization in them, where they do not consider how units at different granularities contribute to representation. In this paper, we propose DNAZEN, an enhanced genomic representation framework designed to learn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oomics/dnazen
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Gene expression and cancer classification · DNA and Biological Computing