DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units
Lei Mao, Yuanhe Tian, Yan Song

TL;DR
DNAZEN introduces a novel genomic representation framework that leverages mixed granularities of coding units, including G-grams, to improve gene sequence modeling and downstream task performance.
Contribution
The paper proposes a new method to incorporate multiple granularities of gene sequence units, especially G-grams, into Transformer-based models for enhanced genomic representations.
Findings
DNAZEN outperforms existing models on benchmark datasets.
Whole G-gram masking improves training effectiveness.
Incorporating G-grams enhances downstream task accuracy.
Abstract
Genome modeling conventionally treats gene sequence as a language, reflecting its structured motifs and long-range dependencies analogous to linguistic units and organization principles such as words and syntax. Recent studies utilize advanced neural networks, ranging from convolutional and recurrent models to Transformer-based models, to capture contextual information of gene sequence, with the primary goal of obtaining effective gene sequence representations and thus enhance the models' understanding of various running gene samples. However, these approaches often directly apply language modeling techniques to gene sequences and do not fully consider the intrinsic information organization in them, where they do not consider how units at different granularities contribute to representation. In this paper, we propose DNAZEN, an enhanced genomic representation framework designed to learn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Gene expression and cancer classification · DNA and Biological Computing
