mRNA2vec: mRNA Embedding with Language Model in the 5'UTR-CDS for mRNA Design
Honggen Zhang, Xiangrui Gao, June Zhang, Lipeng Lai

TL;DR
mRNA2vec introduces a novel language model-based embedding method for mRNA sequences, improving prediction of translation efficiency, expression levels, and stability, aiding mRNA therapeutic design.
Contribution
The paper presents a new self-supervised embedding approach tailored for mRNA, integrating sequence location, energy prediction, and secondary structure tasks, outperforming existing methods.
Findings
Enhanced translation efficiency prediction in UTRs
Improved expression level prediction in mRNA sequences
Competitive performance in stability and protein production tasks
Abstract
Messenger RNA (mRNA)-based vaccines are accelerating the discovery of new drugs and revolutionizing the pharmaceutical industry. However, selecting particular mRNA sequences for vaccines and therapeutics from extensive mRNA libraries is costly. Effective mRNA therapeutics require carefully designed sequences with optimized expression levels and stability. This paper proposes a novel contextual language model (LM)-based embedding method: mRNA2vec. In contrast to existing mRNA embedding approaches, our method is based on the self-supervised teacher-student learning framework of data2vec. We jointly use the 5' untranslated region (UTR) and coding sequence (CDS) region as the input sequences. We adapt our LM-based approach specifically to mRNA by 1) considering the importance of location on the mRNA sequence with probabilistic masking, 2) using Minimum Free Energy (MFE) prediction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Natural Language Processing Techniques · RNA Research and Splicing
