Character-level Tokenizations as Powerful Inductive Biases for RNA   Foundational Models

Adri\'an Morales-Pastor; Raquel V\'azquez-Reza; Mi{\l}osz Wiecz\'or,; Cl\`audia Valverde; Manel Gil-Sorribes; Bertran Miquel-Oliver; \'Alvaro; Ciudad; Alexis Molina

arXiv:2411.11808·q-bio.QM·November 19, 2024

Character-level Tokenizations as Powerful Inductive Biases for RNA Foundational Models

Adri\'an Morales-Pastor, Raquel V\'azquez-Reza, Mi{\l}osz Wiecz\'or,, Cl\`audia Valverde, Manel Gil-Sorribes, Bertran Miquel-Oliver, \'Alvaro, Ciudad, Alexis Molina

PDF

Open Access

TL;DR

This paper introduces ChaRNABERT, a novel learnable tokenization-based RNA foundational model that achieves state-of-the-art results in RNA-related tasks, addressing a significant gap in computational biology.

Contribution

The paper presents ChaRNABERT, a sample- and parameter-efficient RNA foundational model with learnable tokenization, improving performance on multiple RNA benchmarks and interaction prediction tasks.

Findings

01

Achieved state-of-the-art performance on RNA benchmarks.

02

Effective in RNA-protein and aptamer-protein interaction prediction.

03

Models are sample- and parameter-efficient.

Abstract

RNA is a vital biomolecule with numerous roles and functions within cells, and interest in targeting it for therapeutic purposes has grown significantly in recent years. However, fully understanding and predicting RNA behavior, particularly for applications in drug discovery, remains a challenge due to the complexity of RNA structures and interactions. While foundational models in biology have demonstrated success in modeling several biomolecules, especially proteins, achieving similar breakthroughs for RNA has proven more difficult. Current RNA models have yet to match the performance observed in the protein domain, leaving an important gap in computational biology. In this work, we present ChaRNABERT, a suite of sample and parameter-efficient RNA foundational models, that through a learnable tokenization process, are able to reach state-of-the-art performance on several tasks in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRNA and protein synthesis mechanisms · DNA and Nucleic Acid Chemistry · Genomics and Chromatin Dynamics