Unlocking RNA mysteries: Predicting subcellular localizations with AI
Nguyen Quoc Khanh Le

Abstract
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Cancer-related molecular mechanisms research · Genomics and Phylogenetic Studies
Main text
Long non-coding RNAs (lncRNAs) have emerged as pivotal regulators of cellular processes, with their subcellular localization offering critical insights into their functions.1 Aberrant lncRNA localization has been linked to a range of diseases, from cancers to neurodegenerative disorders, underscoring the importance of understanding their distribution within cellular compartments. Traditional methods, such as fluorescence in situ hybridization (FISH), are resource intensive and not scalable for high-throughput studies. Addressing this challenge, Li et al. introduce LncDNN, a machine learning model tailored to predict lncRNA localization within the nucleolus and nucleoplasm, offering a transformative approach to RNA biology.2
Their findings, published in Molecular Therapy Nucleic Acids,2 showcase the potential of LncDNN to revolutionize our understanding of lncRNA biology. By integrating diverse feature sets and leveraging Shapley additive explanations (SHAP) interpretability analysis (Figure 1), the model achieved exceptional predictive performance, with AUCs of 0.873 and 0.831 on validation and test datasets, respectively. Beyond prediction, the study sheds light on the sequence features influencing localization, opening avenues for experimental exploration.Figure 1. Overview of lncRNA localization prediction using LncDNNENAC, enhanced nucleic acid composition; ANF, accumulated nucleotide frequency; SHAP, Shapley additive explanations.
The complexity of lncRNA biology lies not only in their sequence diversity but also in their spatial dynamics within cells. Localization patterns often dictate lncRNA function, influencing processes such as transcriptional regulation, RNA stability, and protein interactions. In diseases like cancer, mislocalized lncRNAs can act as oncogenes or tumor suppressors, making them attractive targets for therapeutic intervention. Yet, the scalability of traditional methods limits their utility in large-scale studies. The advent of computational approaches, especially those employing machine learning, has transformed genomics and transcriptomics research. These tools offer a scalable, cost-effective means to analyze vast datasets, predict functional attributes, and infer biological mechanisms. LncDNN exemplifies this paradigm shift, providing both predictive accuracy and biological interpretability.
Li et al. adopted a data-driven approach, harnessing the RNALocate v.2.0 database3 to curate a high-quality dataset of lncRNA sequences localized to the nucleolus and nucleoplasm. To enhance model generalizability, sequences were clustered at 90% similarity, ensuring unique samples for training and testing. The model’s architecture incorporated three complementary feature sets.
- (1)Enhanced nucleic acid composition (ENAC): captures global sequence composition patterns.
- (2)Binary encoding: encodes positional information of nucleotides.
- (3)Accumulated nucleotide frequency (ANF): reflects cumulative sequence insights.
The integration of these features enabled LncDNN to capture intricate sequence patterns influencing localization. Importantly, the use of SHAP analysis provided interpretability, identifying key sequence motifs and positional features driving predictions. This interpretability bridges the gap between computational predictions and experimental validation, fostering confidence among biologists.
The predictive capabilities of LncDNN extend beyond academic curiosity. Accurate localization predictions could aid in identifying lncRNAs as biomarkers or therapeutic targets. For instance, nucleolar-localized lncRNAs often influence ribosome biogenesis and cellular stress responses, pathways frequently dysregulated in cancer.4 Similarly, nucleoplasmic lncRNAs can modulate transcriptional networks, with implications for developmental disorders and neurodegeneration.5 Moreover, the interpretability of LncDNN addresses a long-standing challenge in AI-driven biology: understanding the “why” behind predictions. By identifying sequence features linked to localization, the model provides testable hypotheses, enabling targeted experimental validation and functional studies.
While LncDNN represents a significant advancement, certain limitations warrant discussion. The reliance on RNALocate v.2.0,3 while comprehensive, may introduce biases inherent to the database. Expanding the dataset to include diverse cell types, species, and experimental conditions would enhance model robustness and generalizability. Another limitation lies in the lack of experimental validation of the model’s predictions. Future studies could integrate in vitro and in vivo assays to confirm the localization and functional roles of predicted lncRNAs. Additionally, extending the model to predict localization across more subcellular compartments or dynamic cellular states could further broaden its applicability.
Li et al.’s work exemplifies the potential of machine learning in RNA biology, offering a roadmap for future innovations.2 By combining predictive accuracy with biological interpretability, LncDNN sets a benchmark for computational tools in genomics. The integration of such models with multi-omics data could unlock deeper insights into RNA biology, revealing novel therapeutic targets and biomarkers.
As computational tools evolve, their synergy with experimental workflows will be crucial. Collaborative efforts between computational scientists and experimental biologists can bridge the gap between prediction and validation, accelerating discoveries in RNA biology. The journey to fully understand lncRNA localization and function is far from over, but LncDNN represents a significant step forward, illuminating new paths for exploration and innovation.
Acknowledgments
This work is supported by the National Science and Technology Council, Taiwan (grant number MOST111-2628-E-038-002-MY3).
Declaration of interests
The author declares no competing interests.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Mattick J.S.Amaral P.P.Carninci P.Carpenter S.Chang H.Y.Chen L.-L.Chen R.Dean C.Dinger M.E.Fitzgerald K.A.Long non-coding RN As: definitions, functions, challenges and recommendations Nat. Rev. Mol. Cell Biol.2420234304473659686910.1038/s 41580-022-00566-8PMC 10213152 · doi ↗ · pubmed ↗
- 2Li J.Ju Y.Zou Q.Ni F.lnc RNA localization and feature interpretability analysis Mol. Ther. Nucleic Acids 36202510242510.1016/j.omtn.2024.102425 PMC 1180316039926317 · doi ↗ · pubmed ↗
- 3Cui T.Dou Y.Tan P.Ni Z.Liu T.Wang D.Huang Y.Cai K.Zhao X.Xu D.RNA Locate v 2.0: an updated resource for RNA subcellular localization with increased coverage and annotation Nucleic Acids Res.502022 D 333D 3393455144010.1093/nar/gkab 825PMC 8728251 · doi ↗ · pubmed ↗
- 4Hwang S.-P.Denicourt C.The impact of ribosome biogenesis in cancer: from proliferation to metastasis NAR Cancer 62024 zcae 01710.1093/narcan/zcae 017PMC 1102338738633862 · doi ↗ · pubmed ↗
- 5Statello L.Guo C.-J.Chen L.-L.Huarte M.Gene regulation by long non-coding RN As and its biological functions Nat. Rev. Mol. Cell Biol.222021961183335398210.1038/s 41580-020-00315-9PMC 7754182 · doi ↗ · pubmed ↗
