Growing connections: PlantConnectome maps molecular networks in Arabidopsis
Jan Wilhelm Huebbers

Abstract
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhotosynthetic Processes and Mechanisms · Plant Gene Expression Analysis · 14-3-3 protein interactions
One of the core strengths of large language models (LLMs) is their ability to reason across vast amounts of text, enabling them to extract functional relationships between entities and words. In essence, an LLM is like someone who has read millions of stories and can map how characters and ideas are connected based on how they appear together.
In new work, Shan Chun Lim and colleagues (2025) used this feature of LLMs to connect the dots between various entities in plant biology, including gene products, metabolites, tissues, and organs. The authors recognized that transforming the vast amount of unstructured information in the literature into a structured knowledge graph is one of the central challenges in biology. They also noted that the ever-growing body of literature makes it increasingly difficult to formulate strong hypotheses about functional biological relationships. The situation is much like solving a 100-piece versus a 1000-piece puzzle: while the larger puzzle provides a more detailed picture, it also takes considerably more time to complete. Computational tools can help solve the puzzle. Several such tools already exist, but many require laborious and inflexible preprocessing of input data, or they provide limited outputs (e.g. no gene–gene relationships or no information on interaction types), or both. To address these limitations, the authors used carefully fine-tuned prompts to text-mine over 71,000 plant research papers and compiled the functional relationships into a user-friendly database, PlantConnectome.
For their analysis, the authors focused on articles that mentioned Arabidopsis thaliana genes in their abstracts, totaling 71,136 articles, of which 19,809 were programmatically accessible as full texts. As a first step, they conducted a meta-analysis of the downloaded abstracts. This included an overview of the current plant literature landscape, revealing distinct groupings, for example, by experimental procedures, plant organs, and biological processes.
To text-mine the retrieved full-text articles and abstracts, they used OpenAI's ChatGPT models to identify (i) the functional relationships between pairs of entities (e.g. gene products, metabolites, tissues), (ii) the experimental basis for each relationship, and (iii) the definition of the extracted entities (Fig.). This pipeline ultimately yielded a knowledge graph representing 4.8 million functional relationships among more than 2.7 million entities.
The authors also acknowledged the limitations of LLMs, including their tendency to hallucinate and misinterpret text. For example, they manually evaluated a subset of extracted edges (i.e. relationships between 2 entities) and observed high accuracy of over 85%. Most remaining errors involved misinterpretation of hypotheses as facts, which could largely be addressed by fine-tuning models with manually curated input. They also observed that redundancies—such as the use of “Arabidopsis” versus “A. thaliana”—inflate the size of the central knowledge graph (Fig.). To address this, they implemented a protocol to resolve these synonym-related issues.
PlantConnectome is a community-driven effort designed to systematically advance plant biology. It builds on a carefully analyzed foundation of user needs and was crafted by developers of other web applications such as GeneCat (Mutwil et al. 2008), PlaNet (Mutwil et al. 2011), and CoNekT (Proost and Mutwil 2018). This study not only contributes a valuable tool to the field, but it also demonstrates what is possible when scientific research is accessible.
Like other machine learning approaches, the depth and accuracy of PlantConnectome's insights scale with the size and quality of its input data. In this release, 71,136 research articles were analyzed, but approximately 72% (51,327) were abstracts only. The full-text versions were either not published open access or were unavailable for high-throughput download. This limitation serves as a clear reminder of the importance of open science and raises the question: How much more powerful could tools like PlantConnectome be if research were truly open?
Recent related articles in The Plant Cell:
Grones et al. (2024) proposed general guidelines for executing single-cell transcriptomics and analyzing and storing the retrieved data. Lasky et al. (2023) utilized genotype–environment association analyses to uncover the molecular basis of environmental adaptation in Arabidopsis thaliana across diverse environmental gradients. VanBuren et al. (2022) presented “Plants & Python,” a series of lessons integrating plant biology and Python programming to enhance computational literacy in plant science education. Li et al. (2022) developed “LeafNet,” a deep learning–based tool for automated localization and quantification of stomata and pavement cells from diverse leaf epidermal micrographs.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Grones C, Eekhout T, Shi D, Neumann M, Berg LS, Ke Y, Shahan R, Cox KL, Gomez-Cano F, Nelissen H, et al Best practices for the execution, analysis, and data storage of plant single-cell/nucleus transcriptomics. Plant Cell. 2024:36(4):812–828. 10.1093/plcell/koae 00338231860 PMC 10980355 · doi ↗ · pubmed ↗
- 2Lasky JR, Josephs EB, Morris GP. Genotype-environment associations to reveal the molecular basis of environmental adaptation. Plant Cell. 2023:35(1):125–138. 10.1093/plcell/koac 26736005926 PMC 9806588 · doi ↗ · pubmed ↗
- 3Li S, Li L, Fan W, Ma S, Zhang C, Kim JC, Wang K, Russinova E, Zhu Y, Zhou Y. Leaf Net: a tool for segmenting and quantifying stomata and pavement cells. Plant Cell. 2022:34(4):1171–1188. 10.1093/plcell/koac 02135080620 PMC 8972303 · doi ↗ · pubmed ↗
- 4Lim SC, Itharajula M, Møller MH, Fo K, Chuah YS, Foo H, Davey EE, Fullwood M, Thibault G, Mutwil M. Plant Connectome: knowledge graph database encompassing >71,000 plant articles. Plant Cell. 2025.10.1093/plcell/koaf 169PMC 1229088340700523 · doi ↗ · pubmed ↗
- 5Mutwil M, Klie S, Tohge T, Giorgi FM, Wilkins O, Campbell MM, Fernie AR, Usadel B, Nikoloski Z, Persson S. Pla Net: combined sequence and expression comparisons across plant networks derived from seven species. Plant Cell. 2011:23(3):895–910. 10.1105/tpc.111.08366721441431 PMC 3082271 · doi ↗ · pubmed ↗
- 6Mutwil M, Obro J, Willats WGT, Persson S. Gene CAT–novel webtools that combine BLAST and co-expression analyses. Nucleic Acids Res. 2008:36(suppl_2):W 320–W 326. 10.1093/nar/gkn 29218480120 PMC 2447783 · doi ↗ · pubmed ↗
- 7Proost S, Mutwil M. Conekt: an open-source framework for comparative genomic and transcriptomic network analyses. Nucleic Acids Res. 2018:46(W 1):W 133–W 140. 10.1093/nar/gky 33629718322 PMC 6030989 · doi ↗ · pubmed ↗
- 8Van Buren R, Rougon-Cardoso A, Amézquita EJ, Coss-Navarrete EL, Espinosa-Jaime A, Gonzalez-Iturbe OA, Luckie-Duque AC, Mendoza-Galindo E, Pardo J, Rodríguez-Guerrero G, et al Plants & python: a series of lessons in coding, plant biology, computation, and bioinformatics. Teaching tools in plant biology: lecture notes. Plant Cell. 2022:34(7):e 1. 10.1093/plcell/koac 18735781826 PMC 9252484 · doi ↗ · pubmed ↗
