# Pipeline to explore information on genome editing using large language models and genome editing meta-database

**Authors:** Takayuki Suzuki, Hidemasa Bono

PMC · DOI: 10.1093/database/baaf022 · Database: The Journal of Biological Databases and Curation · 2025-03-08

## TL;DR

This paper introduces a pipeline using large language models and a genome editing database to efficiently extract and prioritize genome editing information for research.

## Contribution

A novel pipeline combining large language models and a genome editing meta-database to extract and prioritize GE information.

## Key findings

- A systematic method was developed to extract GE information using large language models.
- Extracted GE information was converted into metrics to prioritize genes for future research.
- The pipeline enhances the efficiency of selecting target genes for genome editing.

## Abstract

Genome editing (GE) is widely recognized as an effective and valuable technology in life sciences research. However, certain genes are difficult to edit depending on some factors such as the type of species, sequences, and GE tools. Therefore, confirming the presence or absence of GE practices in previous publications is crucial for the effective designing and establishment of research using GE. Although the Genome Editing Meta-database (GEM: https://bonohu.hiroshima-u.ac.jp/gem/) aims to provide as comprehensive GE information as possible, it does not indicate how each registered gene is involved in GE. In this study, we developed a systematic method for extracting essential GE information using large language models from the information based on GEM and GE-related articles. This approach allows for a systematic and efficient investigation of GE information that cannot be achieved using the current GEM alone. In addition, by converting the extracted GE information into metrics, we propose a potential application of this method to prioritize genes for future research. The extracted GE information and novel GE-related scores are expected to facilitate the efficient selection of target genes for GE and support the design of research using GE.

Database URLs:  https://github.com/szktkyk/extract_geinfo, https://github.com/szktkyk/visualize_geinfo

## Full-text entities

- **Genes:** GEM (GTP binding protein overexpressed in skeletal muscle) [NCBI Gene 2669] {aka KIR}
- **Diseases:** FP (MESH:D017541), toxicity (MESH:D064420), LLMs (MESH:D007806), PD (MESH:D010300), TN (MESH:C579935), GE (MESH:D042822), OS (MESH:D000079225)
- **Chemicals:** FP (-)
- **Species:** Homo sapiens (human, species) [taxon 9606], Mus musculus (house mouse, species) [taxon 10090]
- **Cell lines:** Llama3-70b — Homo sapiens (Human), Transformed cell line (CVCL_DC96)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC11890094/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11890094/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/PMC11890094/full.md

---
Source: https://tomesphere.com/paper/PMC11890094