# Development of Large Language Model Specialized into Microbiome Datasets: an Application of Self-Evaluation and Scoring Comparison with Conventional Natural Language Processing Markers

**Authors:** Chan Kyu Park, Sung Hwan Bae, Hyeon Woo Park, Nam Su Oh, Young Jun Kim, Young-Wan Kim, Tae Jin Cho, Ying Li, Jianmin Chai, Jiangchao Zhao, Hyung Taek Cho, Ji Hoon Jung, Jinbong Park, Tae Gyun Kim, Jae Kyeom Kim

PMC · DOI: 10.4014/jmb.2511.11050 · Journal of Microbiology and Biotechnology · 2026-01-26

## TL;DR

A new AI model called METABOLISM was developed to better understand gut microbiome data and its connections to liver biology.

## Contribution

The novel contribution is METABOLISM, a fine-tuned large language model specialized for microbiome data and optimized for biological reasoning.

## Key findings

- METABOLISM outperformed general LLMs in relevance and clarity for microbiome-related questions.
- Traditional NLP metrics like BLEU and ROUGE showed weak correlation with human and AI-based quality assessments.
- The model demonstrates potential for synthesizing complex microbiome data into interpretable biological insights.

## Abstract

The gut microbiome plays a fundamental role in host metabolism, immune regulation, and disease development. With the rapid accumulation of multi-omics and literature data, the microbiome field now faces the challenge of efficiently extracting scientific insights from massive, heterogeneous datasets. Artificial intelligence (AI) and large language models (LLMs) provide promising tools to address this complexity by enabling integrative analysis and knowledge synthesis across diverse biological sources. In this study, we developed METABOLISM, a microbiome-specialized LLM fine-tuned on 160,000 scientific abstracts to enhance literature-based contextual understanding of microbiome–liver interactions and related biological mechanisms. Using LoRA-based parameter-efficient training, METABOLISM was optimized for domain-specific reasoning and response generation. Model performance was evaluated through both automated Phi-4 scoring (a large language model–based evaluator for relevance, informativeness, and fluency) and structured human expert rubric assessments involving 20 domain specialists. The fine-tuned METABOLISM achieved superior relevance and clarity scores (mean > 7.5 ± 0.06) compared with general-purpose LLMs such as Gemma-3-12B-IT and ChatGPT-4o. Correlation analysis revealed weak to moderate negative relationships (R = –0.65, p < 0.0001) between traditional NLP metrics (BLEU, ROUGE) and human expert rubric scores, with a similar trend observed for correlations with Phi-4–based automated evaluation scores, indicating the limitations of surface-level similarity measures in biomedical contexts. Overall, our findings demonstrate that microbiome-adapted LLMs can effectively distill high-volume scientific data into biologically meaningful insights, supporting more efficient and interpretable research in microbiology and systems biology.

## Full-text entities

- **Diseases:** liver cancer (MESH:D006528), hallucination (MESH:D006212), cirrhosis (MESH:D005355), LLMs (MESH:D007806), hepatic inflammation (MESH:D007249), METABOLISM (MESH:D008659), alcoholic liver disease (MESH:D008108), liver disease (MESH:D008107), hepatitis (MESH:D056486), NAFLD (MESH:D065626)
- **Chemicals:** short-chain fatty acid (MESH:D005232), BERTScore (-), lipopolysaccharides (MESH:D008070), bile acid (MESH:D001647)
- **Species:** gut metagenome (species) [taxon 749906], Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12868943/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12868943/full.md

## References

15 references — full list in the complete paper: https://tomesphere.com/paper/PMC12868943/full.md

---
Source: https://tomesphere.com/paper/PMC12868943