An Empirical Analysis of Fine-Tuning Large Language Models on Bioinformatics Literature: PRSGPT and BioStarsGPT
Muhammad Muneeb, David B. Ascher

TL;DR
This paper introduces a reproducible pipeline for fine-tuning large language models on bioinformatics data, demonstrated through two specialized models, PRSGPT and BioStarsGPT, achieving improved performance and rich datasets for domain-specific applications.
Contribution
The paper presents a novel, scalable pipeline for domain-specific fine-tuning of LLMs using diverse bioinformatics data sources and prompt-based QA generation, with extensive benchmarking and human evaluation.
Findings
Qwen2.5-7B outperformed other models on benchmarks.
PRSGPT achieved 61.9% accuracy in human evaluation.
BioStarsGPT demonstrated 59% conceptual accuracy.
Abstract
Large language models (LLMs) often lack specialized knowledge for complex bioinformatics applications. We present a reproducible pipeline for fine-tuning LLMs on specialized bioinformatics data, demonstrated through two use cases: PRSGPT, focused on polygenic risk score (PRS) tools, and BioStarsGPT, trained on community forum discussions. The nine-step pipeline integrates diverse data sources, structured preprocessing, prompt-based question-answer (QA) generation (via Google Gemini), natural language inference (NLI) for quality control, semantic deduplication, clustering-based data splitting, and parameter-efficient fine-tuning using LoRA. We fine-tuned three LLMs (LLaMA-3.2-3B, Qwen2.5-7B, Gemma) and benchmarked them on over 14 lexical and semantic metrics. Qwen2.5-7B emerged as the best performer, with BLEU-4 and ROUGE-1 improvements of 82\% and 70\% for PRSGPT and 6\% and 18\% for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare
