Plain language adaptations of biomedical text using LLMs: Comparision of evaluation metrics
Primoz Kocbek, Leon Kopitar, Gregor Stiglic

TL;DR
This study compares different LLM-based methods for simplifying biomedical texts to improve health literacy, evaluating their performance with various quantitative and qualitative metrics.
Contribution
It introduces and compares prompt-based, multi-agent, and fine-tuning approaches for biomedical text simplification using LLMs, highlighting the effectiveness of gpt-4o-mini.
Findings
gpt-4o-mini outperformed other models
Fine-tuning approaches underperformed
G-Eval metric aligned well with qualitative assessments
Abstract
This study investigated the application of Large Language Models (LLMs) for simplifying biomedical texts to enhance health literacy. Using a public dataset, which included plain language adaptations of biomedical abstracts, we developed and evaluated several approaches, specifically a baseline approach using a prompt template, a two AI agent approach, and a fine-tuning approach. We selected OpenAI gpt-4o and gpt-4o mini models as baselines for further research. We evaluated our approaches with quantitative metrics, such as Flesch-Kincaid grade level, SMOG Index, SARI, and BERTScore, G-Eval, as well as with qualitative metric, more precisely 5-point Likert scales for simplicity, accuracy, completeness, brevity. Results showed a superior performance of gpt-4o-mini and an underperformance of FT approaches. G-Eval, a LLM based quantitative metric, showed promising results, ranking the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Health Literacy and Information Accessibility · Artificial Intelligence in Healthcare and Education
