# What Large Language Models offer about Familial Mediterranean Fever: An Analysis of Quality, Readability, Completeness, and Accuracy

**Authors:** Burak Tayyip Dede, Didem Erdem Gürsoy, Muhammed Oğuz, Bülent Alyanak, Fatih Bağcıer

PMC · DOI: 10.31138/mjr.261224.hfm · Mediterranean Journal of Rheumatology · 2025-08-20

## TL;DR

This study evaluates how well large language models answer questions about Familial Mediterranean Fever, finding them accurate but with concerns about readability and quality.

## Contribution

The study introduces a comparative analysis of LLMs on FMF-related questions using readability, accuracy, completeness, and quality metrics.

## Key findings

- LLMs showed acceptable accuracy but low completeness scores, with ChatGPT-4 performing best in completeness.
- Gemini outperformed other LLMs in quality assessment using the EQIP tool.
- Readability scores were moderate, with no significant differences between models in accuracy or readability.

## Abstract

The aim of this study was to evaluate the quality, completeness, accuracy, and readability of Large Language Models (LLM) responses to 25 popular questions about Familial Mediterranean Fever (FMF).

The readability of the responses of LLMs (ChatGPT-4, Copilot, Gemini) was assessed by Flesch Reading Ease Score (FRES) and Flesch-Kincaid Grade (FKG). The Ensuring Quality Information for Patients (EQIP) tool was used to assess the quality. To assess the completeness and accuracy of responses, 3-point and 5-point Likert scales were used, respectively.

The mean FRES scores of LLMs ranged between 29.80 and 35.66. The FKG scores ranged between 12.36 and 13.72. The mean accuracy scores of LLMs ranged between 4.88 and 4.96. No significant difference was found between the LLM groups regarding accuracy and readability scores (p>0.05). The mean completeness scores of LLMs ranged between 2.36 and 2.84. ChatGPT-4 was the leading LLM in completeness scores according to the Likert scale, and the difference between LLM groups was statistically significant (p=0.006). Gemini performed better in the quality analysis with the EQIP tool, and there was a statistically significant difference between the LLM groups (p<0.001).

In this study, LLMs performed acceptably in accuracy and completeness. However, there are serious concerns about their readability and quality. To improve health information, LLM developers should include more diverse data sources in the training sets of the models. Moreover, the ability of LLMs to provide readability features that are adaptable to the level of education could be an important innovation in this field.

## Linked entities

- **Diseases:** Familial Mediterranean Fever (MONDO:0009572)

## Full-text entities

- **Diseases:** FMF (MESH:D010505)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12536757/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12536757/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/PMC12536757/full.md

---
Source: https://tomesphere.com/paper/PMC12536757