# Benchmarking large language models for pathogen–disease classification in post-acute infection syndromes

**Authors:** Syed Mohammed Khalid, Tom Wölker, Leidy-Alejandra G Molano, Simon Graf, Andreas Keller

PMC · DOI: 10.1093/bib/bbag089 · Briefings in Bioinformatics · 2026-03-06

## TL;DR

This paper benchmarks large language models for identifying pathogen-disease associations in post-acute infection syndromes using a dataset of PubMed abstracts.

## Contribution

The study introduces a benchmark for evaluating LLMs in pathogen-disease classification within PAIS-related biomedical literature.

## Key findings

- Zero-shot prompting with Mistral-Small-Instruct-2409 and Llama-3.1-Nemotron-70B-Instruct achieved balanced accuracy scores of 0.81 and 0.80.
- Few-shot and CoT prompting degraded performance in generalist models but improved accuracy in reasoning models like DeepSeek-R1-Distill-Llama-70B and QwQ-32B.

## Abstract

Post-Acute Infection Syndromes (PAIS) are medical conditions that persist following acute infections from pathogens such as SARS-CoV-2, Epstein–Barr virus, and Influenza virus. Despite growing global awareness of PAIS and the exponential increase in biomedical literature, only a small fraction of this literature pertains specifically to PAIS, making the identification of pathogen–disease associations within such a vast, heterogeneous, and unstructured corpus a significant challenge for researchers. This study evaluated the effectiveness of large language models (LLMs) in extracting these associations through a binary classification task using a curated dataset of 1000 manually labeled PubMed abstracts. We benchmarked a wide range of open-source LLMs of varying sizes (4B–70B parameters), including generalist, reasoning, and biomedical-specific models. We also investigated the extent to which prompting strategies such as zero-shot, few-shot, and Chain of Thought (CoT) methods can improve classification performance. Our results indicate that model performance varied by size, architecture, and prompting strategy. Zero-shot prompting produced the most reliable results: Mistral-Small-Instruct-2409 and Llama-3.1-Nemotron-70B-Instruct achieved balanced accuracy scores of 0.81 and 0.80, respectively, along with macro-F1 scores of up to 0.80, while maintaining minimal invalid outputs. While few-shot and CoT prompting often degraded performance in generalist models, reasoning models such as DeepSeek-R1-Distill-Llama-70B and QwQ-32B demonstrated improved accuracy and consistency when provided with additional context.

## Linked entities

- **Diseases:** SARS-CoV-2 (MONDO:0100096)

## Full-text entities

- **Diseases:** infectious diseases (MESH:D003141), COVID-19 (MESH:D000086382), infection (MESH:D007239), disease encephalitis (MESH:D004660), death (MESH:D003643), Code hallucination (MESH:D006212), PAIS (MESH:D013313), diarrhea (MESH:D003967), LLMs (MESH:D007806), leishmaniasis (MESH:D007896), Epstein-Barr virus (MESH:D020031)
- **Chemicals:** CoT (-), Mistral (MESH:C050435)
- **Species:** Homo sapiens (human, species) [taxon 9606], human gammaherpesvirus 4 (Epstein Barr virus, no rank) [taxon 10376], Leishmania (subgenus) [taxon 38568], Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049], Lama glama (llama, species) [taxon 9844], Rickettsia africae (species) [taxon 35788], Ebola virus [taxon 186536], Japanese encephalitis virus (no rank) [taxon 11072]
- **Cell lines:** Qwen-72B — Homo sapiens (Human), Melanoma, Cancer cell line (CVCL_T314)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12963971/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12963971/full.md

## References

51 references — full list in the complete paper: https://tomesphere.com/paper/PMC12963971/full.md

---
Source: https://tomesphere.com/paper/PMC12963971