Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

Mohammed Rakibul Hasan

arXiv:2603.20514·cs.CL·March 24, 2026

Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study

Mohammed Rakibul Hasan

PDF

Open Access

TL;DR

This study assesses the reliability of various large language models in providing accurate health crisis information in resource-limited settings like Bangladesh, using a hybrid multi-metric evaluation approach.

Contribution

It introduces a comprehensive evaluation framework for LLMs on health crisis knowledge in low-resource contexts, highlighting their strengths and limitations.

Findings

01

LLMs show promise in understanding epidemiological history.

02

Models exhibit limitations in accuracy and consistency.

03

Evaluation framework effectively identifies model strengths and weaknesses.

Abstract

Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question--answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · COVID-19 epidemiological studies