Evaluating Large Language Models on Historical Health Crisis Knowledge in Resource-Limited Settings: A Hybrid Multi-Metric Study
Mohammed Rakibul Hasan

TL;DR
This study assesses the reliability of various large language models in providing accurate health crisis information in resource-limited settings like Bangladesh, using a hybrid multi-metric evaluation approach.
Contribution
It introduces a comprehensive evaluation framework for LLMs on health crisis knowledge in low-resource contexts, highlighting their strengths and limitations.
Findings
LLMs show promise in understanding epidemiological history.
Models exhibit limitations in accuracy and consistency.
Evaluation framework effectively identifies model strengths and weaknesses.
Abstract
Large Language Models (LLMs) offer significant potential for delivering health information. However, their reliability in low-resource contexts remains uncertain. This study evaluates GPT-4, Gemini Pro, Llama~3, and Mistral-7B on health crisis-related enquiries concerning COVID-19, dengue, the Nipah virus, and Chikungunya in the low-resource context of Bangladesh. We constructed a question--answer dataset from authoritative sources and assessed model outputs through semantic similarity, expert-model cross-evaluation, and Natural Language Inference (NLI). Findings highlight both the strengths and limitations of LLMs in representing epidemiological history and health crisis knowledge, underscoring their promise and risks for informing policy in resource-constrained environments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Topic Modeling · COVID-19 epidemiological studies
