Evaluating Large Language Models for Public Health Classification and Extraction Tasks
Joshua Harris, Timothy Laurence, Leo Loman, Fan Grayson, Toby, Nonnenmacher, Harry Long, Loes WalsGriffith, Amy Douglas, Holly Fountain,, Stelios Georgiou, Jo Hardstaff, Kathryn Hopkins, Y-Ling Chi, Galena, Kuyumdzhieva, Lesley Larkin, Samuel Collins, Hamish Mohammed

TL;DR
This study evaluates the performance of various large language models on public health classification and extraction tasks, highlighting their potential to assist experts despite some challenges on complex tasks.
Contribution
The paper provides a comprehensive evaluation of open-weight LLMs and GPT-4 models on public health tasks, revealing their strengths and limitations in this domain.
Findings
Llama-3.3-70B-Instruct performs best among open-weight LLMs.
Significant variation in model performance across different tasks.
LLMs achieve over 80% micro-F1 on some tasks, indicating practical utility.
Abstract
Advances in Large Language Models (LLMs) have led to significant interest in their potential to support human experts across a range of domains, including public health. In this work we present automated evaluations of LLMs for public health tasks involving the classification and extraction of free text. We combine six externally annotated datasets with seven new internally annotated datasets to evaluate LLMs for processing text related to: health burden, epidemiological risk factors, and public health interventions. We evaluate eleven open-weight LLMs (7-123 billion parameters) across all tasks using zero-shot in-context learning. We find that Llama-3.3-70B-Instruct is the highest performing model, achieving the best results on 8/16 tasks (using micro-F1 scores). We see significant variation across tasks with all open-weight LLMs scoring below 60% micro-F1 on some challenging tasks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Dropout
