Evaluating Large Language Models for Public Health Classification and   Extraction Tasks

Joshua Harris; Timothy Laurence; Leo Loman; Fan Grayson; Toby; Nonnenmacher; Harry Long; Loes WalsGriffith; Amy Douglas; Holly Fountain,; Stelios Georgiou; Jo Hardstaff; Kathryn Hopkins; Y-Ling Chi; Galena; Kuyumdzhieva; Lesley Larkin; Samuel Collins; Hamish Mohammed; Thomas Finnie,; Luke Hounsome; Michael Borowitz; Steven Riley

arXiv:2405.14766·cs.CL·February 20, 2025·5 cites

Evaluating Large Language Models for Public Health Classification and Extraction Tasks

Joshua Harris, Timothy Laurence, Leo Loman, Fan Grayson, Toby, Nonnenmacher, Harry Long, Loes WalsGriffith, Amy Douglas, Holly Fountain,, Stelios Georgiou, Jo Hardstaff, Kathryn Hopkins, Y-Ling Chi, Galena, Kuyumdzhieva, Lesley Larkin, Samuel Collins, Hamish Mohammed

PDF

Open Access

TL;DR

This study evaluates the performance of various large language models on public health classification and extraction tasks, highlighting their potential to assist experts despite some challenges on complex tasks.

Contribution

The paper provides a comprehensive evaluation of open-weight LLMs and GPT-4 models on public health tasks, revealing their strengths and limitations in this domain.

Findings

01

Llama-3.3-70B-Instruct performs best among open-weight LLMs.

02

Significant variation in model performance across different tasks.

03

LLMs achieve over 80% micro-F1 on some tasks, indicating practical utility.

Abstract

Advances in Large Language Models (LLMs) have led to significant interest in their potential to support human experts across a range of domains, including public health. In this work we present automated evaluations of LLMs for public health tasks involving the classification and extraction of free text. We combine six externally annotated datasets with seven new internally annotated datasets to evaluate LLMs for processing text related to: health burden, epidemiological risk factors, and public health interventions. We evaluate eleven open-weight LLMs (7-123 billion parameters) across all tasks using zero-shot in-context learning. We find that Llama-3.3-70B-Instruct is the highest performing model, achieving the best results on 8/16 tasks (using micro-F1 scores). We see significant variation across tasks with all open-weight LLMs scoring below 60% micro-F1 on some challenging tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Dropout