Efficient information extraction using LLMs and knowledge distillation: A study on HPV health communication

Saadat Hasan Khan; Kevin Lybarger; MinJae Woo; Dhiya Al-Jumeily OBE

PMC · DOI:10.1371/journal.pdig.0001275·March 10, 2026

Efficient information extraction using LLMs and knowledge distillation: A study on HPV health communication

Saadat Hasan Khan, Kevin Lybarger, MinJae Woo, Dhiya Al-Jumeily OBE

PDF

Open Access

TL;DR

This study uses efficient AI models to evaluate how state health websites communicate HPV and vaccination information, aiming to improve public health messaging.

Contribution

A computationally efficient framework using knowledge distillation to evaluate health communication on HPV from state websites.

Findings

01

A fine-tuned RoBERTa Large model achieved an F1 score of 0.74, nearly matching the teacher model's performance.

02

The framework was applied to evaluate content from 48 state health department websites.

03

The method can identify gaps in health communication and improve messaging for public health.

Abstract

State Department of Health (DOH) websites serve as authoritative sources of HPV-related health communications, presenting state-specific content that influences public awareness and vaccination decisions. We develop a computationally efficient framework to systematically evaluate these information repositories based on their content quality, completeness, and their motivational impact on vaccination behavior. We propose a dataset consolidating 48 different DOH websites’ data targeted towards HPV and HPV vaccination. By developing an annotated dataset (n = 400), efficient prompting techniques and a Knowledge Distillation framework, we develop and evaluate efficient student models based on the Llama family of Large Language Models (LLMs) and the RoBERTa Large encoder architecture. We finally deploy the best-performing student model for a computationally feasible evaluation of the content…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

MAP3K8

Proteins1

Species3

Human papillomavirus(species)Lama glama(llama · species)Homo sapiens(human · species)

Diseases11

influenza Cervical Cancer?CoTAnswer1 diabetes Cancer LLMs NCDs Diseases and Infections COVID-19 chronic pain chronic disease communicable diseases

Figures5

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Hate Speech and Cyberbullying Detection