# Efficient information extraction using LLMs and knowledge distillation: A study on HPV health communication

**Authors:** Saadat Hasan Khan, Kevin Lybarger, MinJae Woo, Dhiya Al-Jumeily OBE

PMC · DOI: 10.1371/journal.pdig.0001275 · 2026-03-10

## TL;DR

This study uses efficient AI models to evaluate how state health websites communicate HPV and vaccination information, aiming to improve public health messaging.

## Contribution

A computationally efficient framework using knowledge distillation to evaluate health communication on HPV from state websites.

## Key findings

- A fine-tuned RoBERTa Large model achieved an F1 score of 0.74, nearly matching the teacher model's performance.
- The framework was applied to evaluate content from 48 state health department websites.
- The method can identify gaps in health communication and improve messaging for public health.

## Abstract

State Department of Health (DOH) websites serve as authoritative sources of HPV-related health communications, presenting state-specific content that influences public awareness and vaccination decisions. We develop a computationally efficient framework to systematically evaluate these information repositories based on their content quality, completeness, and their motivational impact on vaccination behavior. We propose a dataset consolidating 48 different DOH websites’ data targeted towards HPV and HPV vaccination. By developing an annotated dataset (n = 400), efficient prompting techniques and a Knowledge Distillation framework, we develop and evaluate efficient student models based on the Llama family of Large Language Models (LLMs) and the RoBERTa Large encoder architecture. We finally deploy the best-performing student model for a computationally feasible evaluation of the content of DOH websites. We show that fine-tuned RoBERTa Large model achieves an F1 score of 0.74 on the test set, outperforming all other student models and approaching the teacher model's performance (F1 = 0.77). The fine-tuned RoBERTa-Large model is subsequently applied to data from various state DOH websites to evaluate the information presented. We also discuss the broader implications, limitations, and ethical and legal considerations of the proposed approach.

We studied how state health department websites communicate information about human papillomavirus (HPV) and vaccination. These websites are important because they shape public understanding and influence people’s decisions about getting vaccinated. We collected information from 48 state websites and looked at how complete, clear, and persuasive their content was. To do this, we created a smaller, efficient computer model that could quickly read and evaluate large amounts of website text. We trained this model using a teaching approach where a larger, more powerful model showed it how to make good decisions. Our smaller model was almost as accurate as the larger one, but much faster and easier to use. We then used it to review each state’s website. This approach can help public health organizations identify gaps in the information they provide and improve how they communicate important health messages. While we focused on HPV, the same method could be used to study how other health topics are presented online.

## Full-text entities

- **Genes:** MAP3K8 (mitogen-activated protein kinase kinase kinase 8) [NCBI Gene 1326] {aka AURA2, COT, EST, ESTF, MEKK8, TPL2}
- **Diseases:** influenza (MESH:D007251), Cervical Cancer?CoTAnswer1 (MESH:D002583), diabetes (MESH:D003920), Cancer (MESH:D009369), LLMs (MESH:D007806), NCDs (MESH:D000073296), Diseases and Infections (MESH:D007239), COVID-19 (MESH:D000086382), chronic pain (MESH:D059350), chronic disease (MESH:D002908), communicable diseases (MESH:D003141)
- **Species:** Human papillomavirus (species) [taxon 10566], Lama glama (llama, species) [taxon 9844], Homo sapiens (human, species) [taxon 9606]

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12974803/full.md

---
Source: https://tomesphere.com/paper/PMC12974803