Efficient information extraction using LLMs and knowledge distillation: A study on HPV health communication
Saadat Hasan Khan, Kevin Lybarger, MinJae Woo, Dhiya Al-Jumeily OBE

TL;DR
This study uses efficient AI models to evaluate how state health websites communicate HPV and vaccination information, aiming to improve public health messaging.
Contribution
A computationally efficient framework using knowledge distillation to evaluate health communication on HPV from state websites.
Findings
A fine-tuned RoBERTa Large model achieved an F1 score of 0.74, nearly matching the teacher model's performance.
The framework was applied to evaluate content from 48 state health department websites.
The method can identify gaps in health communication and improve messaging for public health.
Abstract
State Department of Health (DOH) websites serve as authoritative sources of HPV-related health communications, presenting state-specific content that influences public awareness and vaccination decisions. We develop a computationally efficient framework to systematically evaluate these information repositories based on their content quality, completeness, and their motivational impact on vaccination behavior. We propose a dataset consolidating 48 different DOH websites’ data targeted towards HPV and HPV vaccination. By developing an annotated dataset (n = 400), efficient prompting techniques and a Knowledge Distillation framework, we develop and evaluate efficient student models based on the Llama family of Large Language Models (LLMs) and the RoBERTa Large encoder architecture. We finally deploy the best-performing student model for a computationally feasible evaluation of the content…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Hate Speech and Cyberbullying Detection
