# Automated Multitier Tagging of Chinese Online Health Education Resources Using a Large Language Model: Development and Validation Study

**Authors:** Jialin Meng, Ruiming Dai, Xiaolan Huang, Yi Gu, Shixing Yan, Xiaoke Wang, Jingrong Gao, Tian-Tian Zhang

PMC · DOI: 10.2196/83219 · Journal of Medical Internet Research · 2025-12-17

## TL;DR

This study developed an AI system to automatically tag Chinese health resources, improving precision in health communication by outperforming human annotators.

## Contribution

A novel hybrid AI pipeline using a large language model for automated, scalable health content tagging with higher reliability than human-only methods.

## Key findings

- The AI-human agreement (Cohen κ=0.54) was significantly higher than human-human agreement (Cohen κ=0.32).
- AI identified 15.9% more relevant tags than human annotators, with 90% expert-validated accuracy.
- The system provides a scalable blueprint for precision health communication using AI-enhanced metadata.

## Abstract

Precision health promotion, which aims to tailor health messages to individual needs, is hampered by the lack of structured metadata in vast digital health resource libraries. This bottleneck prevents scalable, personalized content delivery and exacerbates information overload for the public.

This study aimed to develop, deploy, and validate an automated tagging system using a large language model (LLM) to create the foundational metadata infrastructure required for tailored health communication at scale.

We developed a comprehensive, 3-tier health promotion taxonomy (10 primary, 34 secondary, and 90,562 tertiary tags) using a hybrid Delphi and corpus-mining methodology. We then constructed a hybrid inference pipeline by fine-tuning a Baichuan2-7B LLM with low-rank adaptation for initial tag generation. This was then refined by a domain-specific named entity recognition model and standardized against a vector database. The system’s performance was evaluated against manual annotations from nonexpert staff on a test set of 1000 resources. We used a “no gold standard” framework, comparing the artificial intelligence–human (A-H) interrater reliability (IRR) with a supplemental human-human (H-H) IRR baseline and expert adjudication for cases where artificial intelligence provided additional tags (“AI Additive”).

The A-H agreement was moderate (Cohen κ=0.54, 95% CI 0.53-0.56; Jaccard similarity coefficient=0.48, 95% CI 0.46-0.50). Critically, this was higher than the baseline nonexpert H-H agreement (Cohen κ=0.32, 95% CI 0.29-0.35; Jaccard similarity coefficient=0.35, 95% CI 0.27-0.43). A granular analysis of disagreements revealed that in 15.9% (159/1000) of the cases, the “AI Additive” tags were not identified by human annotators. Expert adjudication of these cases confirmed that the “AI Additive” tags were correct and relevant with a precision of 90% (45/50; 95% CI 78.2%-96.7%).

A fine-tuned LLM, integrated into a hybrid pipeline, can function as a powerful augmentation tool for health content annotation. The system’s consistency (A-H κ=0.54) was found to be superior to the baseline human workflow (H-H κ=0.32). By moving beyond simple automation to reliably identify relevant health topics missed by manual annotators with high, expert-validated accuracy, this study provides a robust technical and methodological blueprint for implementing artificial intelligence to enhance precision health communication in public health settings.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12756663/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12756663/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12756663/full.md

---
Source: https://tomesphere.com/paper/PMC12756663