SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data
Yiwei Yan, Hao Li, Hua He, Gong Kai, Zhengyi Yang, Guanfeng Liu

TL;DR
SALP-CG is a novel LLM-based pipeline that classifies and grades privacy risks in online conversational health data, ensuring compliance with standards and enhancing health data governance.
Contribution
This work introduces SALP-CG, a unified, rule-based extraction pipeline that reliably classifies and grades sensitivity in conversational health data using LLMs and schema-guided decoding.
Findings
Achieves high schema compliance and accurate sensitivity grading.
Strong model performance with micro-F1=0.900 for maximum-level prediction.
Effectively stratifies data by sensitivity, aiding privacy risk management.
Abstract
Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health via Writing · Data-Driven Disease Surveillance · Machine Learning in Healthcare
