Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries
Liqi Zhou, Jiafu Li

TL;DR
This study evaluates the effectiveness of prompted large language models in classifying online patient inquiries into four triage categories, demonstrating potential support for clinical routing with few-shot learning.
Contribution
It introduces a novel evaluation of LLMs for actionable triage of patient inquiries using a new dataset and compares multiple models and prompting strategies.
Findings
The strongest LLM achieved macro-F1 of 0.475, surpassing supervised BioBERT baseline.
Few-shot prompting and model agreement improve classification reliability in certain categories.
LLMs can support triage prioritization but are not suitable for autonomous decision-making.
Abstract
Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
