Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment
Huy Nghiem, Swetasudha Panda, Devashish Khatwani, Huy V. Nguyen, Krishnaram Kenthapadi, Hal Daum\'e III

TL;DR
This paper introduces an iterative alignment framework using KTO and DPO to improve the safety and helpfulness of healthcare AI assistants, demonstrating significant safety improvements and analyzing calibration biases.
Contribution
It presents a novel iterative post-deployment alignment method for healthcare LLMs, combining KTO and DPO, with extensive evaluation on safety and calibration biases.
Findings
Up to 42% improvement in safety metrics for harmful query detection
Identified architecture-dependent calibration biases affecting safety
Ablation studies on self-evaluation reliability and judge types
Abstract
Large Language Models (LLMs) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman-Tversky Optimization (KTO) and Direct Preference Optimization (DPO) to refine models against domain-specific safety signals. Using the CARES-18K benchmark for adversarial robustness, we evaluate four LLMs (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)
