Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment

Huy Nghiem; Swetasudha Panda; Devashish Khatwani; Huy V. Nguyen; Krishnaram Kenthapadi; Hal Daum\'e III

arXiv:2512.04210·cs.AI·December 5, 2025

Balancing Safety and Helpfulness in Healthcare AI Assistants through Iterative Preference Alignment

Huy Nghiem, Swetasudha Panda, Devashish Khatwani, Huy V. Nguyen, Krishnaram Kenthapadi, Hal Daum\'e III

PDF

Open Access

TL;DR

This paper introduces an iterative alignment framework using KTO and DPO to improve the safety and helpfulness of healthcare AI assistants, demonstrating significant safety improvements and analyzing calibration biases.

Contribution

It presents a novel iterative post-deployment alignment method for healthcare LLMs, combining KTO and DPO, with extensive evaluation on safety and calibration biases.

Findings

01

Up to 42% improvement in safety metrics for harmful query detection

02

Identified architecture-dependent calibration biases affecting safety

03

Ablation studies on self-evaluation reliability and judge types

Abstract

Large Language Models (LLMs) are increasingly used in healthcare, yet ensuring their safety and trustworthiness remains a barrier to deployment. Conversational medical assistants must avoid unsafe compliance without over-refusing benign queries. We present an iterative post-deployment alignment framework that applies Kahneman-Tversky Optimization (KTO) and Direct Preference Optimization (DPO) to refine models against domain-specific safety signals. Using the CARES-18K benchmark for adversarial robustness, we evaluate four LLMs (Llama-3B/8B, Meditron-8B, Mistral-7B) across multiple cycles. Our results show up to 42% improvement in safety-related metrics for harmful query detection, alongside interesting trade-offs against erroneous refusals, thereby exposing architecture-dependent calibration biases. We also perform ablation studies to identify when self-evaluation is reliable and when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)