The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

Dou Liu; Ying Long; Sophia Zuoqiu; Kaipeng Xie; Runze Yang; Di Liu; Kang Li; Yiting Lin; Hanyi Liu; Rong Yin; Tian Tang

arXiv:2511.18084·cs.LG·November 25, 2025

The Alignment Paradox of Medical Large Language Models in Infertility Care: Decoupling Algorithmic Improvement from Clinical Decision-making Quality

Dou Liu, Ying Long, Sophia Zuoqiu, Kaipeng Xie, Runze Yang, Di Liu, Kang Li, Yiting Lin, Hanyi Liu, Rong Yin, Tian Tang

PDF

Open Access

TL;DR

This study evaluates different alignment strategies for medical large language models in infertility care, revealing that algorithmic accuracy does not always align with clinician trust or interpretability, highlighting an alignment paradox.

Contribution

It systematically compares four alignment methods, showing that reinforcement-based optimization improves accuracy but may reduce clinical trust and interpretability.

Findings

01

GRPO achieves highest algorithmic accuracy

02

Clinicians prefer SFT for interpretability and feasibility

03

Algorithmic improvements do not always increase clinical trust

Abstract

Large language models (LLMs) are increasingly adopted in clinical decision support, yet aligning them with the multifaceted reasoning pathways of real-world medicine remains a major challenge. Using more than 8,000 infertility treatment records, we systematically evaluate four alignment strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Group Relative Policy Optimization (GRPO), and In-Context Learning (ICL) through a dual-layer framework combining automatic benchmarks with blinded doctor-in-the-loop assessments. GRPO achieves the highest algorithmic accuracy across multiple decision layers, confirming the value of reinforcement-based optimization for structured prediction tasks. However, clinicians consistently prefer the SFT model, citing clearer reasoning processes (p = 0.035) and higher therapeutic feasibility (p = 0.019). In blinded pairwise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Genomics and Rare Diseases