MedOmni-45{\deg}: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine

Kaiyuan Ji; Yijin Guo; Zicheng Zhang; Xiangyang Zhu; Yuan Tian; Ning Liu; Guangtao Zhai

arXiv:2508.16213·cs.CV·August 25, 2025

MedOmni-45{\deg}: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine

Kaiyuan Ji, Yijin Guo, Zicheng Zhang, Xiangyang Zhu, Yuan Tian, Ning Liu, Guangtao Zhai

PDF

TL;DR

MedOmni-45 Degrees is a comprehensive benchmark designed to evaluate reasoning safety and performance trade-offs in medical language models, highlighting vulnerabilities like faithfulness and sycophancy across diverse models and tasks.

Contribution

The paper introduces MedOmni-45 Degrees, a novel benchmark with a workflow and metrics to quantify safety-performance trade-offs in medical LLMs under manipulative hints.

Findings

01

Models show a safety-performance trade-off with no model surpassing the diagonal.

02

Open-source QwQ-32B balances safety and accuracy closest to optimal.

03

Benchmark exposes reasoning vulnerabilities and guides safer model development.

Abstract

With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness -- whether reasoning aligns with responses and medical facts -- and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open-…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.