Divergent Realities: A Comparative Analysis of Human Expert vs. Artificial Intelligence Based Generation and Evaluation of Treatment Plans in Dermatology
Dipayan Sengupta, Saumya Panda

TL;DR
This study reveals a significant evaluator effect in dermatology treatment plan assessments, where human experts favor peer plans but AI judges prefer AI-generated plans, highlighting a gap between human and AI reasoning.
Contribution
It demonstrates the contrasting evaluations of human and AI-generated treatment plans, emphasizing the need for explainable AI to bridge the reasoning gap in clinical decision-making.
Findings
Human experts favor peer plans over AI plans
AI judges prefer AI-generated plans over human plans
A reasoning AI was rated highest by AI but lower by humans
Abstract
Background: Evaluating AI-generated treatment plans is a key challenge as AI expands beyond diagnostics, especially with new reasoning models. This study compares plans from human experts and two AI models (a generalist and a reasoner), assessed by both human peers and a superior AI judge. Methods: Ten dermatologists, a generalist AI (GPT-4o), and a reasoning AI (o3) generated treatment plans for five complex dermatology cases. The anonymized, normalized plans were scored in two phases: 1) by the ten human experts, and 2) by a superior AI judge (Gemini 2.5 Pro) using an identical rubric. Results: A profound 'evaluator effect' was observed. Human experts scored peer-generated plans significantly higher than AI plans (mean 7.62 vs. 7.16; p=0.0313), ranking GPT-4o 6th (mean 7.38) and the reasoning model, o3, 11th (mean 6.97). Conversely, the AI judge produced a complete inversion,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Cutaneous Melanoma Detection and Management
