Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Liam G. McCoy; Fateme Nateghi Haredasht; Kanav Chopra; David Wu; David JH Wu; Abass Conteh; Sarita Khemani; Saloni Kumar Maharaj; Vishnu Ravi; Arth Pahwa; Yingjie Weng; Leah Rosengaus; Lena Giang; Kelvin Zhenghao Li; Olivia Jee; Daniel Shirvani; Ethan Goh; and Jonathan H. Chen

arXiv:2508.01159·cs.CL·November 13, 2025

Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates

Liam G. McCoy, Fateme Nateghi Haredasht, Kanav Chopra, David Wu, David JH Wu, Abass Conteh, Sarita Khemani, Saloni Kumar Maharaj, Vishnu Ravi, Arth Pahwa, Yingjie Weng, Leah Rosengaus, Lena Giang, Kelvin Zhenghao Li, Olivia Jee, Daniel Shirvani, Ethan Goh, and Jonathan H. Chen

PDF

Open Access

TL;DR

This paper benchmarks large language models' ability to generate structured clinical consultation templates, revealing strengths in comprehensiveness but challenges in prioritization and specialty-specific performance, underscoring the need for improved evaluation methods.

Contribution

It introduces a multi-agent pipeline for evaluating LLMs in clinical template generation and highlights their potential and limitations in real-world medical communication.

Findings

01

Models like o3 achieve high comprehensiveness (up to 92.2%)

02

Models often generate excessively long templates and struggle with prioritization

03

Performance varies across medical specialties, with lower accuracy in psychiatry and pain medicine

Abstract

This study evaluates the capacity of large language models (LLMs) to generate structured clinical consultation templates for electronic consultation. Using 145 expert-crafted templates developed and routinely used by Stanford's eConsult team, we assess frontier models -- including o3, GPT-4o, Kimi K2, Claude 4 Sonnet, Llama 3 70B, and Gemini 2.5 Pro -- for their ability to produce clinically coherent, concise, and prioritized clinical question schemas. Through a multi-agent pipeline combining prompt optimization, semantic autograding, and prioritization analysis, we show that while models like o3 achieve high comprehensiveness (up to 92.2\%), they consistently generate excessively long templates and fail to correctly prioritize the most clinically important questions under length constraints. Performance varies across specialties, with significant degradation in narrative-driven fields…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare