Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results
Jonathan Liu, Haoling Qiu, Jonathan Lasko, Damianos Karakos, Mahsa Yarmohammadi, Mark Dredze

TL;DR
This paper develops an infrastructure to probe and evaluate medical chatbots using multiple LLMs, revealing low inter-LLM agreement and emphasizing the need for diverse evaluators to ensure generalizable results.
Contribution
It introduces a pipeline for generating realistic medical queries and evaluating LLM responses with multiple LLMs, highlighting issues in evaluation consistency and proposing best practices.
Findings
Low inter-LLM agreement (average Cohen's Kappa 0.118)
Significant differences across writing styles, genders, races with specific LLM pairs
Recommendations for using multiple LLMs for evaluation to ensure generalizability
Abstract
Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Healthcare and Education · Digital Mental Health Interventions · AI in Service Interactions
