Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results

Jonathan Liu; Haoling Qiu; Jonathan Lasko; Damianos Karakos; Mahsa Yarmohammadi; Mark Dredze

arXiv:2511.02246·cs.CL·November 5, 2025

Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results

Jonathan Liu, Haoling Qiu, Jonathan Lasko, Damianos Karakos, Mahsa Yarmohammadi, Mark Dredze

PDF

Open Access

TL;DR

This paper develops an infrastructure to probe and evaluate medical chatbots using multiple LLMs, revealing low inter-LLM agreement and emphasizing the need for diverse evaluators to ensure generalizable results.

Contribution

It introduces a pipeline for generating realistic medical queries and evaluating LLM responses with multiple LLMs, highlighting issues in evaluation consistency and proposing best practices.

Findings

01

Low inter-LLM agreement (average Cohen's Kappa 0.118)

02

Significant differences across writing styles, genders, races with specific LLM pairs

03

Recommendations for using multiple LLMs for evaluation to ensure generalizability

Abstract

Recent research has shown that hallucinations, omissions, and biases are prevalent in everyday use-cases of LLMs. However, chatbots used in medical contexts must provide consistent advice in situations where non-medical factors are involved, such as when demographic information is present. In order to understand the conditions under which medical chatbots fail to perform as expected, we develop an infrastructure that 1) automatically generates queries to probe LLMs and 2) evaluates answers to these queries using multiple LLM-as-a-judge setups and prompts. For 1), our prompt creation pipeline samples the space of patient demographics, histories, disorders, and writing styles to create realistic questions that we subsequently use to prompt LLMs. In 2), our evaluation pipeline provides hallucination and omission detection using LLM-as-a-judge as well as agentic workflows, in addition to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Digital Mental Health Interventions · AI in Service Interactions