MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations

Congbo Ma; Yichun Zhang; Yousef Al-Jazzazi; Ahamed Foisal; Laasya Sharma; Yousra Sadqi; Khaled Saleh; Jihad Mallat; Farah E. Shamout

arXiv:2602.05692·cs.CL·February 6, 2026

MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations

Congbo Ma, Yichun Zhang, Yousef Al-Jazzazi, Ahamed Foisal, Laasya Sharma, Yousra Sadqi, Khaled Saleh, Jihad Mallat, Farah E. Shamout

PDF

Open Access

TL;DR

MedErrBench is a comprehensive multilingual benchmark for detecting, localizing, and correcting medical errors in clinical texts, developed with expert annotations across English, Arabic, and Chinese to improve AI healthcare safety.

Contribution

This paper introduces MedErrBench, the first multilingual clinical error detection and correction benchmark with expert annotations, covering diverse languages and error types to advance clinical NLP evaluation.

Findings

01

Significant performance gaps in non-English clinical texts

02

Language-specific models outperform general models in error tasks

03

Benchmark promotes development of safer, multilingual healthcare AI systems

Abstract

Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling