MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

Naoto Iwase; Hiroki Okuyama; Junichiro Iwasawa

arXiv:2511.00421·cs.CL·November 4, 2025

MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts

Naoto Iwase, Hiroki Okuyama, Junichiro Iwasawa

PDF

Open Access 1 Models

TL;DR

MedRECT is a novel cross-lingual benchmark for evaluating and improving large language models' ability to detect, localize, and correct errors in clinical texts across English and Japanese, advancing safe medical AI deployment.

Contribution

This paper introduces MedRECT, the first comprehensive cross-lingual benchmark for medical error correction, with scalable data generation and evaluation of diverse LLMs, including fine-tuning methods.

Findings

01

Reasoning models outperform standard architectures in error detection and localization.

02

Cross-lingual evaluation shows 5-10% performance gaps between English and Japanese.

03

Fine-tuning improves error correction, surpassing human experts in structured tasks.

Abstract

Large language models (LLMs) show increasing promise in medical applications, but their ability to detect and correct errors in clinical texts -- a prerequisite for safe deployment -- remains under-evaluated, particularly beyond English. We introduce MedRECT, a cross-lingual benchmark (Japanese/English) that formulates medical error handling as three subtasks: error detection, error localization (sentence extraction), and error correction. MedRECT is built with a scalable, automated pipeline from the Japanese Medical Licensing Examinations (JMLE) and a curated English counterpart, yielding MedRECT-ja (663 texts) and MedRECT-en (458 texts) with comparable error/no-error balance. We evaluate 9 contemporary LLMs spanning proprietary, open-weight, and reasoning families. Key findings: (i) reasoning models substantially outperform standard architectures, with up to 13.5% relative improvement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
pfnet/Preferred-MedRECT-32B
model· 16 dl· ♡ 1
16 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare