MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports

Sunggu Kyung; Hyungbin Park; Jinyoung Seo; Jimin Sung; Jihyun Kim; Dongyeong Kim; Wooyoung Jo; Yoojin Nam; Sangah Park; Taehee Kwon; Sang Min Lee; Namkug Kim

arXiv:2506.19217·cs.CV·June 25, 2025

MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports

Sunggu Kyung, Hyungbin Park, Jinyoung Seo, Jimin Sung, Jihyun Kim, Dongyeong Kim, Wooyoung Jo, Yoojin Nam, Sangah Park, Taehee Kwon, Sang Min Lee, Namkug Kim

PDF

Open Access

TL;DR

MedErr-CT is a new benchmark designed to evaluate medical multimodal large language models' ability to identify and correct errors in CT reports, addressing a gap in clinical relevance and expert-level understanding.

Contribution

It introduces a comprehensive VQA benchmark with diverse error categories and task levels for assessing and improving medical MLLMs' diagnostic accuracy.

Findings

01

Significant variation in model performance across error types

02

Benchmark reveals strengths and weaknesses of current medical MLLMs

03

Provides a foundation for developing more reliable clinical AI tools

Abstract

Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs' ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo) - and is organized into three task levels:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRadiology practices and education · Topic Modeling · Artificial Intelligence in Healthcare and Education