MedErr-CT: A Visual Question Answering Benchmark for Identifying and Correcting Errors in CT Reports
Sunggu Kyung, Hyungbin Park, Jinyoung Seo, Jimin Sung, Jihyun Kim, Dongyeong Kim, Wooyoung Jo, Yoojin Nam, Sangah Park, Taehee Kwon, Sang Min Lee, Namkug Kim

TL;DR
MedErr-CT is a new benchmark designed to evaluate medical multimodal large language models' ability to identify and correct errors in CT reports, addressing a gap in clinical relevance and expert-level understanding.
Contribution
It introduces a comprehensive VQA benchmark with diverse error categories and task levels for assessing and improving medical MLLMs' diagnostic accuracy.
Findings
Significant variation in model performance across error types
Benchmark reveals strengths and weaknesses of current medical MLLMs
Provides a foundation for developing more reliable clinical AI tools
Abstract
Computed Tomography (CT) plays a crucial role in clinical diagnosis, but the growing demand for CT examinations has raised concerns about diagnostic errors. While Multimodal Large Language Models (MLLMs) demonstrate promising comprehension of medical knowledge, their tendency to produce inaccurate information highlights the need for rigorous validation. However, existing medical visual question answering (VQA) benchmarks primarily focus on simple visual recognition tasks, lacking clinical relevance and failing to assess expert-level knowledge. We introduce MedErr-CT, a novel benchmark for evaluating medical MLLMs' ability to identify and correct errors in CT reports through a VQA framework. The benchmark includes six error categories - four vision-centric errors (Omission, Insertion, Direction, Size) and two lexical error types (Unit, Typo) - and is organized into three task levels:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiology practices and education · Topic Modeling · Artificial Intelligence in Healthcare and Education
