OralMLLM-Bench: Evaluating Cognitive Capabilities of Multimodal Large Language Models in Dental Practice
Rongyang Wang, Shuang Zhou, Jiashuo Wang, Wenya Xie, Xiaoxia Che

TL;DR
This paper introduces a comprehensive benchmark to evaluate the cognitive abilities of multimodal large language models in dental radiographic analysis across multiple imaging modalities.
Contribution
It defines a new benchmark with 27 tasks and clinician assessments to evaluate MLLMs' performance in dental radiology, highlighting current gaps and areas for improvement.
Findings
MLLMs lag behind clinicians in dental radiographic tasks.
Performance varies across different cognitive categories and imaging modalities.
The benchmark reveals specific failure patterns of current models.
Abstract
Multimodal large language models (MLLMs) have emerged as a promising paradigm for dental image analysis. However, their ability to capture the multi-level cognitive processes required for radiographic analysis remains unclear. Here, we present a comprehensive benchmark to evaluate the cognitive capabilities of MLLMs in dental radiographic analysis. It spans three critical imaging modalities, i.e., periapical, panoramic, and lateral cephalometric radiographs, and defines four cognitive categories: perception, comprehension, prediction, and decision-making. The benchmark comprises 27 clinically grounded tasks derived from public datasets, with manually curated annotations and 3,820 clinician assessments for evaluation. Six frontier MLLMs, including GPT-5.2 and GLM-4.6, are evaluated. We demonstrate the performance gap between MLLMs and clinicians in dental practice, delineate model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
