MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry
Meng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu, Chun-Xiao Jin, Jia-Min Wu, Yue Han, James Kit Hon Tsoi, Gui-Song Xia, Cui Huang

TL;DR
MetaDent introduces a large-scale, annotated dental image dataset and benchmark suite to evaluate vision-language models in dentistry, highlighting current model limitations in fine-grained intraoral image understanding.
Contribution
The paper presents a novel, comprehensive dental image dataset with hierarchical annotations and benchmarks for evaluating state-of-the-art vision-language models in clinical dentistry.
Findings
Models achieve moderate accuracy on dental VQA tasks.
State-of-the-art models produce inconsistent descriptions in image captioning.
The dataset and benchmarks are publicly available for research.
Abstract
Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
