MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Meng-Xun Li; Wen-Hui Deng; Zhi-Xing Wu; Chun-Xiao Jin; Jia-Min Wu; Yue Han; James Kit Hon Tsoi; Gui-Song Xia; Cui Huang

arXiv:2604.14866·cs.CV·April 17, 2026

MetaDent: Labeling Clinical Images for Vision-Language Models in Dentistry

Meng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu, Chun-Xiao Jin, Jia-Min Wu, Yue Han, James Kit Hon Tsoi, Gui-Song Xia, Cui Huang

PDF

TL;DR

MetaDent introduces a large-scale, annotated dental image dataset and benchmark suite to evaluate vision-language models in dentistry, highlighting current model limitations in fine-grained intraoral image understanding.

Contribution

The paper presents a novel, comprehensive dental image dataset with hierarchical annotations and benchmarks for evaluating state-of-the-art vision-language models in clinical dentistry.

Findings

01

Models achieve moderate accuracy on dental VQA tasks.

02

State-of-the-art models produce inconsistent descriptions in image captioning.

03

The dataset and benchmarks are publicly available for research.

Abstract

Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.