Prompt-Based Caption Generation for Single-Tooth Dental Images Using Vision-Language Models

Anastasiia Sukhanova; Aiden Taylor; Julian Myers; Zichun Wang; Kartha Veerya Jammuladinne; Satya Sri Rajiteswari Nimmagadda; Aniruddha Maiti; Ananya Jana

arXiv:2603.07403·cs.CV·March 10, 2026

Prompt-Based Caption Generation for Single-Tooth Dental Images Using Vision-Language Models

Anastasiia Sukhanova, Aiden Taylor, Julian Myers, Zichun Wang, Kartha Veerya Jammuladinne, Satya Sri Rajiteswari Nimmagadda, Aniruddha Maiti, Ananya Jana

PDF

Open Access

TL;DR

This paper explores using vision-language models with guided prompts to generate meaningful captions for single-tooth dental images, addressing the lack of specialized datasets and holistic descriptions in dental image analysis.

Contribution

It introduces a framework for generating captions for single-tooth dental images using vision-language models and guided prompts, filling a gap in dental image datasets with captions.

Findings

01

Guided prompts improve caption relevance and quality.

02

RGB images enhance potential in consumer applications.

03

The framework produces better visual descriptions of dental images.

Abstract

Digital dentistry has made significant advances with the advent of deep learning. However, the majority of these deep learning-based dental image analysis models focus on very specific tasks such as tooth segmentation, tooth detection, cavity detection, and gingivitis classification. There is a lack of a specialized model that has holistic knowledge of teeth and can perform dental image analysis tasks based on that knowledge. Datasets of dental images with captions can help build such a model. To the best of our knowledge, existing dental image datasets with captions are few in number and limited in scope. In many of these datasets, the captions describe the entire mouth, while the images are limited to the anterior view. As a result, posterior teeth such as molars are not clearly visible, limiting the usefulness of the captions for training vision-language models. Additionally, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning