# DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding

**Authors:** Hengchuan Zhu, Yihuan Xu, Yichen Li, Zijie Meng, Zuozhu Liu

arXiv: 2508.20416 · 2025-08-29

## TL;DR

DentalBench is a comprehensive bilingual benchmark that evaluates and enhances large language models' capabilities specifically in the dental medical domain, addressing a gap in specialized healthcare AI evaluation.

## Contribution

Introduces DentalBench, the first bilingual dental domain benchmark, including a large QA dataset and corpus for domain adaptation, to evaluate and improve LLMs in dentistry.

## Key findings

- Significant performance gaps across models and tasks.
- Domain adaptation improves knowledge-intensive task performance.
- Highlighting the need for specialized benchmarks in healthcare AI.

## Abstract

Recent advances in large language models (LLMs) and medical LLMs (Med-LLMs) have demonstrated strong performance on general medical benchmarks. However, their capabilities in specialized medical fields, such as dentistry which require deeper domain-specific knowledge, remain underexplored due to the lack of targeted evaluation resources. In this paper, we introduce DentalBench, the first comprehensive bilingual benchmark designed to evaluate and advance LLMs in the dental domain. DentalBench consists of two main components: DentalQA, an English-Chinese question-answering (QA) benchmark with 36,597 questions spanning 4 tasks and 16 dental subfields; and DentalCorpus, a large-scale, high-quality corpus with 337.35 million tokens curated for dental domain adaptation, supporting both supervised fine-tuning (SFT) and retrieval-augmented generation (RAG). We evaluate 14 LLMs, covering proprietary, open-source, and medical-specific models, and reveal significant performance gaps across task types and languages. Further experiments with Qwen-2.5-3B demonstrate that domain adaptation substantially improves model performance, particularly on knowledge-intensive and terminology-focused tasks, and highlight the importance of domain-specific benchmarks for developing trustworthy and effective LLMs tailored to healthcare applications.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20416/full.md

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20416/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/2508.20416/full.md

---
Source: https://tomesphere.com/paper/2508.20416