# Camera-Based Monocular Depth Estimation in Orthodontics: Vision Transformer vs. CNN Model Performance

**Authors:** Arda Arısan, Gökhan Serhat Duran

PMC · DOI: 10.3390/s25216512 · Sensors (Basel, Switzerland) · 2025-10-22

## TL;DR

This study compares vision transformer and CNN models for predicting facial depth from photos in orthodontics, finding that vision transformers perform much better.

## Contribution

Demonstrates that vision transformer models outperform CNNs in extracting clinically meaningful depth information from frontal facial images for orthodontic profiling.

## Key findings

- Vision transformer DPT-Large achieved 92.7% accuracy in predicting facial depth rankings.
- CNN-based models performed below theoretical chance level for the same task.
- Depth estimation from frontal photos could support facial profile evaluation in orthodontics.

## Abstract

Background: Monocular Depth Estimation (MDE) is a computer vision approach that predicts spatial depth information from a single two-dimensional image. In orthodontics, where facial soft-tissue evaluation is integral to diagnosis and treatment planning, such methods offer new possibilities for obtaining sagittal profile information from standard frontal photographs. This study aimed to determine whether MDE can extract clinically meaningful information for facial profile assessment. Methods: Standardized frontal photographs and lateral cephalometric radiographs from 82 adult patients (48 Class I, 28 Class II, 6 Class III) were retrospectively analyzed. Three clinically relevant soft-tissue landmarks—Upper Lip Anterior (ULA), Lower Lip Anterior (LLA), and Soft Tissue Pogonion (Pog′)—were annotated on frontal photographs, while true vertical line (TVL) analysis from cephalograms served as the reference standard. For each case, anteroposterior (AP) relationships among the three landmarks were represented as ordinal rankings based on predicted depth values, and accuracy was defined as complete agreement between model-derived and reference rankings. Depth maps were generated using one vision transformer model (DPT-Large) and two CNN-based models (DepthAnything-v2 and ZoeDepth). Model performance was evaluated using accuracy, 95% confidence intervals, and effect size measures. Results: The transformer-based DPT-Large achieved clinically acceptable accuracy in 92.7% of cases (76/82; 95% CI: 84.8–97.3), significantly outperforming the CNN-based models DepthAnything-v2 (9.8%) and ZoeDepth (4.9%), both of which performed below the theoretical chance level (16.7%). Conclusions: Vision transformer-based Monocular Depth Estimation demonstrates the potential for clinically meaningful soft-tissue profiling from frontal photographs, suggesting that depth information derived from two-dimensional images may serve as a supportive tool for facial profile evaluation. These findings provide a foundation for future studies exploring the integration of depth-based analysis into digital orthodontic diagnostics.

## Full-text entities

- **Diseases:** III (MESH:C537189), I (MESH:D006969)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12610820/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12610820/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/PMC12610820/full.md

---
Source: https://tomesphere.com/paper/PMC12610820