Accuracy of dentalmonitoring’s artificial intelligence in detecting aligner tracking issues: a retrospective multi-centric study
Julie Fahl McCray, William Dabney, Dylan Handlin, Logan Smith, Jessie Zhu, Maria Roxas Regina Trinidad, Naurine Shah, Sarah Zaki, Rayan Skafi, Mohammed H. Elnagar

TL;DR
This study evaluated an AI system's ability to detect dental aligner tracking issues and found it to be highly accurate in identifying misfits.
Contribution
The study provides empirical validation of DentalMonitoring’s AI performance in detecting aligner tracking issues across multiple clinical centers.
Findings
The AI showed high sensitivity and specificity in detecting aligner unseats in both binary and three-level models.
High negative predictive values indicate the AI is reliable in ruling out clinically meaningful unseats.
Lower PPV in the noticeable-unseat category was attributed to the low prevalence of such cases in the dataset.
Abstract
This study aimed to test the performance of DentalMonitoring’s (DM) artificial intelligence (AI) in detecting aligner tracking issues. This multicenter retrospective comparative study analyzed 3,323 assessments from 623 patients treated at multiple U.S. sites. DM’s AI performance was evaluated using a binary model (seated vs. unseated) and a three-level model (seated, slight unseat, noticeable unseat). AI outputs were compared against a reference standard established through independent case reviews performed by a panel of three U.S.-based orthodontic residents. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated. For the binary comparison (seat vs. unseat), sensitivity was 93.2% and specificity 86.2%, with a PPV of 89.2% and an NPV of 94.4%. For the three-level comparison, the noticeable-unseat category demonstrated a…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4- —Dental Monitoring
- —DentalMonitoring
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOrthodontics and Dentofacial Orthopedics · Dental Radiography and Imaging · dental development and anomalies
Introduction
With the ongoing advancements in clear aligner techniques, Clear Aligner Therapy (CAT) has been increasingly favored within the orthodontic community over traditional fixed appliances. CAT is attractive to both practitioners and patients due to its numerous benefits, including enhanced comfort, improved aesthetics, ease of oral hygiene, better maintenance of dental and periodontal health, and reduced chairside time [1–3]. Nevertheless, this gain in popularity has raised concerns regarding the efficiency of tooth movement, particularly in the older population. Regardless of the system used, the companies that promote the use of Clear Aligners (CA) make numerous claims about the effectiveness and predictability of their systems, despite the limited availability of high-quality evidence that substantiates those claims [4]. In fact, the literature shows that the clinical realization of the intended movements typically falls short of the prescribed and approved corrections [5–7]. In addition, the predictability of difficult-to-be-achieved tooth movement through CAT is often overestimated by practitioners with limited clinical experience [8]. Consequently, additional series of refinement trays are frequently required, resulting in extra financial and time commitments for both patients and providers [9, 10]. This discrepancy between simulated and actual outcomes can be attributed to three potential factors.
The first reason is insufficient daily wear time. A retrospective cohort study of 2,644 adults treated with clear aligners used an app-based questionnaire to track aligner-change dates and self-reported daily wear time [11]. Only 36% met recommended wear-time targets, and 25% showed poor compliance. Women were less compliant than men, and patients without prior orthodontic treatment adhered better.
Second, Patients may also switch to the next aligner before completing the intended movement. Wear-time recommendations have long been debated. Early CAT guidance suggested 2-week wear per tray [12]. Subsequent work proposed 10-day intervals [13]. In 2016, Invisalign^®^ (Align Technology) advised 7-day wear (“How Does Invisalign Work | Invisalign Braces,” n.d.). As a result, orthodontist instructions typically range from 7 to 14 days or more. Even though previous literature has been limited in addressing an appropriate aligner change interval, one randomized clinical study has directly evaluated the efficacy of tooth movement with different aligner wear protocols of 7, 10, and 14 days [9]. This study demonstrated the comparable accuracy of the 7-day and 14-day protocols, with the 7-day protocol offering shorter treatment times, and therefore, a viable option for clinical treatment; additionally, the study specifically recommended to follow the 14-day protocol for complex posterior movements or angular movements (i.e. torque, tip, rotation) due to a statistically significant difference, however, they demonstrated no clinically significant difference (Al-Nadawi et al. 2021).
Third, and linked to the two reasons mentioned above, the aligners may not fit properly. This lack of intimate contact between the aligner and the tooth compromises the necessary forces required for effective tooth movement. Given the long appointment intervals for clear aligner patients, fitting issues may go undetected for several months, worsening progressively over several sets of aligners [10]. Remote monitoring has been shown to improve compliance among patients treated with clear aligners. In an interrupted-time-series analysis using a mobile app with e-reminders, the proportion of poorly compliant aligner patients dropped from 24.5% to 9.3% [14]. In a separate retrospective study of 100 tele-orthodontic users, reported aligner wear increased from 14 h to 22 h per day [15].
As remote monitoring technologies expand, many of these systems increasingly incorporate artificial intelligence (AI) to automate image analysis and support clinical decision-making. AI-assisted remote monitoring and big-data predictive analytics are already being used in medicine to interpret patient-submitted images and detect early complications [16–18]. In dentistry, deep learning models have been applied to remotely evaluate caries and early signs of oral cancer using smartphones [19, 20]. In orthodontics, emerging AI systems like DM enable patients to capture high-quality images for remote clinical assessment, track tooth movement between visits, and generate 3D models with accuracy comparable to those produced by intraoral scanners [21, 22]. One study evaluated DM’s ability to detect aligner tracking issues and found that it accurately identifies noticeable aligner unseats, with an overall accuracy of 93.1% [23].
Currently, there is a lack of independent, rigorous validation for AI-driven remote monitoring systems like DentalMonitoring (DM) in detecting aligner tracking issues. The primary aim of this study was to evaluate the sensitivity and specificity of DM’s AI algorithm in detecting aligner tracking issues, using a consensus of expert orthodontists as the reference standard.
Materials and methods
Study design & rationale
This was a multi-centric retrospective qualitative comparative study that included a dataset from 623 patients. The study was conducted in the scope of an FDA approval process, and the data were extracted from DM’s data pool.
This study was conducted in accordance with the ethical principles that have their origin in the Declaration of Helsinki. Applicable data protection regulations were also followed, namely Regulation (EU) 2017/745 and ISO 14155:2020. An exemption from the Western Institutional Review Board (WITH^®^) known as WCG IRB (WIRB-Copernicus Group IRB), was granted under 45 CFR § 46.104(d)(4). Because this study was a retrospective study the WCG IRP granted exemption for a Consent form (Human Ethics and Consent to Participate declarations: not applicable).
The patient inclusion criteria included: Adolescents & adults of 13 years old and older and patients treated with aligners with at least one attachment. The patient exclusion criteria included: Patients with at least one primary tooth, patients ≤ 12 years old and non-US and non-European patients. With regard to the picture sets generated from the DM scans, the inclusion criteria were: De-identified DM picture sets, picture set acquired using the DM app (on a phone with at least Android 6 and up, or iOS 11 and up), a DM Cheek Retractor and a DM ScanBox (Fig. 1) and lastly a DM picture set with at least eight images processable by DM respecting the following reparation: At least three closed-view images (left, right, and front), at least three open-view images (left, right, and front), and two occlusal-view images (up and down).Fig. 1A The DM ScanBox Pro to support remote intraoral image acquisition by patients. B A patient performing a scan using the DM ScanBox Pro with a smartphone inserted, allowing image capture of occlusal and intraoral structures
The study aimed to evaluate DM’s performance in detecting aligner tracking issues based on predefined clinical categories and system outputs. Aligner tracking was assessed according to three classifications: satisfactory aligner tracking (seat), slight unseat, and noticeable unseat. A seat was defined as a satisfactory fit of the aligner with no visible unseats or damage. A slight unseat occurred when the aligner was not in close contact with the tooth, showing a minor gap of less than 1 mm between the incisal or occlusal edge and the aligner. A noticeable unseat referred to a poor aligner fit with a visibly greater gap exceeding 1 mm.
These categories were evaluated under the clinical context of aligner and thermoformed retainer tracking. The possible outputs generated by the system for this parameter were: satisfactory tracking, slight unseat, noticeable unseat, cannot identify, and not applicable.Teeth presenting power ridges or bite ramps were excluded because these built-in aligner features can alter crown contours, and the standard DM workflow instructs users to exclude such teeth to avoid measurement noise. Although the DM algorithms are trained to recognize these features, that functionality was not assessed in this study. Examples of these outputs are shown in Figs. 2 and 3.Fig. 2. Examples of aligner seating conditions captured using the DM ScanBox and DM App, illustrating satisfactory aligner tracking (A), a slight unseat on UL2 (B), and a noticeable unseat on UR2 (C)Fig. 3. Examples of clear aligner features captured using the DM ScanBox and DM App, showing power ridges (A) and bite ramps (B)
Prior to study initiation, DM’s Clinical Affairs team provided standardized training to all participating sites and expert panel members. The objective was to ensure consistent application of the study protocol, diagnostic definitions, and the use of the established reading scale, which has been widely used in radiology research studies [24]. To minimize potential bias, the Clinical Affairs team’s involvement was strictly limited to procedural training and troubleshooting prior to the formal initiation of the study. They were not involved in data selection, or data analysis during the study.
Following the initial instruction, a familiarization phase was conducted to validate investigators’ readiness to perform the assessments. During this phase, the investigators reviewed a set of ten representative mock cases, distinct from the study dataset, to practice labeling procedures, apply study group definitions, and complete data entry according to protocol. Each investigator used a dedicated computer with identical screen specifications, ensuring standardized assessment conditions.
For each case, a DM intraoral picture set consisting of at least eight images from a single scan was processed by the DM software to generate an automated result. The same image set was independently reviewed by a panel of three experts who are US-based orthodontic residents. Each expert provided an assessment for every tooth and parameter, yielding three evaluations per item. These assessments were then examined to establish a panel consensus (Fig. 4). When all three experts agreed, that response was accepted as the final panel result. If two experts provided the same answer, the majority opinion was used. Cases with no agreement (three different answers) were discussed collectively until a consensus was reached.Fig. 4. Workflow for establishing the Ground Truth, including independent expert review, consensus determination, DM result comparison, and external expert adjudication for discrepant cases
The panel’s consensus result was subsequently compared with the DM-generated result. When both matched, the consensus was accepted as the Ground Truth. For cases in which the two results differed, an additional adjudication step was implemented: an external expert, who is an experienced board-certified orthodontist, reviewed the case to determine the final Ground Truth. The external reviewer was provided with detailed definitions and examples of the parameter under evaluation, and the complete intraoral image set.
The final Ground Truth dataset served as the Reference Standard for all statistical analyses.
Statistical analysis
The sensitivity and specificity, with their 95% confidence intervals, were calculated using a Generalized Estimating Equations (GEE) model. This approach accounts for correlations between multiple teeth from the same patient, ensuring robust estimates. The model is particularly suited for this study, where each patient contributes multiple results due to the inclusion of several teeth.
To evaluate the diagnostic performance of DM, results were classified in comparison to the ground truth using a standard 2 × 2 contingency framework. A result was considered a true positive (TP) when the method correctly identified a positive case, and a true negative (TN) when it correctly identified a negative case. A false positive (FP) indicated an incorrect identification of a negative case as positive, while a false negative (FN) referred to a missed positive case.
Based on this classification, sensitivity was calculated as TP/(TP + FN), representing the method’s ability to correctly detect true positive findings. Specificity was calculated as TN/(FP + TN), reflecting the method’s accuracy in identifying true negatives. These metrics were used to quantify the reliability of the system in detecting clinical events.
Positive predictive value (PPV) and negative predictive value (NPV) were also calculated to quantify the probability that positive and negative DM findings corresponded to the ground truth. PPV was computed as TP/(TP + FP), indicating the proportion of positive detections that were truly positive. NPV was calculated as TN/(TN + FN), representing the proportion of negative detections that were truly negative. These values reflect the diagnostic performance of the system within the prevalence distribution of the study sample and provide complementary information on the clinical trustworthiness of positive and negative DM outputs.
Results
The total number of results included in the final statistical analysis was 3,323 results from over 623 patients. The final comparisons and results for the two-level and three-level parameters are presented in Tables 1 and 2.Table 1. Final comparison between DM outputs and ground truth for the two-level and three-level unseat classifications. The upper section (“Seat vs Unseat”) includes all evaluated teeth, with “Positive” representing either slight or noticeable unseat. The lower section (“Noticeable vs slight Unseat”) is a subset of the positive cases from the upper section and reports the differentiation between noticeable and slight unseats only within that positive subsetSeat vs. UnseatResult95% CI lower bound95% CI upper boundSensitivity93.2%91.3%94.7%Specificity86.2%83.4%88.6%Noticeable vs. Slight UnseatResult95% CI lower bound95% CI upper boundSensitivity91.10%85.90%94.50%Specificity90.50%87.80%92.70%Table 2. Final sensitivity and specificity results for the two-level and three-level unseat parameters. The upper section (“Seat vs Unseat”) reports performance across all evaluated teeth, with unseats defined as either slight or noticeable. The lower section (“Noticeable vs slight Unseat”) corresponds to the subset of teeth classified as unseated in the two-level analysis and presents the model’s ability to distinguish noticeable from slight unseats within that subsetSeat vs. UnseatGround truthPositive (noticeable or slight)NegativeTotalDM resultsPositive (noticeable or slight)13601651525Negative10116971798Total146118623323Noticeable vs. Slight UnseatGround truthNoticeableSlightTotalDM resultNoticeable19399292Slight1810501068Total21111491360
For the two-level classification task (Seat vs. Unseat), the model demonstrated a sensitivity of 93.2% (95% CI: 91.3%–94.7%) and a specificity of 86.2% (95% CI: 83.4%–88.6%). Based on the observed prevalence within the study sample, the corresponding positive predictive value (PPV) was 89.2%, while the negative predictive value (NPV) reached 94.4%.
For the three-level comparison between noticeable and slight unseat, the model achieved a sensitivity of 91.1% (95% CI: 85.9%–94.5%) and a specificity of 90.5% (95% CI: 87.8%–92.7%). In this subset, the PPV was 66.1%, and the NPV was 98.3%, reflecting the lower prevalence of noticeable unseats within this dataset.
Finally, to make sure that the study covers all the teeth of a patient’s dentition, it was verified that each tooth number represented at least 1% of the total number analyzed. This distribution is presented in Table 3.Table 3. Distribution of individual teeth included in the analysis, showing the percentage of total assessments contributed by each toothTooth #FrequencyTooth #Frequency115.5%316.4%126.0%325.6%133.1%334.0%142.7%343.3%152.9%352.8%162.2%362.3%172.1%372.0%214.8%416.3%226.2%425.5%233.2%433.3%242.2%443.6%252.6%452.5%262.6%461.9%271.9%472.3%
Descriptive statistics for all recruited subjects who completed the study were performed for gender, age (years), and location (Table 4).Table 4. Distribution of individual teeth included in the analysis, showing the percentage of total assessments contributed by each toothCharacteristic# (%)Age group 13; 22 years of age291 (46.7%) ≥ 22 years of age332 (53.3%)Location Central286 (45.9%) East Coast192 (30.8%) West Coast17 (2.7%) Out of US8 (1.3%) Western85 (13.6%) US - Other35 (5.6%)
Discussion
This study aimed to evaluate the performance of Artificial Intelligence–Driven Remote Monitoring (AIDRM) in detecting aligner tracking issues using two classification schemes: a binary model (seat vs. unseat) and a three-level model (seated, slight unseat, and noticeable unseat). The study cohort included patients from multiple geographic regions across the United States, spanning different age groups and treated with several commercial aligner brands. To minimize positional bias and reflect real-world clinical conditions, all teeth were included in the analysis.
The results demonstrated that DM achieved high diagnostic performance across both classification tasks. In the binary comparison, sensitivity was 93.2% and specificity 86.2%, with a PPV of 89.2% and an NPV of 94.4%. These values indicate that DM was highly reliable in identifying true unseat events and particularly effective at ruling them out. In the three-level model, sensitivity and specificity remained high (91.1% and 90.5%, respectively). The NPV was exceptionally high at 98.3%, reflecting the system’s strong ability to exclude noticeable misfits. The PPV for noticeable unseat was lower (66.1%), which can be explained by the low prevalence of noticeable unseats within the sample and the known dependence of PPV on event prevalence. Together, these findings suggest that DM functions as a highly sensitive, safety-oriented system that minimizes false negatives while preserving strong discriminative ability across different levels of aligner misfit severity.
When compared with the study by Tahir A., meaningful parallels and distinctions emerge [25]. Tahir reported overall detection accuracies exceeding 90% for clear-aligner tracking parameters, but noted reduced sensitivity for subtle findings such as slight unseat (76.8%). This pattern aligns with our multi-level results, in which the noticeable-unseat category showed a lower PPV due to the low prevalence of noticeable cases, while still maintaining high sensitivity and an excellent NPV. In contrast to Tahir’s study, which evaluated a more uniform set of aligner systems, the present study included multiple commercial aligner brands, introducing greater variability and enhancing the generalizability of the findings. Because this was a retrospective study based on randomly selected records, specific information on which aligner brands were used for each patient was not available in the extracted dataset and therefore could not be reported. Although all teeth were included in the analysis to minimize positional bias, tooth-specific performance metrics (e.g., sensitivity or specificity per tooth type) were not computed in this study. Given the clinical relevance of tooth-level variability and prior indications from Tahir’s work that certain teeth may be more challenging to detect, this represents an area for future investigation [18].
The literature indicates that the predictability of OTM with CA is influenced by numerous factors, including the clinician’s experience [8], the nature of tooth movement [5, 26, 27], patient compliance and aligner wear time, the interval between aligner changes, the amount of programmed tooth movement [28], and even the type of aligner used. Studies have shown that with in-house aligners, there is a significant discrepancy between achieved and predicted tooth movements [29]. Additionally, other biological factors such as age, sex, systemic diseases, and medications also play a role [30].
These limitations in expressing predicted tooth movements ultimately pose a challenge in achieving clinically acceptable results and highlight the need for careful consideration and planning in CAT protocols. Given the complexity and variability of these factors, some of which are beyond the clinician’s control, advancements in artificial intelligence and remote monitoring present a significant opportunity for developing individualized aligner change protocols that account for the patient’s biological response [31]. A recent study conducted at Harvard University evaluated the ability of DM to personalize aligner wear time and its impact on treatment duration. It found that patients using DM progressed through aligners more quickly than the standard 7- or 14-day protocols [32]. For the four groups monitored with DM, the average frequency of aligner changes ranged from 4.315 to 5.370 days per tray. The study’s author emphasized that this does not mean every patient can change trays every 4 to 5 days; for instance, one patient took 25 days to receive a “GO” on the 9th aligner. Consequently, this personalization of the treatment schedule not only enhances efficiency but also maintains treatment quality. As found in Tahir A. et al. study, the DM group achieved similar treatment outcomes compared to conventional treatment groups, based on the American Board of Orthodontics (ABO) Discrepancy Index (DI) and Objective Grading System (OGS).
This study has several limitations that should be considered when interpreting the findings. First, the retrospective design may introduce selection bias and limit control over confounding variables. Second, the study was funded by DentalMonitoring as part of an FDA submission, and some investigators were compensated for their time, which introduces a potential conflict of interest. Third, although all teeth were included to minimize positional bias, tooth-specific performance metrics such as sensitivity or specificity by tooth type were not computed and should be evaluated in future work. Fourth, the consistency of the AI’s performance across repeated assessments over the course of treatment was not investigated and longitudinal analyses are needed to confirm consistency and reliability of the findings.
As AIDRM gains in popularity, future research endeavors should meticulously assess the systems’ reliability while examining the ramifications of this innovative approach on patient care quality, experience, and treatment efficiency.
Conclusion
The findings of this study indicate that DM’s AI demonstrates strong diagnostic performance in detecting aligner tracking issues across both binary and three-level classifications. The high sensitivity and high NPV observed in this analysis suggest that the system is particularly effective at identifying and ruling out clinically meaningful unseat events. By providing timely notifications following patient scans, DM supports clinicians in making prompt, informed decisions during treatment.
Given the absence of evidence supporting a universal fixed interval for clear aligner changes, AI-driven remote monitoring offers a valuable opportunity to develop individualized aligner progression protocols. Such an approach has the potential to better reflect each patient’s biological response and may contribute to more efficient and adaptive clear aligner therapy in future clinical applications.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Activation time and material stiffness of sequential removable orthodontic appliances. Part 2: Dental improvements - Pub Med. https://pubmed.ncbi.nlm.nih.gov/14614416/ . Accessed 7 Nov 2024.10.1016/s 0889-5406(03)00577-814614416 · doi ↗ · pubmed ↗
- 2Orthodontic Practitioners’ Knowledge and Education Demand on Clear Aligner Therapy - Science Direct. https://www.sciencedirect.com/science/article/pii/S 002065392300120 X . Accessed 7 Nov 2024.10.1016/j.identj.2023.07.001PMC 1082936037500450 · doi ↗ · pubmed ↗
- 3Sn T. S G. Is remote monitoring a reliable method to assess compliance in clear aligner orthodontic treatment? Pub Med. https://pubmed.ncbi.nlm.nih.gov/34916648/ . Accessed 27 Nov 2025.10.1038/s 41432-021-0231-x 34916648 · doi ↗ · pubmed ↗
- 4Invisalign®: early experiences: Journal of Orthodontics: Vol 30, No 4 - Get Access. https://www.tandfonline.com/doi/full/10.1093/ortho/30.4.348 . Accessed 7 Nov 2024.
- 5Nd MG, Mh RSR. Y. Robust Methods for Real-Time Diabetic Foot Ulcer Detection and Localization on Mobile Devices. Pub Med. https://pubmed.ncbi.nlm.nih.gov/30188841/ . Accessed 27 Nov 2024.10.1109/JBHI.2018.286865630188841 · doi ↗ · pubmed ↗
- 6H MR, Sr M, Mh R et al. Deep learning for caries detection: A systematic review. Pub Med. https://pubmed.ncbi.nlm.nih.gov/35367318/ . Accessed 27 Nov 2024.
- 7N AR AS et al. B R,. The Effectiveness of Artificial Intelligence in Detection of Oral Cancer. Pub Med. https://pubmed.ncbi.nlm.nih.gov/35581039/ . Accessed 27 Nov 2024.
- 8Mh E, Mk HNB et al. B,. Assessing an AI-driven remote monitoring system for reliable clinical photo acquisition and patient usability. Pub Med. Published online 2025. https://pubmed.ncbi.nlm.nih.gov/41006147/ . Accessed 27 Nov 2024.10.1016/j.ejwf.2025.08.00241006147 · doi ↗ · pubmed ↗
