Automated classification of skeletal malocclusion in German orthodontic patients

Eva Paddenberg-Schubert; Kareem Midlej; Sebastian Krohn; Erika Kuchler; Nezar Watted; Peter Proff; Fuad A. Iraqi

PMC · DOI:10.1007/s00784-025-06485-0·August 5, 2025

Automated classification of skeletal malocclusion in German orthodontic patients

Eva Paddenberg-Schubert, Kareem Midlej, Sebastian Krohn, Erika Kuchler, Nezar Watted, Peter Proff, Fuad A. Iraqi

PDF

Open Access

TL;DR

This study shows that AI can accurately classify skeletal malocclusion in German orthodontic patients, potentially streamlining diagnostic workflows.

Contribution

The study introduces AI models that achieve 100% accuracy in classifying skeletal classes using cephalometric data from German patients.

Findings

01

Random forest models achieved 100% accuracy in classifying skeletal classes using all input parameters.

02

Calculated_ANB was identified as the most important parameter for accurate classification.

03

An artificial neural network achieved 95.31% accuracy with all parameters and 100% with Calculated_ANB alone.

Abstract

Precisely diagnosing skeletal class is mandatory for correct orthodontic treatment. Artificial intelligence (AI) could increase efficiency during diagnostics and contribute to automated workflows. So far, no AI-driven process can differentiate between skeletal classes I, II, and III in German orthodontic patients. This prospective cross-sectional study aimed to develop machine- and deep-learning models for diagnosing their skeletal class based on the gold-standard individualised ANB of Panagiotidis and Witt. Orthodontic patients treated in Germany contributed to the study population. Pre-treatment cephalometric parameters, sex, and age served as input variables. Machine-learning models performed were linear discriminant analysis (LDA), random forest (RF), decision tree (DT), K-nearest neighbours (KNN), support vector machine (SVM), Gaussian naïve Bayes (NB), and multi class logistic…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

skeletal malocclusion

Figures4

Click any figure to enlarge with its caption.

Performance of the deep-learning models to classify the patients as skeletal class I, II or III, including training loss function, validation loss, training accuracy and validation accuracy. In (A) the model includes all cephalometric variables, sex and age. In (B) the model is based on Calculated_ANB only. C shows the performance of model based on the variables SNA, SNB and ML-NSL.

Confusion matrices illustrating the agreement between the gold standard in diagnosing skeletal class (i.e., Calculated_ANB) and other classification methods- the Wits appraisal (A), and iANB_KP (B). Furthermore, Cohen Kappa describes the agreement between the methods included in each figure

Funding1

—Universität Regensburg (3161)

Keywords

Skeletal malocclusionArtificial intelligenceOrthodontic diagnosticsCephalometric analysisIndividualized ANBOrthodontic treatment planning

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOrthodontics and Dentofacial Orthopedics · Dental Radiography and Imaging · Temporomandibular Joint Disorders

Full text

Introduction

Within the German population growing patients and adults present a high demand and need for orthodontic treatment to correct various types of malocclusion or dysgnathia [1, 2]. Prior to the actual orthodontic treatment, precise diagnoses are mandatory to allow for personalised treatment plans [3]. Therefore, various diagnostic procedures are necessary, often including, among others, lateral cephalograms and their analyses, which are comprised of skeletal, dental and soft-tissue parameters in the sagittal and vertical dimension.

One of the skeletal sagittal variables of high importance is skeletal class, describing the antero-posterior relation of the upper and lower jaw base. There is a variety of possibilities to determine an individual’s skeletal class. Although the method of measuring the ANB angle, introduced by Riedel [4], is well established and used by many orthodontists, this procedure bears the risk of distortion of the true skeletal class due to geometric dependencies of the anatomic structures concerned and the corresponding reference points [5]. Hence, both different cephalometric parameters and various individualisation methods, including graphical solutions and regression equations, have been developed to increase the precision in defining the true sagittal relation of the maxilla and mandible of the individual patient [5–8]. The advantage of individualising cephalometric norm values for each patient is the increased precision in diagnosis and therefore in treatment planning. Within this context, one frequently used regression equation is the individualised ANB, which was introduced by Panagiotidis and Witt [7]. Their method was updated and optimised by Paddenberg and Kirschneck and will be referred to as iANB_KP in this manuscript [5]. Another well spread technique applied to determine a patient’s skeletal class, which will be evaluated in this study, is the Wits appraisal, which was established by Jacobson [6].

In orthodontics, as well as in other fields of dentistry, artificial intelligence (AI) is of increasing interest and steadily improves [9–11]. Frequent applications of AI in orthodontics include, for example, the automated identification of cephalometric landmarks [12, 13]. Other implications include the prediction of the duration of orthodontic treatment [14] or assist the practitioner in treatment decisions, such as with extractions of teeth [15]. However, many AI systems still require further improvement and cannot substitute the orthodontist yet [15]. So far, various AI-models have been described to conduct cephalometric analyses [16, 17]. Some studies specifically focus on the automated determination of patients’ skeletal class. For example, Nan et al. established a deep-learning method in Chinese children for the classification of skeletal class I, II and III, and validated it by comparing it to their gold standard, which was a combination of ANB and Wits appraisal [18]. Ueda et al. evaluated, next to vertical parameters, the performance of various AI-models for the determination skeletal class based on ANB [19]. They considered only Japanese adults in their analysis. Midlej et al. assessed various machine-learning models to correctly classify Arab patients as skeletal class II and III [20] or as skeletal class I and II [21].

However, to the best of our knowledge there is no study available, which evaluates machine-and deep-learning models to precisely diagnose the skeletal class of German orthodontic patients of all ages. Hence, the main aim of this prospective cross-sectional study was the development and assessment of the performance of machine- and deep-learning models to correctly differentiate between skeletal class I, II and III in German orthodontic patients of all ages more accurately compared to the gold standard individualised ANB [7].

Materials and methods

Cephalometric parameters

In this study, 24 cephalometric parameters as well as the covariates sex and age were included in the analysis. The cephalometric parameters included in this study are summarised in Supplementary Tables 1 and Supplementary Fig. 1.

Inclusion and exclusion

Patients, receiving orthodontic treatment between 03/2023 and 12/2024 were screened for inclusion. All patients of the study centres were considered for inclusion, irrespective of their malocclusion, sex or age.

Inclusion criteria

Availability of pre-treatment lateral cephalograms with scale.
Skeletal class I, II or III according to the Calculated_ANB, which is defined as ANB angle- individualised ANB of Panagiotidis and Witt [7]:

Skeletal class I (−1.5 < Calculated_ANB < 1.5).
Skeletal class II (Calculated_ANB > 1.5).
Skeletal class III (Calculated_ANB<−1.5).
3.Demographic data including information about sex and age.

Exclusion criteria

Patients/guardians who were not able or refused to sign an informed consent form.
Missing pre-treatment lateral cephalogram with scale and/or information about sex and age.

Data analysis

High-resolution cephalograms of each patient were imported without loss of resolution using TIF format into Ivoris^®^ analyze pro (Computer konkret AG, Falkenstein, Germany; version 8.2.83.130). Precise calibration was performed within the software environment prior to analysis to ensure metric accuracy. Prior to the data analysis, interrater- and intrarater - reliability of the cephalometric analysis was tested by validating 50 randomly chosen lateral cephalograms, which were evaluated twice by two different raters (SK, EP). Besides, the same rater evaluated the images with interval of minimum two weeks. Data analysis was applied with Python software using the Google Colab platform [22]. In this study, different classification algorithms, including linear discriminant analysis (LDA), random forest (RF), decision tree (DT), K-nearest neighbors (KNN), support vector machine (SVM), and Gaussian naïve Bayes (NB), and multi-class logistic regression (MCLR) were used. All these machine-learning models were applied using the Scikit-Learn Python package [23]. Besides, a deep-learning model, more precisely an artificial neural network (ANN) was performed to classify the patients as skeletal class I, II, or III patterns based on the cephalometric parameters and the covariates sex and age. The ANN provides a means for dealing with complex pattern-oriented problems of categorisation and time-series (trend analysis) types [24].

Further analyses consisted of a comparison between different classification methods, i.e., between the gold standard Calculated_ANB and the alternative methods Wits appraisal [6] and iANB_KP [5], by a confusion matrix and Cohen’s Kappa. Finally, Wits appraisal was used to re-classify the study population and repeat both the machine- and deep-learning methods.

Sample size

This study enrolled the maximum available cases within the period of patient enrolment (03/2023-12/2024). Besides, the current sample size achieved the expected accuracy results on the validation data (20% of the data).

Data preprocessing

In the current study, the downsampling process was used to balance the classification groups, followed by the Python function- Random UnderSampler, where samples from the targeted classes are removed randomly [25]. In addition, due to the different scales of the parameters and variables, data was standardised before applying the models.

Machine-learning algorithms

For each model, based on the random test data that was selected using Python software code, that ensured reproducibility by using a fixed random seed (20% unseen data), we calculated the accuracy, as well as the precision, recall and F1 score. Accuracy determines the numbers in accordance with the actual and the predicted skeletal class using the complete data set. The F1 score evaluates the accuracy of a machine-learning model, but, contrary to accuracy, it focuses on the accuracy for each class (skeletal class I, II, III), which is of advantage in case of unequal sizes of the classes. Calculation of the F1 score is based on the precision and recall. Precision is also called positive predictive value, and recall expresses the sensitivity of the machine-learning model. Furthermore, the importance of the input parameters was determined and used to develop machine-learning models based on the most important variables only.

Applying both machine learning & deep learning models

In this study, we aimed to apply both machine learning and deep learning models. Machine learning needs algorithms, learn from that data and then classify the data, while deep learning consists of many hidden layers and multiple neurons per layer. Besides, deep learning takes a large amount of data while machine learning needs a small amount of data to work and arrive at a conclusion [26]. In this study we used both tools (i.e., machine learning and deep learning) to examine the ability of each tool to classify patients. Although this study demonstrated the ability of machine learning models to classify patients with very high accuracy, we also applied deep learning as an introduction to a general model that will include more data from our population and from other populations. In other words, the deep learning models introduced in this study will be a basis for more general and complex models in further studies.

Results

This study consisted of the coded data of 1277 German patients diagnosed as skeletal class I (n = 623, 48.79%), II (n = 352, 27.56%) or III (n = 302, 23.64%). Both interrater (0.92 to 0.99) and intrarater reliability (0.90 to 0.99) were almost perfect as required for subsequent data analysis. Among skeletal class I patients, mean age was 13 (M = 13, SD = 3.8), and most were females (n = 341, 54.73%). The mean age of skeletal class II patients was also 13 (M = 13, SD = 6), and 56.53% (n = 199, 56.53%) were females. Finally, among skeletal class III patients, the mean age was 14 (M = 14, SD = 5.6), and here also, 53.64% were females (n = 162, 53.64%). The full results of the demographic variables and the cephalometric parameters descriptive statistics are detailed in Table 1. After the downsampling process, the sample size consisted of 956 patients- 302 skeletal class I patients, 352 skeletal class II patients, and 302 skeletal class III patients. 80% (n = 764) of the total study population after the downsampling were used as a training set to develop the AI-models.

Table 1. Descriptive statistics of the skeletal and dental cephalometric parameters for skeletal class I, II, and III - sample size (N), mean, and standard deviationSkeletal classIIIIIIVariable N MeanSD N MeanSD N MeanSDAge623133.8352136302145.6Sex - Females341 (54.73%)199 (56.53%)162 (53.64%)NL/ML623245.9352225.7302245.6NL/NSL6237.43.43528.43.23026.93.2PFH/AFH623675352675.1302675.1Gonial angle6231226.23521196.73021256.5Facial axis623904.3352894.1302924.5SNA angle623813.7352813.4302813.8SNB angle623783.1352752.9302813.5ANB angle6233.61.63526.21.73020.422.1ANBind6233.61.43523.51.53023.51.5Calculated_ANB6230.0190.843522.81302−3.11.6SN-Ba6231324.93521334.63021305.2SNP-g623793.2352763302823.6SN (mm)623663.9352663.9302663.9GoMe (mm)623675.1352655.2302695.8Wits6230.122.23523.52.3302−3.92.8ML-NSL623315.9352315.9302316+ 1/NL angle623688.33527110302657.1+ 1/SNL angle623768.53527911302727.3+ 1/NA angle623238.33522011302276.8+ 1/NA (mm)6233.72.73522.23.330252.4−1/ML623846.7352807.3302897.4−1/NB angle623256.9352267.6302227.1−1/NB (mm)6233.82.43524.12.43022.92.3Interincisal angle623128123521281430213011

General machine-learning models

The general machine-learning model included 24 cephalometric parameters as well as sex and age as covariates. Among the seven machine-learning models performed, random forest was the most accurate to classify the patients as skeletal I, II or III. In this model, the results showed 100% accuracy. The model showed 100% precision, recall and F1 scores in each class. Concerning the other models, the results showed 99% accuracy in the decision tree, 98% accuracy in the support vector machine and multi-class logistic regression models, 93% accuracy in the linear discriminant analysis model, 91% accuracy in the Gaussian naïve Bayes model and 85% accuracy in the k-nearest neighbours model. The detailed results are presented in Table 2.

Table 2. Performance of different machine-learning models in predicting skeletal class I, II, and III - Linear discriminant analysis, random forest, decision tree, K-nearest neighbours, support vector machine, Gaussian Naïve bayes, and multi-class logistic regression. For every model, precision, recall, F1 score, and accuracy scores are presentedIncluded parameters: 24 cephalometries & 2 covariatesIncluded parameters: Calculated_ANB (ANB – ANB_ind_)Included parameters: SNA, SNB & ML-NSL anglesClassPrecisionRecallF1scorePrecisionRecallF1scorePrecisionRecallF1scoreLinear Discriminant AnalysisI0.870.910.890.980.960.970.960.960.96II0.990.950.970.971.000.990.971.000.99III0.930.930.931.000.980.991.000.970.98Accuracy0.930.980.98Random ForestI1.001.001.00---1.000.980.99II1.001.001.00---1.001.001.00III1.001.001.00---0.981.000.99Accuracy1.00-0.99Decision TreeI1.000.980.991.000.980.990.980.980.98II1.001.001.001.001.001.001.001.001.00III0.981.000.990.981.000.990.980.980.98Accuracy0.990.990.99K – Nearst NeighborsI0.770.720.751.001.001.001.000.980.99II0.900.950.921.001.001.001.001.001.00III0.860.860.861.001.001.000.981.000.99Accuracy0.851.000.99Support Vector MachineI0.960.960.961.000.980.991.000.960.98II0.990.990.991.001.001.001.001.001.00III0.980.980.980.981.000.990.971.000.98Accuracy0.980.990.99Gaussian Naive BayesI0.830.880.851.000.910.951.000.910.95II0.970.960.970.961.000.980.971.000.99III0.910.880.900.971.000.980.951.000.98Accuracy0.910.970.97Multi Class Logistic RegressionI0.960.960.961.000.930.961.000.910.95II1.000.970.990.971.000.990.971.000.99III0.971.000.980.971.000.980.951.000.98Accuracy0.980.980.97

Machine-learning parameters are important according to the most accurate general machine-learning model

In the next section, the importance of each input parameter in the most accurate model, i.e., the random forest model, which included 24 cephalometric parameters as well as sex and age, was examined. The results shown in Table 3 demonstrate that the essential variable contributing to the model was Calculated_ANB (0.429), followed by the parameters’ ANB, Wits appraisal, SNB, SNPg and − 1/ML, respectively.Table 3. Importance of the 24 cephalometric parameters and the covariates sex and age to classify an individual as skeletal class I, II, or III in the random forest machine-learning modelVariableImportanceCalculated_ANB0.429ANB0.164Wits0.131SNB0.045SNPg0.042-1/ML0.024Gonial angle0.020ANBind0.014+1/NA (mm)0.012+1/SN0.012+1/NA0.012ML-NSL0.011SNA0.009NL/ML0.009PFH/AFH0.009Go-Me (mm)0.008age0.007facial Axis0.006-1/NB0.006SN-Ba0.005+1/NL0.005-1/NB (mm)0.005interincisal angle0.005S-N (mm)0.005NL/NSL0.004Gender0.001

Calculated_ANB as a single predictor

According to the previous section, Calculated_ANB was the most important variable with a score of 0.429, and remarkably higher than the subsequent parameters. Therefore, in this section machine-learning models with Calculated_ANB as the only predictor were applied. In all models, accuracy was 97% or higher. For example, the k-nearest neighbours model was able to classify the patients with 100% accuracy, and perfect precision, recall and F1 scores. In addition, the decision tree model was able to classify the patients with 99% accuracy, with perfect precision among skeletal class I, and II, and 98% precision among skeletal class III patients (Table 2). To understand the decision tree limits for each class, the decision tree for Calculated_ANB is visualised in Fig. 1A. The decision tree demonstrates that patients were categorised as skeletal class II, when Calculated_ANB was greater than 1.27. Besides, the patients were categorized as skeletal class I if Calculated_ANB ranged between − 1.25 to + 1.27 and as skeletal class III if Calculated_ANB was smaller than − 1.25. The Gini index, which ranges from 0 to 1 and describes the inequality among a distribution, clarifies that along the decision tree, equality of the (remaining) sample increased.Fig. 1A presents the decision tree for the Calculated_ANB. Each node is labelled with the value of Calculated_ANB. In addition, in each node, the number of patients in each class is presented (samples) as well as the Gini score of purity (between 0 to 1). B presents the decision tree for the ANB angle

Machine-learning models based on the parameters that define the Calculated_ANB (SNA, SNB, ML-NSL)

In this section, we applied machine-learning models based on the parameters needed to determine the Calculated_ANB, i.e., SNA, SNB, and ML-NSL. The results show that all models could classify the patients with at least 97% accuracy. The best models were random forest, decision tree, k-nearest neighbors, and support vector machine models with 99% accuracy. The full results of all models are described in Table 2.

Can we gain accurate classification from the ANB angle (i.e., without the Calculated_ANB)? (Stepwise forward machine-learning models)

In the previous section, it was shown that the Calculated_ANB could predict an individual’s skeletal class precisely. This section aimed to examine machine-learning models, which do not include Calculated_ANB as an input parameter. Therefore, the importances of the input variables from the general model (Table 3) were used to perform a forward models’ system, which adds one more variable every time according to their importance. The results are presented in Table 4A-B. The results show that when including ANB angle only, the accuracy ranged between 71% in the decision tree model and 76% in the Gaussian naïve Bayes. The ANB angle decision tree, which is visualised in Fig. 1B, demonstrates the decision tree limits for each class. The decision tree faced difficulty in categorising the patients, which is also represented by the higher Gini scores in the lower nodes of the tree. In the next level, the parameter Wits appraisal was added to the ANB angle before applying the same machine-learning models. Here, accuracy ranged from 79% in the decision tree model to 82% in the model’s linear discriminant analysis and Gaussian naïve Bayes. The model’s accuracy improved by continuing the step-forward process and adding the next important parameter each time. This process was stopped when accuracy reached 92% in the model decision tree, support vector machine, and multi-class logistic regression with the input variables ANB angle, Wits appraisal, SNB, SNPg, −1/ML, and Gonial angle.Table 4A-B: Precision, recall, F1 score, and accuracy of the machine-learning model's Linear discriminant analysis, random forest, decision tree, K-nearest neighbours, support vector machine, Gaussian naïve Bayes, and multi-class logistic regressionAIncluded parameters: ANB angleIncluded parameters: ANB angle and Wits appraisalIncluded parameters: ANB angle, Wits appraisal and SNB angleIncluded parameters: ANB angle, Wits appraisal, SNB and SNPg anglesClassPrecisionRecallF1 scorePrecisionRecallF1 scorePrecisionRecallF1 scorePrecisionRecallF1 scoreLinear Discriminant Analysis10.560.540.550.690.740.710.760.740.750.740.790.7620.790.830.810.880.840.860.860.910.880.880.880.8830.840.810.830.900.880.890.930.900.910.950.880.91Accuracy0.740.820.85****0.85Random Forest1------0.750.700.730.760.770.772------0.870.910.890.900.910.903------0.880.900.890.910.880.90Accuracy**--0.840.86Decision Tree10.520.490.500.640.680.660.720.630.670.750.740.7420.830.750.790.880.800.840.820.890.860.860.920.8930.750.880.810.840.880.860.900.900.900.930.860.89Accuracy0.710.790.820.85K– Nearst Neighbors10.560.530.540.660.580.620.730.700.710.750.740.7420.790.790.790.860.830.850.870.880.880.890.890.8930.820.860.840.800.930.860.880.900.890.880.900.89Accuracy0.730.790.830.85Support Vector Machine10.560.510.530.680.680.680.750.700.730.750.740.7420.790.830.810.880.840.860.860.910.880.880.890.8930.820.830.820.850.900.880.900.900.900.900.900.90Accuracy0.730.810.840.85Gaussian Naive Bayes10.580.630.610.690.740.710.750.750.750.770.770.7720.790.820.810.880.840.860.870.890.880.880.910.9030.900.800.850.900.880.890.930.900.910.930.900.91Accuracy0.760.820.850.86Multi Class Logistic Regression10.550.510.530.670.670.670.750.720.730.780.750.7720.790.830.810.880.840.860.870.890.880.900.920.9130.810.810.810.840.880.860.900.900.900.900.900.90Accuracy0.730.800.840.86BIncluded parameters: ANB angle, Wits appraisal, SNB, SNPg and -1/MLIncluded parameters: ANB angle, Wits appraisal, SNB, SNPg, -1/ML and Gonial angleClassPrecisionRecallF1 scorePrecisionRecallF1 scoreLinear Discriminant Analysis10.810.880.840.800.890.8420.910.950.930.940.950.9431.000.860.930.980.850.91Accuracy0.900.90Random Forest10.820.880.850.870.850.8520.910.950.930.910.950.9331.000.880.940.950.950.95Accuracy0.910.91Decision Tree10.780.880.830.860.880.8720.910.950.930.930.930.9331.000.830.910.970.950.96Accuracy0.890.92K– Nearst Neighbors10.810.820.820.820.860.8420.890.920.900.920.950.9430.980.920.950.960.880.92Accuracy0.890.90Support Vector Machine10.870.840.860.850.880.8620.910.950.930.940.950.9430.970.950.960.960.920.94Accuracy0.920.92Gaussian Naive Bayes10.860.880.870.830.880.8520.910.960.940.940.950.9431.000.920.960.960.900.93Accuracy0.920.91Multi Class Logistic Regression10.850.790.820.860.880.8720.890.950.920.950.950.9530.950.930.940.950.930.94Accuracy0.900.92**

Deep-learning algorithms (ANN)

General deep-learning model

A general deep-learning model, an ANN, which included all cephalometric parameters as well as the covariates sex and age, was performed. The results demonstrate the ability of this model to predict an individual’s skeletal class with an accuracy of 95.31% in the validation data (training-accuracy = 98.17%, validation-accuracy = 95.31%). The performance of the model is presented in Fig. 2A.Fig. 2. Performance of the deep-learning models to classify the patients as skeletal class I, II or III, including training loss function, validation loss, training accuracy and validation accuracy. In (A) the model includes all cephalometric variables, sex and age. In (B) the model is based on Calculated_ANB only. C shows the performance of model based on the variables SNA, SNB and ML-NSL.

The deep-learning model that included Calculated_ANB as a single predictor

According to the results of the importance of input variables for the machine-learning model random forest, Calculated_ANB was the most crucial variable. The results of the deep-learning model, which included only Calculated_ANB as a predictor, demonstrate 100% validation accuracy (training-accuracy = 98.42%, validation-accuracy = 100%) (Fig. 2B).

Deep-learning models based on the parameters needed for the Calculated_ANB (SNA, SNB, and ML-NSL)

In the following model, the parameters, which are required to determine Calculated_ANB, i.e., the angles SNA, SNB and ML-NSL, were included. The model’s validation-accuracy was 97.40% (training-accuracy = 96.55%, validation-accuracy = 97.40%) (Fig. 2C).

ANB measured angle power to accurately classify orthodontic patients (Stepwise forward machine-learning models)

In the previous sections, the results showed a high accuracy when incorporating the Calculated_ANB in the deep-learning models. In addition, almost similar accuracy was achieved when using the parameters incorporated in the Calculated_ANB. In contrast, the results revealed insufficient accuracy of deep-learning models based on the ANB angle only. Furthermore, the addition of the next important variables, according to the random forest model presented in Table 3, did not significantly improve the accuracy of the models.

The use of different classifiers

In this study, Calculated_ANB was used as the gold standard to classify patients as skeletal class I, II and III. The agreement between the gold standard and other classifying methods was visualised applying a confusion matrix and Cohen’s Kappa. The agreement between the methods Calculated_ANB and Wits appraisal was moderate, with a Kappa score of 0.52. The confusion matrix in Fig. 3A illustrates that 391 (n = 391, 62.76%) patients were diagnosed as skeletal class I by both methods. 294 (n = 294, 83.2%) cases were diagnosed as skeletal class II, and 201 (n = 201, 66.55%) patients were classified as skeletal class III by both methods. Finally, the agreement between the methods Calculated_ANB and iANB_KP was 0.51. Here, the confusion matrix in Fig. 3B demonstrates that 467 (n = 467, 74.95%) patients were diagnosed as skeletal class I, and 286 (n = 286, 81.25%) individuals were classified as skeletal class II by both methods. In contrast, only 140 (n = 140, 46.35%) patients were diagnosed as skeletal class III by these two methods.Fig. 3. Confusion matrices illustrating the agreement between the gold standard in diagnosing skeletal class (i.e., Calculated_ANB) and other classification methods- the Wits appraisal (A), and iANB_KP (B). Furthermore, Cohen Kappa describes the agreement between the methods included in each figure

The performance of machine-learning & deep-learning algorithms based on other classification methods

The same machine-learning and deep-learning algorithms described above were repeated using the sample’s classification with the Wits appraisal. In the general models, which included all parameters, the random forest and decision tree models could classify the patients with 100% accuracy. Comparing the importance of all input variables in the random forest model, Wits appraisal was the most critical parameter, followed by Calculated_ANB, ANB, and − 1/ML, respectively (Table 5). Including only the most important input parameter, i.e. Wits appraisal, resulted in 100% accuracy in the decision tree and k-nearest neighbours models. Regarding the deep-learning models (ANN) based on this classification method, the general deep-learning model, which was based on all parameters, achieved a validation-accuracy of 94.09% (training-accuracy = 98.09%, validation-accuracy = 94.09%). The deep-learning models, which considered Wits appraisal only, resulted in a higher validation accuracy, being 99.51% (training-accuracy = 99.04%, validation-accuracy = 99.51%).

Table 5. Importance of the 24 cephalometric parameters and the covariates sex and age to classify an individual as skeletal class I, II, or III in the random forest machine-learning model. This model was based on a sample, which was classified according to the wits appraisalVariableImportanceWits appraisal0.53Calculated_ANB0.13ANB0.08−1/ML0.02SNB0.02SNPg0.02+ 1/NA (mm)0.02Gonial angle0.01PFH/AFH0.01+ 1/NA0.01Age0.01SNA0.01Go-Me (mm)0.01ANB indiv0.01+ 1/SN0.01ML-NSL0.01NL/NSL0.01S-N (mm)0.01−1/NB0.01−1/NB (mm)0.01Facial axis0.01NL/ML0.01SN-Ba0.01+ 1/NL0.01Interincisal angle0.01Sex0.00

Discussion

This prospective cross-sectional study aimed to develop both a machine-learning and a deep-learning model to correctly classify German orthodontic patients of all ages and sexes as skeletal class I, II or III. The machine- and deep-learning models, including all input parameters, achieved 100% and 95.31% accuracy, respectively. Secondary outcomes included the comparison of different classification methods (Wits appraisal, iANB_KP) as well as the repetition of the AI-methods for an alternative classification of the sample (Wits appraisal). Here, the general models of the machine- and deep-learning with all input variables resulted in an accuracy of 100% and 94.09%, respectively.

The study population was inhomogenously distributed with respect to skeletal class I, II and III, but representative of the German patients at the study centres. Based on Calculated_ANB, the majority of the participants had skeletal class I (48.79%), whereas 27.56% and 23.64% presented a class II and III, respectively. Depending on the population assessed, these distributions of skeletal class vary. For example, in a group of primarily Italian children skeletal class I was present in 49.14%, similar to skeletal class II (47.43%), whereas skeletal class III was found only in 3.43% [27]. Although a variety of classification methods is described in the literature, the primary outcome of this study was based on classifying the participants according to the Calculated_ANB, as reported by Panagiotidis and Witt [7]. This method was chosen due to increased precision in diagnostics [5, 7] and because of the ability to compare it with other publications [20]. Among all skeletal classes, females slightly dominated, but not to a clinically relevant extent (< 7%). According to the mean age of 13 and 14 years in class I and II, and III, respectively, most patients were adolescents. The patients in this investigations were younger than the Arab cohort of class II and III patients with mean ages of 17 and 18 years, respectively [20].

Various machine-learning models were performed with partly variable performances. When all input parameters were included, i.e. all cephalometric variables as well as sex and age, the RF model performed best with 100% accuracy, and 100% precision, recall and F1 scores in each class. This accuracy was clearly higher than the 87% accuracy achieved in machine-learning models for skeletal class I and II diagnosis in Arabs [21], but comparable to the 99% accuracy for skeletal class II and III classification in Arab patients [20]. Within the general model, the most important parameter was Calculated_ANB, followed by ANB, Wits appraisal, SNB, SNPg and − 1/ML, respectively. Interestingly, all of these skeletal parameters are sagittal ones, as is skeletal class. It appears, that the sagittal parameters of the mandible (SNB, SNPg) have a bigger impact on the automated classification than those of the maxilla. Among the most crucial variables, there is one dentoalveolar parameter, −1/ML. The importance of the inclination of the lower incisors’ could be explained by its relation with skeletal class, which is expressed in floating norms for the inclination and position of these teeth [28, 29].

The machine-learning models, based on Calculated_ANB only, resulted in 97–100% accuracy. Within the model DT, the lower and upper limits of Calculated_ANB for skeletal class I were − 1.25 ° and + 1.27 °, respectively. Compared to the manually applied limits in the gold standard (± 1.5 °) the AI-method used slightly narrower limits to achieve the high accuracy of 99% and final Gini scores of 0, indicating perfect equality of the automatically classified groups. Increasing the number of input variables by considering all those needed to determine Calculated_ANB, led to slightly less accuracy, ranging between 97 and 99%. In a work that was done by Suryakumar et al. [30] about the critical dimension in data mining, and demonstrated that shown that the phenomenon of critical dimension indeed exists for many datasets, and in each dataset was found that the elimination of irrelevant feature can gain a higher accuracy. In contrast, including ANB only, resulted in clearly lower accuracy (71–76%), which identifies this model as inappropriate to determine an individual’s skeletal class. This observation could be explained by the imprecise diagnosis of this parameter in case the maxillary prognathism (SNA) and/or mandible’s inclination deviate from the empirical norm values [7]. Hence, from our results, it can be recommended to include Calculated_ANB in the machine-learning model to avoid a (clearly) noticeable loss of accuracy.

The general deep-learning model (ANN) presented almost perfect accuracy during validation (95.31%), which was even better (100% validation-accuracy) with Calculated_ANB only. This accuracy was comparable with the mean-accuracies of 90.50–95.70% reported for the CNN-models of Yu et al. [31]. Similar to the machine-learning model, ANB as a single predictor achieved insufficient accuracy in the deep-learning algorithm. The advantage of the deep-learning model compared to the machine-learning algorithm is.

Comparing the gold standard of this study, Calculated_ANB, with two other classification methods, Wits appraisal and iANB_KP, resulted only in moderate accuracy (Kappa 0.52 and 0.51, respectively). This clarifies the need to carefully look at the classification method chosen, when comparing the data with other publications. For example, Nan et al. used a combination of ANB and Wits appraisal [18], which could have led to a different distribution within their study collective. The advantage of iANB_KP [5] is the update for a contemporary population compared to Calculated_ANB [7]. Wits appraisal is based on a linear measurement and frequently used [6, 18, 31], but does not present an individualised norm, which could potentially have a negative impact on the precision. AI models based on the sample’s classification according to the Wits appraisal resulted in perfect accuracy (100%) in the general model with complete input variables as well as with Wits appraisal, the most important parameter in this model only. This observation is identical to the 100% accuracy reported for Calculated_ANB-classified samples. Also comparable results were reached with the deep-learning models, which considered all variables (validation-accuracy 94.09%) or Wits appraisal only (validation-accuracy 99.51%).

Limitations

The limits chosen to differentiate between skeletal class I, II, and III (1.5 °) were slightly different from those suggested by Panagiotidis and Witt (1 °) [7], which should be considered during comparison with other studies. This definition was applied to exclude borderline cases from the skeletal classes II and III groups. Another limitation is the missing assessment of patients’ ethnicity. However, since only German study centers were used for recruitment, it appears unlikely that many patients presented different ethnic groups. Furthermore, the study size of 1277 patients should be increased in future investigations, allowing for validation of the AI algorithms presented even in a bigger population. Finally, the cephalometric analysis used as the gold standard could be incorrect. But, interrater and intrarater reliability testing performed revealed reproducible measurements.

Conclusion

Machine- and deep-learning models were successfully developed for classifying German orthodontic patients as skeletal class I, II and III. These findings could be implemented in a digital and automated workflow to relieve the orthodontic practitioner, and integrated along with the available tools and clinical experience, for a fast, efficient and accurate diagnosis. Depending on the gold standard applied to classify the data pool, few cephalometric parameters are especially crucial for automated classification, which clarifies the need for precise cephalometric analyses. Future studies should aim to increase the study size and validate the data in bigger cohorts. Besides, a user-friendly application that will be based on the models presented in this study, should be examined in a real-world orthodontist’s clinic, while adjusting these applications to better help the orthodontists’ preferences and needs.

Supplementary Information

Below is the link to the electronic supplementary material.ESM 1(PDF 201 KB)

Bibliography2

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Google (2025) Google Colaboratory. https://colab.research.google.com/. Accessed 8 Apr 2025
2Imbalanced-learn (2024) Under-sampling. https://imbalanced-learn.org/stable/under_sampling.html. Accessed 15 Mar 2025