Orthodontic treatment outcome predictive performance differences between artificial intelligence and conventional methods
To evaluate an artificial intelligence (AI) model in predicting soft tissue and alveolar bone changes following orthodontic treatment and compare the predictive performance of the AI model with conventional prediction models. A total of 1774 lateral cephalograms of 887 adult patients who had undergone orthodontic treatment were collected. Patients who had orthognathic surgery were excluded. On each cephalogram, 78 landmarks were detected using PIPNet-based AI. Prediction models consisted of 132 predictor variables and 88 outcome variables. Predictor variables were demographics (age, sex), clinical (treatment time, premolar extraction), and Cartesian coordinates of the 64 anatomic landmarks. Outcome variables were Cartesian coordinates of the 22 soft tissue and 22 hard tissue landmarks after orthodontic treatment. The AI prediction model was based on the TabNet deep neural network. Two conventional statistical methods, multivariate multiple linear regression (MMLR) and partial least squares regression (PLSR), were each implemented for comparison. Prediction accuracy among the methods was compared. Overall, MMLR demonstrated the most accurate results, while AI was least accurate. AI showed superior predictions in only 5 of the 44 anatomic landmarks, all of which were soft tissue landmarks inferior to menton to the terminal point of the neck. When predicting changes following orthodontic treatment, AI was not as effective as conventional statistical methods. However, AI had an outstanding advantage in predicting soft tissue landmarks with substantial variability. Overall, results may indicate the need for a hybrid prediction model that combines conventional and AI methods.ABSTRACT
Objectives
Materials and Methods
Results
Conclusions
INTRODUCTION
With the advent of high-speed computer technology, the use of artificial intelligence (AI) in research has become popular. With a wealth of new AI literature emerging, orthodontics is no exception to the rapid influx of AI. Recently, several studies have used AI to predict facial soft tissue changes following orthodontic treatment. However, these AI studies seem to simply repeat analyses using AI, which have already been analyzed using conventional statistical methods (Table 1).1–3 Since AI requires significant computing resources even for simple tasks, it must demonstrate superior effectiveness over traditional methods to justify its use. If an AI model does not perform better than conventional methods, it is impractical and unnecessarily costly. To determine its practicality, it might be essential to compare the accuracy of AI predictions with traditional methods.

To develop a clinically applicable method for predicting smooth soft tissue curves, it is necessary to analyze multiple predictor and outcome (response) variables of the soft tissue landmarks.4,5 Multivariate multiple linear regression (MMLR), which produces the ordinary least squares estimator, is one of the conventional methods available to do so.6 However, MMLR has limitations when there are numerous variables and those variables are significantly correlated. Partial least squares regression (PLSR) applies dimensional reduction latent modeling and has preferably been used to provide more accurate prediction results after combined surgical-orthodontic treatment5,7–10 or in predicting facial growth changes.11
In developing AI models, deep-learning algorithms based on convolutional neural network (CNN) architecture have been the most popular. One of the latest CNNs, TabNet deep neural network (DNN), has been used to develop an individualized facial growth prediction model12 and to predict soft tissue changes following orthognathic surgery.13 TabNet DNN can model complex nonlinear relationships, incorporating multiple predictors and outcome variables.14 Previously, AI showed effectiveness when automatically identifying cephalometric landmarks and subsequent analyses15–20 and for predicting facial growth in growing children.12 Contrary to the initial assumption that AI could provide a universal solution to diverse challenges, AI has not always been effective. For example, when predicting soft tissue changes following orthognathic surgery, the AI prediction was not as effective as the conventional PLSR method, particularly when predicting areas with small surgical changes.13
The purpose of this study was to develop and evaluate an AI model for predicting changes following orthodontic treatment. The specific aim was to compare the predictive performance of the AI prediction model with conventional prediction methods, MMLR and PLSR.
MATERIALS AND METHODS
Subjects
The institutional review board for the protection of human subjects of the Seoul National University School of Dentistry approved the research protocol (S-D20200036).
The subjects were 887 patients (604 females and 283 males; mean age = 24.2 ± 8.5 years) who had undergone orthodontic treatment at the Department of Orthodontics, Seoul National University Dental Hospital, Seoul, Korea, from January 2013 to December 2022. The inclusion criteria were (1) females aged > 15 years, males > 17 years, to exclude subjects during major growth spurts, and (2) treated with comprehensive orthodontic treatment using fixed appliances. The exclusion criteria were (1) a history of orthognathic surgery and (2) the presence of craniofacial syndromes.
Predictors and Outcome Variables
For all patients, lateral cephalograms taken before (T1) and after (T2) orthodontic treatment were collected. On 1774 images from 887 patients, 46 skeletal and 32 soft tissue landmarks were identified using automated landmark detection software (Ceppro, DDH Inc, Seoul, Korea) based on the PIPNet algorithm by Jin et al. (2021).21
The prediction models comprised 132 predictors and 88 outcome variables (Figure 1). The predictor variables were demographics (age, sex), clinical (treatment time, premolar extraction), and Cartesian coordinates of the 64 anatomic landmarks.



Citation: The Angle Orthodontist 94, 5; 10.2319/111823-767.1
Landmarks were chosen to reflect orthodontic treatment changes of the incisors (16 variables), the molars (24 variables), the soft tissue from subnasale to the terminal point of the neck (44 variables), the alveolar bone (12 variables), and the hard tissue of the mandible (32 variables).
The outcome variables included Cartesian coordinates of the 22 soft tissue landmarks from subnasale to the terminal point of the neck, 6 alveolar bone landmarks, and 16 skeletal landmarks on the mandible that could undergo changes according to orthodontic tooth movement (Figure 2; Table 2).



Citation: The Angle Orthodontist 94, 5; 10.2319/111823-767.1

AI Prediction Model
The AI prediction model applied in the present study was based on the TabNet DNN algorithm by Arik and Pfister (2021).14 The training and testing were performed using Python (Python Software Foundation, Wilmington, Del) on a desktop computer run on Ubuntu 22.04 LTS of Linux distribution.
To develop an optimal AI prediction model, various AI training circumstances (also called hyperparameters) were tested, and the optimal conditions were selected by comparing prediction errors of numerous combinations of training hyperparameters. Regarding the early stopping number of training epochs, 50, 100, 1000, and 10,000 were tested. Subsequently, the AI model trained through 10,000 epochs was selected as the optimal AI model (Figure 3A).



Citation: The Angle Orthodontist 94, 5; 10.2319/111823-767.1
The oversampling method based on the synthetic minority oversample technique (SMOTE)22 was implemented. SMOTE values of 0.05, 0.1, 0.2, and 0.3 were tested. However, the results from varying values of SMOTE did not show significant differences (Figure 3B).
Two Conventional Statistical Prediction Models
MMLR is based on the ordinary least squares estimator. The stepwise variable selection method was used in constructing the MMLR prediction model.
PLSR combines the benefits of principal component analysis and MMLR through dimensional reduction latent modeling.23 The PLSR prediction model with 40 latent variables was selected.
Validation and Evaluation of Predictive Performance
To validate the prediction models and to avoid overfitting, the leave-one-out cross-validation, which is known for its superiority compared with other test/validation methods, was used.24 During the validation process, 2661 prediction models were built, with each algorithm (AI, MMLR, and PLSR) having 887 models that excluded one subject during model building.
To compare predictive performance, analysis of variance was conducted. Scatterplots with 95% confidence ellipses were used to evaluate the prediction errors in two dimensions.25 Changes in the 22 skeletal and 22 soft tissue landmarks after orthodontic treatment were connected using spline curves overlaid on real patient photos and cephalometric images (Figure 1).
RESULTS
Among 887 patients, 31% had undergone premolar extraction treatment; 53.6%, 34.8%, and 11.6% had Class I, II, and III malocclusions, respectively. The mean treatment duration was 32 months.
The pooled average prediction errors of the 44 anatomical landmarks were 1.69 mm, 1.74 mm, and 2.12 mm from the MMLR, PLSR, and AI prediction methods, respectively.
Overall, MMLR demonstrated the most accurate results in all of the alveolar bone and skeletal landmarks. However, AI demonstrated superiority over MMLR and PLSR in predicting 5 among 22 soft tissue landmarks, all of which were landmarks on the face below menton to the terminal point of the neck that had been poorly predicted by MMLR and PLSR (Table 2).
From the point of view of statistical significance, MMLR showed more accurate results than PLSR in 14 landmarks. However, when the prediction errors were evaluated in two dimensions, the differences between MMLR and PLSR did not illustrate clinically significant differences (Figure 4). In certain areas, the differences between the AI and conventional methods were noticeable. Figure 4 illustrates several representative scatterplots of prediction errors where AI demonstrated greater prediction errors, such as at the upper lip, lower lip, soft tissue point B, soft tissue pogonion, and soft tissue menton. However, when predicting the cervical point, AI showed fewer prediction errors than the conventional methods (Figure 4).



Citation: The Angle Orthodontist 94, 5; 10.2319/111823-767.1
To provide real case examples, the predicted outcomes were overlaid with the actual changes following orthodontic treatment. Figure 5A displays an orthodontic patient treated with four premolar extractions, which demonstrated that MMLR and PLSR are more accurate than AI when predicting changes in the alveolar process and lip curves. Figure 5B is a treatment case with an open-bite resolved by counterclockwise autorotation of the mandible through the intrusion of the maxillary posterior teeth, which illustrated poor accuracy by the AI model compared with MMLR and PLSR for the prediction of rotational movement of the mandible. Similar variations in the predictive outcomes were observed in all patients, particularly in the lip region and chin tip (Figure 5C,D). In predicting soft tissue curves below soft tissue menton, AI outperformed MMLR and PLSR for all patients (Figure 5).



Citation: The Angle Orthodontist 94, 5; 10.2319/111823-767.1
DISCUSSION
Although recent AI studies have shown promising features of AI in predicting changes following orthodontic treatment,1–3 no studies compared the predictive performance of AI with conventional statistical methods to determine whether AI was superior enough to deserve the spotlight over conventional methods in this area. In addition, previous literature was based on a relatively small sample size with a limited number of outcome variables. So far, the present study appears to be the first to compare the predictive performance of an AI prediction model with conventional prediction methods for orthodontic treatment outcomes, using the largest sample size ever. This study showed that AI was not effective in predicting changes after orthodontic treatment, except for the neck area. So far, it could be conjectured that AI might not be as effective as conventional methods.
Based on literature reviews, AI has not always been effective. According to Hwang et al. (2020),19 when detecting cephalometric landmarks, AI was poorer than human examiners in accurately identifying the nose tip, the nasal bone tip, incisal edges, and incisal root tips. These landmarks had a relatively clear form and shape that could be visually pinpointed with ease. Recently, according to Moon et al. (2024),12 when predicting facial growth, AI was the most accurate in 63 out of 78 landmarks (81%). However, when predicting cranial base landmarks, AI showed poorer results than PLSR. AI also did not outperform PLSR in the study by Park et al. (2024)13 on predicting changes after orthognathic surgery, AI only provided the most accurate results for 6 out of 32 landmarks (18.8%). In the present study, the percentage of the cases for which AI showed the most accurate results decreased to 5 of 44 landmarks (11.4%), as summarized in Table 3. This suggests that, when changes were limited, variations were minimal, or a clear cause-and-effect relationship existed, conventional statistical prediction methods were more effective than AI. Although predicting changes after orthodontic treatment is complex, orthodontic treatment changes are not as significant as those resulting from natural facial growth over time or orthognathic surgical procedures.

AI does have some disadvantages. First, AI cannot explain how it arrives at solutions, unlike conventional linear regression analysis that can interpret and estimate the relationships between predictor and outcome variables. Since the internal operations of AI cannot be fully understood or interpreted, in this sense, AI could be deemed a black box.13 Second, developing an AI model requires significant computation resources and time, ranging from several weeks to months, depending on the sample size and computer specifications.
The large sample size of this study may have influenced the finding that MMLR was more effective than PLSR. Without exception, all of the previous growth prediction studies11,12 and orthognathic surgery prediction studies5,7–10,13 showed MMLR to have a poorer predictive performance than PLSR. With some caution, it is surmised that this phenomenon might have been related to the ratio of the number of subjects (n) to the number of predictors (p), namely, the n/p ratio. PLSR is advantageous in the case of small n and large p situations,23 whereas MMLR models would be more robust when the n/p ratio was greater than 5.26 In those studies, the n/p ratios of growth prediction and orthognathic surgery prediction studies were 2.55 and 2.78, respectively. The consequence was that PLSR models were better methods than MMLR models. In contrast, the current study had a larger sample size and fewer variables than previous studies, resulting in an n/p ratio of 6.72. As shown in Table 3, this might have led to the finding that MMLR was the more accurate method. However, further clinical studies or simulations are required to validate this hypothesis.
On a similar note, AI seemed to have shown better predictive performance when the n/p ratio was low, ie, when there were limited numbers of subjects but many predictor variables. This may imply that, in the case of three-dimensional (3D) studies which involve 3 to 4 times more variables than 2D study formulations, AI is likely to play a more meaningful role than conventional statistical methods.4,15 Orthodontic treatment prediction results will become more sophisticated in the future as more 3D information is collected.
CONCLUSIONS
-
When predicting changes following orthodontic treatment, AI was not as effective as the conventional statistical methods, suggesting that AI might not always be the best option for predicting everything.
-
However, this does not necessarily mean that conventional methods should be applied. The strength of the AI prediction method was apparent in predicting the soft tissue changes in the neck, whereas traditional methods had poorly predicted changes in that area.
-
Applying multiple methods catered to the anatomic features and variability of response variables may be a viable option to improve predictive performance.

The experimental design.

The reference planes and 78 cephalometric landmarks used in this study: (A) pretreatment image, skeletal landmarks in capital letters; (B) posttreatment image, soft tissue landmarks in lowercase letters.

Searching for optimal artificial intelligence (AI) model training conditions by comparing 95% confidence ellipses of the AI prediction errors at the upper lip and lower lip: (A) according to the number of training epochs; (B) according to the amount of oversampling.

The prediction errors in several soft tissue landmarks obtained from the multivariate multiple linear regression (MMLR, green), partial least squares regression (PLSR, blue), and artificial intelligence (AI, red) prediction methods. In general, AI showed the least accurate prediction results except for the cervical point, where AI showed the smallest ellipses.

Comparison between actual changes after orthodontic treatment and prediction results according to multivariate multiple linear regression (MMLR), partial least squares regression (PLSR), and artificial intelligence (AI) prediction methods in patients with (A) Class III anterior crossbite, (B) Class II open-bite, (C) Class I open-bite, and (D) Class II open-bite.
Contributor Notes