Accuracy and reliability of automated landmark identification and cephalometric measurements on cone beam computed tomography using Invivo software
To evaluate the accuracy and reliability of an automated landmark identification (ALI) system and the impact of ALI errors on cephalometric measurements on cone-beam computed tomography (CBCT) images. Thirty-one landmarks were identified on 76 CBCT images using Invivo7 software (Anatomage, San Jose, Calif). Ground truth was established by averaging landmark coordinates from two calibrated human examiners. The accuracy of the ALI system was assessed by the mean absolute error (MAE, mm) across coordinate axes, the mean error distance (mm), and the successful detection rate (SDR) for each landmark. Interexaminer reliability between the ALI and manual landmark location was evaluated. Eighteen cephalometric measurements were computed from 25 landmarks. Accuracy of measurements from the ALI system was assessed with the MAE and successful measurement rates (SMR). The ALI system closely matched human examiners in landmark identification, with an average MAE of 0.94 ± 0.99 mm. Across all three coordinate axes, 87% of the landmarks had <2 mm MAE. ALI average MAE for conventional linear and angular cephalometric measurements were 1.35 ± 1.33 mm and 0.89 ± 0.89 degrees, respectively. Only one measurement, Intercondylar Width, showed MAE >3 mm. The ALI system showed clinically acceptable accuracy and reliability for the majority of cephalometric landmarks and measurements. Clinicians are advised to critically evaluate ALI landmarks with substantial errors, to fully utilize the capabilities of commercial software effectively.ABSTRACT
Objectives
Materials and Methods
Results
Conclusions
INTRODUCTION
Cephalometric analysis provides a quantitative method to assess skeletal and dentoalveolar morphology, which are crucial for understanding dentofacial discrepancies and growth.1 Cone-beam computed tomography (CBCT) technology has enabled three-dimensional (3D) analysis, offering more precise anatomical representation.2 A key advantage of 3D analysis is the elimination of bilateral structure superimposition and distortion present in 2D imaging, improving accuracy.3
Despite the benefits offered by CBCT, challenges remain, especially in the time-consuming and less familiar process of landmark identification on 3D images. Computational advancements have propelled machine learning applications, including automated landmark identification (ALI), facilitating analysis in both two-dimensional (2D) and 3D imaging.4–16 Artificial intelligence (AI) integration aims to lessen clinician workload and expedite the process.17 ALI software reduces identification time to 1–2 minutes compared to the 15 minutes needed for manual CBCT identification by an experienced operator.1,2,4,7,8
Recent research has shown that deep learning methods accurately detect landmarks.1,8,9 Promising algorithms developed over the past two years have improved ALI accuracy on 3D images.14–16 Although previous studies demonstrated promising results, the impact on cephalometric analysis remains to be evaluated thoroughly.13,18,19 This study had two main objectives: to assess the accuracy and reliability of automated 3D CBCT landmark identification using the widely used commercial software, Invivo (Anatomage, San Jose, Calif), and to evaluate the impact of ALI errors on commonly used cephalometric measurements.
MATERIALS AND METHODS
This study was approved by the institutional review board of the University of the Pacific (#2021-95). The study sample was gathered retrospectively from University of Pacific Department of Orthodontics. The inclusion criteria were: (1) CBCT scans with a voxel size ≤0.3 mm3 and a field of view at least 16 × 13 cm, and (2) the presence of all permanent teeth. Participants of all ages, genders, and with various skeletal conditions were included. Exclusion criteria were: (1) restorative work causing significant scatter and (2) the presence of craniofacial abnormalities, syndromes, or cleft lip and palate.
The sample consisted of 76 CBCT volumes, acquired using an Imaging Science International CBCT scanner (Hatfield, PA). These volumes were imported in Digital Imaging and Communications in Medicine (DICOM) format into the Invivo7 software (Anatomage, San Jose, Calif). The software included multiplanar sectional slices in axial, sagittal, and coronal views to aid in identifying the landmarks (Figure 1). CBCT images were traced independently by two calibrated human examiners (YJ and HS) and the ALI system. A total of 31 landmarks were utilized, comprising 17 skeletal, eight soft tissue, and six dental landmarks (Table 1). The resulting x, y, and z coordinates were exported to a Microsoft Excel (Microsoft Corp., Redmond, Wash) file. The ground truth was established by calculating the mean x, y, and z coordinates by two examiners for each landmark.


Citation: The Angle Orthodontist 95, 4; 10.2319/122324-1049.1

Cephalometric measurements were calculated using coordinate values reoriented to a standard anatomical frame of reference (AFOR). The axial plane was formed by right Porion, left Porion, and right Orbitale. The sagittal plane was constructed perpendicular to the axial plane and contained Nasion and Basion. The coronal plane was made perpendicular to the other two planes and passed through Nasion. Eighteen cephalometric measurements, including eight angular and 10 linear measurements, were computed. All angular and eight linear measurements were calculated by projecting the line segments onto the midsagittal plane. Transverse width measurements, intercondylar width and mandibular width, were projected onto the coronal plane. Distance and angle calculations used standard mathematical formulas.
To evaluate the accuracy of ALI, the mean absolute error (MAE, mm) in the x, y, and z coordinates between the ground truth and ALI were calculated and the error distance was calculated with a 3D Euclidian distance formula.3,10 A successful detection rate (SDR) was also calculated. The SDR represents the percentage of images in which the landmark was located within a precision range.3,11 Clinically, an ALI system is considered accurate if the variance from the ground truth is less than 2 mm, and acceptable if under 4 mm.5,14,20 To detect subtle changes during treatment or growth, this study implemented stricter criteria of ≤1 mm, 1.5 mm, 2 mm, and 3 mm. Successful measurement rate (SMR) was also calculated with the same criteria as the SDR. To evaluate intraexaminer reliability of the ALI system, all 76 images were subjected to two rounds of testing at an interval of 1 week. The interexaminer reliability between ALI and manual identification, as well as the reliability between two calibrated human examiners, was evaluated.
Statistical Analysis
Basic descriptive statistics including the mean, standard deviation (SD), and percentages were computed. Intraclass correlation coefficients (ICC) were used to evaluate reliability. To visualize and compare reliability, scattergrams with 95% confidence ellipses were generated, showing differences between ALI and human examiner 2, as well as differences between human examiners 1 and 2. Data were analyzed using Statistical Package for the Social Sciences (IBM Corp, Armonk, New York) and language R (Vienna, Austria).
RESULTS
The ALI system demonstrated perfect consistency, with an ICC of 1 when the same CBCT images were processed twice. The interexaminer reliability for landmark location between the two calibrated human examiners was excellent, with ICC ranging from 0.9 to 1 except for the x-coordinate of left Orbitale with 0.7. The interexaminer reliability between ALI and a human examiner (human examiner 2) ranged from 0.9 to 1 for the y and z coordinates, and from 0.5 to 1 for the x coordinates. In general, the landmarks which exhibited large differences between human examiners also showed considerable differences between the human and ALI (Figure 2). However, there were exceptions such as Porion, Condylion, and the maxillary first molar cusp: the x-coordinates for Porion demonstrated the lowest reliability, with ICC values ranging from 0.5 to 0.6, compared to an ICC of 0.9 between two calibrated human examiners. Both x-coordinates of Condylion and the maxillary first molar cusp exhibited lower reliability (ICC = 0.7) compared to the ICC >0.9 observed between human examiners.


Citation: The Angle Orthodontist 95, 4; 10.2319/122324-1049.1
The overall accuracy of the ALI system in comparison to the established ground truth is presented in Table 2. ALI achieved remarkable accuracy, with a mean absolute error (MAE) <2 mm for all landmarks, except for Porion, Gonion, and Stomion Superius. The MAE for the 31 landmarks was 1.35 mm on the x-axis, 0.72 mm on the y-axis, and 0.74 mm on the z-axis, resulting in an overall MAE of 0.94 mm. The mean error distance between the ALI system and ground truth was 1.99 ± 1.26 mm, with 81% of the landmarks showing a SDR within a 3 mm error distance margin. Notably, Nasion and Sella landmarks showed exceptional accuracy, achieving SDRs of 91% and 92% within a 2 mm error distance range (Table 3). High accuracy was also observed for the upper and lower incisor edges, attaining 100% and 99% SDRs within 2 mm, whereas the lower incisor root apex showed a 70% SDR within 2 mm. (Table 3). Right and left Porion exhibited the least accuracy, with mean errors of 3.76 ± 1.83 mm and 3.38 ± 1.36 mm, respectively (Table 2).


Cephalometric measurement errors are presented in Table 4. The MAE for conventional linear and angular cephalometric measurements were 0.93 ± 0.92 mm and 0.89 ± 0.89 degrees, respectively. In the conventional linear measurement category, PFH had the greatest MAE of 1.52 ± 1.30 mm and 65% accuracy within 2 mm, whereas all other measures indicated MAE around 1 mm. Two transverse measurements evaluated in this study, intercondylar width and mandibular width, had MAE of 3.32 ± 1.61 mm and 2.74 ± 1.03 mm, respectively. The measurements with higher reliability, which exhibited smaller differences between human examiners, also demonstrated higher reliability between a human and the ALI system (Figure 3).


Citation: The Angle Orthodontist 95, 4; 10.2319/122324-1049.1

DISCUSSION
The ALI system reduced landmark identification time to 1–2 minutes, compared to 15 minutes required for manual location on CBCT by calibrated human operators. The ALI system nearly matched human accuracy, with 95% of landmarks within a 3 mm error. This surpassed previous research by Shahidi et al.7 or was comparable to Ghowsi et al.14 Another distinction in the current study was the demonstrated high accuracy of the ALI system in locating dental landmarks. Notably, the maxillary and mandibular incisal edges achieved SDRs of 100% and 99%, respectively, within a 2-mm error distance (Table 3).
Errors in landmark identification can stem from the difference in the operational landmark definition between the software and human examiners. For example, ALI places Condylion along the mandibular profile line, whereas human examiners typically mark it at the highest point of the condyle, usually medial to the ALI location. This discrepancy leads to consistent overestimation of intercondylion width by the ALI system (Figure 3). In 2D cephalometry, this is less critical since Condylion is used to measure mandibular and ramus lengths. In contrast, for 3D evaluations of transverse dimensions, discrepancies become significant. Significant transverse errors were also found for Porion, which ALI placed more medially on the temporal bone than human experts. However, these errors did not affect construction of the FH plane since anterior-posterior and vertical errors were smaller.
Landmark ambiguity contributes to identification errors. The definitions and anatomy allow for a degree of interpretation, leading to individual variance in landmark identification.21,22 For instance, substantial errors across x, y, and z coordinates were observed for Gonion (Table 2). These errors may originate from the ALI training data from human experts, reflecting the challenge in defining Gonion on the broad curve of the mandible.21,23,24 The difficulty in locating Gonion is illustrated in Figure 2B, which shows large differences between human examiners. Such discrepancies underscore the need for a clear consensus on definitions among human experts.25
Angular measurement errors tend to be larger when a line segment intersects the broader aspect of the error ellipse.23,25 Despite substantial landmark location errors for Gonion, the mandibular plane angle (FMA) was accurate, as the larger error distribution dimension for Gonion aligned with the Gonion–Menton line (Figure 2B). The PFH measurement, a length measurement, exhibited greater error due to the larger error in Gonion landmark location. PFH showed larger differences both between human examiners and between the manual and ALI systems (Figure 3).
There is no definitive gold standard for identifying landmark positions in live patients (ie, directly identifying a site on the bone).21 Although human examiners were calibrated, variations in landmark identification persist, leading to discrepancies in cephalometric measurements. Without a definitive ground truth, the intra- and interexaminer reliability patterns become essential for validating new methods.26,27 This study showed that interexaminer reliability between the manual and ALI method was comparable to that between calibrated human examiners. Most variance arose from ambiguous landmark definitions, as shown by confidence ellipses (Figure 2) and limits of agreement (Figure 3). Although location of x-coordinates was less reliable between the manual and ALI methods, it did not significantly impact sagittal and vertical measurement errors. However, interpreting the intercondylar width requires caution, as the measurement was biased.
Improvements in ALI require refined 3D landmark definitions and expanded training datasets. Subjective interpretations by human experts lead to discrepancies, a challenge for ALI software development. Regarding measurements and analysis, if a landmark is to be used to evaluate a certain dimension, it should be shown to have relatively good consistency and precision.21 For example, a more clear definition of Condylion point is necessary before intercondylar width can significantly contribute to transverse analysis.
The software used in this study detected mandibular midline landmarks on the midsagittal plane, which may result in large x-coordinate errors in patients with asymmetry (Figure 2D). These x-coordinate errors were observed in both the mandibular skeletal and soft tissue midline landmarks (Table 2). This indicated the need for caution when analyzing facial asymmetry and the importance of developing a more sophisticated detection algorithm for midline points. In addition, incorporating additional bilateral landmarks such as jugum, zygoma, first molars, and canines could enhance transverse analysis, which is a key advantage of 3D imaging.
This study provided an error range for each landmark in each dimension and the consequent impact on cephalometric measurement accuracy. Although ALI holds promise in orthodontic practice, it is imperative for clinicians to recognize the landmarks and measurements prone to substantial errors. A hybrid approach that combines ALI with human review for error-prone landmarks seems advisable for the time being. Additionally, developing an algorithm to highlight significant outliers for the clinician to review, and allowing manual adjustments, will be essential features for ALI software. The path forward requires more rigorous and standardized landmark definitions and the cautious use of less accurate landmarks and measurements.
CONCLUSIONS
ALI achieved an SDR of 87% for all 31 landmarks within a 2-mm margin of error.
Interexaminer reliability between a human and the ALI system was comparable to that between calibrated human examiners, with most variance arising from ambiguous definitions of the landmarks.
The MAE of all measurements remained under 2 degrees or 2 mm for commonly used cephalometric measurements. The intercondylar width and mandibular width were not as accurate with MAE >2 mm.
Clinicians must recognize potential inaccuracies in landmarks and measurements susceptible to significant errors, and more rigorous and standardized definitions are required to enhance ALI use in orthodontics.

Three-dimensional landmark identification on CBCT. Positive signs on the x, y, and z axes denote left, posterior, and upward directions, respectively, while negative signs indicate right, anterior, and downward.

Scatterplots of (ALI coordinates – human examiner 2 coordinates) (red), and (human examiner 1 coordinates – human examiner 2 coordinates) (blue), with 95% confidence ellipses in different planes of view. Larger ellipses indicate greater variability and, thus, lower reliability. For bilateral landmarks, the right-side landmark is presented. Positive values on the x, y, and z axes indicate that the ALI (red) or human examiner 1 (blue) placed landmarks medially, posteriorly, and superiorly compared to the locations identified by human examiner 2. (A), Nasion; (B), Gonion; (C), Lower incisal crown; (D), Lower lip.

Bland-Altman plots presenting smaller variance (A) and larger variance (B) in measurements. Red represents (ALI – human examiner 2); Blue represents (human examiner 1 – human examiner 2). The solid line indicates the mean difference (bias); the dotted lines show the upper and lower limits of agreement. Wider limits of agreement indicate greater variability and, thus, lower reliability.
Contributor Notes