Editorial Type:
Article Category: Research Article
 | 
Online Publication Date: 10 Apr 2025

Accuracy and reliability of automated landmark identification and cephalometric measurements on cone beam computed tomography using Invivo software

,
,
, and
Page Range: 362 – 370
DOI: 10.2319/122324-1049.1
Save
Download PDF

ABSTRACT

Objectives

To evaluate the accuracy and reliability of an automated landmark identification (ALI) system and the impact of ALI errors on cephalometric measurements on cone-beam computed tomography (CBCT) images.

Materials and Methods

Thirty-one landmarks were identified on 76 CBCT images using Invivo7 software (Anatomage, San Jose, Calif). Ground truth was established by averaging landmark coordinates from two calibrated human examiners. The accuracy of the ALI system was assessed by the mean absolute error (MAE, mm) across coordinate axes, the mean error distance (mm), and the successful detection rate (SDR) for each landmark. Interexaminer reliability between the ALI and manual landmark location was evaluated. Eighteen cephalometric measurements were computed from 25 landmarks. Accuracy of measurements from the ALI system was assessed with the MAE and successful measurement rates (SMR).

Results

The ALI system closely matched human examiners in landmark identification, with an average MAE of 0.94 ± 0.99 mm. Across all three coordinate axes, 87% of the landmarks had <2 mm MAE. ALI average MAE for conventional linear and angular cephalometric measurements were 1.35 ± 1.33 mm and 0.89 ± 0.89 degrees, respectively. Only one measurement, Intercondylar Width, showed MAE >3 mm.

Conclusions

The ALI system showed clinically acceptable accuracy and reliability for the majority of cephalometric landmarks and measurements. Clinicians are advised to critically evaluate ALI landmarks with substantial errors, to fully utilize the capabilities of commercial software effectively.

INTRODUCTION

Cephalometric analysis provides a quantitative method to assess skeletal and dentoalveolar morphology, which are crucial for understanding dentofacial discrepancies and growth.1 Cone-beam computed tomography (CBCT) technology has enabled three-dimensional (3D) analysis, offering more precise anatomical representation.2 A key advantage of 3D analysis is the elimination of bilateral structure superimposition and distortion present in 2D imaging, improving accuracy.3

Despite the benefits offered by CBCT, challenges remain, especially in the time-consuming and less familiar process of landmark identification on 3D images. Computational advancements have propelled machine learning applications, including automated landmark identification (ALI), facilitating analysis in both two-dimensional (2D) and 3D imaging.4–16 Artificial intelligence (AI) integration aims to lessen clinician workload and expedite the process.17 ALI software reduces identification time to 1–2 minutes compared to the 15 minutes needed for manual CBCT identification by an experienced operator.1,2,4,7,8

Recent research has shown that deep learning methods accurately detect landmarks.1,8,9 Promising algorithms developed over the past two years have improved ALI accuracy on 3D images.14–16 Although previous studies demonstrated promising results, the impact on cephalometric analysis remains to be evaluated thoroughly.13,18,19 This study had two main objectives: to assess the accuracy and reliability of automated 3D CBCT landmark identification using the widely used commercial software, Invivo (Anatomage, San Jose, Calif), and to evaluate the impact of ALI errors on commonly used cephalometric measurements.

MATERIALS AND METHODS

This study was approved by the institutional review board of the University of the Pacific (#2021-95). The study sample was gathered retrospectively from University of Pacific Department of Orthodontics. The inclusion criteria were: (1) CBCT scans with a voxel size ≤0.3 mm3 and a field of view at least 16 × 13 cm, and (2) the presence of all permanent teeth. Participants of all ages, genders, and with various skeletal conditions were included. Exclusion criteria were: (1) restorative work causing significant scatter and (2) the presence of craniofacial abnormalities, syndromes, or cleft lip and palate.

The sample consisted of 76 CBCT volumes, acquired using an Imaging Science International CBCT scanner (Hatfield, PA). These volumes were imported in Digital Imaging and Communications in Medicine (DICOM) format into the Invivo7 software (Anatomage, San Jose, Calif). The software included multiplanar sectional slices in axial, sagittal, and coronal views to aid in identifying the landmarks (Figure 1). CBCT images were traced independently by two calibrated human examiners (YJ and HS) and the ALI system. A total of 31 landmarks were utilized, comprising 17 skeletal, eight soft tissue, and six dental landmarks (Table 1). The resulting x, y, and z coordinates were exported to a Microsoft Excel (Microsoft Corp., Redmond, Wash) file. The ground truth was established by calculating the mean x, y, and z coordinates by two examiners for each landmark.

Figure 1.Figure 1.Figure 1.
Figure 1.Three-dimensional landmark identification on CBCT. Positive signs on the x, y, and z axes denote left, posterior, and upward directions, respectively, while negative signs indicate right, anterior, and downward.

Citation: The Angle Orthodontist 95, 4; 10.2319/122324-1049.1

Table 1.Definitions Used by Human Examiners for 3D Landmark Identification
Table 1.

Cephalometric measurements were calculated using coordinate values reoriented to a standard anatomical frame of reference (AFOR). The axial plane was formed by right Porion, left Porion, and right Orbitale. The sagittal plane was constructed perpendicular to the axial plane and contained Nasion and Basion. The coronal plane was made perpendicular to the other two planes and passed through Nasion. Eighteen cephalometric measurements, including eight angular and 10 linear measurements, were computed. All angular and eight linear measurements were calculated by projecting the line segments onto the midsagittal plane. Transverse width measurements, intercondylar width and mandibular width, were projected onto the coronal plane. Distance and angle calculations used standard mathematical formulas.

To evaluate the accuracy of ALI, the mean absolute error (MAE, mm) in the x, y, and z coordinates between the ground truth and ALI were calculated and the error distance was calculated with a 3D Euclidian distance formula.3,10 A successful detection rate (SDR) was also calculated. The SDR represents the percentage of images in which the landmark was located within a precision range.3,11 Clinically, an ALI system is considered accurate if the variance from the ground truth is less than 2 mm, and acceptable if under 4 mm.5,14,20 To detect subtle changes during treatment or growth, this study implemented stricter criteria of ≤1 mm, 1.5 mm, 2 mm, and 3 mm. Successful measurement rate (SMR) was also calculated with the same criteria as the SDR. To evaluate intraexaminer reliability of the ALI system, all 76 images were subjected to two rounds of testing at an interval of 1 week. The interexaminer reliability between ALI and manual identification, as well as the reliability between two calibrated human examiners, was evaluated.

Statistical Analysis

Basic descriptive statistics including the mean, standard deviation (SD), and percentages were computed. Intraclass correlation coefficients (ICC) were used to evaluate reliability. To visualize and compare reliability, scattergrams with 95% confidence ellipses were generated, showing differences between ALI and human examiner 2, as well as differences between human examiners 1 and 2. Data were analyzed using Statistical Package for the Social Sciences (IBM Corp, Armonk, New York) and language R (Vienna, Austria).

RESULTS

The ALI system demonstrated perfect consistency, with an ICC of 1 when the same CBCT images were processed twice. The interexaminer reliability for landmark location between the two calibrated human examiners was excellent, with ICC ranging from 0.9 to 1 except for the x-coordinate of left Orbitale with 0.7. The interexaminer reliability between ALI and a human examiner (human examiner 2) ranged from 0.9 to 1 for the y and z coordinates, and from 0.5 to 1 for the x coordinates. In general, the landmarks which exhibited large differences between human examiners also showed considerable differences between the human and ALI (Figure 2). However, there were exceptions such as Porion, Condylion, and the maxillary first molar cusp: the x-coordinates for Porion demonstrated the lowest reliability, with ICC values ranging from 0.5 to 0.6, compared to an ICC of 0.9 between two calibrated human examiners. Both x-coordinates of Condylion and the maxillary first molar cusp exhibited lower reliability (ICC = 0.7) compared to the ICC >0.9 observed between human examiners.

Figure 2.Figure 2.Figure 2.
Figure 2.Scatterplots of (ALI coordinates – human examiner 2 coordinates) (red), and (human examiner 1 coordinates – human examiner 2 coordinates) (blue), with 95% confidence ellipses in different planes of view. Larger ellipses indicate greater variability and, thus, lower reliability. For bilateral landmarks, the right-side landmark is presented. Positive values on the x, y, and z axes indicate that the ALI (red) or human examiner 1 (blue) placed landmarks medially, posteriorly, and superiorly compared to the locations identified by human examiner 2. (A), Nasion; (B), Gonion; (C), Lower incisal crown; (D), Lower lip.

Citation: The Angle Orthodontist 95, 4; 10.2319/122324-1049.1

The overall accuracy of the ALI system in comparison to the established ground truth is presented in Table 2. ALI achieved remarkable accuracy, with a mean absolute error (MAE) <2 mm for all landmarks, except for Porion, Gonion, and Stomion Superius. The MAE for the 31 landmarks was 1.35 mm on the x-axis, 0.72 mm on the y-axis, and 0.74 mm on the z-axis, resulting in an overall MAE of 0.94 mm. The mean error distance between the ALI system and ground truth was 1.99 ± 1.26 mm, with 81% of the landmarks showing a SDR within a 3 mm error distance margin. Notably, Nasion and Sella landmarks showed exceptional accuracy, achieving SDRs of 91% and 92% within a 2 mm error distance range (Table 3). High accuracy was also observed for the upper and lower incisor edges, attaining 100% and 99% SDRs within 2 mm, whereas the lower incisor root apex showed a 70% SDR within 2 mm. (Table 3). Right and left Porion exhibited the least accuracy, with mean errors of 3.76 ± 1.83 mm and 3.38 ± 1.36 mm, respectively (Table 2).

Table 2.Mean Absolute Error at x, y, and z Coordinates for Each Landmark and Mean Error Distance for Each Landmark. Errors Calculated Using the Average of Human Examiner Coordinates as the Reference Standard
Table 2.
Table 3.Successful Detection Rate (SDR) for −1, −1.5, −2, and −3 mm Range Criteria for Landmark Location
Table 3.

Cephalometric measurement errors are presented in Table 4. The MAE for conventional linear and angular cephalometric measurements were 0.93 ± 0.92 mm and 0.89 ± 0.89 degrees, respectively. In the conventional linear measurement category, PFH had the greatest MAE of 1.52 ± 1.30 mm and 65% accuracy within 2 mm, whereas all other measures indicated MAE around 1 mm. Two transverse measurements evaluated in this study, intercondylar width and mandibular width, had MAE of 3.32 ± 1.61 mm and 2.74 ± 1.03 mm, respectively. The measurements with higher reliability, which exhibited smaller differences between human examiners, also demonstrated higher reliability between a human and the ALI system (Figure 3).

Figure 3.Figure 3.Figure 3.
Figure 3.Bland-Altman plots presenting smaller variance (A) and larger variance (B) in measurements. Red represents (ALI – human examiner 2); Blue represents (human examiner 1 – human examiner 2). The solid line indicates the mean difference (bias); the dotted lines show the upper and lower limits of agreement. Wider limits of agreement indicate greater variability and, thus, lower reliability.

Citation: The Angle Orthodontist 95, 4; 10.2319/122324-1049.1

Table 4.Mean Absolute Error for Each Measurement and Successful Measurement Rate (SMR) Within −1, −1.5, −2, and −3 (Degrees or mm) Range Criteria. Errors Calculated Using the Average of Human Examiner Measurements as the Reference Standard
Table 4.

DISCUSSION

The ALI system reduced landmark identification time to 1–2 minutes, compared to 15 minutes required for manual location on CBCT by calibrated human operators. The ALI system nearly matched human accuracy, with 95% of landmarks within a 3 mm error. This surpassed previous research by Shahidi et al.7 or was comparable to Ghowsi et al.14 Another distinction in the current study was the demonstrated high accuracy of the ALI system in locating dental landmarks. Notably, the maxillary and mandibular incisal edges achieved SDRs of 100% and 99%, respectively, within a 2-mm error distance (Table 3).

Errors in landmark identification can stem from the difference in the operational landmark definition between the software and human examiners. For example, ALI places Condylion along the mandibular profile line, whereas human examiners typically mark it at the highest point of the condyle, usually medial to the ALI location. This discrepancy leads to consistent overestimation of intercondylion width by the ALI system (Figure 3). In 2D cephalometry, this is less critical since Condylion is used to measure mandibular and ramus lengths. In contrast, for 3D evaluations of transverse dimensions, discrepancies become significant. Significant transverse errors were also found for Porion, which ALI placed more medially on the temporal bone than human experts. However, these errors did not affect construction of the FH plane since anterior-posterior and vertical errors were smaller.

Landmark ambiguity contributes to identification errors. The definitions and anatomy allow for a degree of interpretation, leading to individual variance in landmark identification.21,22 For instance, substantial errors across x, y, and z coordinates were observed for Gonion (Table 2). These errors may originate from the ALI training data from human experts, reflecting the challenge in defining Gonion on the broad curve of the mandible.21,23,24 The difficulty in locating Gonion is illustrated in Figure 2B, which shows large differences between human examiners. Such discrepancies underscore the need for a clear consensus on definitions among human experts.25

Angular measurement errors tend to be larger when a line segment intersects the broader aspect of the error ellipse.23,25 Despite substantial landmark location errors for Gonion, the mandibular plane angle (FMA) was accurate, as the larger error distribution dimension for Gonion aligned with the Gonion–Menton line (Figure 2B). The PFH measurement, a length measurement, exhibited greater error due to the larger error in Gonion landmark location. PFH showed larger differences both between human examiners and between the manual and ALI systems (Figure 3).

There is no definitive gold standard for identifying landmark positions in live patients (ie, directly identifying a site on the bone).21 Although human examiners were calibrated, variations in landmark identification persist, leading to discrepancies in cephalometric measurements. Without a definitive ground truth, the intra- and interexaminer reliability patterns become essential for validating new methods.26,27 This study showed that interexaminer reliability between the manual and ALI method was comparable to that between calibrated human examiners. Most variance arose from ambiguous landmark definitions, as shown by confidence ellipses (Figure 2) and limits of agreement (Figure 3). Although location of x-coordinates was less reliable between the manual and ALI methods, it did not significantly impact sagittal and vertical measurement errors. However, interpreting the intercondylar width requires caution, as the measurement was biased.

Improvements in ALI require refined 3D landmark definitions and expanded training datasets. Subjective interpretations by human experts lead to discrepancies, a challenge for ALI software development. Regarding measurements and analysis, if a landmark is to be used to evaluate a certain dimension, it should be shown to have relatively good consistency and precision.21 For example, a more clear definition of Condylion point is necessary before intercondylar width can significantly contribute to transverse analysis.

The software used in this study detected mandibular midline landmarks on the midsagittal plane, which may result in large x-coordinate errors in patients with asymmetry (Figure 2D). These x-coordinate errors were observed in both the mandibular skeletal and soft tissue midline landmarks (Table 2). This indicated the need for caution when analyzing facial asymmetry and the importance of developing a more sophisticated detection algorithm for midline points. In addition, incorporating additional bilateral landmarks such as jugum, zygoma, first molars, and canines could enhance transverse analysis, which is a key advantage of 3D imaging.

This study provided an error range for each landmark in each dimension and the consequent impact on cephalometric measurement accuracy. Although ALI holds promise in orthodontic practice, it is imperative for clinicians to recognize the landmarks and measurements prone to substantial errors. A hybrid approach that combines ALI with human review for error-prone landmarks seems advisable for the time being. Additionally, developing an algorithm to highlight significant outliers for the clinician to review, and allowing manual adjustments, will be essential features for ALI software. The path forward requires more rigorous and standardized landmark definitions and the cautious use of less accurate landmarks and measurements.

CONCLUSIONS

  • ALI achieved an SDR of 87% for all 31 landmarks within a 2-mm margin of error.

  • Interexaminer reliability between a human and the ALI system was comparable to that between calibrated human examiners, with most variance arising from ambiguous definitions of the landmarks.

  • The MAE of all measurements remained under 2 degrees or 2 mm for commonly used cephalometric measurements. The intercondylar width and mandibular width were not as accurate with MAE >2 mm.

  • Clinicians must recognize potential inaccuracies in landmarks and measurements susceptible to significant errors, and more rigorous and standardized definitions are required to enhance ALI use in orthodontics.

REFERENCES

  • 1.
    Bao H,Zhang K,Yu C, et al. Evaluating the accuracy of automated cephalometric analysis based on artificial intelligence. BMC Oral Health. 2023;23:191.
  • 2.
    Pittayapat P,Limchaichana-Bolstad N,Willems G,Jacobs R. Three-dimensional cephalometric analysis in orthodontics: a systematic review. Orthod Craniofac Res. 2014;17:6991.
  • 3.
    Li C,Teixeira H,Tanna N, et al. The reliability of two- and three-dimensional cephalometric measurements: a CBCT study. Diagnostics (Basel). 2021;11.
  • 4.
    Hassan B,Nijkamp P,Verheij H, et al. Precision of identifying cephalometric landmarks with cone beam computed tomography in vivo. Eur J Orthod. 2013;35:3844.
  • 5.
    Yue W,Yin D,Li C,Wang G,Xu T. Automated 2-D cephalometric analysis on X-ray images by a model-based approach. IEEE Trans Biomed Eng. 2006;53:16151623.
  • 6.
    Montufar J,Romero M,Scougall-Vilchis RJ. Hybrid approach for automatic cephalometric landmark annotation on cone-beam computed tomography volumes. Am J Orthod Dentofacial Orthop. 2018;154:140150.
  • 7.
    Shahidi S,Bahrampour E,Soltanimehr E, et al. The accuracy of a designed software for automated localization of craniofacial landmarks on CBCT images. BMC Med Imaging. 2014;14:32.
  • 8.
    Gupta A,Kharbanda OP,Sardana V,Balachandran R,Sardana HK. A knowledge-based algorithm for automatic detection of cephalometric landmarks on CBCT images. Int J Comput Assist Radiol Surg. 2015;10:17371752.
  • 9.
    Mohammad-Rahimi H,Nadimi M,Rohban MH,Shamsoddin E,Lee VY,Motamedian SR. Machine learning and orthodontics, current trends and the future opportunities: a scoping review. Am J Orthod Dentofacial Orthop. 2021;160:170192 e174.
  • 10.
    Hwang HW,Park JH,Moon JH, et al. Automated identification of cephalometric landmarks: Part 2-Might it be better than human? Angle Orthod. 2020;90:6976.
  • 11.
    Moon JH,Hwang HW,Yu Y,Kim MG,Donatelli RE,Lee SJ. How much deep learning is enough for automatic identification to be reliable? Angle Orthod. 2020;90:823830.
  • 12.
    Dot G,Rafflenbeul F,Arbotto M,Gajny L,Rouch P,Schouman T. Accuracy and reliability of automatic three-dimensional cephalometric landmarking. Int J Oral Maxillofac Surg. 2020;49:13671378.
  • 13.
    Gupta A,Kharbanda OP,Sardana V,Balachandran R,Sardana HK. Accuracy of 3D cephalometric measurements based on an automatic knowledge-based landmark detection algorithm. Int J Comput Assist Radiol Surg. 2016;11:12971309.
  • 14.
    Ghowsi A,Hatcher D,Suh H, et al. Automated landmark identification on cone-beam computed tomography: accuracy and reliability. Angle Orthod. 2022;92:642654.
  • 15.
    Serafin M,Baldini B,Cabitza F, et al. Accuracy of automated 3D cephalometric landmarks by deep learning algorithms: systematic review and meta-analysis. Radiol Med. 2023;128:544555.
  • 16.
    Gillot M,Miranda F,Baquero B, et al. Automatic landmark identification in cone-beam computed tomography. Orthod Craniofac Res. 2023;26:560567.
  • 17.
    Kielczykowski M,Kaminski K,Perkowski K,Zadurska M,Czochrowska E. Application of artificial intelligence (AI) in a cephalometric analysis: a narrative review. Diagnostics (Basel). 2023;13.
  • 18.
    Ahn J,Nguyen TP,Kim YJ,Kim T,Yoon J. Automated analysis of three-dimensional CBCT images taken in natural head position that combines facial profile processing and multiple deep-learning models. Comput Methods Programs Biomed. 2022;226:107123.
  • 19.
    Hwang HW,Moon JH,Kim MG,Donatelli RE,Lee SJ. Evaluation of automated cephalometric analysis based on the latest deep learning method. Angle Orthod. 2021;91:329335.
  • 20.
    Lindner C,Wang CW,Huang CT,Li CH,Chang SW,Cootes TF. Fully automatic system for accurate localisation and analysis of cephalometric landmarks in lateral cephalograms. Sci Rep. 2016;6:33581.
  • 21.
    Schlicher W,Nielsen I,Huang JC,Maki K,Hatcher DC,Miller AJ. Consistency and precision of landmark identification in three-dimensional cone beam computed tomography scans. Eur J Orthod. 2012;34:263275.
  • 22.
    Lee H,Cho JM,Ryu S, et al. Automatic identification of posteroanterior cephalometric landmarks using a novel deep learning algorithm: a comparative study with human experts. Sci Rep. 2023;13:15506.
  • 23.
    Baumrind S,Frantz RC. The reliability of head film measurements. 2. Conventional angular and linear measures. Am J Orthod. 1971;60:505517.
  • 24.
    Baumrind S,Frantz RC. The reliability of head film measurements. 1. Landmark identification. Am J Orthod. 1971;60:111127.
  • 25.
    Park J,Baumrind S,Curry S,Carlson SK,Boyd RL,Oh H. Reliability of 3D dental and skeletal landmarks on CBCT images. Angle Orthod. 2019;89:758767.
  • 26.
    Moon J-H,Lee J-M,Park J-A,Suh H,Lee S-J. Reliability statistics every orthodontist should know. Semin Orthod. 2024;30:4549.
  • 27.
    Moon JH,Shin HK,Lee JM, et al. Comparison of individualized facial growth prediction models based on the partial least squares and artificial intelligence. Angle Orthod. 2024;94:207215.
Copyright: © 2025 by The EH Angle Education and Research Foundation, Inc.
Figure 1.
Figure 1.

Three-dimensional landmark identification on CBCT. Positive signs on the x, y, and z axes denote left, posterior, and upward directions, respectively, while negative signs indicate right, anterior, and downward.


Figure 2.
Figure 2.

Scatterplots of (ALI coordinates – human examiner 2 coordinates) (red), and (human examiner 1 coordinates – human examiner 2 coordinates) (blue), with 95% confidence ellipses in different planes of view. Larger ellipses indicate greater variability and, thus, lower reliability. For bilateral landmarks, the right-side landmark is presented. Positive values on the x, y, and z axes indicate that the ALI (red) or human examiner 1 (blue) placed landmarks medially, posteriorly, and superiorly compared to the locations identified by human examiner 2. (A), Nasion; (B), Gonion; (C), Lower incisal crown; (D), Lower lip.


Figure 3.
Figure 3.

Bland-Altman plots presenting smaller variance (A) and larger variance (B) in measurements. Red represents (ALI – human examiner 2); Blue represents (human examiner 1 – human examiner 2). The solid line indicates the mean difference (bias); the dotted lines show the upper and lower limits of agreement. Wider limits of agreement indicate greater variability and, thus, lower reliability.


Contributor Notes

Corresponding author: Dr Heeyeon Suh, Department of Orthodontics, Arthur A. Dugoni School of Dentistry, University of the Pacific, San Francisco, California 94103, USA (e-mail: hsuh1@pacific.edu)
Received: 23 Dec 2024
Accepted: 10 Mar 2025
  • Download PDF