Accuracy and reliability of Keynote for tracing and analyzing cephalometric radiographs
To evaluate the reliability and accuracy of Keynote for tracing and analyzing cephalograms in comparison to Quick Ceph Studio. This was a cross-sectional study, which utilized the lateral cephalometric digital images (radiographs) from 49 patients. The study site was the Dental Radiology unit in the School of Dentistry of the Muhimbili University of Health and Allied Sciences (MUHAS), in Dar es Salaam, Tanzania. Cephalograms were imported to Quick Ceph Studio and then to Keynote for analysis. Minimum, maximum, mean, standard deviation, and mean difference were used to describe the data. Agreement between the two techniques was assessed by the Bland-Altman plot, linear regression, and interexaminer reliability tests. A level of significance was considered at P < .05, and a 95% CI was estimated for the outcomes in the study groups. The majority of the mean values obtained from Quick Ceph were greater (P < .05) than those obtained from Keynote. According to Bland-Altman plots, all measurements were within the limit of agreement except for only five linear variables. The interexaminer reliability test showed no agreement between the two instruments for all linear parameters except for the LAFH: TAFH, whereas all angular measurements revealed good to excellent agreement (ICC: 0.75 to 0.97) between the methods. The measurements obtained with the Keynote software were found to be clinically reliable since the limits did not exceed the maximum acceptable difference between the methods. The two software instruments were considered to be in agreement and can be used interchangeably.ABSTRACT
Objectives
Materials and Methods
Results
Conclusions
INTRODUCTION
Currently, the analysis of cephalometric radiographs is commonly performed by a computer-assisted method that may involve either manual or auto-identification of cephalometric landmark points on the monitor.1 Previous literature reported that, as long as the landmark points are identified manually, a computerized cephalometric analysis does not induce more measurement error than the traditional tracing method.2,3 Additionally, the use of computers is expected to minimize any error caused by operator fatigue and provide effective evaluation with a high rate of reproducibility.4 The commonly used preprogrammed cephalometric analysis software includes Quick Ceph, Dolphin Imaging, Nemoceph, and Vistadent.5 However, availability and affordability of these commercially available software remains questionable.6 This is due to the fact that some of them seem to be very expensive for many clinicians.
Keynote is a free software program that can be a cost-effective way to perform analysis of cephalograms. Notably, Keynote is a program developed by Apple for presenting visual data. Additionally, it was reported that Keynote is an alternative for freely available software that can be more cost-effective for performing cephalometric analysis.7 Therefore, the null hypothesis of this study was that there would be no difference between the analysis performed by Quick Ceph and that performed by Keynote. However, in clinical orthodontics, effectiveness of the presented cephalometric tracing software needs to be assessed for its accuracy to allow clinicians to select appropriate methods and tools for analysis.8 For these reasons, this study aimed to evaluate accuracy and reliability of Keynote software for tracing and analyzing cephalometric radiographs compared to Quick Ceph Studio.
MATERIALS AND METHODS
This cross-sectional study was approved (MUHAS-REC-05-2023-1654) by the Research and Publications Committee of the Muhimbili University Senate. The digital images of lateral cephalograms of 49 patients (24 females and 25 males) were obtained from the Dental Radiology unit of the School of Dentistry (MUHAS). The images were taken using cone beam computed tomography (CBCT) (X-VIEW 3D Pan Ceph, Trident S.r.l, Italy) according to the standard regulation of radiation9 of Tanzania. The cephalograms were selected using a systematic randomization method. They were then categorized based on gender. To reduce random error, the following exclusion criteria were used: craniofacial abnormality, missing incisors, presence of impacted or unerupted teeth, and a radiograph with poor quality. The inclusion criteria were: lateral cephalogram of patients with no crowding, presence of all teeth (third molars may or may not be present), and no history of orthodontic treatment. The cephalograms were imported to Quick Ceph Studio, (Version 5.2.6, Quick Ceph Systems, Inc., FL, USA) and then to Keynote for macOS, (version 14.1, Apple Inc., USA) for analysis (Figure 1). The magnification corrections for each option were initially undertaken based on the distance of 10 mm between two fixed points on the Cephalostat rod in the Cephalogram. Thirteen anatomical landmarks were selected and manually identified using a cursor, followed by calculations of 10 angular and 6 linear measurements using both software applications. All measurements were taken by a single operator daily, with a maximum of 10 cephalograms per day (using Quick Ceph first). After an interval of a minimum of 1 week, the same images were remeasured using the Keynote application.


Citation: The Angle Orthodontist 95, 4; 10.2319/101724-864.1
The Kolmogorov-Smirnov test was used to review normality of the data distribution. Intra- and inter-examiner reliability for each measurement was assessed using the intraclass correlation coefficient (ICC). The minimum, maximum, mean, standard deviation, and mean difference were used to describe the data. A systematic bias between Quick Ceph and Keynote software was assessed by paired t-test. A Bland-Altman plot,10 interexaminer reliability, and linear regression tests were applied for assessment of the agreement between the two techniques of measurement. Clinically relevant differences were considered when the difference was greater than 2° or 2 mm for the angular and linear measurements, respectively.11 A statistical significance level of P < .05 was considered and a 95% confidence interval was estimated for the outcomes in the study group. Data analysis was conducted using RStudio Desktop for macOS 12+, (Posit Software, Boston, Mass, USA).
RESULTS
There were 49 cephalograms analyzed including 24 from males and 25 from females. Based on the skeletal pattern distribution, there were 29 Class I, 12 Class II, and 8 Class III skeletal patterns. The ICC demonstrated that the intra-examiner reliability was very good to excellent (0.86 to 0.99) and moderate to excellent (0.73 to 0.99) for the measurements by Quick Ceph and Keynote software, respectively (Table 1). The maximum differences were 2.56° (interincisal angle) and 15.52 mm (TAFH) for the angular and linear measurements, respectively (Table 2). It was also noted that the majority of the mean values obtained from Quick Ceph were greater than those obtained from Keynote. However, a paired t-test showed that there was a significant difference (P < .05) in all parameters except for only five angular variables, namely the Saddle Angle, SNA, SNB, ANB, and the FMIA when compared between the two methods (Table 3). Additionally, linear regression analysis revealed a significant (P < .05) proportional bias between the two methods.



According to the Bland-Altman plots (Figures 2 and 3), all measurements were within the limit of agreement as the bias analysis was close to zero except for five linear variables: anterior cranial base, posterior cranial base, ramus height, LAFH, and TAFH. The mean differences of these mentioned parameters drifted away from zero, indicating a systematic bias between the two methods. To confirm this, the interexaminer reliability test (Table 4) showed no agreement between the two instruments for all linear parameters except for LAFH: TAFH, whereas all angular measurements revealed good to excellent agreement between the two approaches (ICC: 0.75 to 0.97). In addition, trends for a greater number of data points randomly distributed around the mean difference lines were also observed, suggesting good agreement (Figures 2 and 3).


Citation: The Angle Orthodontist 95, 4; 10.2319/101724-864.1


Citation: The Angle Orthodontist 95, 4; 10.2319/101724-864.1

Based on the clinically relevant difference, the error size for all measurements was within the acceptable range, and the limits for most variables did not exceed the maximum acceptable difference (2° and 2 mm, for angular and linear measurements, respectively) between the methods. All measurements that revealed a significant difference in the paired t-test were also within the limit of agreement.
DISCUSSION
The use of computerized cephalometric analysis techniques has been reportedly shown to minimize errors resulting from manual drawing of lines and measuring with a ruler and protractor in the conventional method.2 The current study aimed to evaluate the reliability and accuracy of Keynote software for tracing and analyzing cephalograms in comparison to Quick Ceph software. Quick Ceph was used in the current study as a standard tool because it had been shown in previous studies12,13 to produce adequate angular and linear measurements. Keynote software, on the other hand, has recently been introduced as pre-installed software and is freely available on all Apple computers as a cost-effective alternative method for digital cephalometric analysis.7 This was the first study that analyzed its performance and verified its accuracy for clinical use.
The study involved several sagittal and vertical skeletal patterns, ensuring the existence of any potential variation in the vertical and anteroposterior relationship of the jaws that could be faced when performing cephalometric analysis. Each software application has achieved sufficient reliability when tested at different intervals. As intra-examiner errors are less frequent than interexaminer,3 the measurements were undertaken by the same investigator to avoid any possible error between operators and to achieve a required standard.
To analyze agreement between the methods, various studies have used the Pearson correlation coefficient (measuring the association instead of agreement); however, this statistical technique can be misleading and inappropriate.14 Therefore, the dataset was analyzed by applying a graphical technique with an appropriate use of regression to determine 95% limits of agreement and confidence intervals as well as to quantify the disagreement between two measurement techniques.15,16 According to the Bland-Altman plots in the present study, all measurements were in acceptable agreement. Even the variables that revealed a significant difference in the paired t-test (Table 3) were also within acceptable limits (Figures 2 and 3) since the decision on what was acceptable agreement was a predetermined clinical judgment. Thus, although there was a significant difference in some of the parameters and wide limits of agreement in the Bland-Altman plots between the instruments, clinically, the analysis can be carried out with both software programs.
The interchangeability of Keynote and Quick Ceph cannot be generalized across all parameters. Although angular measurements such as ANB fell within clinically acceptable ranges of agreement, particularly those involving facial height or cranial base dimensions demonstrated proportional bias that limits their direct comparability. Clinicians should exercise caution and validate critical measurements manually when using Keynote for precision-sensitive applications. The present findings concurred with previous literature which concluded that the differences between measurements derived from the cephalometric radiographs with the two different digitized methods were statistically significant but clinically acceptable.3 Additionally, previous literature reported that the findings obtained from the same patient may vary as a function of the cephalometric measurement approach used.17
Although the ICC (interexaminer reliability) showed that the major discrepancies in agreement between Quick Ceph and Keynote software were only with the linear measurements, especially for posterior cranial base, ramus height, anterior cranial base, LAFH, and TAFH, which showed poor level of agreement, a linear regression analysis showed a different scenario. The regression exhibited a significant proportional bias between the two techniques for both angular and linear variables. The lower reliability of some measurements could have been due to difficulty in the identification of some landmark points. For instance, some studies indicated that identification of Porion, Orbitale, Articulare, and Gonion points on lateral cephalograms can be challenging,18,19 and any measurement based on the Frankfort horizontal plane might be erroneous;20 this justifies the lower reliability of the linear measurements in the present study. Similar findings for linear values were reported by Kumar et al.6 using Nemoceph and Foxit PDF Reader, and Celik et al.8 using Vistadent software vs the Jiffy orthodontic evaluation program.
The accuracy of software-based cephalometric analysis has been extensively evaluated in previous studies, aiming to improve efficiency and precision compared to traditional manual methods.4 Automated and semiautomated software tools, including the one assessed in this study, have demonstrated comparable reliability to manual methods for most parameters,21 particularly angular measurements such as ANB. However, discrepancies are often observed in linear measurements, likely due to variances in landmark identification.11 These findings were in agreement with those of the current study, indicating that, while Keynote is a cost-effective alternative, its accuracy must be carefully evaluated, particularly for parameters requiring high precision.
Modern cephalometric software tools, while showing significant proportional biases in some angular and linear parameters, typically provide measurements within clinically acceptable ranges. This makes them reliable alternatives to manual methods for routine orthodontic and surgical assessments. However, clinicians should be cautious of systematic bias in specific linear measurements, particularly in cases requiring high precision, such as craniofacial anomaly assessments or detailed growth monitoring.22 Literature suggests that such biases may arise from differences in how software tools and manual methods interpret landmarks, particularly in complex or ambiguous anatomical regions.23 For example, cranial base measurements are sensitive to errors in landmark identification due to overlapping structures, whereas facial height parameters rely on clear differentiation of anatomical boundaries, which software tools may not always accurately identify.24 Understanding the acceptable range of error for each parameter is vital, emphasizing the need for software tools to be calibrated and validated against manual methods with clinically defined tolerances for meaningful interpretation.25
Although manual cephalometric tracing is often considered the gold standard due to the ability of experienced clinicians to adjust for individual anatomical variations or radiographic artifacts,26 it remains time-consuming and susceptible to inter- and intra-operator variability.6,24 Studies have shown that most software tools provide reliable measurements for both linear and angular parameters, though their accuracy may vary depending on the clarity of landmarks and the precision of the algorithms.27 For instance, angular measurements such as SNA and SNB show high consistency due to reliance on well-defined craniofacial landmarks. Conversely, discrepancies in parameters like the Saddle angle and FMIA highlight the susceptibility of certain variables to software-dependent errors, likely arising from differences in landmark identification or scaling.28
Overall, these findings underscore the importance of understanding the limitations of cephalometric software tools and their alignment with clinical needs. Although these tools can significantly enhance efficiency and consistency, ensuring their accuracy and addressing systematic biases in critical measurements are essential for safe and effective clinical application.
Limitations
The reliability of the methods was evaluated by a single investigator. To mitigate this limitation, additional reliability assessments using multiple investigators are planned in future studies. This would help ensure that the results are not influenced by individual bias and provide a more robust evaluation of the methods.
CONCLUSIONS
The findings of the present study exposed some areas where the two software applications were inconsistent in the analysis, particularly in terms of linear measurements and systematic bias. Consequently, the null hypothesis stating that there is no statistically significant difference between the analysis carried out by Quick Ceph and Keynote can be rejected.
Measurements obtained with Keynote software used in the current study were shown to be clinically reliable.
Since the limits did not exceed the maximum acceptable difference between methods, the two software programs are considered to be in agreement and can be used interchangeably.

Cephalometric analysis using Keynote software.

Bland-Altman plots presenting angular variables in each Quick Ceph and Keynote method.

Bland-Altman plots for the linear variables in each Quick Ceph and Keynote method.
Contributor Notes