Reliability assessment of craniofacial and airway measurements: a comparative study between multidetector computed tomography and cone-beam computed tomography
To compare the intra- and inter-examiner reliability of multidetector computed tomography (MDCT) and cone-beam computed tomography (CBCT) using Amira and Dolphin software analyses for craniofacial/airway measurements by six examiners. Five adults and one dry skull with CBCT and MDCT scan files were duplicated and randomly numbered. Six orthodontic residents imported these files into two software programs, oriented the images, set thresholds, and performed 26 measurements. Intra- and inter-examiner reliabilities were determined by using intraclass correlation coefficient (ICC) and presented with scatterplots. Variables including anterior nasal width, posterior nasal width, frontomaxillary suture right-to-left, inner nasal contour point right-to-left, and minimum cross-sectional area in the oropharynx showed “moderate” to “substantial” intra- or inter-examiner agreement. Amira provided relatively reliable airway assessment, while Dolphin showed standard deviations 10 to 30 times larger for volumetric airway measurements. MDCT scans significantly reduced airway volume/area measurements compared to CBCT, except for intraoral airway volume. Unreliable skeletal measurements and low reliability of Dolphin for airway analysis discourage using CT to quantitatively correlate changes in craniofacial structures with airway dimensions.ABSTRACT
Objectives
Materials and Methods
Results
Conclusions
INTRODUCTION
Multi-detector (multi-slice) computed tomography (MDCT; MSCT) and cone-beam CT (CBCT) are prevalent CT technologies for head and neck evaluation, albeit with distinct image-acquisition processes (including beam shape, x-ray generator, and detecting system).1 Popularity of CBCT in maxillofacial imaging stems from its lower radiation dose, cost-effectiveness, and ease of installation. Applications extend to various medical fields, including orthodontics,2 where it aids in assessing upper airway morphology, treatment effects, and disorders such as obstructive sleep apnea (OSA).3 Concerns about cumulative radiation doses from repeated CBCT scans underscore the need for justifying routine use.4
MDCT quantifies radiodensity using the Hounsfield scale (HU) and provides absolute value for each tissue type. Due to inherent CBCT acquisition limitations,5 most studies question the ability to convert CBCT gray values (GV) to MDCT HU.6 Many CBCT manufacturers do not calibrate GV along a pseudo-HU scale and, even after recalibration, significant errors persist in quantitative CBCT GV use.7 MDCT excels in contrast detectability, discerning contrast at about 1 HU, while CBCT typically discerns contrast within a 10 HU range, limiting its ability to distinguish between different soft tissue structures. Yamashina et al. found CBCT produced distinct CT values from MDCT for air, water, and soft tissues.8 Nevertheless, accurate measurement by CBCT of air spaces in a phantom8 suggests potential precision in oropharyngeal airway dimension assessment.
In craniofacial studies using human cadaver/skull or phantom/prototype, most research found no significant difference between physical linear measurements and those obtained through MDCT and CBCT scans.9–11 However, Naser and Mehr reported differences in linear measurements on hemi-mandible specimens,12 and Chen et al. found variations in volumetric and cross-sectional area measurements using different MDCT and CBCT scanners.13 Since these studies used stationary objects, reported accuracy may be higher than in clinical scenarios, in which longer acquisition time by CBCT (20 to 40 seconds) increases motion artifact risk compared to MDCT scans acquired in a single breath.
Given inherent weaknesses of CBCT and ongoing interest in clinical research, determining its reliability as a quantitative airway assessment tool is crucial. A systematic review on upper pharyngeal airway assessment using CBCT identified only five high-quality studies out of 42,14 with concerns about overrated reliability due to a limited number of studies evaluating interexaminer reliability and the absence of manual orientation of scanned images. Previous evidence suggests the obscurity of reliability of CBCT in quantitative measurements, prompting the present study to compare intra- and inter-examiner reliability of MDCT and CBCT using two software programs (Amira and Dolphin) for craniofacial/airway parameters.
MATERIALS AND METHODS
Samples and Exposure
This retrospective study, approved by the Research Ethics Committee of National Taiwan University Hospital (approval number: 202201101RINA), utilized data from patients who underwent two head and neck CT scans between 2016 and 2021, sourced from the National Taiwan University Hospital-integrative Medical Database (NTUH-iMD). Inclusion criteria were: adults with both CBCT and MSCT scans within a year, and no head and neck treatment between scans. All CBCT images used were acquired by using 3D Accuitomo 170 (J.Morita MFG. Corp., Kyoto, Japan) with specified settings: 90 kVP, 5 mA, 17 × 12 cm, 0.25 mm, and exposure time of 35 seconds with two consecutive scans. MDCT images used were acquired by using Somatom Definition AS (Siemens Medical Solutions, Malvern, Pennsylvania, USA): 120 kVP, 260 mA, 1.2 mm, 512 × 512, and exposure time of 0.5 seconds. Exclusion criteria included: age under 20 years, incomplete craniomaxillofacial skeleton coverage, non-occluded teeth in centric relation, or the use of different CT scanner models. Only five subjects fulfilled the inclusion criteria. The sampling flow chart is detailed in Figure 1. Additionally, a dry skull underwent both CBCT and MDCT scans (SKULL group) as the gold standard. All images were reconstructed and exported as digital imaging and communications in medicine (DICOM) files.



Citation: The Angle Orthodontist 95, 1; 10.2319/022124-131.1
Duplication and Randomization of the DICOM Files
The corresponding author duplicated the MDCT and CBCT DICOM files (five subjects and one dry skull), created 24 files, and randomly assigned code numbers to each. Six orthodontic residents imported these files into two software programs: Dolphin Imaging Version 11.9 Premium (Dolphin Imaging and Management Solutions, Chatsworth, CA) and Amira software (version 2022.1, Thermo Fischer Scientific, Merignac, France), unaware that the data were duplicates. They believed the data represented 20 individuals/subjects and four dry skulls, not knowing that they were being evaluated for intra-examiner reliability. Before the study, the examiners were trained on importing DICOM files, orienting images, and measuring the 26 variables, and were given a step-by-step illustrated handout.
Image Processing and Dimensional Measurements
After importing the CT data into Dolphin or Amira, examiners reoriented head positions. Before measurements, HU calibration/correction was conducted using Amira but not Dolphin, as Dolphin lacked this function.
The study involved 26 common airway-related measurements categorized into eight linear orthogonal,15–17 six linear 3D surface rendering,18–20 six lateral cephalometric,21 and six airway analysis parameters.22,23Table 1 outlines the definition of each measurement. Semi-automatic segmentation was used for airway measurements, with manual selection of boundaries and threshold sensitivity setting. Airway parameters or spinal curvature were not measured in the SKULL group due to absence of soft tissue and the cervical spine in the dry skull.

Statistical Analysis
Raw data were entered into an Excel file, and R statistical software (version 4.2.3; https://www.r-project.org) was used for scatterplots and statistical calculations. Intra- and inter-examiner reliability were determined using the intraclass correlation coefficient (ICC)24 with 95% confidence intervals (95% CIs). Reliability grades ranged from poor (ICC: 0.00 ∼ 0.10), slight (ICC: 0.11 ∼ 0.20), fair (ICC: 0.21 ∼ 0.40), moderate (ICC: 0.41 ∼ 0.60), substantial (ICC: 0.61 ∼ 0.80), almost perfect (ICC: 0.81 ∼ 0.99), and perfect agreement (ICC: 1.00). Regression analysis created prediction models for each variable, considering imaging modality (MDCT or CBCT), measurement software (Amira or Dolphin), subject type (skull or patient), and specific examiner (1 to 6). The model was expressed mathematically as:
Yi was the measured value from number-i examiner; μ the average value from all examiners; τ1D1 the dummy variable for imaging modality (D1=1 for MDCT; D1=0 for CBCT); τ2D2 the dummy variable for software (D2 =1 for Amira; D2=0 for Dolphin); τ3D3 the dummy variable for subject (D3=1 for skull; D3=0 for patient); αi the fixed effect of the number-i examiner; τ1D1 × τ2D2 the interaction effect of imaging modality and software; τ1D1 × τ3D3 the interaction effect of imaging modality and type of subject; τ2D2 × τ3D3 the interaction effect of software and type of subject; and εi the residual error. Measurement agreement was evaluated using Wilcoxon sign rank test, with significance level set at 0.05.
RESULTS
Tables 2 through 6 provide detailed ICC values for intra- and inter-examiner reliability.

Intra-examiner Reliability
In MDCT × Amira and CBCT × Amira groups, all examiners demonstrated “almost perfect” intra-examiner agreement, except for one MDCT examiner and two CBCT examiners with “substantial” agreement in the variable PNW (Tables 2 and 3). In MDCTxDolphin group, most examiners had “almost perfect” intra-examiner agreement, except for variables ANW, PNW, FMr-l, and MCA, which showed only “substantial” agreement (Table 4). In CBCT × Dolphin group, most examiners demonstrated “almost perfect” intra-examiner agreement, except for ANW, PNW, and FMr-l. All examiners had “moderate” to “substantial” agreement for ANW (ICC: 0.601–0.770). Three examiners showed “moderate” to “substantial” agreement for PNW and FMr-l (ICCs: 0.628–0.719) (Table 5).



Inter-observer Reliability
In MDCT × Amira and CBCT × Amira groups, inter-examiner agreement was “almost perfect” for all measurements, except for PNW, which showed ICCs of 0.769 (MDCT) and 0.793 (CBCT). In MDCTxDolphin group, inter-examiner agreement was “substantial” for ANW, PNW, FMr-l, and MCA, while other measurements indicated “almost perfect” agreement. In CBCTxDolphin group, interexaminer agreement was “substantial” for ANW, PNW, FMr-l, and INCr-l, while other measurements indicated “almost perfect” agreement (Table 6).

Comparison of the Measurements Acquired using Amira and Dolphin
Average measurements using Amira and Dolphin on the same CT scans by all examiners are presented in Tables 7 through 9. Significant differences in most variables were observed between the two software programs regardless of the imaging modality (MDCT or CBCT). Volumetric airway measurements showed larger standard deviations with Dolphin, often 10 to 30 times more (Tables 7 and 8). Approximately half of the variables in the SKULL group exhibited significant differences between Amira and Dolphin measurements (Table 9). In the SKULL group, all standard deviations were within one unit (1 mm or degree). For patient images, Dolphin had standard deviations exceeding one unit for all linear and angular variables, while Amira exceeded one unit only for PNW, ZTUr-l, ZMU r-l, ZML r-l, L, and Go-Gn.



Regression Model for each Variable
Tables 10 through 13 summarize how variables were affected by imaging modalities (MDCT or CBCT), measurement software (Amira or Dolphin), subject type (skull or patient), and specific examiner. Significant factors (P < 0.05) are explained by their estimate values. Imaging modalities significantly influenced ANFW, FMr-l, INCr-l, ZMLr-l, SNGoGn, NAV, OAV, HAV, and MCA. Software affected ANW, PNW, PNFW, FMr-l, NAV, OAV, HAV, and MCA. Subject type impacted ANW, ANFW, PNW, PNFW, EMW, PW, ZTUr-l, FMr-l, FZr-l, INCr-l, ZMUr-l, ZMLr-l, SNA, Co-A, and Co-Gn, indicating that the clinical measurements of these variables deviated from the gold standard of SKULL. Only FMr-l and MCA were influenced by the interaction of imaging modalities and software, while PNFW was influenced by the interaction of software and subject type. Examiner did not affect variable values. All airway volumes and areas were significantly reduced on MDCT scans except for IAV. For example, MCA estimate shows 57.784 (−79.467 + 59.283 − 37.6 = −57.784) mm2 less when measured with Amira on MDCT images.




DISCUSSION
There has been growing interest in CBCT assessment in maxillofacial and otorhinolaryngological specialties, particularly regarding airway morphology and its relationship with sleep-disordered breathing. Many studies have explored the impact of various treatments on airway dimensions, often reporting statistically significant differences, yet overlooking large standard deviations. This raises concerns about the justification for repeated CBCT exposures for research purposes. Therefore, the current study compared intra- and inter-examiner reliability of MDCT and CBCT using Amira and Dolphin software applications for evaluating craniofacial/airway parameters. This study is unique, being the only one that has compared the reliability of both CT modalities with multiple examiners assessing clinical CT images from patients. Results suggested that the software program may have a more significant influence than the image modality on reliability.
Amira, with powerful 3D visualization and animation functions, is expensive and less user-friendly,25 while Dolphin, favored by orthodontists, lacks airway segmentation control in 2D slices and has noncompatible threshold interval units.26 Amira's HU calibration function, absent in Dolphin, may explain why Dolphin's airway volume measurements exhibited significantly larger standard deviations. The findings of this study were consistent with the conclusion of de Water et al. that Dolphin was not accurate or reliable for airway analysis.27
Zimmerman et al.14 systematically reviewed 42 studies to assess the reliability of CBCT for upper airway evaluation. Notably, previous studies lacked examiner orientation of scanned images and the assignment of threshold sensitivity. The current study addressed these limitations by allowing examiners to manually orient images and set sensitivity thresholds. Importantly, examiners were unaware that intra-examiner reliability was being evaluated due to the use of duplicated images with random coding. The results revealed excellent intra-examiner reliability (ICC > 0.9 per Mattos et al.28) for all airway measurements in the MDCT × Amira (0.992–0.999) and CBCT × Amira (0.993–0.999) groups. Despite an intra-examiner ICC of 0.929 for NAV measured by one examiner (red circle) using Dolphin on CBCT, there was a 1593 mm³ difference between measurements (9335 − 7,742 = 1,593 mm³) (Figure 2A). This indicated that simply reporting ICC ratings of “excellent,” or “almost perfect” agreement is insufficient to represent clinical reliability. Unexpectedly, some linear measurements of the hard tissue on the orthogonal planes and 3D surface rendering images exhibited lower intra- and inter-examiner reliability than airway measurements, even when using Amira. For instance, the intra-examiner error for ANW was as much as 4mm on CBCT (34.5 − 30.5 = 4 mm) using Dolphin (Figure 2B). The inter-examiner differences for PNW were as large as 2.9 mm (34.1 − 31.2 mm) using Amira on CBCT (red circle) and 4.3 mm (37.3 − 33 mm) using Dolphin (green circle) (Figure 2C). Since these measurements are common in airway research, studies claiming treatment-induced craniofacial changes and airway improvements should be interpreted cautiously.



Citation: The Angle Orthodontist 95, 1; 10.2319/022124-131.1
Treatment strategies for adult OSA, such as mandibular advancement devices, have shown effectiveness.29 Conventional facial orthopedic treatments have proven effective for pediatric OSA.30 However, the current study highlights the postural effect on upper airway structures/dimensions, with measurements significantly smaller in MDCT scans compared to CBCT scans. Given the postural influence,31 assessing airway dimensions using CBCT in the upright position for predicting supine-position-related sleep apnea disorders requires further validation.
Limitations
Limitations of this study included a small sample size and a retrospective design relying on existing databases. A database search for adult patients who had undergone CBCT and MDCT head and neck scans within a year, excluding those who received treatment affecting craniofacial structures during that period, yielded only five subjects meeting the inclusion criteria. Nonetheless, this limited sample size may reflect adherence to the “as low as reasonably achievable” (ALARA) principle in patient treatment. Additionally, the comparison involved only one MDCT, one CBCT, and two imaging software packages, limiting generalizability to other scanner models and software. The results of this study only apply to the present imaging protocols with the same scanner models. For example, the slice thickness of MDCT in this study was 1.2 mm, while the voxel of CBCT was 0.25 mm. The thickness of each slice between the two modalities was not equal, which may have affected the accuracy and reliability of the research results. As the slice thickness of MDCT was large, the imaging quality might have been underestimated.
CONCLUSIONS
Within the limitations of this study, the following conclusions can be drawn:
-
Dolphin was deemed unreliable for airway analysis, while proper training with software supporting HU calibration, such as Amira, could yield relatively reliable airway measurements.
-
Airway volumes were significantly reduced on MDCT scans compared to CBCT, except for IAV.
-
The undesirable reliability found for skeletal measurements precludes using CT to correlate craniofacial changes with airway dimensions.

Subject screening flow chart.

Intra-examiner reliability for nasopharyngeal airway volume (A) and anterior nasal width (B). Inter-examiner reliability for posterior nasal width (C). Letters A to F: six examiners. Numbers 1 to 5: individual patients. The letter ‘x:’ dry skull. MSCT = MDCT. Black dashed line: regression line.
Contributor Notes
The first two authors contributed equally to this work.