Deep Learning-Assisted Quantitative Measurement of Thoracolumbar Fracture Features on Lateral Radiographs

Article information

Neurospine. 2024;21(1):30-43

Publication date (electronic) : 2024 March 31

doi : https://doi.org/10.14245/ns.2347366.683

Woon Tak Yuh ¹

, Eun Kyung Khil^,²^,³

, Yu Sung Yoon ⁴

, Burnyoung Kim ⁵, Hongjun Yoon ⁵

, Jihe Lim ²

, Kyoung Yeon Lee ²

, Yeong Seo Yoo ²

, Kyeong Deuk An ¹

¹Department of Neurosurgery, Hallym University Dongtan Sacred Heart Hospital, Hwaseong, Korea

²Department of Radiology, Hallym University Dongtan Sacred Heart Hospital, Hwaseong, Korea

³Department of Radiology, Fastbone Orthopedic Hospital, Hwaseong, Korea

⁴Department of Radiology, Kyungpook National University Hospital, School of Medicine, Kyungpook National University, Daegu, Korea

⁵DEEPNOID Inc., Seoul, Korea

Corresponding Author Eun Kyung Khil Department of Radiology, Hallym University Dongtan Sacred Heart Hospital, 7 Keunjaebong-gil, Hwaseong 18450, Korea Email: nizzinim@gmail.com

Received 2023 December 24; Revised 2024 January 24; Accepted 2024 February 2.

Abstract

Objective

This study aimed to develop and validate a deep learning (DL) algorithm for the quantitative measurement of thoracolumbar (TL) fracture features, and to evaluate its efficacy across varying levels of clinical expertise.

Methods

Using the pretrained Mask Region-Based Convolutional Neural Networks model, originally developed for vertebral body segmentation and fracture detection, we fine-tuned the model and added a new module for measuring fracture metrics—compression rate (CR), Cobb angle (CA), Gardner angle (GA), and sagittal index (SI)—from lumbar spine lateral radiographs. These metrics were derived from six-point labeling by 3 radiologists, forming the ground truth (GT). Training utilized 1,000 nonfractured and 318 fractured radiographs, while validations employed 213 internal and 200 external fractured radiographs. The accuracy of the DL algorithm in quantifying fracture features was evaluated against GT using the intraclass correlation coefficient. Additionally, 4 readers with varying expertise levels, including trainees and an attending spine surgeon, performed measurements with and without DL assistance, and their results were compared to GT and the DL model.

Results

The DL algorithm demonstrated good to excellent agreement with GT for CR, CA, GA, and SI in both internal (0.860, 0.944, 0.932, and 0.779, respectively) and external (0.836, 0.940, 0.916, and 0.815, respectively) validations. DL-assisted measurements significantly improved most measurement values, particularly for trainees.

Conclusion

The DL algorithm was validated as an accurate tool for quantifying TL fracture features using radiographs. DL-assisted measurement is expected to expedite the diagnostic process and enhance reliability, particularly benefiting less experienced clinicians.

Keywords: Artificial intelligence; Deep learning; Spinal fractures; Spinal injuries; Spinal curvatures; Radiography

INTRODUCTION

Thoracolumbar (TL) fractures often manifest with vertebral deformities as the fractures progress, resulting in decreased vertebral body (VB) height and the development of spinal kyphosis [1-3]. Quantitative measurements of vertebral height, compression rate (CR), and kyphotic angles, primarily conducted using plain radiographs, are essential not only for clinical evaluation but also for surgical reimbursement criteria. Many insurance providers base their coverage decisions on criteria such as the presence of unstable spinal fractures, such as burst fractures with a kyphotic angle greater than 30°, a compression ratio above 40%, or spinal canal invasion exceeding 50%. Despite the simplicity of these measurements, they are often repetitive and time-consuming due to the use of various measurement methods and subject to significant interobserver variance [4-6].

Deep learning (DL) algorithms, particularly employing convolutional neural networks (CNNs) as a computer vision, have gained significant interests in the various medical fields due to their remarkable potential for improving diagnostic accuracy and workflow efficiency [7-12]. In the spine researches, many studies have been conducted on detecting vertebral fractures [13,14], and segmenting the VB with sagittal parameter measurement [15-18]. These approaches have demonstrated excellent efficacy, leading to the introduction of several commercially available software solutions for preoperative planning in spinal deformity surgery [19,20].

However, there is still a lack of DL research focused on the quantitative measurement of TL fracture features, which requires precise segmentation of fractured VBs. Current DL models for semantic segmentation, designed for binary classification, perform well on intact VBs, but struggle with fractured VBs, which are often deformed and overlapped with adjacent VBs [14,21]. To address this challenge, it is crucial to employ an appropriate DL algorithm for instance segmentation. Hence, this study aimed to develop and validate a relevant DL algorithm specifically for the quantitative measurement of TL fracture features using plain radiographs.

MATERIALS AND METHODS

This retrospective study was approved by both the Institutional Review Board (No. 2022-08-008) and the Institutional Big Data Research Ethics Committee. Any patient information was anonymized prior to image acquisition, and the requirement for informed consent was waived due to the retrospective nature of the study.

1. Data Collection and Study Population

Two patient cohorts were included in this study. The nonfracture group comprised patients who underwent lumbar spine radiography at our hospital's emergency medicine, orthopedic surgery, and neurosurgery departments between November 2021 and January 2022. The fracture group consisted of patients diagnosed with TL fractures who underwent lumbar spine radiography from January 2015 to January 2022. The exclusion criteria for the nonfracture group included patients with severely impaired image quality, inappropriate field of view (either too wide or too small), presence of spinal surgical instrumentation, limitations in imaging evaluation (e.g., due to vertebroplasty or other surgical factors, presence of artifacts or contrast agents obstructing evaluation), pathological fractures (e.g., due to tumors, infections, inflammation), severe spinal deformities (e.g., severe scoliosis or kyphosis), and congenital spinal anomalies. In the fracture group, we additionally excluded patients with 3 or more consecutive vertebral fractures, or fractures outside the VB (e.g., spinous process or transverse process).

In total, 1,531 cases (1,000 nonfracture cases and 531 fracture cases) were collected. For training, 1,000 nonfracture cases and 318 fracture cases were used, and for internal validation, 213 fractured cases were used. Additionally, for external validation, we collected 200 fracture cases from another tertiary hospital using the same selection criteria. There was an imbalance in the training dataset between nonfractured cases and fractured cases, which reflects the actual prevalence of TL fracture, as fractures are relatively less common compared to nonfractured cases. However, with a large number of nonfractured cases, the model can be further fine-tuned for the 6-point labeling accuracy and segmentation performance.

2. Image Data Acquisition and 6-Point Labeling

The lateral view of lumbar spine plain radiographs, covering T10 to S5, were obtained in Digital Imaging and Communications in Medicine (DICOM) file format. After image acquisition, no special preprocessing was applied to the radiographs before labeling. The original DICOM images were directly transferred to the labeling platform, which allows for basic image adjustments such as brightness, contrast, saturation, and hue, along with a magnification feature. Three musculoskeletal radiologists, each with more than a decade of clinical experience, were assigned the task of independently labeling one-third of the dataset. Following their individual labeling works, a final review was conducted by one radiologist to ensure accuracy and consistency. In cases where significant disagreements arose during the final review, they were thoroughly discussed to reach a consensus and finalized the decisions. The results of the final decision were considered to be the ground truth (GT). We adopted a 6-point labeling method for the quantitative measurement of TL fracture features [22,23]. For nonfractured vertebrae, 6 points were marked on each VB (Fig. 1): 4 at the corners and 2 additional points at the midportion of the superior and inferior endplates. These 6 points form 3 pairs—anterior, middle, and posterior. For the fractured vertebrae, the anterior and posterior points were marked in the same manner, while the middle points were placed at the smallest height location. Lines were drawn between each pair, perpendicular to the endplates, enabling accurate assessment of the anterior, middle, and posterior heights of the VB. The heights at the anterior, middle, and posterior VB were measured as the distance between the respective paired points. For fractured VBs, the smallest of these 3 heights was taken as the representative height. In cases where one of the adjacent VBs could not be measured, the height of the counterpart was taken as the average.

Fig. 1.

Measurement of the compression rate and kyphotic angles based on 6-point labeling. Automated detection of a fractured vertebral body (VB) with a bounding box is demonstrated. Lines connecting the superior and inferior corner dots define the superior and inferior endplate lines, respectively. (A) Compression rate is calculated by the ratio of the reduced height of the fractured VB to the average height of the adjacent VBs (1–[height of fractured VB/mean height of adjacent VBs]). The blue numbers display the compression rates (%) of the anterior, middle, and posterior parts of the VB. The smallest value among these is selected as the representative compression rate. (B) Cobb angle is measured between the superior endplate of the upper adjacent VB and the inferior endplate of the lower adjacent VB. (C) Gardner angle is measured between the superior endplate of the upper adjacent VB and the inferior endplate of the fractured VB. (D) Sagittal index is calculated as the angle between the superior and inferior endplates of the fractured VB.

Various ambiguous situations of the 6-point labeling and its corresponding rules are illustrated in Supplementary Fig. 1. Recognizing the variability in fracture types, particularly the more intricate C and D types, we made rules to label in various situations, to mitigate errors and improve inter- and intraobserver reliability in our measurements. This was critical to ensure reproducibility and consistency, especially since fractures can present as combinations (e.g., A+C or B+D). Supplementary Fig. 2 showed the rules of measuring fracture features in presence of 2 consecutive VB fractures. When 2 VBs are involved, the CR is calculated using the height of one adjacent unfractured VB. The Cobb angle (CA), Gardner angle (GA), and sagittal index (SI) are measured in the same manner as with other types of fractures.

3. Quantitative Metrics of TL Fracture Features

Various metrics for TL fracture features have been introduced and studied [5,24-26]. Among them, we evaluated the CR, CA, GA, and SI. The CR is calculated by comparing the height of the fractured VB with the mean height of the adjacent upper and lower VBs. The formula was: CR= (1–height of the fractured VB/mean height of the upper and lower adjacent VBs)× 100%. The CA was determined as the angle between the lines of the upper endplate of the upper adjacent VB and the lower endplate of the lower adjacent VB. The GA measures the angle between the upper endplate of the upper adjacent VB and the lower endplate of the fractured VB. The SI was calculated as the angle between the upper and lower endplate lines of the fractured VB (Fig. 1).

4. Development of a DL Algorithm

1) DEEP:SPINE

In our previous study, the automatic spinal disease analysis software, DEEP:SPINE (DEEPNOID Inc., Seoul, Korea) was developed; this software was originally designed for the detection and segmentation of all visible VBs, including fractured VBs, from spine radiography [14]. DEEP:SPINE utilized Detectron2 library [27], a Mask Region-based Convolutional Neural Network (Mask R-CNN) model designed specifically for object instance segmentation, via the transfer learning method [16,28].

During the development of DEEP:SPINE, we compared U-Net-based and Mask R-CNN based models (Supplementary Fig. 3). The Dice similarity coefficient values with standard deviations for U-Net and Mask R-CNN were 0.895 ± 0.052 and 0.939 ± 0.029, respectively, demonstrating that Mask R-CNN consistently outperformed U-Net in VB segmentation. This can be attributed to the different segmentation approaches used: U-Net, which performs semantic segmentation, faces challenges in accurately delineating overlapping or deformed VBs. In contrast, Mask R-CNN, which specializes in instance segmentation, detects regions of interest (ROI) prior to segmentation, thereby offering improved precision. For these reasons, we selected the Mask R-CNN model for the DEEP:SPINE software due to its suitability for our specific segmentation tasks.

In this study, we further fine-tuned the pretrained DEEP: SPINE model to improve its suitability for fractured VB segmentation [27]. Additionally, we developed new modules to quantitatively measure fracture features, thereby establishing a novel 2-step DL algorithm that integrates the fine-tuned DEEP:SPINE framework with the newly developed modules. The process of deriving quantitative TL fracture features consisted of 3 stages, as outlined below (Fig. 2).

Fig. 2.

Sequential pipeline for deep learning-based quantitative analysis of TL fracture features. The diagram outlines the stepby- step workflow of the developed deep learning-based algorithm for analyzing quantitative features related to TL fractures. The pipeline includes data preprocessing, Mask R-CNN based vertebral body segmentation, and subsequent stages focused on measuring CR and kyphotic angles. RPN, Region Proposal Network; TL, thoracolumbar; Mask R-CNN, Mask Region-Based Convolutional Neural Networks; CR, compression rate.

(1) Stage 1: Preprocessing

The initial stage involved the preprocessing of L-spine radiography images to enhance their quality and prepare them for subsequent analysis. This stage encompassed the following key steps.

- Normalization: Radiography images underwent a normalization process to ensure consistent intensity levels across all images.
- Contrast enhancement: We applied the Contrast Limited Adaptive Histogram Equalization (CLAHE) algorithm to enhance image quality and improve the visibility of critical features. CLAHE is one of the most popular image processing techniques to enhance the quality of radiographs, especially for spine or bone segmentation [29]. Albeit high computational costs are a concern, the enhanced accuracy due to the clarity of the resulting image justifies these costs. A clip limit of 100 and a tile grid size of (8,8) were set as the parameters, which were suggested by Pan et al. [30] to optimize the performance.
- Resizing: Images were resized to a standardized dimension of 512 by 512 pixels, while maintaining the original image’s aspect ratio through zero-padding.

(2) Stage 2: VB segmentation

In the second stage of our methodology, automated VB segmentation was performed on the preprocessed radiography images using Mask R-CNN. This advanced CNN architecture comprises 3 crucial components: a backbone network responsible for feature extraction, a Region Proposal Network generating candidate object region proposals, and a mask head that handles object classification, bounding box regression, and instance mask prediction. To train this model, mask images representing hexagonal regions corresponding to the 6-point labels assigned to VBs were utilized. The training process involved configuring the model with the following specific parameters: an Adam optimizer initialized with an initial learning rate of 0.001, further optimized by the Warmup-Cosine learning rate scheduler. We employed a configuration with a maximum of 128 ROIs per image, conducted training for 100 epochs, and used a batch size of 32 images. Therefore, considering the total dataset size (1,318), batch size (32), and number of epochs (100), the training phase involved a total of 4,800 iterations. Furthermore, we added random augmentations of random horizontal flips, rotation from -10° to 10°, random brightness adjustment from range of 0.5 to 2, and random contrast from range 0.9 to 1.1. During the training phase, a multitask loss function was employed, encompassing object classification loss (log loss), bounding box regression loss (smooth L1 loss), and segmentation loss (binary cross-entropy loss). The equation for total loss and all the individual losses are as below:

(1) LTotal=LClassification+LBbox+LMask

where L represents loss function.

(2) LClassification=LCross-entropy=-∑c=1Cyclogpc

where C represents number of classes, y_c represents the binary indicator (0 or 1) of whether class label c is the correct classification for the prediction, and p_c is the predicted probability of the observation being of class c.

(3) LBbox=LSmooth L1-loss=∑nNln, where ln=12xn-yn2β, if xn-yn<βxn-yn-12β, otherwise

where x_n represents the predicted value, y_n represents the GT value, and N represents the total number of elements in the output. β is the arbitrary threshold parameter that determines the point at which the loss transitions from a quadratic to linear function. We set 1.0 as the value for β which is the default and most generally used value for the term.

(4) LMask=LBinary Cross-entropy=-nlogm+1-nlog1-m

where m is the predicted probability of the positive class, and n is the binary indicator (0 or 1) of whether the GT is the positive class.

Classification loss (L_{Classification}) penalizes incorrect class predictions, bounding box loss (L_Bbox) penalizes by the incorrectness of the box coordinates, and mask loss (L_Mask) penalizes the inaccurate prediction in the pixel-level segmentation object. This comprehensive loss function guided the model in achieving accurate object classification, precise bounding box refinement, and the generation of high-fidelity instance masks. For model training, the Detectron2 framework (https://github.com/facebookresearch/detectron2) [27], built upon the PyTorch framework, was utilized. Moreover, we used the publicly available pretrained weights provided in the Detectron2 package, which have been trained on the popular COCO instance segmentation dataset. The training process was executed on a dedicated single GPU server equipped with 512GB of RAM memory, an Intel Xeon E5-2640 v4 CPU, and 8ea TITAN XP GPUs by NVIDIA. These hardware and software resources collectively provided the computational capacity required for training our Mask R-CNN model effectively.

(3) Stage 3: Quantitative measurement of TL fracture features

In the final stage, patches corresponding to each segmented VB were extracted from the predicted mask images generated by the segmentation model. Within each selected subject VB, the labeling data of 6 points were extracted, forming the basis for measuring CR, CA, GA, and SI. This comprehensive approach constituted the fundamental process of our measurement methodology. This automated process utilizes image preprocessing, DL techniques, and statistical algorithms for accurate measurements.

5. Evaluation of the DL Algorithm Performance

1) Agreement of the DL algorithm with the GT

Since the measurements do not have a specific correct answer, achieving consistency with or the closest value to the GT indicates high performance. Four key metrics—CR, CA, GA, and SI—were calculated for both internal and external validations, followed by their agreement between the GT and the DL algorithm (Fig. 3).

Fig. 3.

Comparison of compression rate with 6-point labeling on thoracolumbar spinal radiographs by ground truth (GT) and a deep learning (DL) algorithm. Every visible vertebral body (VB) is marked with 6 points indicating the anterior, middle, and posterior columns. (A) Simple compression fracture. GT and DL labels nearly identical. (B) Failure of DL to place the middle upper point at the lowest site of the VB. (C) Incorrect placement of the middle pair by DL, suggesting difficulty in interpreting the 3-dimensional structure of a fractured VB from a nontrue lateral view 2-dimensional radiograph. The blue numbers indicate the compression rates (%) for the anterior, middle, and posterior parts of the VB.

2) Performance of different levels of readers without DL algorithm assistance

Four readers with varying clinical experience from our institution participated in this evaluation: a second-year neurosurgery trainee, second- and fourth-year radiology trainees, and an attending spine neurosurgeon with 7 years of experience after neurosurgery board certification. Each reader performed 6-point labeling on both the internal and the external validation datasets. Four metrics were derived from these labels. Agreement with the GT for each metric was obtained for each reader.

3) Performance of readers with DL algorithm assistance

A month after the initial labeling session, the 4 readers revisited the labeling task, this time with the assistance of the DL algorithm, on the same internal and external datasets. In this second round, the readers adjusted 6 preidentified points suggested by the DL algorithm, refining their positions as necessary.

6. Statistical Analysis

The demographic characteristics of both the nonfracture and fracture groups were analyzed using descriptive statistics. Since this task involves quantitative measurements without a definitive correct answer, we considered the labeling results of the experienced radiologist as the reference (meaning GT), and compared the results of the DL algorithm with the GT. The agreement between the GT and the DL algorithm was analyzed using the intraclass correlation coefficient (ICC) values: less than 0.5, between 0.5 and 0.75, between 0.75 and 0.9, and greater than 0.90 are indicated poor, moderate, good, and excellent reliability, respectively.31 Paired t-tests were used to compare values between measurements without and with DL assistance for each reader, to investigate the impact of the DL assistance. All the statistical analyses were carried out using R software (ver. 4.0.5, R Foundation for Statistical Computing, Vienna, Austria).

RESULTS

1. Demographics and Baseline Characteristics

Table 1 summarizes the baseline characteristics and fracture distributions of a total of 1,731 patients from the internal and external datasets. Within the internal dataset, the nonfracture group consisted of 522 males (52.2%) and 478 females (47.8%) with a mean age of 49.4 years (range, 19–89 years; standard deviation [SD], 16.2), whereas the fracture group included 152 males (30.5%) and 369 females (69.5%) with a mean age of 68.3 years (range, 19–101 years; SD, 14.5). The external dataset included 200 fracture cases, for which the gender ratio was 111 males (55.5%) to 89 females (44.5%). There were statistically significant demographic differences between the internal and external validation datasets, in terms of mean age and gender distribution (both with p< 0.01). The internal set, reflecting the training dataset, included a higher proportion of older females, while the external set included a greater number of younger males, with a lower incidence of degenerative and osteoporotic changes. Across all datasets, the most frequent sites of fractures were observed at L1, T12, and L2 levels, accounting for approximately 70% of cases in each test set.

Table 1.

Patient demographics and baseline characteristics

2. Evaluation of the DL Algorithm Performance

1) Agreement between the GT and the DL algorithm

The DL algorithm demonstrated good to excellent agreement with the GT. In the internal validation, ICC values for CR, CA, GA, and SI were 0.841, 0.944, 0.932, and 0.776, respectively. In the external validation, the 4 measurement values were 0.836, 0.940, 0.916, and 0.815, indicating consistent and reliable results (Table 2, Fig. 4).

Table 2.

Performance of readers with or without DL assistance

Fig. 4.

Intraclass correlation coefficient plots between the deep learning algorithm and ground truth. ICC, intraclass correlation coefficient; CR, compression rate; CA, Cobb angle; GA, Gardner angle; SI, sagittal index; GT, ground truth; DL, deep learning.

2) Agreement between the GT and readers without DL algorithm assistance

The readers exhibited good to excellent agreement with the GT across all the metrics. Among the readers, the attending spine surgeon, fourth-year radiology trainee, second-year radiology trainee, and second-year neurosurgery trainee had the highest to lowest ICC values for all the measurements (Table 2). On average, the ICC values of the readers for CR were superior, while for CA and GA, they were comparable. For the SI, the values were slightly lower than those achieved by the DL algorithm in both validation sets (Fig. 5).

Fig. 5.

Comparison of performance among the 4 readers for CR, CA, GA, and SI, without and with the DL assistance. An asterisk denotes a significant difference with p<0.05 by paired t-test. NS R2, neurosurgery second-year resident; Rad R2, radiology second-year resident; Rad R4, radiology fourth-year resident; DL, deep learning; ICC, intraclass correlation coefficient; CR, compression rate; CA, Cobb angle; GA, Gardner angle; SI, sagittal index; GT, ground truth.

3) Agreement between the GT and readers with DL algorithm assistance

The ICC values indicated good to excellent agreement with the GT. Paired t-test analysis revealed statistically significant improvements in most metrics when utilizing the DL algorithm compared to without its assistance (Fig. 5). Notably, the trainees showed more substantial improvements compared to the attending spine surgeon (Table 2).

DISCUSSION

The key findings of this study reveal that the DL algorithm, employing the Mask R-CNN model, successfully measured the various quantitative metrics of TL fracture. The results exhibited good to excellent agreement with the assessments of an experienced musculoskeletal radiologist. Furthermore, DL-assisted measurements showed an improvement for radiology and neurosurgery trainees compared to their assessments without DL assistance.

CNN models, key algorithms in computer vision, have been increasingly applied in spinal image analysis. Li et al. [13] utilized DL techniques to detect osteoporotic lumbar vertebral fractures in spinal radiographs. Similarly, Galbusera et al. [17] developed a fully automated method for measuring spinal parameters using a CNN model. Cho et al. [15] also demonstrated the effective automated measurement of lumbar lordosis through VB segmentation using the U-Net model with radiography images. The U-Net, a widely used CNN architecture in medical imaging analysis, is designed for efficient binary semantic segmentation with small biomedical image datasets [32]. However, the U-Net encounters limitations in tasks requiring the detection and instance segmentation of fractured and normal adjacent vertebrae, as binary segmentation may incorrectly classify overlapping vertebrae as a single entity, particularly in radiographs where such overlap is common. To address this issue, our study employed Mask R-CNN, which distinguishes individual objects within the same class [28].

R-CNN, was originally developed for object detection [33]. This evolved into Fast R-CNN [34] and then Faster R-CNN [35,36], and finally Mask R-CNN was introduced in 2017 [28]. Mask R-CNN represents a significant advancement in the realm of computer vision, specifically designed for object instance segmentation. Its distinctive feature lies in its capability for instance segmentation, enabling the identification and differentiation of individual instances of objects within an image. This is achieved through the simultaneous prediction of bounding boxes, class labels, and pixel-wise segmentation masks. Mask R-CNN has been widely utilized across various domains including medical image analysis. In spinal image analysis research, Kim et al. [16] reported the successful use of Mask R-CNN for measurement of sagittal radiographic parameters with automatic vertebral segmentation for nonfractured vertebrae without deformity. Additionally, Kónya et al. [21] compared segmentation performance of various semantic and instance segmentation algorithms, including Mask R-CNN model, demonstrating its effectiveness in segmenting spine radiography images, even with implants such as screws and cement. However, this study lacked validation to assess its clinical effectiveness. Our study emphasizes the need for precise detection of the vertebral contours, particularly the endplates, to accurately define the boundaries of the intervertebral spaces. Moreover, we applied transfer learning from a pretrained Mask R-CNN model, which had been trained for VB segmentation tasks on diverse range of datasets, encompassing spinal radiographs of both nonfractured and fractured patients. This pretraining phase equipped the model with a comprehensive understanding of VB features, enabling it to generalize seamlessly to our target segmentation task. This transfer learning strategy also offers an advantage in improvement of accuracy in segmenting VB even with relatively small datasets, particularly in scenarios where labeled data is limited.

While the 6-point placement has been a well-established method for semiquantitative evaluation of VB fractures [22,23], our study stands out as the first to apply the 6-point labeling method to automated measurements through the DL algorithm. Previous research has often employed a 4-point labeling technique, involving the marking the 4 corners of the VB, to measure various coronal angles [18,37,38] and sagittal parameters [15,16,18]. However, the limitations of the 4-point labeling method become apparent in accurately measuring CR, as the lowest height of a fractured VB is often located at the midpoint. Kim et al. [14] employed an eight-point labeling method and successfully detected and segmented of fractured lumbar vertebrae using radiographs; however, they did not extend their model for quantitative measurement of TL fracture features. In contrast, our study applied the six-point labeling method and has successfully developed a DL model that accurately measures both CR and various kyphotic angles.

Additionally, our findings revealed that clinicians with different experience levels improved their diagnostic performance by using the DL assistance, demonstrating its practical clinical applicability through a comprehensive evaluation. To date, this is the first study regarding automated measurement of TL fracture features using a DL algorithm. Our results showed several interesting findings. Firstly, the DL algorithm was inferior to the GT and all readers for the measuring CR. Despite training with GT annotation results for labeling middle pair dots, the DL algorithm still showed discrepancies compared to the GT. This limitation highlights a critical area where the algorithm falls short, as it struggles to accurately infer the actual three-dimensional structure from 2-dimensional radiographs—a task in which human doctors excel. Conversely, regarding the various kyphotic angle measurements, the performance of the DL algorithm mostly surpassed that of the readers. This could be due to the relative straightforwardness of labeling anterior and posterior pairs for calculating kyphotic angles, where the DL algorithm was more precise and consistent. Secondly, the accuracy of the measurements performed by different readers mostly tended to be in the order of the attending spine surgeon, the DL algorithm, the fourth-year radiology trainee, the second-year radiology trainee, and the second-year neurosurgery trainee. This ranking underscores the importance of individual clinical experience in the accuracy of measurement, positioning the accuracy of the DL algorithm between that of an attending surgeon and that of trainees. Lastly, while the accuracy of the trainees improved with the assistance of the DL algorithm, the attending surgeon did not exhibit increased accuracy in kyphotic angle measurements. These findings suggested that augmentation with DL may not always be beneficial, particularly for those who already outperform the DL algorithm.

Another aspect worth discussing is the demographic disparity between the internal and external datasets. Notably, there were significant demographic differences between the internal and external validation cohorts, particularly in terms of mean age and gender distribution. The external validation set included a greater proportion of younger males, resulting in a lower frequency of degenerative changes or osteoporotic vertebral fractures (OVFs), along with higher instances of high-energy fractures. Such fractures typically show clearer margins due to higher bone density in younger patients, which might have contributed to marginally better algorithm performance in the external validation. Nevertheless, our model was designed to be applicable to various types of TL fractures and was not limited to OVFs. This effective performance on the external dataset demonstrates the robustness and generalizability of our model.

This study has several limitations. Firstly, the dataset was small and imbalanced, which was a result of the difficulty of collecting fracture data in a single-center study. However, by implementing transfer learning and increasing the number of nonfractured samples, we demonstrated robust model performance. Future studies with larger and more balanced datasets could further enhance the performance of the DL model. Secondly, the limited number of clinicians for the reader study, which was only 4, may not represent the diversity of clinical expertise. Subsequent studies could benefit from a more uniform group of readers, particularly from a single department and with consistent experience levels, to provide more controlled validation conditions and potentially greater reliability of the results. Thirdly, since fracture types and situations of VBs can significantly impact the accuracy of the DL model performance, subgroup analysis among them would be valuable. However, since there are many overlaps between each subgroup, it was difficult to classify them into definite categories and perform a subgroup analysis. Moreover, while there are definitive answers for the existence of fractures, concrete answer for TL fracture features is lacking. Thus, we relied solely on the experienced radiologist’s GT for comparison, and our evaluation was limited to using ICC values. Lastly, this study did not compare the time taken to perform measurements with and without DL assistance, an aspect that could provide insights into efficiency or cost-effectiveness.

CONCLUSION

We developed and validated a DL algorithm as an automated assisting tool for the quantitative measurement of TL fracture features. The reliability of the algorithm has been validated, and the results are comparable to those of various clinicians. Integrating this DL algorithm into the clinical workflow is anticipated to expedite the diagnostic process and enhance the reliability of diagnoses, offering substantial benefits, particularly to trainees in clinical practice.

Supplementary Material

Supplementary Figs. 1-3 can be found via https://doi.org/10.14245/ns.2347366.683.

Supplementary Fig. 1.

(A) Six-point labeling in nonfractured vertebral body (VB) with true lateral view radiograph. (B) For fractured vertebrae, while the corner points remain unchanged, the middle pair points are placed at the site of the most pronounced height reduction, considered in a 3-dimensional (3D) context. (C) In instances where the lateral radiograph is not a true lateral view, and the superior and inferior endplates are misaligned, we inferred the 3D anatomy of the VB and labeled the middle points for the appropriate VB height. The green dots represent the most appropriate labeling methods, but red or blue dots can serve as alternative methods. (D) For VBs with significant osteophytes or protruding fracture fragments, the anterior and posterior pairs of points are marked, conserving the original, undisturbed shape of the VB.

ns-2347366-683-Supplementary-Fig-1.pdf

Supplementary Fig. 2.

Measurement methods for 2 consecutive vertebral body fractures. CR, compression rate; CA, Cobb Angle; GA, Gardner angle; SI, sagittal index; VH, vertebral height.

ns-2347366-683-Supplementary-Fig-2.pdf

Supplementary Fig. 3.

Comparison of the semantic segmentation model (U-Net) and the instance segmentation model (Mask R-CNN). (A) Original image of the lateral spine radiograph. (B) Ground truth (GT) segmentation by an experienced radiologist using 6-point labeling. (C) Semantic segmentation by U-Net, which failed to failed to segment overlapped T11 and T12 vertebral bodies (VBs), independently. The segmentation was too fit to osteophytes and deformed VBs. (D) Instance segmentation by Mask R-CNN, with bounding boxes and contours in various colors, closely aligned with the GT and outperforming U-Net.

ns-2347366-683-Supplementary-Fig-3.pdf

Notes

Conflict of Interest

The authors have nothing to disclose.

Funding/Support

This study received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Author Contribution

Conceptualization: EKK; Formal Analysis: BK, HY; Investigation: WTY, EKK, YSY, JL, KYL, YSY, KDA; Methodology: WTY, EKK, BK, HY; Project Administration: EKK; Writing – Original Draft: WTY, EKK; Writing – Review & Editing: WTY, EKK, BK, HY.

Acknowledgements

This manuscript was presented in the 5th Biospine Korea Annual Meeting in 2nd December 2023. I would like to express our gratitude to researchers Jiyun Lee and DaHae Park (Hallym University Dongtan Sacred Heart Hospital) for their assistance in labeling imaging data. Additionally, we are thankful to Professor Il Choi, from the Department of Neurosurgery at Hallym University Dongtan Sacred Heart Hospital, for inspiring this research.

References

1. Mohamadi A, Googanian A, Ahmadi A, et al. Comparison of surgical or nonsurgical treatment outcomes in patients with thoracolumbar fracture with Score 4 of TLICS: a randomized, single-blind, and single-central clinical trial. Medicine (Baltimore) 2018;97:e9842.

2. Shen J, Xu L, Zhang B, et al. Risk factors for the failure of spinal burst fractures treated conservatively according to the thoracolumbar injury classification and severity score (TLICS): a retrospective cohort trial. PLoS One 2015;10:e0135735.

3. Alimohammadi E, Bagheri SR, Ahadi P, et al. Predictors of the failure of conservative treatment in patients with a thoracolumbar burst fracture. J Orthop Surg Res 2020;15:514.

4. Sadiqi S, Verlaan JJ, Lehr AM, et al. Measurement of kyphosis and vertebral body height loss in traumatic spine fractures: an international study. Eur Spine J 2017;26:1483–91.

5. Ruiz Santiago F, Tomas Munoz P, Moya Sanchez E, et al. Classifying thoracolumbar fractures: role of quantitative imaging. Quant Imaging Med Surg 2016;6:772–84.

6. Street J, Lenehan B, Albietz J, et al. Intraobserver and interobserver reliabilty of measures of kyphosis in thoracolumbar fractures. Spine J 2009;9:464–9.

7. Kim H, Yoon H, Thakur N, et al. Deep learning-based histopathological segmentation for whole slide images of colorectal cancer in a compressed domain. Sci Rep 2021;11:22520.

8. Mutasa S, Varada S, Goel A, et al. Advanced deep learning techniques applied to automated femoral neck fracture detection and classification. J Digit Imaging 2020;33:1209–17.

9. Jafari Z, Karami E. Breast cancer detection in mammography images: a CNN-based approach with feature selection. Information 2023;14:410.

10. Kim M, Yun J, Cho Y, et al. Deep learning in medical imaging. Neurospine 2019;16:657–68.

11. Chung SW, Han SS, Lee JW, et al. Automated detection and classification of the proximal humerus fracture by using deep learning algorithm. Acta Orthop 2018;89:468–73.

12. Jones RM, Sharma A, Hotchkiss R, et al. Assessment of a deep-learning system for fracture detection in musculoskeletal radiographs. NPJ Digit Med 2020;3:144.

13. Li YC, Chen HH, Horng-Shing Lu H, et al. Can a deep-learning model for the automated detection of vertebral fractures approach the performance level of human subspecialists? Clin Orthop Relat Res 2021;479:1598–612.

14. Kim KC, Cho HC, Jang TJ, et al. Automatic detection and segmentation of lumbar vertebrae from X-ray images for compression fracture evaluation. Comput Methods Programs Biomed 2021;200:105833.

15. Cho BH, Kaji D, Cheung ZB, et al. Automated measurement of lumbar lordosis on radiographs using machine learning and computer vision. Global Spine J 2020;10:611–8.

16. Kim YT, Jeong TS, Kim YJ, et al. Automatic spine segmentation and parameter measurement for radiological analysis of whole-spine lateral radiographs using deep learning and computer vision. J Digit Imaging 2023;36:1447–59.

17. Galbusera F, Niemeyer F, Wilke HJ, et al. Fully automated radiological analysis of spinal disorders and deformities: a deep learning approach. Eur Spine J 2019;28:951–60.

18. Cina A, Bassani T, Panico M, et al. 2-step deep learning model for landmarks localization in spine radiographs. Sci Rep 2021;11:9482.

19. Langella F, Villafañe JH, Damilano M, et al. Predictive accuracy of surgimap surgical planning for sagittal imbalance: a cohort study. Spine (Phila Pa 1976) 2017;42:E1297–304.

20. Ou-Yang D, Burger EL, Kleck CJ. Pre-operative planning in complex deformities and use of patient-specific UNiD(™) instrumentation. Global Spine J 2022;12(2_suppl):40S–44S.

21. Kónya S, Natarajan TS, Allouch H, et al. Convolutional neural network-based automated segmentation and labeling of the lumbar spine X-ray. J Craniovertebr Junction Spine 2021;12:136–43.

22. Genant HK, Wu CY, van Kuijk C, et al. Vertebral fracture assessment using a semiquantitative technique. J Bone Miner Res 1993;8:1137–48.

23. Guglielmi G, Diacinti D, van Kuijk C, et al. Vertebral morphometry: current methods and recent advances. Eur Radiol 2008;18:1484–96.

24. Steib JP, Aoui M, Mitulescu A, et al. Thoracolumbar fractures surgically treated by “in situ contouring”. Eur Spine J 2006;15:1823–32.

25. Kuklo TR, Polly DW, Owens BD, et al. Measurement of thoracic and lumbar fracture kyphosis: evaluation of intraobserver, interobserver, and technique variability. Spine (Phila Pa 1976) 2001;26:61–5. discussion 66.

26. Jiang SD, Wu QZ, Lan SH, et al. Reliability of the measurement of thoracolumbar burst fracture kyphosis with Cobb angle, Gardner angle, and sagittal index. Arch Orthop Trauma Surg 2012;132:221–5.

27. Wu Y, Kirillov A, Massa F, et al. Detectron2 [Software]. 2019 [cited 2023 Dec 24]. Available from https://github.com/facebookresearch/detectron2.

28. He K, Gkioxari G, Dollar P, et al. Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 2020;42:386–97.

29. Kim DH, Jeong JG, Kim YJ, et al. Automated vertebral segmentation and measurement of vertebral compression ratio based on deep learning in x-ray images. J Digit Imaging 2021;34:853–61.

30. Pan Y, Chen Q, Chen T, et al. Evaluation of a computer-aided method for measuring the Cobb angle on chest X-rays. Eur Spine J 2019;28:3035–43.

31. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med 2016;15:155–63.

32. Ronneberger O, Fischer P, Brox T. U-net: convolutional networks for biomedical image segmentation. Paper presented at: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference; 2015 Oct 5-9; Munich, Germany. Proceedings, Part III 182015.

33. Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. Paper presented at: 2014 IEEE Conference on Computer Vision and Pattern Recognition; 2014 Jun 23-28; Columbus (OH), USA.

34. Girshick R. Fast r-cnn. Paper presented at: 2015 IEEE International Conference on Computer Vision (ICCV); 2015 Dec 7-13; Santiago, Chile.

35. Ren S, He K, Girshick R, Sun J. Faster r-cnn: towards real-time object detection with region proposal networks. arXiv 1506.01497. [Preprint]. 2015. Available from: https://doi.org/10.48550/arXiv.1506.01497.

36. Ruhan S, Owens W, Wiegand R, et al. Intervertebral disc detection in X-ray images using faster R-CNN. Annu Int Conf IEEE Eng Med Biol Soc 2017;2017:564–7.

37. Wu H, Bailey C, Rasoulinejad P, et al. Automated comprehensive adolescent idiopathic scoliosis assessment using MVC-Net. Med Image Anal 2018;48:1–11.

38. Yi J, Wu P, Huang Q, et al. Vertebra-focused landmark detection for scoliosis assessment. Paper presented at: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI); 2020 Apr 3-7; Iowa City (IA), USA.

Article information Continued

This is an open access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Characteristic	Internal dataset			External dataset	p-value^†
	Training		Validation
	Nonfracture group		Fracture group
No.	1,000	318	213	200
Age (yr)	49.4 ± 16.2	69.4 ± 14.1	66.5 ± 15.1	53.1 ± 10.9	< 0.001
Sex					< 0.001
Male	522 (52.2)	87 (27.4)	75 (35.2)	111 (55.5)
Female	478 (47.8)	231 (72.6)	138 (64.8)	89 (44.5)
Fracture level
Total	NA	376	261	250
T11	-	19 (5.1)	14 (5.4)	12 (4.8)
T12	-	102 (27.1)	55 (21.1)	55 (22.0)
L1	-	113 (30.1)	85 (32.6)	77 (30.8)
L2	-	58 (15.4)	57 (21.8)	48 (19.2)
L3	-	38 (10.1)	25 (9.6)	20 (8.0)
L4	-	31 (8.2)	18 (6.9)	20 (8.0)
L5	-	15 (4.0)	7 (2.7)	18 (7.2)

Internal validation	CR		CA		GA		SI
Internal validation	ICC	95% CI	ICC	95% CI	ICC	95% CI	ICC	95% CI
DL	0.860	0.669–0.925	0.944	0.927–0.956	0.932	0.908–0.950	0.779	0.542–0.876
Without DL
NS R2	0.882	0.845–0.916	0.914	0.824–0.940	0.925	0.827–0.962	0.762	0.664–0.829
Rad R2	0.902	0.669–0.925	0.920	0.896–0.966	0.929	0.885–0.980	0.764	0.734–0.936
Rad R4	0.914	0.819–0.961	0.933	0.856–0.960	0.931	0.882–0.958	0.794	0.766–0.866
Surgeon	0.923	0.897–0.980	0.952	0.915–0.973	0.941	0.851–0.977	0.815	0.801–0.878
With DL
NS R2	0.924	0.881–0.988	0.933	0.893–0.944	0.930	0.848–0.959	0.791	0.740–0.825
Rad R2	0.925	0.853–0.990	0.946	0.867–0.961	0.931	0.890–0.981	0.793	0.753–0.895
Rad R4	0.943	0.856–0.979	0.946	0.899–0.977	0.937	0.909–0.966	0.801	0.723–0.852
Surgeon	0.939	0.885–0.981	0.950	0.904–0.967	0.940	0.915–0.962	0.812	0.799–0.964
External validation	CR		CA		GA		SI
External validation	ICC	95% CI	ICC	95% CI	ICC	95% CI	ICC	95% CI
DL	0.836	0.700–0.899	0.940	0.844–0.970	0.916	0.668–0.965	0.815	0.640–0.929
Without DL
NS R2	0.880	0.860–0.971	0.911	0.887–0.989	0.911	0.866–0.945	0.774	0.718–0.819
Rad R2	0.899	0.853–0.979	0.924	0.914–0.983	0.925	0.878–0.980	0.784	0.761–0.916
Rad R4	0.910	0.892–0.963	0.932	0.858–0.979	0.928	0.896–0.972	0.803	0.730–0.892
Surgeon	0.930	0.893–0.986	0.950	0.933–0.975	0.944	0.924–0.958	0.836	0.785–0.877
With DL
NS R2	0.901	0.874–0.925	0.923	0.910–0.947	0.933	0.892–0.967	0.810	0.767–0.917
Rad R2	0.904	0.837–0.970	0.941	0.911–0.977	0.929	0.838–0.985	0.803	0.767–0.859
Rad R4	0.923	0.872–0.980	0.940	0.917–0.985	0.936	0.867–0.977	0.824	0.809–0.962
Surgeon	0.938	0.873–0.980	0.951	0.919–0.967	0.947	0.863–0.962	0.841	0.825–0.975