Commentary on “Predicting Mechanical Complications After Adult Spinal Deformity Operation Using a Machine Learning Based on Modified Global Alignment and Proportion Scoring With Body Mass Index and Bone Mineral Density”

Article information

Neurospine. 2023;20(1):275-277

Publication date (electronic) : 2023 March 31

doi : https://doi.org/10.14245/ns.2346296.148

Fabio Galbusera^,¹

, Andrea Cina ¹^,², Dino Samartzis ³

¹Schulthess Clinic, Zürich, Switzerland

²Department of Health Sciences and Technologies, ETH Zürich, Zürich, Switzerland

³Department of Orthopaedic Surgery, Rush University Medical Center, Chicago, IL, USA

Corresponding Author Fabio Galbusera Department of Teaching, Research and Development Schulthess Clinic, Lengghalde 2, Zürich 8008, Switzerland Email: fabio.galbusera@kws.ch

Predictive modeling has become a hot topic in many fields of medical research including spine surgery. Several models able to predict the risk of mechanical complications based on patients’ demographic and clinical information have been described [1], with some of them including data derived from radiological imaging. The paper “Predicting mechanical complications after adult spinal deformity operation using a machine learning based on modified global alignment and proportion scoring with body mass index and bone mineral density” [2] follows this line of research, introducing the global alignment and proportion score as well as other relevant radiological measurements such as relative pelvic version, relative lumbar lordosis and bone mineral density. On the dataset on which the model was tested, the recorded performances were promising with accuracies in the range of 62%–73%. It is worthy of note that a simple method, such as logistic regression, achieved accuracies very similar to that of the best-performing method, which was in this case a random forest. Even techniques usually considered state-of-the-art in the field like XGBoost and deep neural networks did not perform better than logistic regression in this dataset.

Such surprising results highlight the challenges hidden in the development and validation of predictive models. While the performance of a model may look favourable in absolute terms when examined on a relatively small test set, the model may result inadequate when implemented in the clinical practice or, more simply, to a dataset collected in another center or country. In particular, complex models, such as deep neural networks, are wellknown for their high-risk of overfitting when trained on small datasets and not properly validated. In fact, this also emerges in the present paper in which the gap between the training (1.0) and the test accuracy (0.7/0.8) for the random forest might indicate some degree of overfitting. As with Ockham’s razor, simpler is usually better; if a logistic regression or decision tree performs similarly to a deep neural network, there is no reason to prefer the more complex solution which inevitably involves the risk of overfitting and lower generalizability. Moreover, simpler models such as logistic regression and random forests are easier to interpret, therefore giving the opportunity to explain the model’s decision which is very important in a clinical setting. For the present study, we commend the choice of analyzing in comparative terms the performance of various machine learning techniques, even if we would have rather concluded that advanced machine learning methods did not provide a conspicuous advantage with respect to logistic regression in this dataset.

A key limitation to the present work is the lack of external validation. Prior to contemplating the use of a predictive model in clinical practice to provide guidance in the choice of the most appropriate treatment, its capability to provide accurate predictions for patients different from those used for training the model must be proven [3]. This does not mean that the model would be expected to perform well using different inclusion and exclusion criteria, but rather in patients selected in the same way and treated by different clinicians, possibly in other centers or countries [4]. Cases of predictive models showing excellent performance in the population used for their development but poor generalization in other patient groups are well-known and extensively reported [5], including in the field of spine surgery [6]. A recent systematic review by Lubelski et al. [1] reported that, of 31 papers describing prognostic models for degenerative spine surgery, only 5 described an external validation; thereby, underscoring the heterogeneity and lack of robustness between studies, the need for more “quality control.”

As a matter of fact, most papers about predictive models in the medical field employ the so-called “random split” approach, also used in this study, in which the majority (70%–90%) of the available data (most commonly from a single dataset) is used for training the model, while the remaining 10%–30% for testing its performance. While this approach is meaningful for a first assessment of the model capabilities and to an extent to address internal validation, it is not generally deemed sufficient for the aim of “external” validation. Temporal and geographical splits can be considered acceptable solutions; nevertheless, the golden standard for external validation is a replication of the test by independent researchers using novel data, which eliminates the possibility of adjusting or fine-tuning the model after a first attempt at external validation is conducted [4].

Journals reporting the performance of novel predictive models should consider implementing a policy about the level of validation, similar to the requirement of declaring the level of evidence that was introduced by several journals for clinical research papers or in the similar spirit as various study checklists, such as the CONSORT (Consolidated Standards of Reporting Trials) checklist in reporting clinical trials. While highly innovative or methodologically-oriented studies may be of interest to the readers even if the validation of the model has not been comprehensively completed yet, for papers describing prognostic models and suggesting their clinical implementation in the near future a proper external validation may be considered imperative and thus be strictly required.

Another issue about machine learning-based prognostic models that is frequently overlooked by the scientific community, even if it is not strictly related to the publications that should describe their development and validation, is the “regulatory aspect” [7]. In many countries including the USA and the European Union, prognostic models used as decision-support tools in the management of patients are considered “software as a medical device” (SaMD). Based on the definition of the International Medical Device Regulators Forum, SaMD is “software intended to be used for one or more medical purposes that perform these purposes without being part of a hardware medical device” [8]; therefore, such a position would include tools for predictive outcomes and complications of surgical treatments. While the exact requirements for SaMD depend on the specific local regulations, they generally include risk categorization, a quality management system, and a clinical evaluation to be conducted by means of a clinical trial. Machine learning tools used as SaMD generally have further requirements, such as transparency, explicability, minimization of biased, human oversight, repeatability, reliability, safety, and possibly others [9]. While this should not be considered a criticism of the present paper since regulatory issues are evidently out of its scope, we believe that it is worth noting that any predictive tool should undergo such an assessment before any conceivable clinical implementation.

Taking into account these open challenges in the development of valid prognostic models in spine surgery, it emerges that the major bottleneck is the availability of a large pool of high-quality data from multiple centers worldwide that can be accessed by interested researchers. National and international spine registries such as SweSpine, SIRIS Spine and Spine Tango, naturally lend themselves to this aim, but also offer drawbacks. First, for the sake of keeping their use simple and less time-consuming as possible, the quantity of collected data is minimized; outcome measures are scarcely represented, as well as radiological imaging and parameters. The documentation of all cases in international registries, such as Spine Tango, is not mandatory due to the complexity of the legal framework, leading to the risk of selection bias. Besides, rights of data use and potential commercial exploitations would need to be discussed in advance among the institutes participating in the data collection, which may lead to disagreements and tension within the consortium. The latter issue is especially important because of the high costs associated with the regulatory assessment of SaMD, as mentioned above mandatory before any clinical implementation, which would necessarily involve the participation of industry players. Nevertheless, large data collection projects involving multiple hospitals in different countries, possibly building on existing registries and databases, will undoubtedly play a major role in the future development of large-scale predictive models that can effectively impact the quality of healthcare in spine surgery. That said, quality control, validation and replication, and following regulatory guidelines are no doubt to be a cornerstone if any platform fuelled by artificial intelligence is to be adopted by the community for clinical decision-making and patient care.

Notes

Conflict of Interest

The authors have nothing to disclose.

References

1. Lubelski D, Hersh A, Azad TD, et al. Prediction models in degenerative spine surgery: a systematic review. Global Spine J 2021;11(1_suppl):79S–88S.

2. Noh SH, Lee HS, Park GE, et al. Predicting mechanical complications after adult spinal deformity operation using a machine learning based on modified global alignment and proportion scoring with body mass index and bone mineral density. Neurospine 2023;20:265–74.

3. Collins GS, de Groot JA, Dutton S, et al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol 2014;14:40.

4. Ramspek CL, Jager KJ, Dekker FW, et al. External validation of prognostic models: what, why, how, when and where? Clin Kidney J 2020;14:49–58.

5. Siontis GC, Tzoulaki I, Castaldi PJ, et al. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol 2015;68:25–34.

6. Janssen DMC, van Kuijk SMJ, d'Aumerie BB, et al. External validation of a prediction model for surgical site infection after thoracolumbar spine surgery in a Western European cohort. J Orthop Surg Res 2018;13:114.

7. Hornung AL, Hornung CM, Mallow GM, et al. Artificial intelligence and spine imaging: limitations, regulatory issues and future direction. Eur Spine J 2022;31:2007–21.

8. Software as a medical device (SaMD): key definitions [Internet]. International Medical Device Regulators Forum; c2013 [cited 2023 Mar 7]. Available from: https://www.imdrf.org/sites/default/files/docs/imdrf/final/technical/imdrftech-131209-samd-key-definitions-140901.pdf.

9. Artificial intelligence in EU medical device legislation [Internet]. Brussels (Belgium): COCIR; 2023 [cited 2023 Mar 7]. Available from: https://futurium.ec.europa.eu/sites/default/files/2020-09/COCIR_Analysis_on_AI_in_medical_Device_Legislation_-_Sept._2020_-_Final_2.pdf.

Article information Continued

This is an open access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.