Clinical Machine Learning Modeling Studies: Methodology and Data Reporting

Title	Clinical Machine Learning Modeling Studies: Methodology and Data Reporting
Creator	Oana M. Dumitrascu; Yalin Wang; John J. Chen
Affiliation	Departments of Neurology (OMD) and Ophthalmology (OMD), Mayo Clinic, Scottsdale, Arizona; Departments of Neurology (JC) and Ophthalmology (JC), Mayo Clinic, Rochester, Minnesota; and School of Computing and Augmented Intelligence (YW), Arizona State University, Phoenix, Arizona
Subject	Machine Learning; Research Design
OCR Text	Show Editorial Clinical Machine Learning Modeling Studies: Methodology and Data Reporting Oana M. Dumitrascu, MD, MSc, Yalin Wang, PhD, John J. Chen, MD, PhD A rtiﬁcial intelligence (AI) is a branch of computer science dealing with independent intelligent behavior in computers (1), aiming to develop computer systems that can perform tasks normally requiring human intelligence (2). AI is a tool with the potential to automate medical data interpretation to facilitate early diagnosis and timely management of disease, with image interpretation being of particular interest (3–5). Such approaches may increase efﬁciency and capacity in neuroophthalmology, which is facing a current and prospective shortage of providers (6). Machine learning (ML) is the process through which the intelligence is created. Deep learning (DL) is a subtype of ML that uses convolutional neural networks (CNN). Driven by the potential medical applications, advances in computer science and computing power, ML and AI studies in ophthalmology have recently increased, comprising multiple computer science strategies, with deep learning (DL) being the most advanced methodology to date (7). Relevant to neuro-ophthalmology is the recent study by Liu et al (8) which used DL to develop a deep CNN to classify a spectrum of optic disc abnormalities in color fundus photographs. The ultimate AI aim of this study was to determine whether the CNN could automatically distinguish between normal and abnormal optic discs. Similar forthcoming studies and trials in ophthalmology and neuro-ophthalmology make critical appraisal of ML/AI-based studies of paramount importance. To highlight the strengths and limitations of the study by Liu et al (8), here, we review basic ML methodology with a focus on DL model implementation and recent guidelines for AI clinical trial design and reporting. CLINICAL ML OVERVIEW Machine learning (ML) aims to develop models and algorithms which can perform speciﬁc tasks by learning patterns from representative data (1,2). The implementation of an ML model involves a minimum of 2 phases, the training and testing phases (9). To improve the performance, and when the available samples are large enough, a validation phase is introduced between the training and testing phases (Fig. 1). Regardless of whether 2 or 3 phases are used, it is critically important that in the ﬁnal phase of testing, the model is tested independently from its training and ﬁne-tuning. The training (learning) phase generates the classiﬁcation model. Training data sets must be in a sufﬁciently large number and representative of the “general” population to obtain a model with generalization abilities (10). Training can be supervised, semisupervised, or unsupervised, varying according to the input data provided (11,12). Supervised learning means the training data are labeled with the correct output (9). For example, a data set of fundus photographs (input) includes labeling of the normal optic disc, disc with papilledema, or disc with other abnormalities (output) (13). The learning system is then tasked to ﬁnd a relation that maps each input of the training set (the data) into an output (the label) (9). Semisupervised learning allows the use of partially labeled data (14). Conversely, in unsupervised learning, the training data are not labeled. For example, a data set of fundus photographs (input) is not labeled with the correct output. The learning system is tasked with identifying patterns that separate these data into subsets. For example, unsupervised learning was used to Departments of Neurology (OMD) and Ophthalmology (OMD), Mayo Clinic, Scottsdale, Arizona; Departments of Neurology (JC) and Ophthalmology (JC), Mayo Clinic, Rochester, Minnesota; and School of Computing and Augmented Intelligence (YW), Arizona State University, Phoenix, Arizona. The authors report no conﬂicts of interest. Address correspondence to Oana M. Dumitrascu, MD, MSc, Mayo Clinic Scottsdale: Mayo Clinic Arizona, 13400 E. Shea Boulevard, Scottsdale, AZ 85259; E-mail: dumitrascu.oana@mayo.edu Dumitrascu et al: J Neuro-Ophthalmol 2022; 42: 145-148 145 Copyright © North American Neuro-Ophthalmology Society. Unauthorized reproduction of this article is prohibited. Editorial FIG. 1. Main phases of a machine leaning algorithm development. identify quantiﬁable patterns of visual ﬁeld loss in idiopathic intracranial hypertension that were similar to those designed by human intelligence (15). In the validation phase, the model parameters determined during the training phase are optimized to maximize a given metric. Such parameters may include the number of variables used or their relative weight. Data used in this phase are called validation data, and the performance of the model in correctly classifying these data is called validation performance (10). The testing phase is characterized by testing of the model learned during the training and validation phases on new samples. In the case of supervised ML, the ability of the model to correctly classify new input data without labels into one of the classes of labels deﬁned during the training phase is assessed (16). In the case of unsupervised ML, the ability of the model to classify new input data into one of the subsets that were implicitly deﬁned during the training phase is assessed (16). Testing data set should exclude samples included in the training and validation data and represent diverse subjects to assess the generalizability of the model. The performance of the model in correctly classifying the testing data is called testing performance (10). Multiple data sets may be used in the testing phase. In the study by Liu et al (8), testing performance was assessed on an external data set which reﬂected real-world experience (e.g., retinal images with different image quality, various artifacts, and dissimilar ﬁelds of view), and there was a drop in testing performance driven by an increase in false positives. ML model performance should be reported at 2 levels: 1) how does the model itself perform on the test data set (e.g., F1 scores (17), Dice coefﬁcient (18)) and 2) how do the model predictions translate into the clinical performance metrics (sensitivity, speciﬁcity, positive predictive value, negative predictive value, numbers needed to treat, and area under the ROC (receiver operating characteristics) curve) (19). One technique that is often used for image classiﬁcation tasks is heat maps, which are color-coded images that highlight speciﬁc image parts that are pivotal in the algorithm classiﬁcation–making process. For instance, a warmer color (red) suggests a region that plays a major role 146 in the decision process, as compared with a colder color (blue). These heat maps helped Liu et al (8) to gain insight into the parts of the color fundus photographs that contributed to differentiating normal and abnormal optic discs. A detailed review of the heat map testing of Liu et al with an external data set showed that one of the possible causes of false positives was the erroneous interpretation of image artifacts that were not present in the training data set. CHALLENGES IN ML MODEL IMPLEMENTATION The sample size is one of the main drivers behind ML model performance because small sizes of the training and test sets are sources of bias and contribute to the variance of a model performance. DL, in particular, requires a large amount of data for training, such as a few thousand images for diabetic retinopathy detection (20) or a few hundred thousand images for cardiovascular risk factors prediction from retinal fundus images (21). This could limit DL applicability in ﬁelds such as neuro-ophthalmology, where large image databases are typically not available. When dealing with small training data sets, the DL models are prone to overﬁtting (18), meaning the model does well on training data and poorly on testing (new) data. Conversely, the model can be underﬁtted if the training data do not allow sufﬁcient learning. To overcome this limitation, techniques such as deep transfer learning (22) and data augmentation (23,24) were developed. Transfer learning was adopted by Liu et al (8) in their study to help overcome a training set of 944 color fundus images. GUIDELINES FOR ML, DL, AND AI CLINICAL TRIALS DESIGN AND REPORTING For ML studies to be implemented in patient care, evidence from randomized controlled trials (RCT) was recently mandated by health research quality guiding agencies (25). SPIRIT-AI (Standard Protocol Items: Recommendations for Interventional Trials–Artiﬁcial Intelligence) is a Dumitrascu et al: J Neuro-Ophthalmol 2022; 42: 145-148 Copyright © North American Neuro-Ophthalmology Society. Unauthorized reproduction of this article is prohibited. Editorial reporting guideline for clinical trial protocols evaluating interventions with an AI component (2). SPIRIT-AI was designed to help promote transparency and completeness for clinical trial protocols for AI interventions. Its use assists editors, reviewers, and general readers to understand, interpret, and critically appraise the design and risk of bias for a planned clinical trial. It was developed in parallel with its companion statement for AI trials reporting, CONSORTAI (Consolidated Standards of Reporting Trials–Artiﬁcial Intelligence). Neither of these guidelines is pertinent to the report of Liu et al because it did not involve an RCT. Checklist for Artiﬁcial Intelligence in Medical Imaging, CLAIM, was proposed as best practice to guide authors and reviewers of AI manuscripts in medical imaging (26). The MI-CLAIM (minimum information about clinical artiﬁcial intelligence modeling) checklist was proposed to enable a direct assessment of the clinical impact and to allow rapid replication of the technical design process of a clinical AI study (19). MI-CLAIM has 6 major parts: study design, separation of data into partitions for model training and model testing, optimization and ﬁnal model selection, performance evaluation, model examination/explanation, and end-to-end pipeline replication (19). Liu et al (8) did not include end-to-end pipeline replication in their report. This is important to enable independent researchers to validate published results in own independent cohorts. Additional guidelines for AI clinical research quality assessment include, QUADAS-AI, which is an AI-centered diagnostic test accuracy quality assessment tool (27); STARD-AI, which guides AI-centered diagnostic test accuracy studies (28); and DECIDE-AI, which is a reporting guideline to bridge the development-to-implementation gap in clinical AI (29). New quality metrics and guidelines will be developed as AI applications continue to expand. Development of an automatic, effective, and accurate method to detect speciﬁc optic disc abnormalities in color fundus photographs has boundless promise and remains a current unmet need. The study by Liu et al (8) published in Journal of Neuro-Ophthalmology demonstrates some of the potential that DL can have in identiﬁcation of optic disc abnormalities. Future larger studies with high accuracy, validity, sensitivity, and speciﬁcity of the automated analysis of retinal images for detection of optic neuropathies are warranted and required. The diagnostic performance of the DL method should be clinically acceptable and highly reproducible in external validation data sets including real-world data (large databases with heterogeneous data), in a transparent manner, before attempting implementation in clinical setting (30). Future combined work from humans and ML will likely produce augmented super-intelligent methods for improved clinical efﬁciency and superior outcomes in neuro-ophthalmology. REFERENCES 1. Abels E, Pantanowitz L, Aeffner F, Zarella MD, van der Laak J, Bui MM, Vemuri VN, Parwani AV, Gibbs J, Agosto-Arroyo E, Beck AH, Kozlowski C Computational pathology deﬁnitions, best Dumitrascu et al: J Neuro-Ophthalmol 2022; 42: 145-148 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. practices, and recommendations for regulatory guidance: a white paper from the Digital Pathology Association. J Pathol. 2019;249:286–294. Cruz Rivera S, Liu X, Chan AW, Denniston AK, Calvert MJ,SPIRIT-AI and CONSORT-AI Working Group, SPIRIT-AI and CONSORT-AI Steering Group, SPIRIT-AI and CONSORT-AI Consensus Group. Guidelines for clinical trial protocols for interventions involving artiﬁcial intelligence: the SPIRIT-AI extension. Nat Med. 2020;26:1351–1363. Balyen L, Peto T. Promising artiﬁcial intelligence-machine learning-deep learning algorithms in ophthalmology. Asia Pac J Ophthalmol (Phila). 2019;8:264–272. Gunasekeran DV, Wong TY. Artiﬁcial intelligence in ophthalmology in 2020: a technology on the cusp for translation and implementation. Asia Pac J Ophthalmol (Phila). 2020;9:61–66. Armstrong GW, Lorch AC. A(eye): a review of current applications of artiﬁcial intelligence and machine learning in ophthalmology. Int Ophthalmol Clin. 2020;60:57–71. Frohman LP. How can we assure that neuro-ophthalmology will survive? Ophthalmology. 2005;112:741–743. Lu W, Tong Y, Yu Y, Xing Y, Chen C, Shen Y. Applications of artiﬁcial intelligence in ophthalmology: general overview. J Ophthalmol. 2018;2018:5278196. Liu TYA, Wei J, Zhu H, Subramanian PS, Myung D, Yi PH, Hui FK, Unberath M, Ting DSW, Miller NR Detection of optic disc abnormalities in color fundus photographs using deep learning. J Neuroophthalmol. 2021;41:368–374. Castiglioni I, Rundo L, Codari M, Di Leo G, Salvatore C, Interlenghi M, Gallivanone F, Cozzi A, D’Amico NC, Sardanelli F AI applications to medical images: from machine learning to deep learning. Phys Med. 2021;83:9–24. Ranschaert ERMS, Algra PR, editors. Artiﬁcial Intelligence in Medical Imaging. Cham: Springer International Publishing; 2019. Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015;61:85–117. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–444. Milea D, Najjar RP, Zhubo J, Ting D, Vasseneix C, Xu X, Aghsaei Fard M, Fonseca P, Vanikieti K, Lagrèze WA, La Morgia C, Cheung CY, Hamann S, Chiquet C, Sanda N, Yang H, Mejico LJ, Rougier M-B, Kho R, Thi Ha Chau T, Singhal S, Gohier P, Clermont-Vignal C, Cheng C-Y, Jonas JB, Yu-Wai-Man P, Fraser CL, Chen JJ, Ambika S, Miller NR, Liu Y, Newman NJ, Wong TY, Biousse V Artiﬁcial intelligence to detect papilledema from ocular fundus photographs. N Engl J Med. 2020;382:1687– 1695. Zhouh Z-H. A brief introduction to weakly supervised learning. Natl Sci Rev. 5:44–53. 2018. Doshi H, Solli E, Elze T, Pasquale LR, Wall M, Kupersmith MJ. Unsupervised machine learning identiﬁes quantiﬁable patterns of visual ﬁeld loss in idiopathic intracranial hypertension. Transl Vis Sci Technol. 2021;10:37. Bishop C. Pattern Recognition and Machine Learning. New York: Springer-Verlag. 2006. Alzubaidi L, Zhang J, Humaidi AJ, Al-Dujaili A, Duan Y, AlShamma O, Santamaría J, Fadhel MA, Al-Amidie M, Farhan L Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8:53. Dice L. Measures of the amount of ecologic association between species. Ecology. 1945;26:297–302. Norgeot B, Quer G, Beaulieu-Jones BK, Torkamani A, Dias R, Gianfrancesco M, Arnaout R, Kohane IS, Saria S, Topol E, Obermeyer Z, Yu B, Butte AJ. Minimum information about clinical artiﬁcial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020;26:1320–1324. Gulshan V, Rajan RP, Widner K, Wu D, Wubbels P, Rhodes T, Whitehouse K, Coram M, Corrado G, Ramasamy K, Raman R, Peng L, Webster DR. Performance of a deep-learning algorithm vs manual grading for detecting diabetic retinopathy in India. JAMA Ophthalmol. 2019;137:987–993. 147 Copyright © North American Neuro-Ophthalmology Society. Unauthorized reproduction of this article is prohibited. Editorial 21. Poplin R, Varadarajan AV, Blumer K, Liu Y, McConnell MV, Corrado GS, Peng L, Webster DR. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat Biomed Eng. 2018;2:158–164. 22. Szegedy CLW, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 1–9. 23. Chlap P, Min H, Vandenberg N, Dowling J, Holloway L, Haworth A. A review of medical image data augmentation techniques for deep learning applications. J Med Imaging Radiat Oncol. 2021;65:545–563. 24. Shorten CKT. A survey on image data augmentation for deep learning. J Big Data 2019;6:60. 25. Topol EJ. Welcoming new guidelines for AI clinical research. Nat Med. 2020;26:1318–1320. 26. Mongan J, Moy L, Kahn CE Jr. Checklist for artiﬁcial intelligence in medical imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell. 2020;2:e200029. 27. Sounderajah V, Ashraﬁan H, Golub RM, Shetty S, De Fauw J, Hooft L, Moons K, Collins G, Moher D, Bossuyt PM, Darzi A, Karthikesalingam A, Denniston AK, Mateen BA, Ting D, Treanor D, King D, Greaves F, Godwin J, Pearson-Stuttard J, Harling L, McInnes M, Rifai N, Tomasev N, Normahani P, Whiting P, Aggarwal R, Vollmer S, Markar SR, Panch T, Liu X. Developing a 148 reporting guideline for artiﬁcial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open. 2021;11:e047709. 28. Sounderajah V, Ashraﬁan H, Rose S, Shah NH, Ghassemi M, Golub R, Kahn CE, Esteva A, Karthikesalingam A, Mateen B, Webster D, Milea D, Ting D, Treanor D, Cushnan D, King D, McPherson D, Glocker B, Greaves F, Harling L, Ordish J, Cohen JF, Deeks J, Leeﬂang M, Diamond M, McInnes MDF, McCradden M, Abràmoff MD, Normahani P, Markar SR, Chang S, Liu X, Mallett S, Shetty S, Denniston A, Collins GS, Moher D, Whiting P, Bossuyt PM, Darzi A. A quality assessment tool for artiﬁcial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat Med. 2021;27:1663–1665. 29. Group DAS. DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artiﬁcial intelligence. Nat Med. 2021;27:186–187. 30. Tsopra R, Fernandez X, Luchinat C, Alberghina L, Lehrach H, Vanoni M, Dreher F, Sezerman OU, Cuggia M, de Tayrac M, Miklasevics E, Itu LM, Geanta M, Ogilvie L, Godey F, Boldisor CN, Campillo-Gimenez B, Cioroboiu C, Ciusdel CF, Coman S, Hijano Cubelos O, Itu A, Lange B, Le Gallo M, Lespagnol A, Mauri G, Soykam HO, Rance B, Turano P, Tenori L, Vignoli A, Wierling C, Benhabiles N, Burgun A. A framework for validating AI in precision medicine: considerations from the European ITFoC consortium. BMC Med Inform Decis Mak. 2021;21:274. Dumitrascu et al: J Neuro-Ophthalmol 2022; 42: 145-148 Copyright © North American Neuro-Ophthalmology Society. Unauthorized reproduction of this article is prohibited.
Date	2022-06
Language	eng
Format	application/pdf
Type	Text
Publication Type	Journal Article
Source	Journal of Neuro-Ophthalmology, June 2023, Volume 43, Issue 2
Collection	Neuro-Ophthalmology Virtual Education Library: Journal of Neuro-Ophthalmology Archives: https://novel.utah.edu/jno/
Publisher	Lippincott, Williams & Wilkins
Holding Institution	Spencer S. Eccles Health Sciences Library, University of Utah
Rights Management	© North American Neuro-Ophthalmology Society
ARK	ark:/87278/s6h042w3
Setname	ehsl_novel_jno
ID	2307872
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6h042w3

Back to Search Results