Utilizing light scattering spectroscopy and machine learning to differentiate heterogenous tissue samples

Publication Type	honors thesis
School or College	College of Engineering
Department	Biomedical Engineering
Faculty Mentor	Robert Hitchcock
Creator	Tiwari, Sarthak
Title	Utilizing light scattering spectroscopy and machine learning to differentiate heterogenous tissue samples
Date	2023
Description	Current biopsy approaches either lack depth penetration or real-time analysis. Light scattering spectroscopy (LSS) aims to address these issues, and our goal is to investigate the use of LSS for identifying cardiac conduction tissues. LSS data were collected from donor tissue samples and created a 3D heart model by staining the samples, segmenting them using a modified U-Net, and registering them. The model allowed determination of the tissue type and depth under each LSS point. Dimensionality reduction approaches were employed to visualize the data and used statistical methods to assess if key regions varied statistically in tissue composition. Based on this information, labels were generated for the tissue that could be utilized by machine learning algorithms. Various deep learning and machine learning models were tested to determine how LSS could differentiate based on the underlying tissue and if it could identify cardiac conduction tissue. Significant differences were observed in connective tissue, muscle tissue, and nuclei density in nodal regions compared to other regions. The machine learning models identified muscle tissue as the strongest predictor of nodal tissue. LSS shows promise for identifying cardiac conduction tissue and supports previous research in spectroscopy. By continuing to develop the LSS-ML system, we could further enhance tissue classification from optical approaches, which are necessary to advance to clinical trials, ultimately improving patient quality of life and healthcare outcomes for critical procedures.
Type	Text
Publisher	University of Utah
Subject	LSS; tissue
Language	eng
Rights Management	© Sarthak Tiwari
Format Medium	application/pdf
Permissions Reference URL	https://collections.lib.utah.edu/ark:/87278/s6rp1amf
ARK	ark:/87278/s6h4j7nx
Setname	ir_htoa
ID	2363431
OCR Text	Show UTILIZING LIGHT SCATTERING SPECTROSCOPY AND MACHINE LEARNING TO DIFFERENTIATE HETEROGENOUS TISSUE SAMPLES by Sarthak Tiwari A Senior Honors Thesis Submitted to the Faculty of The University of Utah In Partial Fulfillment of the Requirements for the Honors Degree in Bachelor of Science In Biomedical Engineering Approved: ______________________________ Robert Hitchcock Thesis Faculty Supervisor _____________________________ David W. Grainger, PhD Chair, Department of Biomedical Engineering _______________________________ Kelly W. Broadhead, PhD Honors Faculty Advisor _____________________________ Sylvia D. Torti, PhD Dean, Honors College May 2023 Copyright © 2023 All Rights Reserved ABSTRACT Current biopsy approaches either lack depth penetration or real-time analysis. Light scattering spectroscopy (LSS) aims to address these issues, and our goal is to investigate the use of LSS for identifying cardiac conduction tissues. LSS data were collected from donor tissue samples and created a 3D heart model by staining the samples, segmenting them using a modified U-Net, and registering them. The model allowed determination of the tissue type and depth under each LSS point. Dimensionality reduction approaches were employed to visualize the data and used statistical methods to assess if key regions varied statistically in tissue composition. Based on this information, labels were generated for the tissue that could be utilized by machine learning algorithms. Various deep learning and machine learning models were tested to determine how LSS could differentiate based on the underlying tissue and if it could identify cardiac conduction tissue. Significant differences were observed in connective tissue, muscle tissue, and nuclei density in nodal regions compared to other regions. The machine learning models identified muscle tissue as the strongest predictor of nodal tissue. LSS shows promise for identifying cardiac conduction tissue and supports previous research in spectroscopy. By continuing to develop the LSS-ML system, we could further enhance tissue classification from optical approaches, which are necessary to advance to clinical trials, ultimately improving patient quality of life and healthcare outcomes for critical procedures. ii TABLE OF CONTENTS ABSTRACT ii INTRODUCTION 1 BACKGROUND 4 METHODS 12 RESULTS 18 DISCUSSION 24 ACKNOWLEDGEMENTS 29 REFERENCES 30 iii 1 INTRODUCTION Tissue biopsies from patients are crucial for clinical diagnosis and treatment of many diseases. Despite their clinical utility, extracting tissue samples from patients carries numerous limitations, such as tissue damage, pain, delays in diagnosis, and for some procedures up to a 20% chance of infection [1]. To address the drawbacks associated with acquiring physical tissue samples, “optical biopsies” are being explored as a new method of tissue analysis. Optical biopsies are a noninvasive method of quantifying tissue characteristics without damaging the tissue by measuring tissue-light interactions. Knowing how specific tissue types (e.g., cancer) interact with light compared to normal tissue can facilitate diagnosis without the negative effects of tissue extraction. Examples of these technologies used on cardiac tissue include optical coherence tomography (OCT) and fiber-optic confocal microscopy (FCM) [2]. Although these techniques are promising, the current limitations of these optical methods include shallow imaging depths and sparsely developed reliable analytical algorithms. A potential alternative to these optical imaging methods is light scattering spectroscopy (LSS). This technology relies on the unique scattering signatures of light from different tissues and molecules. As light travels through tissue, different wavelengths of light will be scattered to different levels based on the material properties [3]. The properties of the material or tissue can be characterized by recording the amplitudes of the returned light, which are called spectra. This approach has potential to characterize and quantify the properties of biological tissues without any need for extraction or tissue damage. LSS shows promise due to its significantly lower computational and overhead cost and its ability to probe at a higher depth of field than 2 other optical methods [4], [5]. LSS is an established method of characterizing tissues in research labs, and it has been shown to detect signs of early cancer formation, and metastasis [4], [6], [7]. Binary classification of diseased tissue using LSS is relatively straightforward, but fully interpreting the spectra remains difficult. New approaches for LSS explore the use of machine learning to interpret the complex data. Our lab has created a custom LSS probe; and by combining it with machine learning (ML) approaches, we created what is called the LSS-ML system. The LSS-ML system has been able to determine key tissue properties like nuclear density [8], and tissue composition of varying depths [5]. However, clinical applications of the LSS-ML system in the cardiac space remain limited, due to the relative novelty of this application. One potential application of the LSS-ML system is during interventional cardiac surgery to aid in identifying the location of the cardiac conduction system. The cardiac conduction system includes the sinus node, atrioventricular node, bundle of His, and other structures; and the system is necessary for the proper beating of the heart. Identification of these tissues is essential for proper His bundle lead placement when implanting a permanent pacemaker and for avoiding damage to the cardiac conduction system during surgical repair of congenital heart defects. To further develop the LSS-ML system as a method of identifying cardiac conduction tissue, our lab has developed a tissue processing pipeline to create 3D models of cardiac tissue [9], [10]. This model allows the quantification of tissue composition and correlates these values to the LSS spectra. We hypothesize that by using unsupervised machine learning approaches, we will be able to better understand the effects of tissue on the resulting spectra of the LSS-ML system. Furthermore, neural networks will enable more precise 3 classification of tissue and enable the localization of key cardiac regions. Deep learning approaches with spectroscopy are relatively unexplored [11], but we aim to determine if deep learning is applicable in identifying nodal tissue using LSS. This research focuses on determining the efficacy of LSS in differentiating tissue, using heterogeneous samples of human heart tissue and related tissue composition models to quantify the underlying tissue composition. Utilizing a combination of data mining approaches and machine learning models, we intend to show how LSS will be effective at identifying nodal tissue, and better understand how it can be used in specific clinical diagnostic applications. 4 BACKGROUND Optical Biopsies: Optical biopsies are a relatively new field of study, and have not yet seen much clinical use for replacing traditional tissue biopsies. Despite that, they are growing in popularity as a promising alternative to traditional tissue biopsies in research settings [7]. Optical biopsies utilize the structural changes in diseased tissues and utilize optical approaches to detect these differences. An example is cancer, tumor cells have been shown to have much higher cell density compared to surrounding tissue [12]. This difference was shown to be visible in MRI images, and so by analogy, optical imaging approaches might also show promise in diagnosis of these areas. The key difference with optical biopsies is they do not need to produce traditional images like MRI. Instead, these approaches can focus on fiber optic and catheterized approaches. Optical coherence tomography (OCT) is widely used for eye scans [13]. These methods provide large magnification and have seen large leaps forward recently due to deep learning approaches [14]. This method has shown promise and is moving forward for clinical trials for the diagnosis of cardiac diseases. Currently, OCT has been shown for use in guidance procedures like stent placement [15]. Another major optical approach is known as fiber optic confocal microscopy (FCM). Confocal microscopy is a type of magnification used in certain microscopes. It utilizes a pinhole to block out-of-focus light. It can adjust the pinhole to capture light from various depths and construct a shallow 3D image. FCM utilizes a confocal microscopy system based on a catheterized fiber optic cable. FCM has been shown to be effective in use in locating nodal tissue regions [16], [17]. While clinical trials are 5 underway for both optical approaches, these approaches are severely limited by their shallow depth, which limits the potential applications of the technology, and could reduce its effectiveness for diagnosing various disease states. Due to these limitations, current clinical trials with FCM and OCT technologies may not succeed since low depth resolution makes them much more inaccurate for many important applications. A promising alternative approach is Light Scattering Spectroscopy (LSS). One of the primary benefits of light scattering is that it can potentially detect differences in tissue at much larger depths [5]. The deeper imaging depth of field allows it to pick up subtle tissue changes that are not directly on the surface, showing much better promise for optical biopsies. The primary drawback, however, is data interpretation. FCM and OCT still produce images that can be visually understood. Traditional 2d convolutional neural networks (CNNs) can be used to differentiate and classify the images from these approaches. LSS does not produce images. Rather, it produces intensity values based on the backscattered light as a function of the light wavelength. Using intensity values introduces complications like changes based on the initial light source, the types of spectrometers that detect the backscattered light, and the interpretation of what these intensities mean. A sample spectrum is shown in Figure 1. The spectra show the relative intensity of scattered light recovered (y-axis) at various wavelengths (x-axis). After incident light is scattered, some will return to the probe and is detected by the spectrometers. Spectra have a traditional shape like the spectrum in Figure 1, which is similar to the shape of the spectrum of light emitted. To account for that, a calibration spectrum is also gathered which allows the primary changes in spectra to be based on tissue properties, not the emitted spectrum. These spectra can then be 6 Figure 1: A sample spectra taken from an in vivo study. The intensity was normalized to have an average value of 1. interpreted based on the relative intensity or amplitude at various wavelengths, since spectra scatter differently based on both the tissue and wavelength. The main benefit of LSS comes from the scattering differences based on wavelength and the material or tissue encountered. For example, blood primarily absorbs wavelengths around 500 nm, and is very reflective at wavelengths around 700 nm (which is the wavelength of red light). However, with heterogenous tissue samples, it becomes incredibly complicated to differentiate how certain wavelengths will be scattered based on the interacting media. These complex patterns make machine learning a promising approach for interpreting these spectra [5], [8]. 7 Our LSS spectrometer utilizes mostly visible light in the 500 to 1100 nm range. The spectrometer is connected to a catheterized fiber optic probe that can be used in vivo and threaded through veins. Additionally, the probe interfaces with two spectrometers that record the intensity of the backscattered light. These spectrometers differ slightly but have been shown to provide more information than a single spectrometer [5]. Cardiac Conduction System and Heart Model: The cardiac conduction system is a network of specialized cells that control the rhythm and rate of the heartbeat. The system initiates the electrical signals at the sinoatrial (SA) node and transmits a delayed signal to the atrioventricular (AV) node. The SA node is the natural pacemaker of the heart and is often the origin of many cardiac diseases [18], [19]. Both nodal regions are crucial for proper beating of the heart, and damaged nodes often require an artificial pacemaker to fix the heartbeat [20]. However, a common problem with many procedures is identifying the exact location of the nodes. They are buried deep into the tissue and cannot be seen at the surface. Cardiac tissue consists of mainly muscular, connective, neural, and blood vessel tissue [21]. The nodal region is not a separate type of tissue, but consists of different amounts of primarily muscle, connective, and neural tissue [22]. However, nuclear density also differs quite dramatically within the nodal region compared to its surroundings. The biggest issue with the nodal tissue is its depth, which fluctuates throughout the nodal region of the heart [23]. The complex but important structure of the nodal regions makes it extremely difficult to locate, particularly in surgical situations. 8 Our use of a 3D heart model addresses many of the biggest concerns with the complex structure of the nodal regions. This model allows us to locate nodal regions and understand their depth. This could be incredibly useful for understanding how the spectra from LSS differ in nodal regions. Machine Learning: Machine learning is a tool that can enhance many technologies in biomedical applications and opens the door for more advanced technologies and interpretations of data. In cardiac imaging, machine learning approaches have been shown to provide greater insight into images, and can be used with imaging technologies to be incredibly effective for diagnostic approaches [24]. Light scattering in tissues tends to follow a pattern known as Mie scattering [25]. This type of scattering can be predicted with simulations known as Monte Carlo simulations. However, these simulations are difficult to use due to their extremely high computational cost associated with tracking the trajectory and pathway of the light. Machine learning has shown promise as an approach for predicting the complex nature of LSS spectra. In our research, two main types of machine learning approaches have been explored: unsupervised learning, and neural networks. Unsupervised learning approaches separate data based on differences the computational strategy discerns. This basic analysis could find key differences in the spectra that a human would easily miss. These methods essentially group data together by minimizing a measure of distance between groups. Spectra are high-dimensional datasets, and as such machine learning approaches can find the ideal sets of these groups. These have shown promise in previous spectroscopic applications and identified clear 9 differences based on useful real-world properties [8], [26]. These approaches can provide real world information about the data. Supervised ML approaches allow classification of data and can be trained to find specific differences in the data. There are many factors that can influence the returned spectra from LSS. This makes the use of supervised learning approaches ideal for differentiating tissues, because they can be trained to only identify differences between tissues and ignore many of the other effects. Many kinds of supervised learning approaches exist, including artificial neural networks [27], convolutional neural networks [28], XGBoost [29], logistic regression [30], support vector machines [31], and random forest models [32]. All these approaches provide unique insight into the data and can perform better depending on the specific task. These are well known models that have been utilized in many applications from image classification to disease prediction. Despite limited applications in spectroscopy, supervised learning approaches have been extremely successful at classifying tissues based on specific properties or values [5], [8]. Unsupervised approaches primarily find the largest differences in the data, but supervised approaches can be trained to find specific minute differences that are more clinically significant. Deep Learning: Deep learning is a branch of machine learning focused on developing complex but powerful neural networks. Neural networks are powerful with their ability to approximate any function given sufficient data and architectures [33]. Neural networks are a type of machine learning model that are inspired by the structure and function of the human brain. They consist of interconnected nodes, or "neurons," that process and transmit 10 information, and they learn by adjusting the strengths of connections between neurons in response to input data. Neural networks are particularly well-suited to tasks involving pattern recognition, classification, and prediction, and have many potential applications in biomedical research, such as predicting disease outcomes, identifying biomarkers, and discovering new drugs [34]. By training on large datasets, neural networks can identify patterns and relationships that are difficult for humans to detect and can be used to model complex biological systems and simulate the effects of drugs and treatments. Deep learning approaches using neural networks are also an incredibly flexible branch of machine learning that can be generalized to work in many areas. That is what makes it so effective in very different applications. The architecture of the neural network is crucial to modifying or understanding its success. The architecture of a neural network refers to its structure, which determines how the neurons are arranged and how they interact with each other. There are many different types of neural network architectures, each with its own strengths and limitations. For example, feed-forward neural networks consist of layers of neurons that process input data in a forward direction [27], while recurrent neural networks have feedback connections that allow them to process sequences of data [35]. Convolutional neural networks are specifically designed for image and video processing [28], while autoencoders are used for unsupervised learning tasks [36]. The flexibility of neural networks comes from the ability to customize their architectures and adapt them to different types of data and tasks. By choosing the appropriate architecture and adjusting its parameters, researchers can optimize the performance of a neural network for a specific application [27]. 11 There has also been a large push for explainability (i.e., ability for humans to understand model conclusions) in neural networks, which enables people to use the network as more than a black box model [37]. Approaches like SHAP analysis allow discrimination of what inputs are most impactful in changing the outcome of a network’s prediction [38]. This is particularly useful in biomedical research, where clinicians are often hesitant to accept diagnoses or suggested courses of action from a black box model, especially one that can be confidently incorrect. For our project, we seek to harness this powerful ability for neural networks to classify our spectra and use interpretability approaches to understand the key wavelengths used for prediction. This will allow us to assess the capabilities of the system, but also identify approaches to improve the system in the future. 12 METHODS A customized catheterized LSS setup [8] in conjunction with a broad-spectrum light source ranging from 500-1100 nm (Berkshire Photonics, Washington Depot, USA), and two independent spectrometers (Thorlabs, Newton, USA) were used. A single 200 μm illumination fiber was used for incident light application and was surrounded by two 100 μm diameter optical probes that collected and transmitted the light to the spectrometer. The LSS system gathered spectral data from tissue regions excised from 20 healthy, pediatric human donor hearts. All tissue regions contained nodal tissue. All samples were from infants aged less than 20 days since birth, with an equal split of males and females. Each sample had a total volume of approximately 1 cm3. Each sample was placed and pinned onto a black open-pore foam to mitigate light artifacts. These samples were then placed in a well plate and submerged under 10% phosphate-buffered solution to reduce reflections from the surface of the wet tissue sample. The LSS probe was then placed in a fixture and manipulated to predefined positions on the tissue samples using a 3-axis micromanipulator (MP-285, Sutter Instrument Company, Novato, USA). Prior to gathering the spectra from a sample, a standard reference white surface (Spectralon®, Labsphere, Inc., North Sutton, NH, USA) was used to gather calibration spectra. A semi-automated approach was used to gather the LSS spectra from the surface of the tissue samples, guided by a technician who verified the probe contacted the surface of the tissue. Fiducial markers were created in the tissue to ensure proper alignment of a later 3D reconstruction with the LSS data. The LSS probe was placed at each of these fiducial markers and the 3D location was recorded three times before and after data acquisition. Then the software instructed 13 the micromanipulator to advance the probe in a 200 µm grid pattern across the tissue surface. The technician verified tissue contact with the probe and ensured the probe was normal to the tissue surface. After verifying the pattern and probe orientation, the software began automatic data acquisition. The probe gathered spectral data and location information at each point across the surface of the tissue. The 300 spectral samples were gathered and averaged at each location. All the data was gathered across the entire tissue surface. The tissue samples were then prepared for sectioning and staining by marking the key regions and indicating the direction to cut and stain the tissue following a previous approach [9]. The sectioning and staining were performed by a third party (ARUP Laboratories, Salt Lake City, USA). Samples containing the sinus node were sectioned posterior to the crest of the right atriocaval junction and sectioned along the crista terminalis in the anterior direction. The atrio-ventricular nodal samples were sectioned proximal to the coronary sinus towards the membranous septum. The tissue sections were 4 µm thick, and an average of 25 μm apart, and were stained using an automated Masson’s trichrome staining device (Dako Corporation, Carpinteria, USA). The individual stained sections were imaged at 10x resolution under an automated slide scanner (ZEISS Group, Axioscan, Germany). The images were segmented semiautomatically using custom software written with the ImageJ/FIJI packages [39]. To segment myocardium, connective tissues, and cell nuclei, the Ilastic software package was used [40]. Some of the images were used as a subset to train a fully convolutional network using a modified uNet architecture [41], which was subsequently used to segment vasculature and neural tissues in the entire dataset. The segmentation results 14 were inspected and verified by a 3rd party pathologist to ensure that the results were biologically feasible. The individual pixels in each image were classified as either blood vessels, myocardium, connective tissue, neural tissue, or nodal tissue. After segmenting each image, they were manually registered and aligned the tissue sections together using custom software written in MATLAB. With the complete model of the heart, tissue was quantified beneath every point and linked the locations on the tissue to the LSS data. Underneath each LSS point, the number of pixels of each of the five tissue types was quantified. However, due to the scattering nature of light, not only the tissue right below the probe was examined, but also the tissue to the sides of the probe, particularly as the light penetrated deeper into the tissue. The key tissue regions that LSS interacted with were approximated to be in a frustrum shape with an angle of 10 degrees. This approximation allowed for better capturing of the tissue interacting with the incident light from the probe, and the optimal parameters were determined accordingly. The data were standardized by calibrating with the calibration spectrum for each study, subtracting the mean of all spectra from the tissue sample from each spectrum, and dividing it by the standard deviation of all spectra from that heart sample. This standardization approach for machine learning is common for spectra, especially across multiple samples or spectrometers [42], [43]. To calculate the volume fraction of nodal tissue, the number of pixels classified as nodal by the segmentation model was divided by the total number of pixels. The same approach was used to find the muscle and connective tissue volume fractions. Using this data, the nodal or non-nodal data were classified. The nodal data were identified by the 10th percentile of nodal volume fraction of LSS points that contained a non-zero amount of nodal tissue. 15 Principal component analysis (PCA) was performed to reduce the dimensionality of the LSS data, followed by utilizing Uniform Manifold Approximation and Projection (UMAP) as an alternative approach for dimensionality reduction [44]. Both dimensionality reduction approaches were plotted with 2 components to visualize any existing differences in the data. Clustering approaches were used on the full LSS spectra, and the PCA and UMAP plots were colored according to the nodal class and the clusters. Statistical tests were performed to assess whether the nodal regions were different from the surrounding regions in terms of muscle and connective tissue composition, as well as nuclear density. The 10th percentile classification was used to differentiate the two classes. The frustrums under each LSS location were examined to determine whether the nodal volume fraction was above or below the threshold. For each of these points, the muscle tissue volume fraction, connective tissue volume fraction, and nuclear density were calculated, following the same method used for the nodal volume fractions. The volume fractions of the two groups were then compared using a t-test, with a significance level of 0.05. A t-test was deemed sufficient due to a Shapiro-Wilk test for normality [45]. The t-test utilized all the 22,049 LSS data points, with 5,081 points in the nodal group, and the remaining in the non-nodal group. This test was performed in SciPy without assuming equal variances, which utilize the Welch’s t-test [46]. Another statistical test utilized three new groups. These groups consisted of primarily muscle regions, primarily connective regions, or a mix of the two. A region was considered primarily muscle or connective if its respective volume fraction was above 0.7. This threshold allowed determination of differences due to the major tissue types. A Tukey Kramer multiple comparison test was used to determine if the groups statistically 16 differ. The same assumptions were made about the data as the t-test, and the significance level was set to 0.05. Machine learning approaches were employed to classify or differentiate nodal regions, using volume fractions that were calculated identically to the method used in the t-test. The volume fractions were used as labels, while the full spectra, the first 5 principal components, and UMAP indices of the LSS spectra were used as the data for the models. A regression ensemble neural network was created using the TensorFlow library, which utilized an ensemble approach where the full spectra were input to one network, and the five principal components input into a separate network. Batch normalization, dropout, and ReLU activation functions were used, and the output of these networks was concatenated together and input into a final network again using batch normalization and dropout layers at the end. The hyperparameter tuning library, Optuna [26], was utilized to optimize the learning rate, dropout, and number of epochs, and 23,901 samples were used with leave one out cross-validation (LOOCV) for training. Separate networks were trained on the muscle volume fraction, connective volume fraction, and nuclear density. The model was then trained and the R2 and mean absolute error (MAE) were calculated for each of the folds. To classify nodal tissues, the 10th percentile was again used as the labels. Support vector machines (SVM) and a similar neural network model were used, but for classification. The SVM models were created in Sci-kit learn, and again used Optuna for hyperparameter optimization. The optimized hyperparameters were the kernels, regularization parameter C, and gamma which controls model complexity. To judge the 17 effectiveness of these models, recall scores were calculated, since misclassifying a nodal sample as non-nodal is significantly more dangerous. For the classification network, the same architecture was used, but the output was the prediction of the nodal class, which was the nodal volume fraction above or below the 10th percentile. The same approach for training was used, but in addition to the same three hyperparameters, an additional parameter for the class weight was optimized. This class weight allowed more emphasis to be placed on the nodal classification, placing more emphasis on false negatives over false positives. For reproducibility, all machine learning approaches used set random states and a set random seed. 18 RESULTS By utilizing the network for segmentation, and manual registration, a successful heart model was created which quantified the volume fraction of muscle and connective tissue at every 25 μm. A third-party pathologist verified the results, saying they seemed plausible and expected. Utilizing the heart model data, it was found that nodal regions varied statistically from the non-nodal counterparts in terms of muscle tissue (p = 4.39 * 10-118), connective tissue (p = 6.44 10-87), and nuclei density (p = 8.52 10-241). In addition, these tests showed that muscle and connective volume fractions are elevated outside the node. Figure 2 shows a boxplot with the data and summarizes these results. The first 5 principal components accounted for 99.89% of the data, which was deemed sufficient for use in the machine learning models. The labels were created using the volume fraction information as described in the methods. The principal components and UMAP indices were used to visualize the data and were colored based on the 10th Figure 2: Figure showing the results of the statistical analysis. Tissue composition data is unitless and is essentially percentage of the total volume. Each tissue type was compared in their nodal and non-nodal regions, and each was statistically significant. 19 percentile of nodal volume fractions to see if any differences were apparent, which is seen in Figure 3. These results do not show large amounts of separation and are very difficult to differentiate. The clustering approaches separated the data into two groups and are shown on the UMAP and PCA plots in Figure 4. Using a similar t-test with the volume fractions in the two groups as before, the differences were also statistically significant for muscle tissue (p = 2.88 10-300), connective tissue (p = 2.37 10-45), and nuclear density (p = 5.64 *10-270). For the final t-test using the 3 classes of primarily muscle, primarily connective, or a mix of the two, the Tukey-Kramer multiple comparison test found that all groups varied statistically from each other. Figure 5 shows the groups with their adjusted 95% confidence intervals. Figure 3: The first two dimensions of the UMAP and PCA plots created from the LSS spectra. They are colored by the region that they come from. No clear separations are shown, indicating that nodal information is not simply contained within the first 2 dimensions. This emphasizes the need for more data in the analysis. 20 Figure 4: The first two dimensions of the UMAP and PCA plots created from the LSS spectra. They are colored by the clusters derived from k means clustering. These plots reveal clear separations between the clusters, indicating that the computer can predict volume fraction information accurately. Figure 5: The results of the Tukey Kramer multiple comparison test. Each group’s mean is shown along with the adjusted 95% confidence interval. Muscle varied significantly more in terms of the nodal composition than the connective or mixed tissue regions 21 The large statistical significance in both the classes and the clustering showed promise for the machine learning models. The results of the regression models are summarized in Table 1, and an example regression plot is seen in Figure 6. The first model focused on predicting nuclear density. After hyperparameter tuning, the best parameters were a dropout of 0.3, learning rate of 0.16, 30 epochs, and a class weight of 2.97. Using the LOOCV, the average R2 was 0.73 with a maximum of 0.81. This corresponded with a mean MAE of 0.181 and an average of 0.170. The MAE is dimensionless due to the predictions being volume fractions. The next two models focused on connective tissue volume fractions and nuclear density. For the connective tissue model, optimal hyperparameters were found to be a dropout of 0.5, a learning rate of 0.0168, 70 epochs, and a class weight of 4.34. This model had an average R2 of 0.23 with a maximum of 0.51. This corresponded with a mean MAE of 0.582 and a best of 0.235. For the nuclear density network, the best hyperparameters were a dropout of 0.3, a learning rate of 0.0230. 50 epochs, and a class weight of 4.55. This model had an average R2 of 0.56 with a maximum of 0.77. This corresponded with a mean MAE of 0.181 and an average of 0.170. Table 1: Summarizes the results of the regression models. Shows the validation R2 and MAE, including the average and best values. The Test R2 and MAE are given for the test set. Network type Validation R2 Test R2 Validation MAE Test MAE Connective tissue 0.73 (best 0.81) 0.66 0.181 (best 0.170) 0.216 Muscle tissue 0.23 (best 0.51) 0.17 0.582 (best 0.235) 0.513 Nuclear Density 0.56 (best 0.77) 0.49 0.170 (best 0.181) 0.167 22 Figure 6: An example figure for the results of the regression model on predicting muscle tissue. Other plots looked similar with a large R2 value, indicating a strong correlation. The line in red is the ground truth, while the scattered points are the predictions. The final models focused on predicting LSS points as nodal or not. The SVM models found optimal hyperparameters of a radial basis function kernel, regularization parameter C of 15,000, and a gamma of 20. This model had an average AUC of 0.54 with a maximum of 0.74. However, the recall score had an average of 0.63 with a maximum of 0.78. An example confusion matrix is shown in Figure 7. For the neural network, the optimal parameters were a dropout of 0.3, learning rate of 0.000144, 80 epochs, and class weight of 4.4. The network had an average AUC of 0.57 with a best of 0.74. The mean recall was 0.72 with a best of 0.98. The results of both classifiers are summarized in Table 2. Table 2: Summary of the nodal classifier models. It contains the recall and AUC values for the validation and test sets Model Validation recall Test recall Validation AUC Test AUC SVM 0.63 (best 0.78) 0.68 0.548 (best 0.741) 0.51 TensorFlow model 0.72 (best 0.98) 0.94 0.571 (best 0.744) 0.46 23 Figure 7: An example confusion matrix for the results of the SVM classification model on classifying LSS spectra as nodal. Other confusion matrices looked similar. Values in the diagonals refer to correct predictions, while off diagonals are incorrect. 24 DISCUSSION This study aimed to determine if light scattering spectroscopy (LSS) could be used to identify tissues based on their underlying tissue composition and potentially identify cardiac conduction regions. We developed a 3D heart model to correlate LSS spectra with underlying heart tissue composition. Using the model, we assessed whether nodal regions statistically varied in tissue composition compared to non-nodal ones and employed various machine learning approaches to predict tissue properties from the spectra. Overall, the findings of the study help clarify how LSS spectra change with underlying tissue composition and suggest LSS’s potential for identifying nodal tissue regions. The use of statistical approaches allowed us to ascertain whether nodal regions had statistical differences in nuclear density and other tissue types. Figure 2 highlights those differences. The very small p-value highlights the differences associated with the nodal and non-nodal regions. This is in part due to the large sample size of the experiment. These results are in line with other literature, as previous research indicates that the tissue around the node and cardiac conduction system differs from other regions of the heart [47]. Moreover, the observed differences in clusters shown in Figure 4 suggest that spectra contain relevant information about underlying composition. This aligns with prior studies using LSS, which highlighted its ability to determine different tissues at various depths [2], [5], [8]. These results are incredibly promising, as the nodal tissue regions primarily consist of connective and muscle tissue. So, verifying the LSSML system’s ability to differentiate regions based on the presence of nodal tissues 25 allowed further analysis. However, the significance of the statistical analysis in Figures 25 suggested that there is adequate information in the spectra. The use of principal component analysis (PCA) and UMAP shown in Figures 3-4 help us both visualize the separation as well as understand where most of the separation lies, as seen in Figures 3 and 4. The first principal component containing most of the data is expected, even after calibration. This is simply due to the general shape and nature of the spectra and has also been seen in previous literature [8] as well as previous studies conducted by the lab. Classifying areas as primarily muscle, connective, or mixed tissue revealed less significant results compared to previous analyses. This suggests that knowing a region's nodal volume fraction provides little information about the rest of the tissue. Additionally, merely knowing muscle and connective volume fractions does not easily indicate nodal tissue presence. This makes sense in the context of the heart, which is heterogeneous and contains significantly varying regions. The node's relatively minor presence implies that more information could potentially be gained by considering tissue depth. Several machine learning models were utilized to predict different tissue compositions, nuclear density, and classify nodal regions. The results of the network with R2 often above 0.5 demonstrate the LSS-ML system’s potential for differentiating heterogenous tissues. The muscle regression network yielded the best results, which is logical since muscle tissues are potent light scatterers [48]. This is also particularly important, because Figure 5 indicated that muscle tissue is the strongest prediction of nodal regions. The connective tissue's moderate performance was expected. However, the 26 network's poor performance in predicting nuclei density was surprising, which may be due to the network architecture or other hyperparameters. The final machine learning models were used for nodal classification. The SVM performed surprisingly well, with a high recall, and AUC above 0.5. The neural network, however, did not perform as well. This could again be due to an architecture issue, or it could be that neural networks require much more data than other machine learning models to perform well [49]. The use of ensemble networks combining both the original data and principal components was not expected to work particularly well. However, these results were promising, and are consistent with literature in other fields [50]. These new complex models can be more challenging to tune but allow for stronger interpretations by the model. The success of all three approaches underscores the importance of machine learning in differentiating the spectra produced by LSS. They also show promise in LSS’ ability to identify cardiac conduction tissue if it can differentiate muscle and connective regions. In previous literature, LSS was shown to be able to differentiate between homogeneous tissue regions of varying nuclear density or tissues [5], [8]. Deep learning is still not well explored in spectroscopy [11], and this study helps shed light on potential architectures and applications of it. Overall, this study indicates the promise for LSS to non-destructively characterize heterogeneous tissues and locate key regions in the tissue. Despite the success of the models, there are many limitations of this study that warrant further investigation and understanding for success of the LSS-ML system. The creation of labels was one of the biggest questions regarding the machine learning models. The data are unique in that it used a novel 3D heart model. But there were 27 significant questions about how to properly utilize the model to classify data. The use of a frustrum was logical, but many questions remain on how we could incorporate depth of the different tissues. There is significant literature on light scattering at depth [51]–[53], but very little that concerns complex heterogenous tissues. A potential approach would be the use of Monte Carlo simulations. The heart model provides useful and complex information on the underlying tissues, so using Monte Carlo simulations enables understanding the nature of the scattering, especially as it pertains to depth. For example, a sample that is primarily connective on top and muscle at the bottom will likely face more scattering from the connective tissue, but this information is lost with only using volume fractions. This is further complicated by the effect of depth depending on the wavelength. Of course, this was not the main purpose of our study, so a study focused on identifying these optimal depths or using Monte-Carlo simulations could be more helpful in identifying the nature of light scattering. Our purpose was simply to identify useful depths to use for relating tissue composition to LSS spectra. Perhaps even more important than the classifications, is the understanding of LSS based on the underlying tissue. Further research into explainable deep learning models with LSS could allow for deeper understanding of the model prediction so that results could be improved. In addition, clinical translation of machine learning is often challenged, especially without strong interpretable models [54]–[56]. Creating strong and interpretable ML models may be essential for the future success of the LSS-ML system. The biggest limitation of our study is our relatively low accuracy in the classification models. We focused on creating models with a high recall score, but accuracy often suffered as a result. This study aimed to develop the LSS-ML system to 28 determine how it could be used in vivo for identifying the cardiac conduction system. However, the results for now are relatively underwhelming, and sometimes suggest the ML models are not much better than chance. While the results are still promising for this pilot study due to the novel models and data used, significant improvements must be made before LSS can properly replace other biopsy approaches. This is likely due to our still limited sample size even with cross validation. Additionally, this study does not focus on the effects of age or disease, which could be important factors our model did not account for. These findings would be crucial for clinical applications of LSS. Future, larger scale, studies could glean significantly more insight into LSS and understanding its role in clinical applications for node identification. Additionally, identifying novel deep learning architectures could allow those studies to better distinguish tissues, and properly classify the cardiac conduction system. Despite the limitations, this study opens the door for continued research in LSS. Demonstrating its ability to be used in heterogeneous tissue samples furthers this technology for use as an optical biopsy approach. Issues like reducing the risk of infection and helping locate nodal tissue for cardiac pacemakers required the ability for LSS to differentiate heterogenous cardiac tissue, and our study shows great promise in its ability to do so. However, further work enhancing the accuracy of the system is necessary before clinical trials can occur. In the future, development of this technology could allow for greater characterization of disease and eliminating harmful invasive biopsy procedures. In addition, LSS shows promise to provide real time clinical insights to aid in many surgery applications, and overall improve patient quality of life. 29 ACKNOWLEDGEMENTS This work was performed in Dr. Robert Hitchcock’s lab under the guidance of Brian Cottle at the University of Utah. Dr. Frank Sachse also contributed to the creation of this project. This work was supported by the grant NIH-R01 HL135077 30 REFERENCES [1] S. Loeb et al., “Systematic Review of Complications of Prostate Biopsy,” Eur. Urol., vol. 64, no. 6, pp. 876–892, Dec. 2013, doi: 10.1016/j.eururo.2013.05.049. [2] C. Huang et al., “Towards Automated Quantification of Atrial Fibrosis in Images from Catheterized Fiber-Optics Confocal Microscopy Using Convolutional Neural Networks,” in Functional Imaging and Modeling of the Heart, Y. Coudière, V. Ozenne, E. Vigmond, and N. Zemzemi, Eds., Cham: Springer International Publishing, 2019, pp. 168–176. [3] V. Tuchin, Tissue Optics. 1000 20th Street, Bellingham, WA 98227-0010 USA: SPIE, 2007. doi: 10.1117/3.684093. [4] L. Zhang et al., “Light scattering spectroscopy identifies the malignant potential of pancreatic cysts during endoscopy,” Nat. Biomed. Eng., vol. 1, no. 4, p. 0040, Mar. 2017, doi: 10.1038/s41551-017-0040. [5] N. J. Knighton, B. K. Cottle, B. E. B. Kelson, R. W. Hitchcock, and F. B. Sachse, “Towards Intraoperative Quantification of Atrial Fibrosis Using Light-Scattering Spectroscopy and Convolutional Neural Networks,” Sensors, vol. 21, no. 18, Art. no. 18, Jan. 2021, doi: 10.3390/s21186033. [6] T. D. WANG and J. VAN DAM, “Optical Biopsy: A New Frontier in Endoscopic Detection and Diagnosis,” Clin. Gastroenterol. Hepatol. Off. Clin. Pract. J. Am. Gastroenterol. Assoc., vol. 2, no. 9, pp. 744–753, Sep. 2004, doi: 10.1053/S15423565(04)00345-3. [7] H. K. Roy et al., “Four-dimensional elastic light-scattering fingerprints as preneoplastic markers in the rat model of colon carcinogenesis,” Gastroenterology, vol. 126, no. 4, pp. 1071–1081, Apr. 2004, doi: 10.1053/j.gastro.2004.01.009. [8] Nathan J. Knighton et al., “Toward cardiac tissue characterization using machine learning and light-scattering spectroscopy,” J. Biomed. Opt., vol. 26, no. 11, pp. 1–15, Nov. 2021, doi: 10.1117/1.JBO.26.11.116001. [9] J. K. Johnson, B. K. Cottle, A. Mondal, R. Hitchcock, A. K. Kaza, and F. B. Sachse, “Localization of the sinoatrial and atrioventricular nodal region in neonatal and juvenile ovine hearts,” PLOS ONE, vol. 15, no. 5, p. e0232618, May 2020, doi: 10.1371/journal.pone.0232618. [10] N. Chandler et al., “Computer Three-Dimensional Anatomical Reconstruction of the Human Sinus Node and a Novel Paranodal Area,” Anat. Rec., vol. 294, no. 6, pp. 970–979, 2011, doi: 10.1002/ar.21379. [11] M. Kinoshita et al., “Fractional anisotropy and tumor cell density of the tumor core show positive correlation in diffusion tensor magnetic resonance imaging of 31 malignant brain tumors,” NeuroImage, vol. 43, no. 1, pp. 29–35, Oct. 2008, doi: 10.1016/j.neuroimage.2008.06.041. [12] “Optical coherence tomography - PODOLEANU - 2012 - Journal of Microscopy - Wiley Online Library.” https://onlinelibrary.wiley.com/doi/full/10.1111/j.13652818.2012.03619.x (accessed Nov. 28, 2022). [13] G. Lazaridis, M. Lorenzi, S. Ourselin, and D. Garway-Heath, “Enhancing OCT Signal by Fusion of GANs: Improving Statistical Power of Glaucoma Clinical Trials,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, and A. Khan, Eds., in Lecture Notes in Computer Science. Cham: Springer International Publishing, 2019, pp. 3–11. doi: 10.1007/978-3-030-32239-7_1. [14] E. McGovern et al., “Optical Coherence Tomography for the Early Detection of Coronary Vascular Changes in Children and Adolescents After Cardiac Transplantation: Findings From the International Pediatric OCT Registry,” JACC Cardiovasc. Imaging, vol. 12, no. 12, pp. 2492–2501, Dec. 2019, doi: 10.1016/j.jcmg.2018.04.025. [15] A. K. Kaza, A. Mondal, B. Piekarski, F. B. Sachse, and R. Hitchcock, “Intraoperative localization of cardiac conduction tissue regions using real-time fibreoptic confocal microscopy: first in human trial,” Eur. J. Cardiothorac. Surg., vol. 58, no. 2, pp. 261–268, Aug. 2020, doi: 10.1093/ejcts/ezaa040. [16] A. Mondal et al., “An Imaging Protocol to Discriminate Specialized Conduction Tissue During Congenital Heart Surgery,” Semin. Thorac. Cardiovasc. Surg., vol. 31, no. 3, pp. 537–546, Sep. 2019, doi: 10.1053/j.semtcvs.2019.02.006. [17] K. Hu, Y. Qu, Y. Yue, and M. Boutjdir, “Functional Basis of Sinus Bradycardia in Congenital Heart Block,” Circ. Res., vol. 94, no. 4, pp. e32–e38, Mar. 2004, doi: 10.1161/01.RES.0000121566.01778.06. [18] H. F. Brown, “Electrophysiology of the sinoatrial node.,” Physiol. Rev., vol. 62, no. 2, pp. 505–530, Apr. 1982, doi: 10.1152/physrev.1982.62.2.505. [19] J. C. Callaghan and W. G. Bigelow, “An Electrical Artificial Pacemaker for Standstill of the Heart,” Ann. Surg., vol. 134, no. 1, pp. 8–17, Jul. 1951. [20] G. Vunjak-Novakovic et al., “Challenges in Cardiac Tissue Engineering,” Tissue Eng. Part B Rev., vol. 16, no. 2, pp. 169–187, Apr. 2010, doi: 10.1089/ten.teb.2009.0352. [21] M. R. Boyett, H. Honjo, and I. Kodama, “The sinoatrial node, a heterogeneous pacemaker structure,” Cardiovasc. Res., vol. 47, no. 4, pp. 658–687, Sep. 2000, doi: 10.1016/S0008-6363(00)00135-8. [22] “Innervation and Neuronal Control of the Mammalian Sinoatrial Node a Comprehensive Atlas \| Circulation Research.” 32 https://www.ahajournals.org/doi/full/10.1161/CIRCRESAHA.120.318458 (accessed Apr. 23, 2023). [23] M. Henglin, G. Stein, P. V. Hushcha, J. Snoek, A. B. Wiltschko, and S. Cheng, “Machine Learning Approaches in Cardiovascular Imaging,” Circ. Cardiovasc. Imaging, vol. 10, no. 10, p. e005614, Oct. 2017, doi: 10.1161/CIRCIMAGING.117.005614. [24] V. V. Tuchin, “Light scattering study of tissues,” Phys.-Uspekhi, vol. 40, no. 5, p. 495, May 1997, doi: 10.1070/PU1997v040n05ABEH000236. [25] S. Guan, H. Asfour, N. Sarvazyan, and M. Loew, “Application of unsupervised learning to hyperspectral imaging of cardiac ablation lesions,” J. Med. Imaging, vol. 5, no. 4, p. 046003, Dec. 2018, doi: 10.1117/1.JMI.5.4.046003. [26] D. Perekrestenko, P. Grohs, D. Elbrächter, and H. Bölcskei, “The universal approximation power of finite-width deep ReLU networks.” arXiv, Jun. 05, 2018. doi: 10.48550/arXiv.1806.01528. [27] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [Online]. Available: http://www.deeplearningbook.org [28] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal, “Explaining Explanations: An Overview of Interpretability of Machine Learning.” arXiv, Feb. 03, 2019. doi: 10.48550/arXiv.1806.00069. [29] “Fiji,” ImageJ Wiki. https://imagej.github.io/software/fiji/index (accessed Dec. 15, 2022). [30] S. Berg et al., “ilastik: interactive machine learning for (bio)image analysis,” Nat. Methods, vol. 16, no. 12, Art. no. 12, Dec. 2019, doi: 10.1038/s41592-019-0582-9. [31] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation.” arXiv, May 18, 2015. doi: 10.48550/arXiv.1505.04597. [32] R. H. Anderson, J. Yanni, M. R. Boyett, N. J. Chandler, and H. Dobrzynski, “The anatomy of the cardiac conduction system,” Clin. Anat., vol. 22, no. 1, pp. 99–113, 2009, doi: 10.1002/ca.20700. [33] A. H. Gandjbakhche, R. F. Bonner, A. E. Arai, and R. S. Balaban, “Visible-light photon migration through myocardium in vivo,” Am. J. Physiol.-Heart Circ. Physiol., vol. 277, no. 2, pp. H698–H704, Aug. 1999, doi: 10.1152/ajpheart.1999.277.2.H698. [34] X. Ying, “An Overview of Overfitting and its Solutions,” J. Phys. Conf. Ser., vol. 1168, no. 2, p. 022022, Feb. 2019, doi: 10.1088/1742-6596/1168/2/022022. [35] X. Yan, S. Hu, Y. Mao, Y. Ye, and H. Yu, “Deep multi-view learning methods: A review,” Neurocomputing, vol. 448, pp. 106–129, Aug. 2021, doi: 10.1016/j.neucom.2021.03.090. 33 [36] O. Bohdal, Y. Yang, and T. Hospedales, “Flexible Dataset Distillation: Learn Labels Instead of Images.” arXiv, Dec. 12, 2020. doi: 10.48550/arXiv.2006.08572. [37] V. Grossmann, L. Schmarje, and R. Koch, “Beyond Hard Labels: Investigating data label distributions.” arXiv, Oct. 06, 2022. doi: 10.48550/arXiv.2207.06224. [38] F. Martelli, T. Binzoni, A. Pifferi, L. Spinelli, A. Farina, and A. Torricelli, “There’s plenty of light at the bottom: statistics of photon penetration depth in random media,” Sci. Rep., vol. 6, no. 1, Art. no. 1, Jun. 2016, doi: 10.1038/srep27057. [39] S. Wongvibulsin, K. C. Wu, and S. L. Zeger, “Improving Clinical Translation of Machine Learning Approaches Through Clinician-Tailored Visual Displays of Black Box Algorithms: Development and Validation,” JMIR Med. Inform., vol. 8, no. 6, p. e15791, Jun. 2020, doi: 10.2196/15791. [40] N. K. Dinsdale, E. Bluemke, V. Sundaresan, M. Jenkinson, S. M. Smith, and A. I. L. Namburete, “Challenges for machine learning in clinical translation of big data imaging studies,” Neuron, vol. 110, no. 23, pp. 3866–3881, Dec. 2022, doi: 10.1016/j.neuron.2022.09.012. [41] A. Mechelli and S. Vieira, “From models to tools: clinical translation of machine learning studies in psychosis,” Npj Schizophr., vol. 6, no. 1, Art. no. 1, Feb. 2020, doi: 10.1038/s41537-020-0094-8. Name of Candidate: Sarthak Tiwari Date of Submission: May 15, 2023
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6h4j7nx