Comparative spectral decompositions for predicting the clinical outcome of astrocytoma | Institutional Repository | J. Willard Marriott Digital Library

Theses & Dissertations

Comparative spectral decompositions for predicting the clinical outcome of astrocytoma

Download File | | Reference URL

Bluesky

Reddit

Update Item Information

Title	Comparative spectral decompositions for predicting the clinical outcome of astrocytoma
Publication Type	dissertation
School or College	College of Engineering
Department	Biomedical Engineering
Author	Aiello, Katherine Ann
Date	2019
Description	As personalized medicine is integrated into clinical practice for the treatment of cancer, patient care will be centered around new methods of tumor diagnosis that are predictive of an individual patient's outcome based on a tumor's biology. Rather than prognosticating a tumor based solely on its observable anatomic features, the clinical and research communities recognize the importance of also considering the molecular features of a tumor that impact a patient's outcome. This paradigm shift toward personalized diagnosis and treatment of tumors requires the identication of robust molecular signatures that have high analytic and clinical validity. However, these fundamental patterns of biological variation that characterize a tumor's progression, as well as a patient's outcome, are hidden in large, high-dimensional genomic datasets. Comparative spectral decompositions are a set of universal mathematical frame- works that separate a signal into its underlying sources of variation, the same way a prism separates white light into its component colors. Rather than simplifying the data, as is commonly done, the decompositions leverage the complexity of the datasets in order to tease out the patterns within them. We recently demonstrated the eectiveness of these frameworks for modeling DNA copy-number proles from glioblastoma (GBM) brain cancer patients, which revealed a genome-wide pattern of DNA copy-number aberrations (CNAs) that is predictive of patient survival and response to chemotherapy. Recurring DNA CNAs had been observed in GBM tumors' genomes for decades; however, copy-number subtypes that are predictive of a patient's outcome had not been conclusively established, illustrating the ability of comparative spectral decompositions to nd what other methods have missed. In this research, we build on those results by using comparative spectral decom- positions to study lower-grade astrocytoma (LGA) patients' copy-number proles, enabling prognostication of the LGA tumors and comparison of genomic aberrations that characterize the lower- and high-grade tumors. Additionally, we demonstrate the analytic and clinical validity of the GBM pattern as a platform- and technology- independent prognostic predictor in the combined astrocytoma population, by clas- sifying astrocytoma tumors based on genomic proles measured by both microrarray and next generation sequencing technologies. The results reported here bring the GBM pattern a step closer to the clinic, where it can be implemented as a laboratory test and used to improve patient care. iv
Type	Text
Publisher	University of Utah
Subject	comparative spectral decomposition; astrocyoma; cancer; personalized diagnosis
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	© Katherine Ann Aiello
Format	application/pdf
Format Medium	application/pdf
ARK	ark:/87278/s6ms9p15
Setname	ir_etd
ID	1671475
OCR Text	Show COMPARATIVE SPECTRAL DECOMPOSITIONS FOR PREDICTING THE CLINICAL OUTCOME OF ASTROCYTOMA by Katherine Ann Aiello A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Bioengineering The University of Utah May 2019 c Katherine Ann Aiello 2019 Copyright ⃝ All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Kath erine AnnAiell o has been approved by the following supervisory committee members: Orly Alter -------- - - - - - - - -, Chair Andr ea H. Bild , -------------------- Member , Tara L. D ea ns -------------------- Member August 25, 2017 , Robert S.MacLe o d -------------------- Member August 25, 2017 _______ _C_h_ r_i s_J._M_ y er_ _ _s _ _ _ _ _ , Member August 25, 2017 and by August 25, 2017 Date Approved Date Approved Date Approved Date Approved Date Approved er _ _ _ _ _ ______ _ D_a _v1_· dW _ _.G _ _ r_a_ i_ng_ __ _ , Chair/Dean of the Department/College/School of_ _ _ _ _ _ _ _ e_ er_ i_ ng- - - - -_Bi_ o_ e_ ngin and by David B. Kieda, Dean of The Graduate School. ABSTRACT As personalized medicine is integrated into clinical practice for the treatment of cancer, patient care will be centered around new methods of tumor diagnosis that are predictive of an individual patient’s outcome based on a tumor’s biology. Rather than prognosticating a tumor based solely on its observable anatomic features, the clinical and research communities recognize the importance of also considering the molecular features of a tumor that impact a patient’s outcome. This paradigm shift toward personalized diagnosis and treatment of tumors requires the identification of robust molecular signatures that have high analytic and clinical validity. However, these fundamental patterns of biological variation that characterize a tumor’s progression, as well as a patient’s outcome, are hidden in large, high-dimensional genomic datasets. Comparative spectral decompositions are a set of universal mathematical frameworks that separate a signal into its underlying sources of variation, the same way a prism separates white light into its component colors. Rather than simplifying the data, as is commonly done, the decompositions leverage the complexity of the datasets in order to tease out the patterns within them. We recently demonstrated the effectiveness of these frameworks for modeling DNA copy-number profiles from glioblastoma (GBM) brain cancer patients, which revealed a genome-wide pattern of DNA copy-number aberrations (CNAs) that is predictive of patient survival and response to chemotherapy. Recurring DNA CNAs had been observed in GBM tumors’ genomes for decades; however, copy-number subtypes that are predictive of a patient’s outcome had not been conclusively established, illustrating the ability of comparative spectral decompositions to find what other methods have missed. In this research, we build on those results by using comparative spectral decompositions to study lower-grade astrocytoma (LGA) patients’ copy-number profiles, enabling prognostication of the LGA tumors and comparison of genomic aberrations that characterize the lower- and high-grade tumors. Additionally, we demonstrate the analytic and clinical validity of the GBM pattern as a platform- and technologyindependent prognostic predictor in the combined astrocytoma population, by classifying astrocytoma tumors based on genomic profiles measured by both microrarray and next generation sequencing technologies. The results reported here bring the GBM pattern a step closer to the clinic, where it can be implemented as a laboratory test and used to improve patient care. iv To my family. CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTERS 1. 2. 3. 4. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Glioblastoma and Lower-Grade Astrocytoma . . . . . . . . . . . . . . . . . . . . . . . Prognostic Staging: A New Era of Cancer Treatment . . . . . . . . . . . . . . . . Research Aims and Organization of Dissertation . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 7 8 10 COMPARATIVE SPECTRAL DECOMPOSITIONS . . . . . . . . . . . 15 Mathematical Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparative Spectral Decompositions for Modeling Biological Data . . . . . Applications in Personalized Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 19 21 24 DNA COPY-NUMBER ALTERATIONS PREDICTING LOWER-GRADE ASTROCYTOMA OUTCOME . . . . . . . . . . . . . 27 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mathematical Framework: The GSVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . Biological Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 29 30 36 37 52 CROSS-PLATFORM VALIDATION OF PATTERN OF DNA COPY NUMBER ABERRATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 56 59 61 68 5. ASTROCYTOMA GENOTYPE ENCODES FOR TRANSFORMATION AND PREDICTS SURVIVAL PHENOTYPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 The GSVD as a Comparative Spectral Decomposition . . . . . . . . . . . . . . . . 74 Astrocytoma Tumor-Exclusive Genotype and Phenotype . . . . . . . . . . . . . . 76 Blind Separation from Normal and Experimental Sources of Copy-Number Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 The Tumor-Exclusive Genotype Predicts the Survival Phenotype Statistically Better than Any Other Indicator . . . . . . . . . . . . . . . . . . . 80 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Supplementary Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . 122 Prognosis of Lower-Grade Astrocytoma . . . . . . . . . . . . . . . . . . . . . . . . . . . Microarray Platform-Independence of GBM Pattern . . . . . . . . . . . . . . . . . Measurement Technology-Independence of GBM Pattern . . . . . . . . . . . . . . Future Directions with Higher-Order and Higher-Dimensional Datasets . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 122 124 125 126 130 LIST OF FIGURES 2.1 The GSVD factorizes two datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 High-throughput measurements are a superposition of signals . . . . . . . . 23 3.1 GSVD of LGA tumor and normal DNA copy-number profiles . . . . . . . . 41 3.2 The significance of individual probelets in the LGA tumor and normal datasets is given by the generalized normalized Shannnon entropy . . . . 42 3.3 A tumor-exclusive batch effect is revealed by the GSVD . . . . . . . . . . . . 43 3.4 Experimental and biological variation is captured by the GSVD . . . . . . 44 3.5 Significant patterns are revealed by the GSVD of the LGA datasets . . . 45 3.6 The male-specific X-chromosome deletion is identified in the LGA tumor dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.7 Survival analyses of the LGA patients classified by GSVD . . . . . . . . . . 47 3.8 LGA genome-wide pattern of co-occurring CNAs . . . . . . . . . . . . . . . . . 48 3.9 The GBM genome-wide pattern consists of LGA-shared and GBMspecific co-occurring CNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.10 Schematic mapping of the GBM and LGA CNAs onto the Hh pathway 50 3.11 Schematic mapping of the GBM and LGA CNAs onto the Ras pathway 51 4.1 Survival analyses of astrocytoma patients classified by the GBM pattern 63 4.2 Survival analyses of astrocytoma patients classified by treatment . . . . . 4.3 Survival analyses of astrocytoma patients classified by existing indicators 65 4.4 Survival analyses of astrocytoma patients classified by laboratory tests 66 5.1 The GSVD of the WGS read-count profiles of patient-matched astrocytoma tumor and normal DNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2 Astrocytoma tumor-exclusive genotype and phenotype . . . . . . . . . . . . . 92 5.3 The astrocytoma tumor-exclusive genotype encodes for signalling via the canonical Notch, Ras, Shh, and hominin-specific Notch pathways . . 93 5.4 Survival analyses of the WGS astrocytoma patients . . . . . . . . . . . . . . . 94 5.5 Workflow for computation and interpretation of the GSVD . . . . . . . . . . 95 5.6 The most significant row basis vectors uncovered by the GSVD of the WGS astrocytoma tumor and normal datasets . . . . . . . . . . . . . . . . . . . 96 64 5.7 5.8 5.9 Survival analyses of the WGS astrocytoma patients based upon the GSVD of the WGS datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Differential mRNA expression in the Ras pathway is consistent with the corresponding DNA CNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Differential mRNA expression in the Shh pathway is consistent with the corresponding DNA CNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.10 Differential mRNA expression in the Notch pathway is consistent with the corresponding DNA CNAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.11 Differential mRNA expression outside the Ras, Shh, and Notch pathways is consistent with the corresponding DNA CNAs . . . . . . . . . . . . . 101 5.12 The first, most tumor-exclusive row basis vector and corresponding tumor column basis vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.13 The 85th, most normal-exclusive row basis vector and corresponding tumor column basis vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.14 The first tumor and 85th normal column basis vectors are correlated with the fractional GC content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.15 The first and 85th row basis vectors are correlated with experimental batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.16 The 82nd row basis vector is correlated with gender . . . . . . . . . . . . . . . 106 5.17 The 82nd row basis vector and corresponding tumor column basis vector 107 5.18 The 82nd row basis vector and corresponding normal column basis vector108 5.19 The 82nd tumor and normal column basis vectors are correlated with a deletion of the X chromosome relative to the autosome across the tumor and normal genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.20 Survival analyses of the chemotherapy- and radiation-treated WGS astrocytoma patients with MGMT promoter methylation and IDH1 mutation test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.21 Survival analyses of the Affymetrix astrocytoma patients . . . . . . . . . . . 111 5.22 Survival analyses of the Agilent GBM patients . . . . . . . . . . . . . . . . . . . 112 6.1 Tensor GSVD of patient- and platform-matched genomic profiles . . . . . 129 ix LIST OF TABLES 4.1 Cox proportonal hazard models for GBM pattern and existing indicators 67 5.1 Cox proportional hazards models of the WGS astrocytoma patients . . . 113 5.2 Cox proportional hazards models of the Affymetrix astrocytoma patients 114 5.3 Cox proportional hazards models of the Agilent GBM patients . . . . . . . 115 ACKNOWLEDGMENTS First, to my family: my parents Sue and Neil, and my brother Greg. They have been my most ardent supporters and a steady source of encouragement from the very beginning, and this would not be possible without them. My advisor, Dr. Orly Alter, thank you for your guidance through the many phases of this project and my graduate studies. I would also like to thank my Ph.D. committee members, Dr. Andrea Bild, Dr. Tara Deans, Dr. Rob MacLeod, and Dr. Chris Myers, for their support and advice along the way. I would like to thank the faculty members and clinicians who contributed insightful discussions on their areas of expertise that helped to shape this interdisciplinary work: Dr. Roger Horn for thoughtful discussions of matrix analysis, Dr. Matthew P. Scott and Dr. Robert A. Weinberg for helpful comments on the Hedgehog (Hh) and rat sarcoma virus (Ras) signaling pathways, respectively, and Dr. Cheryl A. Palmer and Dr. Randy L. Jensen for their discussions of astrocytoma pathology and intratumor heterogeneity. I would like to thank my co-authors and lab mates, who have been exceptional colleagues and friends throughout the years: Ted Schomay, Cody Maughan, Kaitlin McLean, and lab alumni Dr. Preethi Sankaranarayanan, Dr. Sri Priya Ponnapalli, Ben Alpert, and Nic Bertagnolli. I would also like to thank the faculty and staff of the Scientific Computing and Imaging Institute and the Department of Bioengineering for their dedication to making this an exceptional place to grow and learn over the past 5 years. Finally, I thank my friends, who have encouraged me throughout this process: Zack, Sidh, Jeff, Bronwyn, Alicia, and so many more. This research was funded by the National Cancer Institute (NCI) U01 Grant CA-202144 and the Utah Science, Technology, and Research (USTAR) Initiative. CHAPTER 1 INTRODUCTION Cancer is characterized by instability of the genome that manifests in many forms. Mechanisms of instability include point mutations in oncogenes and tumor suppressors, which inhibit the ability of key proteins to function normally, as well as genomic rearrangements. One common type of genomic rearrangement found in tumor cells are copy number aberrations (CNAs). Normal human cells have two copies of each gene. However, neoplastic tumor cells often exhibit gains and losses across the genome, by which certain genes may be amplified with many more than two copies, while others may be lost entirely. Such aberrations are characteristic of the loss of integrity of the genome found in many cancers [1, 2]. Recurring DNA copy-number alterations (CNAs) have been recognized as a hallmark of cancer for over a century [3], yet what these alterations imply about a solid tumor’s development and progression, and a patient’s diagnosis, prognosis, and treatment remains poorly understood. Over the past century, a variety of experimental methods have been used to observe and study copy number variation in disease with increasing resolution. Early methods were based on identifying large abnormalities of chromosomes using cytogenetic methodologies, such as karyotyping with fluorescence in situ hybridization (FISH) and Giemsa banding. However, the resolution of these methods was limited by the features that are visible in a microscope, making variations under 5 Mb impossible to detect [4]. Twenty-five years ago, the advent of comparative genomic hybridization provided a method for identifying differential copy number in tumor samples relative to a normal control [5]. The new technology allowed for the measurement of relative copy numbers in genomic regions as small as 100 kilobases [4, 6]. DNA copy number analysis was ushered into the modern genomic era with the invention of array CGH (aCGH), which utilizes microarray technology to measure the copy number variation 2 at designated regions across the entire human genome using a predetermined set of short 20–100 nucleotide long reference fragments corresponding to particular regions of the genome [6]. The microarray technology enabled the simultaneous measurement of DNA copy number at thousands, or even millions of loci. By streamlining experimental and analysis protocols, aCGH became a mainstay in copy number analysis for the next two decades. More recently, next-generation sequencing (NGS) emerged as a preferred technology for high-throughput measurement of DNA [7], and early computational methods for studying copy number variation in NGS data were developed [8, 9]. In NGS experiments, the DNA in a sample is sequenced in 100–200 nucleotide fragments, called “reads,” which are then mapped back to a reference genome. NGS enables nearly continuous measurement of DNA copy number in a sample, which can be determined from the relative number of reads that map to a particular location in the genome, known as the “read depth” in that region. Along with the increased resolution brought about by each new DNA technology also comes technology- and protocol-specific sources of noise that obscure the underlying biological signal. These effects are particularly prevalent in high-throughput technologies, where batch effects and experimental artifacts can be so severe that they sometimes outweigh the actual signal being measured [10, 11]. Many standard bioinformatics methods are dependent on a priori knowledge of the existence and structure of such artifacts. Computational methods that are designed specifically for the analysis of data from a particular measurement platform or technology struggle to keep up with the rapidly advancing field, and quickly become outdated. Despite these rapid changes in the technological landscape of DNA copy number measurement and analysis over the past several decades, the underlying biology that drives tumor formation and growth remains the same. In their seminal paper on the hallmarks of cancer, Hanahan and Weinberg predicted that cancer research would become “a logical science, where the complexities of the disease, described in the laboratory and clinic, will become understandable in terms of a small number of underlying principles.” Discovering and studying these underlying principles requires the use of mathematical models that are capable of uncovering fundamental patterns 3 in the datasets, independent of the measurement technology, and without a priori information about the sources of technical or experimental variation in the data. Comparative spectral decompositions are a family of mathematical frameworks that create a single, coherent model from multiple datasets by enabling mathematical comparison of the fundamental patterns that comprise each of the datasets, and are either common to both, or exclusive to one of the datasets. In the same way that a prism separates white light to reveal its component colors, comparative spectral decompositions uncover fundamental patterns in complex, noisy datasets. Glioblastoma and Lower-Grade Astrocytoma Astrocytomas are notoriously aggressive brain tumors that are characterized by poor prognosis and limited response to treatment. The World Health Organization classifies astrocytoma into four grades of increasing malignancy based on the degree of cellular atypia, mitosis, endothelial proliferation, and necrosis observed in the tumor sample viewed under a microscope. Despite over two decades of cancer research, advances in the standard of care for astrocytoma patients remain limited [12]. Incidence and Clinical Management of Astrocytoma Glioblastoma (GBM) is the most frequent primary malignant brain tumor in adults, accounting for nearly half of all malignant tumors of the central nervous system. An estimated 12,390 new cases of GBM are projected to be diagnosed in the United States in 2017, with an additional 1,330 cases of grade III anaplastic astrocytoma, and 1,180 cases of grade II diffuse astrocytoma [13]. The current standard of care for GBM is maximal surgical resection of the tumor, followed by treatment with radiotherapy and concomitant chemotherapy. Systemic temozolomide is the first-line chemotherpeutic agent. Even with the most aggressive clinical treatment, the median survival time for GBM is 14.6 months [14], with a 5-year survival rate of only 5.5%. There have not been any major changes to the first-line treatment of primary GBM since the FDA approval of temozolomide in 2005. Lower-grade astrocytomas (LGAs), classified here as grades II and III astrocytomas, are a clinically heterogeneous group of tumors associated with variable 4 response to treatment [15]. The 5-year survival rates in the lower-grade tumors are significantly higher than that of GBM, at 49.7% and 29.7% for grade II and III tumors, respectively [13]. Clinical management of LGA is variable, depending on the severity of the tumor at the time of diagnosis, and consists of a combination of surgical resection, radiotherapy, and chemotherapy. A subset of highly aggressive grade III tumors, which behave similarly to the higher-grade GBM, have been observed in the clinic [16]. However, they cannot currently be identified at diagnosis. Other than tumor grade [17], the best predictor of an astrocytoma patient’s survival for over 50 years is the patient’s age at diagnosis [18, 19]. GBM has a median age of diagnosis of 64 years, while lower-grade astrocytomas (LGA) have a median age of diagnosis of 53 years for grade III, and 48 years for grade II. Overall, incidence increases with advancing age [13]. Glioblastoma: The Cancer Genome Atlas Pilot Project In 2005, on the heels of the completion of the Human Genome Project, the National Cancer Institute and the National Human Genome Research Institute launched a $100 million collaboration known as The Cancer Genome Atlas (TCGA) Pilot Project. The project brought together preeminent cancer research institutions from across the Untied States to establish a database of large-scale genomic data to characterize human cancers. Due to its overwhelmingly poor prognosis and the lack of recent improvement in the standard of care, GBM was selected as one of three tumors included in the pilot project. The goal of the pilot project was to identify alterations in genes capable of differentiating cancer subtypes, and identifying genomic aberrations, such as copy number changes, that enable the development of targeted diagnostics and therapies, and foster personalized treatment of cancer. The first comprehensive analysis of the GBM data from the pilot project was published in 2008 [12]. The study identified novel CNAs, including frequent deletions of the genes NF1 and PARK2, and amplification of AKT3, as well as infrequent deletion of PTPRD and amplification of FGFR2 and IRS2. However, the CNAs 5 identified in the study were not found to be associated with a patient’s survival [12]. Data for LGA was later added to TCGA as part of a study on low grade glioma, a broader class of diffuse, low grade brain tumors. The initial research findings were published in 2015 [15], but the study did not present any novel copy number-related prognostic predictors for LGA. Intratumor Heterogeneity Astrocytomas, and GBMs in particular, are characterized by extensive histologic and genetic intratumor heterogeneity. Tumor cells are interspersed with normal cells, immune cells, blood vessels and necrotic regions. So marked is the variation, that the original name – “multiforme” – was attributed from early observations of the tumor’s heterogeneous histology [20]. However, this spatial variation leads to challenges in obtaining a representative tumor sample, with many biopsies suffering from insufficient tumor content and suboptimal nucleic acid quality or quantity. Of the 587 biospecimens screened in the original TCGA GBM study, only 35% met the predefined biospecimen quality control requirements [12, 21]. Astrocytomas also exhibit a high degree of genetic intratumor heterogeneity, which contributes to the observed variable response to treatment [20]. It has been shown that multiple histologically similar tumor fragments biopsied from a single tumor have distinct copy-number and gene expression signatures [22]. Known Copy Number Alterations in Astrocytoma Early cytogenetic studies of astrocytomas showed considerable variation, ranging from karyotypically normal, to highly aberrated tumors. In general, as malignancy and grade increase, genomic aberrations become increasingly complex in both structure and chromosome number [23]. Because of this observed genomic heterogeneity, there are inconsistent reports of frequently aberrated regions in the literature [24, 25, 26, 27]. Among the most frequently observed aberrations in LGA are the loss of 9p and 10q, and gain of 19. Among the most frequently observed CNAs in GBM are loss of one or both arms of chromosomes 9, 10, and 22, and gains of one or both arms of 7 and 19, 20 [26]. Additional aberrations have been reported, including the loss of Xp and 6 5p and gain of 8q in grade II tumors, while the loss of 4q, 11p, 13q, and gains of 1q and 6p were found in grade III and IV tumors. The most frequent focal amplification site in all tumors was located on 12q11-q21 [24]. Oncogenes associated with amplified loci include MDM4 (1q32.1), MYCC (8q23), EGFR (7p12.3), CDK4 (12q14), and MDM2 (12q13). Tumor suppressors associated with frequently deleted loci include CDKN2A and CDKN2B (9p21), PTEN (10q23.31), RB (13q14.1), TOP3B (22q), and TAF (22q) [27]. Associations between specific CNAs and patient survival have been largely inconclusive, with the amplification of EGFR among the best studied and most controversial alterations in GBM. Some studies report that patients whose tumors have an amplification of EGFR exhibit significantly shorter survival than those without the amplification [26, 28, 29, 30], while other studies find no significant association [31, 32, 33, 34, 35]. More recent reports have shown that relationships between individual CNAs and survival are more complex and differential survival is dependent on additional biological and clinical variables [12, 15, 36]. Known Biomarkers in Astrocytoma Mutations to the genes IDH1 and IDH2 are associated with improved prognosis in glioma, a broader class of brain tumors to which astrocytoma belongs [37]. These mutations are common in LGA, with over half (54%) of LGAs harboring the mutation on IDH1. However, the mutation is found in only 3% of primary GBMs [38]. These mutations were incorporated into the 2016 revision of the WHO classification of astrocytoma [39] based on their demonstrated ability to improve the stratification of LGA patients by prognosis in data from the Cancer Genome Atlas [15]. The methylation of the promoter region of the gene MGMT, which encodes for a DNA repair protein, is associated with improved clinical outcome in astrocytoma [40]. Methylation of the promoter region prevents transcription of the gene, thus reducing a cell’s ability to repair its damaged DNA, which in turn makes the tumor more sensitive to alkylating agents such as the first-line chemotherapy, temozolomide [41]. Tumors with abnormally high levels of methylation across the entire genome, known as the glioma-CpG island methylator phenotype (G-CIMP), are also associated 7 with improved survival in astrocytoma. However, only 8.8% of primary GBMs are characterized by this phenotype, which is primarily found among young adults [42]. Another recent study identified several GBM molecular subtypes based on mRNA gene expression signatures that are associated with specific neural lineages. The expression signatures are correlated with differential response to aggressive treatment regimens [43]. Even though GBM tumors exhibit a range of DNA CNAs, many of which are thought to play roles in the cancer’s pathogenesis, DNA copy-number subtypes for GBM that predict a patient’s survival have not been established conclusively [44, 45]. Prognostic Staging: A New Era of Cancer Treatment For nearly 75 years, the primary way clinicians and researchers have recorded cancer presentation and progression has been through “anatomic” staging [46]. The tumor’s stage is represented through the “TNM” system, which describes the progression of the disease by the involvement of the tumor (T), lymph nodes (N), and distant metastases (M). While the anatomic stage is still the primary prognostic predictor used in the clinic for nearly all cancers (with the exceptions of leukemia and brain tumors), advances in genomic research and biomarker discovery over the past two decades have painted a much more nuanced picture of cancer prognosis. Recognizing of this new view of cancer characterization, the American Joint Committee on Cancer (AJCC) recently released its 8th Edition Cancer Staging System [47], which incorporates a groundbreaking shift from anatomic staging to a new paradigm of prognostic staging. Beginning in January 2018, tumors diagnosed in the United States will be recorded by their prognostic stage, which incorporates evidence-based, validated biomarkers that are predictive of a patient’s prognosis and clinical outcome, in addition to the tumor’s TNM classification. The AJCC proposed this system as a bridge from the current population based methods toward the implementation of a more personalized means of cancer diagnosis and treatment. This new system represents a marked shift within the cancer community towards the recognition, and now incorporation, of biomarkers as defining and driving features of a tumor’s progression and a patient’s 8 clinical outcome. Unlike most tumors that metastasize throughout the body, GBM and LGA rarely spread beyond the brain. Because of this, the TNM system does not apply to astrocytoma, and a tumor stage is not assigned. Instead, brain tumors are classified by their tumor grade according to the WHO Classification of Tumors of the Central Nervous System. Following the announcement of the aforementioned updates to the AJCC cancer staging system, the WHO followed suit and incorporated validated biomarkers to the definitions of brain tumors. The changes are reflected in the 2016 update of the WHO Classification of Tumors of the Central Nervous System [13, 39]. With this new system of prognostic staging and diagnosis comes a need for the identification and validation of biomarkers that capture the fundamental tumor biology that drives a patient’s clinical outcome. For a biomarker test to be integrated into clinical care, it must uphold rigorous standards of analytic validity, demonstrating that it can be measured in a platform- and protocol-independent manner. This is particularly important with the rapidly changing technological landscape of DNA measurement and analysis. In the same way that the state of the art has advanced from cytogenetic methods to next-generation sequencing over the past two decades, measurement technologies and protocols will continue to evolve. A clinical test must be able to maintain its prognostic integrity across the new measurement technologies. A biomarker must also demonstrate clinical utility in its ability to accurately stratify patients and inform clinical decision making [48, 49]. Research Aims and Organization of Dissertation Large genomic datasets such as TCGA provide a valuable resource for biomarker discovery. However, the fundamental patterns of biological variation that characterize a tumor’s progression, and a patient’s clinical outcome, are hidden within the high-dimensional datasets. Comparative spectral decompositions are a set of universal mathematical frameworks that separate a signal into its underlying sources of variation, the same way a prism separates white light into its component colors. Rather than simplifying the data, as is commonly done, the decompositions make use of the complex structure of the datasets in order to tease out the patterns within 9 them. We recently demonstrated the effectiveness of these frameworks for modeling microarray-measured DNA copy-number profiles from GBM patients. The model revealed a genome-wide pattern of co-occurring DNA CNAs that is predictive of a patient’s survival and response to chemotherapy [50]. Recurring DNA CNAs had been observed in GBM tumors genomes for decades; however, copy-number subtypes that are predictive of patients’ outcomes were not established conclusively, illustrating the ability of comparative spectral decompositions to find what other methods have missed. This research builds on those results according to three research aims. Research Aims In the first research aim, we used the mathematical framework previously described, the GSVD, to comparatively model patient-matched LGA brain tumor and normal DNA copy-number profiles. To date, statistically the best indicators of LGA outcome in clinical use remain the patient’s age at diagnosis and the tumor’s histologically-determined grade. To identify CNAs that might predict LGA patients’ outcomes, we aimed to use the GSVD to model genomic profiles to identify genomewide patterns of CNAs that are predictive of patient outcome and provide new insights into the molecular biology of astrocytomas. In the second research aim, we sought to demonstrate the analytic and clinical validity of the previously identified GBM pattern as a microarray platform-independent prognostic predictor, separately for LGA and GBM patients, as well as in the combined astrocytoma population, by classifying astrocytoma patients based on genomic profiles measured by a different microarray platform. In the third research aim, we demonstrated the technology-independence of the GBM pattern by classifying patients based on genomic profiles measured using next generation sequencing technologies. By demonstrating that the mathematically universal GBM pattern describes the consistent biology of an astrocytoma tumor’s genome, and can be reliably measured and tested on any DNA-measurement technology, we re-validate that the pattern captures the underlying biology that characterizes the clinically aggressive astrocytoma subtype. 10 Organization of Dissertation These research aims, and the results reported in this dissertation, bring the GBM pattern a step closer to the clinic, where it can be implemented as a laboratory test and used to improve patient care. The organization of this dissertation is as follows. Chapter 2 is a background chapter that introduces the mathematics of comparative spectral decompositions and recent applications in biology and personalized medicine. Chapter 3 describes the application of the GSVD to genomic profiles from lower-grade astrocytomas, revealing a novel genome-wide pattern of CNAs that enables improved prognostication of LGA patients and gives insights into signaling pathways involved in brain tumor development. Chapter 4 demonstrates the platformindependence of a previously identified pattern of copy number alterations for predicting the prognosis and clinical outcome of in the general astrocytoma population. Chapter 5 presents the findings of the application of the GSVD to astrocytoma copy number profiles measured by whole genome sequencing, revealing new tumor biology from the higher resolution measurements. Chapter 6 summarizes contributions of this work, and proposes future directions for computational and experimental research supported by the conclusions of this dissertation. The results of the work presented in Chapters 3 and 4 were previously published together as a journal article in PLoS One [51], and are reprinted here with minor revisions in accordance with the Creative Commons Attribution License. Chapter 5 was published as an invited journal article in a special issue of Applied Physics Letters (APL) Bioengineering on the topic of “Bioengineering of Cancer” [52]. References [1] D. Hanahan and R. A. Weinberg, “The hallmarks of cancer,” Cell, vol. 100, no. 1, pp. 57–70, 2000. [2] D. Hanahan and R. A. Weinberg, “Hallmarks of cancer: the next generation,” Cell, vol. 144, no. 5, pp. 646–674, 2011. [3] T. Boveri, “Concerning the origin of malignant tumours by Theodor Boveri. Translated and annotated by Henry Harris,” Journal of Cell Science, vol. 121, no. Supplement 1, pp. 1–84, 2008. [4] T. J. De Ravel, et al., “Whats new in karyotyping? The move towards array 11 comparative genomic hybridisation (CGH),” European Journal of Pediatrics, vol. 166, no. 7, pp. 637–643, 2007. [5] A. Kallioniemi, et al., “Comparative genomic hybridization for molecular cytogenetic analysis of solid tumors,” Science, vol. 258, no. 5083, pp. 818–822, 1992. [6] S. Solinas-Toldo, et al., “Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances,” Genes, Chromosomes and Cancer, vol. 20, no. 4, pp. 399–407, 1997. [7] J. Shendure and H. Ji, “Next-generation DNA sequencing,” Nature Biotechnology, vol. 26, no. 10, p. 1135, 2008. [8] M. A. DePristo, et al., “A framework for variation discovery and genotyping using next-generation DNA sequencing data,” Nature Genetics, vol. 43, no. 5, pp. 491–498, 2011. [9] P. Medvedev, et al., “Computational methods for discovering structural variation with next-generation sequencing,” Nature Methods, vol. 6, pp. S13–S20, 2009. [10] M. K. Kerr, et al., “Analysis of variance for gene expression microarray data,” Journal of Computational Biology, vol. 7, no. 6, pp. 819–837, 2000. [11] O. Alter, et al., “Singular value decomposition for genome-wide expression data processing and modeling,” Proceedings of the National Academy of Sciences, vol. 97, no. 18, pp. 10101–10106, 2000. [12] Cancer Genome Atlas Research Network, “Comprehensive genomic characterization defines human glioblastoma genes and core pathways,” Nature, vol. 455, no. 7216, pp. 1061–1068, 2008. [13] Q. T. Ostrom, et al., “CBTRUS statistical report: primary brain and other central nervous system tumors diagnosed in the United States in 2009–2013,” Neuro-Oncology, vol. 18, no. suppl 5, pp. v1–v75, 2016. [14] R. Stupp, et al., “Radiotherapy plus concomitant and adjuvant temozolomide for glioblastoma,” New England Journal of Medicine, vol. 352, no. 10, pp. 987–996, 2005. [15] Cancer Genome Atlas Research Network, et al., “Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas,” New England Journal of Medicine, vol. 2015, no. 372, pp. 2481–2498, 2015. [16] J. S. Smith, et al., “PTEN mutation, EGFR amplification, and outcome in patients with anaplastic astrocytoma and glioblastoma multiforme,” Journal of the National Cancer Institute, vol. 93, no. 16, pp. 1246–1256, 2001. [17] C. Daumas-Duport, et al., “Grading of astrocytomas: a simple and reproducible method,” Cancer, vol. 62, no. 10, pp. 2152–2165, 1988. [18] M. G. Netsky, et al., “The longevity of patients with glioblastoma multiforme,” Journal of Neurosurgery, vol. 7, no. 3, pp. 261–269, 1950. 12 [19] W. J. Curran, et al., “Recursive partitioning analysis of prognostic factors in three Radiation Therapy Oncology Group malignant glioma trials,” Journal of the National Cancer Institute, vol. 85, no. 9, pp. 704–710, 1993. [20] R. Bonavia, et al., “Heterogeneity maintenance in glioblastoma: a social network,” Cancer Research, vol. 71, no. 12, pp. 4055–4060, 2011. [21] L. S. Hu, et al., “Multi-parametric MRI and texture analysis to visualize spatial histologic heterogeneity and tumor extent in glioblastoma,” PLoS One, vol. 10, no. 11, p. e0141506, 2015. [22] A. Sottoriva, et al., “Intratumor heterogeneity in human glioblastoma reflects cancer evolutionary dynamics,” Proceedings of the National Academy of Sciences, vol. 110, no. 10, pp. 4009–4014, 2013. [23] S. Bigner, et al., “DNA content and chromosomal composition of malignant human gliomas,” Neurologic Clinics, vol. 3, no. 4, pp. 769–784, 1985. [24] R. G. Weber, et al., “Characterization of genomic alterations associated with glioma progression by comparative genomic hybridization,” Oncogene, vol. 13, no. 5, pp. 983–994, 1996. [25] R. Koschny, et al., “Comparative genomic hybridization in glioma: a metaanalysis of 509 cases,” Cancer Genetics and Cytogenetics, vol. 135, no. 2, pp. 147– 159, 2002. [26] R. N. Wiltshire, et al., “Comparative genomic hybridization analysis of astrocytomas: prognostic and diagnostic implications,” The Journal of Molecular Diagnostics, vol. 6, no. 3, pp. 166–179, 2004. [27] J. Bayani, et al., “Molecular cytogenetic analysis in the study of brain tumors: findings and applications,” Neurosurgical Focus, vol. 19, no. 5, pp. 1–36, 2005. [28] N. G. Rainov, et al., “Prognostic factors in malignant glioma: influence of the overexpression of oncogene and tumor-suppressor gene products on survival,” Journal of Neuro-Oncology, vol. 35, no. 1, pp. 13–28, 1997. [29] A. Waha, et al., “Lack of prognostic relevance of alterations in the epidermal growth factor receptortransforming growth factor-α pathway in human astrocytic gliomas,” Journal of Neurosurgery, vol. 85, no. 4, pp. 634–641, 1996. [30] E. W. Newcomb, et al., “Survival of patients with glioblastoma multiforme is not influenced by altered expression of P16, P53, EGFR, MDM2 or Bcl-2 genes,” Brain Pathology, vol. 8, no. 4, pp. 655–667, 1998. [31] E. Jaros, et al., “Prognostic implications of p53 protein, epidermal growth factor receptor, and Ki-67 labelling in brain tumours,” British Journal of Cancer, vol. 66, no. 2, p. 373, 1992. [32] P. Korkolopoulou, et al., “MDM2 and p53 expression in gliomas: a multivariate survival analysis including proliferation markers and epidermal growth factor receptor,” British Journal of Cancer, vol. 75, no. 9, p. 1269, 1997. 13 [33] A. Zhu, et al., “Epidermal growth factor receptor: an independent predictor of survival in astrocytic tumors given definitive irradiation,” International Journal of Radiation Oncology* Biology* Physics, vol. 34, no. 4, pp. 809–815, 1996. [34] M.-C. Etienne, et al., “Epidermal growth factor receptor and labeling index are independent prognostic factors in glial tumor outcome,” Clinical Cancer Research, vol. 4, no. 10, pp. 2383–2390, 1998. [35] D. Krex, et al., “Long-term survival with glioblastoma multiforme,” Brain, vol. 130, no. 10, pp. 2596–2606, 2007. [36] M. L. Simmons, et al., “Analysis of complex relationships between age, p53, epidermal growth factor receptor, and survival in glioblastoma patients,” Cancer Research, vol. 61, no. 3, pp. 1122–1128, 2001. [37] H. Yan, et al., “IDH1 and IDH2 mutations in gliomas,” New England Journal of Medicine, vol. 360, no. 8, pp. 765–773, 2009. [38] K. Ichimura, et al., “IDH1 mutations are present in the majority of common adult gliomas but rare in primary glioblastomas,” Neuro-Oncology, vol. 11, no. 4, pp. 341–347, 2009. [39] D. N. Louis, et al., “The 2016 World Health Organization classification of tumors of the central nervous system: a summary,” Acta Neuropathologica, vol. 131, no. 6, pp. 803–820, 2016. [40] M. E. Hegi, et al., “Correlation of o6-methylguanine methyltransferase (MGMT) promoter methylation with clinical outcomes in glioblastoma and clinical strategies to modulate MGMT activity,” Journal of Clinical Oncology, vol. 26, no. 25, pp. 4189–4199, 2008. [41] M. E. Hegi, et al., “MGMT gene silencing and benefit from temozolomide in glioblastoma,” New England Journal of Medicine, vol. 352, no. 10, pp. 997–1003, 2005. [42] H. Noushmehr, et al., “Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma,” Cancer Cell, vol. 17, no. 5, pp. 510–522, 2010. [43] R. G. W. Verhaak, et al., “Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1,” Cancer Cell, vol. 17, no. 1, pp. 98–110, 2010. [44] R. N. Wiltshire, et al., “Comparative genetic patterns of glioblastoma multiforme: potential diagnostic tool for tumor classification,” Neuro-Oncology, vol. 2, no. 3, pp. 164–173, 2000. [45] A. Misra, et al., “Array comparative genomic hybridization identifies genetic subgroups in grade 4 human astrocytoma,” Clinical Cancer Research, vol. 11, no. 8, pp. 2907–2918, 2005. 14 [46] M. Ménoret, “The genesis of the notion of stages in oncology: The French Permanent Cancer Survey (1943–1952),” Social History of Medicine, vol. 15, no. 2, pp. 291–302, 2002. [47] M. B. Amin, et al., AJCC Cancer Staging Manual, Eighth Edition. Springer International Publishing, 2016. [48] W. Burke, et al., “Genetic test evaluation: information needs of clinicians, policy makers, and the public,” American Journal of Epidemiology, vol. 156, no. 4, pp. 311–318, 2002. [49] W. Burke, “Genetic tests: clinical validity and clinical utility,” Current Protocols in Human Genetics, pp. 9–15, 2014. [50] C. H. Lee, et al., “GSVD comparison of patient-matched normal and tumor aCGH profiles reveals global copy-number alterations predicting glioblastoma multiforme survival,” PLoS One, vol. 7, no. 1, p. e30098, 2012. [51] K. A. Aiello and O. Alter, “Platform-independent genome-wide pattern of DNA copy-number alterations predicting astrocytoma survival and response to treatment revealed by the GSVD formulated as a comparative spectral decomposition,” PLoS One, vol. 11, no. 10, p. e0164546, 2016. [52] K. A. Aiello, et al., “Mathematically universal and biologically consistent astrocytoma genotype encodes for transformation and predicts survival phenotype,” APL Bioengineering, vol. 2, no. 3, p. 031909, 2018. CHAPTER 2 COMPARATIVE SPECTRAL DECOMPOSITIONS Mathematical Frameworks The singular value decomposition (SVD) and its higher-order generalizations are a family of matrix factorizations that generalize the eigendecomposition to real or complex rectangular matrices. The decompositions have widespread applications in a wide range of fields including signal processing, pattern finding, and statistics. Although the SVD exists for both real and complex matrices, all equations in this dissertation are given for the special case of real matrices, as they apply to the analysis of measured data. The Singular Value Decomposition The SVD factorizes an m × n matrix D into the product of three matrices, D = U ΣV T (2.1) where U is a column-wise orthonormal matrix, Σ is a non-negative diagonal matrix containing the singular values, and V T is row-wise orthonormal matrix [1, 2]. Thus, the matrices U and V T comprise orthonormal, uncorrelated column basis vectors (i.e., left basis vectors) and row basis vectors (i.e., right basis vectors), respectively, of the original matrix D. The SVD can also be written as a weighted sum of outer products of the left and right basis vectors, D= N ∑ n=1 σn un ⊗ vnT . (2.2) 16 This is the formulation of the SVD that provides insight into its interpretation as a spectral decomposition; the original data D can be written exactly as a weighted sum of the fundamental patterns captured across the basis vectors of the column and row dimensions. The weights are captured in the singular values, σn , which describe the magnitude of the contribution of each outer product to the data matrix, and the left and right basis vectors capture the fundamental patterns of variation across the two dimensions of the dataset. The basis vectors are the building blocks of the dataset, which can be uniquely combined to exactly reconstruct the original dataset. Intuitively, the SVD is the mathematical description of the way a prism separates white light into its fundamental wavelengths to reveal individual colors. For any input signal that is represented as a matrix, the SVD separates the superimposed patterns that comprise a data matrix into the orthogonal bases that capture the fundamental patterns of the dataset. The Generalized Singular Value Decomposition The generalized singular value decomposition (GSVD) generalizes this powerful framework for the simultaneous decomposition of two matrices, a m1 × n matrix D1 and a m2 × n matrix D2 , such that both matrices have the same number of columns n, but may differ in the number of rows [3, 4]. The structure of the column-matched and row-independent datasets is of an order higher than that of a single matrix. The column dimension, the two independent row dimensions, and both matrices each represent a degree of freedom. Unfolded into a single matrix (i.e., concatenated along the matched column dimension), some of the degrees of freedom are lost and much of the information in the datasets might also be lost. Rather than unfolding the datasets, the GSVD simultaneously factorizes D1 and D2 into five matrices, Di = Ui Σi V T i = 1, 2. (2.3) where U1 and U2 are column-wise orthonormal matrices corresponding to D1 and D2 , 17 respectively, Σ1 and Σ2 are non-negative diagonal matrices containing the generalized singular values corresponding to D1 and D2 , respectively, and V T is a single matrix comprising row basis vectors that correspond to both D1 and D2 (Figure 2.1). Like the SVD, the GSVD can also be written as a weighted sum of outer products, Di = N ∑ σi,n ui,n ⊗ vnT , i = 1, 2. (2.4) n=1 where both of the original data matrices can be exactly written as a weighted sum of the outer product of pairs of the basis vectors. It is important to recognize that the GSVD is a simultaneous decomposition of the two matrices; the decomposition yields a unique set of left basis vectors for each dataset, but generates a single, shared set of right basis vectors that is common to both datasets. This shared row basis is critical for the formulation of the GSVD as a comparative spectral decomposition [5]. For each common pattern of variation across the columns of the datasets, the GSVD reveals the corresponding coordinated patterns of variation across the rows of each of the datasets, as well as the generalized singular values that capture the weight, or significance, of the variation in each dataset. This formulation creates a single coherent model from the two datasets by using the mathematical variables to identify the similarities and differences between the datasets. The significance of a right basis vector vnT in either D1 or D2 , in terms of the “generalized fraction” of the overall information that it captures in the dataset, is proportional to the corresponding nonnegative generalized singular value σ1,n or σ2,n , respectively, 2 / pi,n = σi,n N ∑ 2 , σi,n i = 1, 2. (2.5) n=1 The “generalized normalized Shannon entropy” is defined to measure the complexity of each dataset in terms of the distribution of the overall information in the dataset among the shared right basis vectors, 18 −1 0 ≤ di = −(log N ) N ∑ pi,n log pi,n ≤ 1, i = 1, 2. (2.6) n=1 An entropy of zero corresponds to an ordered and redundant dataset, in which all the information is captured by a single right basis vector. An entropy of one corresponds to a disordered and random dataset, in which all right basis vectors are of equal significance. Following the relation of the GSVD to the cosine-sine (CS) decomposition [6], the significance of a right basis vector vnT in D1 relative to its significance in D2 is defined by the “angular distance” θn , as described [5], −π/4 ≤ θn = arctan(σ1,n /σ2,n ) − π/4 ≤ π/4. (2.7) Shared right basis vectors for which θn ≈ ±π/4 are exclusive to either D1 or D2 , respectively, whereas basis vectors for which \|θn \| ≈ 0 are common to both. The basis vectors are arranged in decreasing order of their angular distances, i.e., their significance in D1 relative to D2 . The GSVD is unique, except in degenerate subspaces, defined by subsets of equal pairs of generalized singular values σ1,n and σ2,n , and up to phase factors of ±1 of each shared right basis vector vnT and the corresponding left basis vectors u1,n and u2,n . Higher-Order Decompositions The GSVD has also been generalized for the decomposition of more than two matrices. The higher-order generalized singular value decomposition (HOGSVD) is the only matrix decomposition that allows for the simultaneous decomposition of more than two column-matched and row-independent matrices and identifies coordinated patterns among the datasets [7, 8]. In the age of big data, many fields generate complex datasets that describe the measurement of high-dimensional systems that cannot be fully captured in two dimensional matrices. Generalizations of the SVD also exist for the decomposition of 19 a higher-order matrix, or tensor, which has more than two dimensions [9, 10]. Single tensor decompositions have provided novel biological insights [11]. The recent formulation of the tensor GSVD enables the simultaneous decomposition of two higher-order tensors that are column-matched in all column dimensions, but are row-independent [12]. The framework is formulated as a comparative spectral decomposition for the higher-order data. Most recently, the higher-order tensor GSVD was formalized for comparative modeling of more than two column-matched and row-independent tensors. Comparative Spectral Decompositions for Modeling Biological Data Multimatrix methods have been used successfully for integrative analyses and comparative modeling of biomedical data [13, 14, 15, 16, 17, 18, 19, 20, 21]. Matrixbased methods such as the SVD and its generalizations are particularly well suited to the analysis of genomic datasets, which are intuitively structured as matrices, or two-dimensional tables [22]. Genomic studies often measure hundreds of thousands, or even millions of observations, such as measurements from individual microarray probes, which can be arranged in rows. The same measurements are made for a set of samples or patients, which can be arranged across the columns. Thus, the dataset has a natural representation as a matrix. In a dataset with this structure, the SVD separates the dataset into patterns of variation across the genome, captured by the left basis vectors, and the coordinated patterns of variation across the samples or patients, captured by the right basis vectors. A high-throughput measurement of a genomic signal, such as that measured from a microarray, captures the superposition of many independent sources of physiological, pathological, and experimental variation (Figure 2.2). Because the SVD is a spectral decomposition, it acts as a mathematical prism that separates independent sources of variation into distinct genomic signatures. For a single matrix of genomic data, the SVD provides a robust mathematical framework to describe the data, in which the mathematical variables and operations represent biological reality. The fundamental patterns revealed by the SVD describe the cellular processes and cellular states 20 captured in a single dataset. This enables the separation of biological patterns of interest, such as regulatory programs and biological processes, from unwanted sources of biological or experimental noise, such as batch effects by mathematically separating them into orthogonal basis vectors [23, 24]. Such matrix decompositions have been effective in modeling biological data and revealing novel insights about the system [25, 26, 27, 28, 29, 30, 31]. Many genomic data sets have a natural representation as two column-matched matrices. Datasets of this structure arise from experiments in which two different sets of measurements are made from the same, or corresponding, samples. This is often the case in genomic studies in which tumor and normal tissue samples from the same patient are measured. The effectiveness of the GSVD for modeling two genomic datasets was first demonstrated by Alter et al. to comparatively model cell cycle phase-matched gene expression data of synchronized cells from humans and yeast. The GSVD, formulated as a comparative spectral decomposition as previously described, revealed the phenomena that are common to both datasets, or exclusive to either one. Common phenomena included shared regulatory programs and biological biological processes, and exclusive phenomena included experimental artifacts [5]. The model predicted a genome-wide causal coordination between DNA replication and mRNA expression, which was then experimentally verified. This study demonstrated that the GSVD can be used to correctly predict previously unknown cellular mechanisms. Recent consortium-style genomic profiling efforts, such as the Cancer Genome Atlas (TCGA), have demonstrated the need for analysis methods capable of separating batch effects from underlying biological signal. These large, high-dimensional datasets are a valuable resource for the cancer research community. However, making robust biological and clinical discoveries from these resources has proven difficult, as evidenced by the relatively few findings that have been successfully translated into clinical tests. This is, in part, due to the challenges associated with combining datasets with different sources of clinical and experimental variation, such as the hospital from which the tumor samples were obtained, the institution at which the molecular profiling 21 was carried out, or even the optical scanner on which a particular microarray was measured. While batch effects can affect even the most highly controlled experiments run in a single laboratory, the effects are greatly compounded when experiments are run across multiple institutions. Existing bioinformatics methods often fail to separate the underlying signal that characterizes the biological system from other sources of clinical and experimental variation. Comparative spectral decompositions effectively address this by mathematically separating these sources of experimental variation from other sources of physiological variation or pathological variation that is of interest to the researchers. Applications in Personalized Medicine A primary goal of personalized medicine is to identify patterns of genomic variation that stratify patients into biologically and clinically relevant groups that, for example, correspond to a patient’s survival or response to a particular therapy. Then, these “genomic signatures” can be used to stratify new patients who come into the clinic, and subsequently inform their diagnosis and treatment. Comparative spectral decompositions are ideally suited for this task because the model uncovers these coordinated fundamental patterns. In a dataset represented as matrices where genomic regions are arranged across the rows, and a set of patients is arranged across the columns, the GSVD reveals the patterns of variation across the genome (i.e., genomic signatures), that correspond to the pattern of variation observed across the patients. If the mathematical pattern across the patients is correlated with clinically informative variation among the patients, such as differential survival or response to a particular drug, then the associated genomic signature can be studied further, and possibly be used to classify new patients. Recent applications of the GSVD and the tensor GSVD to human cancer data from TCGA demonstrated that comparative spectral decompositions of patient-matched tumor and normal datasets reveal robust genomic signatures that capture complex biological phenomena and stratify patients into clinically and biologically relevant groups that predict patient outcome and treatment [12, 32, 33]. 22 GSVD Modeling of Glioblastoma Copy-Number Profiles In a recent study of GBM copy-number profiles, we showed that the GSVD is an effective framework for the comparative modeling of DNA copy-number profiles measured from patient-matched glioblastoma (GBM) brain tumor and normal tissue samples. The GSVD effectively separates noise, such as batch effects associated with a specific research institution or optical scanner, as well as other sources of biological variation such as female-specific X chromosome amplification. This noise is removed from the signal of interest, revealing a pattern of tumor-exclusive co-occurring copynumber aberrations (CNAs). We find that patients whose tumor profiles are very similar to this pattern have a significantly shorter survival time than patients whose tumor profiles are dissimilar. The pattern is a better prognostic indicator than the prognostic indicators currently used in the clinic, such as a patient’s age at diagnosis. The analysis also revealed the focal amplification of putative drug targets TLK2 and METTL2A, which had not been previously identified. In preliminary experiments, RNA silencing of each of these targets led to GBM-specific reduced viability in NCI60 cell lines. This study demonstrates that comparative spectral decompositions are effective for modeling human cancer data, and reveal clinically relevant patterns that can be used to predict patient outcome and identify putative drug targets [32]. Tensor GSVD Modeling of Ovarian Cancer Copy-Number Profiles In a study of ovarian cystadenocarcinoma (OV) patient profiles, we used the tensor GSVD framework for the decomposition of two column-matched and row-independent tensors to identify patterns of copy-number variation. This tensor GSVD uncovered three separate chromosome arm-wide patterns of CNAs that predict OV patient survival and response to platinum-based chemotherapy. This work also demonstrated that the patterns identified by matrix and tensor decompositions capture complex biological mechanisms including human cell transformation, DNA stability, and cellular immune response [12, 34]. 23 Figure 2.1: The GSVD factorizes two datasets, D1 and D2 , into two sets of columnwise orthonormal left basis vectors U1 and U2 , two sets of generalized singular values, Σ1 and Σ2 , and one shared set of normalized right basis vectors, V T . Figure 2.2: High-throughput measurements are a superposition of signals. These measurements, such as those made by a microarray, capture a superposition of many independent sources of variation. This includes physiological variation, such as copy number variation in the sex chromosomes or genomic variation that dictates cell phenotype, pathological variation associated with a particular disease, and experimental variation due to batch effects or a particular measurement instrument. The superimposed signals can be separated using spectral decompositions. 24 References [1] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 2012. [2] G. H. Golub and C. F. Van Loan, Matrix Computations, vol. 3. JHU Press, 2012. [3] C. F. Van Loan, “Generalizing the singular value decomposition,” SIAM Journal on Numerical Analysis, vol. 13, no. 1, pp. 76–83, 1976. [4] C. C. Paige and M. A. Saunders, “Towards a generalized singular value decomposition,” SIAM Journal on Numerical Analysis, vol. 18, no. 3, pp. 398–405, 1981. [5] O. Alter, et al., “Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms,” Proceedings of the National Academy of Sciences, vol. 100, no. 6, pp. 3351–3356, 2003. [6] C. Van Loan, “Computing the CS and the generalized singular value decompositions,” Numerische Mathematik, vol. 46, no. 4, pp. 479–491, 1985. [7] S. P. Ponnapalli, et al., “A higher-order generalized singular value decomposition for comparison of global mRNA expression from multiple organisms,” PLoS One, vol. 6, no. 12, p. e28072, 2011. [8] L. Omberg, et al., “Global effects of DNA replication and DNA replication origin activity on eukaryotic gene expression,” Molecular Systems Biology, vol. 5, no. 1, p. 312, 2009. [9] L. Omberg, et al., “A tensor higher-order singular value decomposition for integrative analysis of DNA microarray data from different studies,” Proceedings of the National Academy of Sciences, vol. 104, no. 47, pp. 18371–18376, 2007. [10] L. De Lathauwer, et al., “A multilinear singular value decomposition,” SIAM Journal on Matrix Analysis and Applications, vol. 21, no. 4, pp. 1253–1278, 2000. [11] C. Muralidhara, et al., “Tensor decomposition reveals concurrent evolutionary convergences and divergences and correlations with structural motifs in ribosomal RNA,” PLoS One, vol. 6, no. 4, p. e18768, 2011. [12] P. Sankaranarayanan, et al., “Tensor GSVD of patient- and platform-matched tumor and normal DNA copy-number profiles uncovers chromosome arm-wide patterns of tumor-exclusive platform-consistent alterations encoding for cell transformation and predicting ovarian cancer survival,” PLoS One, vol. 10, no. 4, p. e0121396, 2015. [13] M. J. Brauer, et al., “Conservation of the metabolomic response to starvation across two divergent microbes,” Proceedings of the National Academy of Sciences, vol. 103, no. 51, pp. 19302–19307, 2006. 25 [14] W. De Clercq, et al., “Canonical correlation analysis applied to remove muscle artifacts from the electroencephalogram,” IEEE Transactions on Biomedical Engineering, vol. 53, no. 12, pp. 2583–2587, 2006. [15] A. W. Schreiber, et al., “Combining transcriptional datasets using the generalized singular value decomposition,” BMC Bioinformatics, vol. 9, no. 1, p. 335, 2008. [16] Y. Sun, et al., “Evolutionarily conserved transcriptional co-expression guiding embryonic stem cell differentiation,” PLoS One, vol. 3, no. 10, p. e3406, 2008. [17] X. Xiao, et al., “Exploring metabolic pathway disruption in the subchronic phencyclidine model of schizophrenia with the generalized singular value decomposition,” BMC Systems Biology, vol. 5, no. 1, p. 72, 2011. [18] E. Acar, et al., “Coupled matrix factorization with sparse factors to identify potential biomarkers in metabolomics,” in Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on, pp. 1–8, IEEE, 2012. [19] X. Xiao, et al., “Multi-tissue analysis of co-expression networks by higher-order generalized singular value decomposition identifies functionally coherent transcriptional modules,” PLoS Genetics, vol. 10, no. 1, p. e1004006, 2014. [20] O. A. Tomescu, et al., “Integrative omics analysis. a study based on Plasmodium falciparum mRNA and protein data,” BMC Systems Biology, vol. 8, no. 2, p. S4, 2014. [21] E. Acar, et al., “Data fusion in metabolomics using coupled matrix and tensor factorizations,” Proceedings of the IEEE, vol. 103, no. 9, pp. 1602–1620, 2015. [22] O. Alter, “Genomic signal processing: from matrix algebra to genetic networks,” in Microarray Data Analysis: Methods and Applications (M. J. Korenberg, ed.), vol. 377, pp. 17–59, Humana Press, 2007. [23] O. Alter, et al., “Singular value decomposition for genome-wide expression data processing and modeling,” Proceedings of the National Academy of Sciences, vol. 97, no. 18, pp. 10101–10106, 2000. [24] T. O. Nielsen, et al., “Molecular characterisation of soft tissue tumours: a gene expression study,” The Lancet, vol. 359, no. 9314, pp. 1301–1307, 2002. [25] O. Alter, et al., “Processing and modeling genome-wide expression data using singular value decomposition,” Microarrays: Optical Technologies and Informatics, vol. 4266, pp. 171–186, 2001. [26] O. Alter and G. H. Golub, “Integrative analysis of genome-scale data by using pseudoinverse projection predicts novel correlation between DNA replication and RNA transcription,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 47, pp. 16577–16582, 2004. 26 [27] O. Alter, et al., “Novel genome-scale correlation between DNA replication and RNA transcription during the cell cycle in yeast is predicted by data-driven models,” in Miami Nature Biotechnology Winter Symposium: Cell Cycle, Chromosomes and Cancer, vol. 15, University of Miami School of Medicine, 2004. [28] O. Alter and G. H. Golub, “Reconstructing the pathways of a cellular system from genome-scale signals by using matrix and tensor computations,” Proceedings of the National Academy of Sciences of the United States of America, vol. 102, no. 49, pp. 17559–17564, 2005. [29] O. Alter and G. H. Golub, “Singular value decomposition of genome-scale mRNA lengths distribution reveals asymmetry in RNA gel electrophoresis band broadening,” Proceedings of the National Academy of Sciences, vol. 103, no. 32, pp. 11828–11833, 2006. [30] O. Alter, “Discovery of principles of nature from mathematical modeling of DNA microarray data,” Proceedings of the National Academy of Sciences, vol. 103, no. 44, pp. 16063–16064, 2006. [31] N. M. Bertagnolli, et al., “SVD identifies transcript length distribution functions from DNA microarray data and reveals evolutionary forces globally affecting GBM metabolism,” PLoS One, vol. 8, no. 11, p. e78913, 2013. [32] C. H. Lee, et al., “GSVD comparison of patient-matched normal and tumor aCGH profiles reveals global copy-number alterations predicting glioblastoma multiforme survival,” PLoS One, vol. 7, no. 1, p. e30098, 2012. [33] K. A. Aiello and O. Alter, “Platform-independent genome-wide pattern of DNA copy-number alterations predicting astrocytoma survival and response to treatment revealed by the GSVD formulated as a comparative spectral decomposition,” PLoS One, vol. 11, no. 10, p. e0164546, 2016. [34] O. Alter, “DNA copy-number alterations in primary ovarian serous cystadenocarcinoma encoding for cell transformation and predicting survival and response to platinum therapy throughout the course of the disease,” in Clinical Cancer Research, vol. 22, American Association of Cancer Research, 2016. CHAPTER 3 DNA COPY-NUMBER ALTERATIONS PREDICTING LOWER-GRADE ASTROCYTOMA OUTCOME This chapter is part of a published journal article in PLoS One Vol. 11 No. 10, article e30098 (October 2016). Reprinted with minor revisions in accordance with the Creative Commons Attribution (CC BY) license. Authors of this chapter are K. A. Aiello and O. Alter. Introduction Recurring DNA copy-number alterations (CNAs) have been recognized as a hallmark of cancer for >100 years [1, 2, 3], yet what these alterations imply about a solid tumor’s development and progression, and a patient’s diagnosis, prognosis, and treatment remains poorly understood. This is despite the growing number of high-dimensional datasets, recording different aspects of a single disease, such as DNA copy-number profiles of two or more cell types from the same set of patients, possibly measured more than once by different platforms. This is due to an existing fundamental need for mathematical frameworks that can create a single coherent model from such datasets. A recent comparison of DNA copy-number profiles of just two cell types, primary tumor and normal, from the same set of ovarian serous cystadenocarcinoma (OV) patients, measured by the same set of platforms, uncovered several tumor-exclusive platform-consistent chromosome arm-wide patterns of DNA CNAs that are correlated with a patient’s survival and response to platinum-based chemotherapy [4]. The datasets had been publicly available in the Cancer Genome Atlas (TCGA) since 2011, and analyzed by using several methods, e.g., hierarchical clustering [5]. These patterns of CNAs, however, remained unknown until the datasets were modeled in 28 2015 by using a novel comparative spectral decomposition: the tensor generalized singular value decomposition (GSVD). For >30 years prior to this discovery, statistically the best indicator of ovarian cancer survival was the tumor’s stage at diagnosis [6]. About 25% of primary ovarian tumors are resistant to platinum therapy, the first-line treatment, yet no diagnostic existed to distinguish resistant from sensitive tumors before the treatment [7]. A previous comparison of copy-number profiles of primary tumor and normal cells from the same set of glioblastoma (GBM) brain cancer patients uncovered a tumor-exclusive genome-wide pattern of CNAs that is correlated with a patient’s survival and response to chemotherapy [8]. The GSVD separated this pattern of CNAs, which occurs only in the tumor genomes, from patterns of copy-number variations (CNVs) that occur in the genomes of normal cells (e.g., female-specific X chromosome amplification) and from variations caused by experimental inconsistencies (e.g., in tissue batch, genomic center, hybridization date, and scanner) that were exclusive to the profiling of either the tumor or the normal samples, without a priori knowledge of these variations. The datasets had been publicly available in TCGA since 2008, and analyzed by using several methods [9], but this pattern of CNAs remained unknown until the datasets were modeled in 2012 by using the GSVD [10, 11] formulated as a comparative spectral decomposition [12]. For >50 years prior to this discovery, statistically the best indicator of GBM outcome was the patient’s age at diagnosis [13, 14, 15]. Copy-number subtypes of GBM, i.e., grade IV astrocytoma, which are predictive of survival and response to treatment were not conclusively identified [16, 17]. Here, we build upon those results by using the GSVD to study lower-grade astrocytoma (LGA), i.e., grades II and III, patients’ copy-number profiles, enabling genomic determination of the prognosis of the LGA patients. To date, statistically the best indicators of LGA outcome in clinical use remain the patient’s age at diagnosis and the tumor’s grade, with older age and higher tumor grade being associated with worse prognosis [18, 19]. 29 Mathematical Framework: The GSVD To identify CNAs that might predict LGA outcome, we modeled TCGA patientmatched LGA tumor and normal DNA copy-number profiles by using the GSVD formulated as a comparative spectral decomposition. We selected patient-matched Affymetrix-measured DNA copy-number profiles of primary LGA tumor and normal tissue samples from a discovery set of 59 patients (Methods). The structure of these tumor and normal datasets is that of two full column-rank matrices D1 ∈ RM1 ×N and D2 ∈ RM2 ×N of N = 59 matched columns (i.e., patients), but independent, i.e., not necessarily matched M1 , M2 = 933,827 rows (i.e., tumor and normal genomic regions, or Affymetrix probes), where M1 , M2 ≫ N (Figure 3.1). The GSVD simultaneously separates the two matrices, or tumor- and normalspecific datasets, into paired weighted sums of outer products, of each normalized right basis vector, or “probelet” vnT (i.e., a pattern of variation across the patients), which is identical for both datasets, combined with one of the two corresponding orthonormal left basis vectors, or “tumor arraylet” u1,n and “normal arraylet” u2,n (i.e., the tumor- and normal-specific patterns of variation across the genome), as given by Equation 2.4. The significance of a probelet vnT in either the tumor dataset D1 or the normal dataset D2 , in terms of the “generalized fraction” of the overall information that it captures in the dataset, is proportional to the corresponding nonnegative generalized singular value σ1,n or σ2,n , respectively, as defined by Equation 2.5. The “generalized normalized Shannon entropy,” defined by Equation 2.6, is a measure of the complexity of each dataset in terms of the distribution of the overall information in the dataset among the probelets. An entropy of zero corresponds to an ordered and redundant dataset, in which all the information is captured by a single probelet. An entropy of one corresponds to a disordered and random dataset, in which all probelets are of equal significance. Following the relation of the GSVD to the cosine-sine (CS) decomposition [20], the significance of a probelet vnT in the tumor dataset D1 relative to its significance in the normal dataset D2 is defined by the “angular distance” θn , according to Equation 2.7 as described previously [12]. Probelets for which θn ≈ ±π/4 are exclusive to either 30 the tumor dataset D1 or the normal dataset D2 , respectively, whereas probelets for which \|θn \| ≈ 0 are common to both. The probelets are arranged in decreasing order of their angular distances, i.e., their significance in the tumor dataset relative to the normal dataset. The GSVD is unique, except in degenerate subspaces, defined by subsets of equal pairs of generalized singular values σ1,n and σ2,n , and up to phase factors of ±1 of each probelet vnT and the corresponding tumor and normal arraylets u1,n and u2,n . We find that the two most tumor-exclusive patterns of variation across the patients, i.e., the first and second probelets, with angular distances θ1 , θ2 > 2π/15, are also the first and third most significant probelets in the tumor dataset, with >8% and 5% of the information in this dataset, respectively (Figure 3.2(a)). The 53rd probelet, which with ∼10% of the information is the most significant probelet in the normal dataset (Figure 3.2(b)), is approximately common to both datasets with \|θ53 \| < π/16. The GSVD, therefore, creates a single coherent model of the two datasets by simultaneously identifying unique probelets that are significant in, and common to the two datasets, as well as those that are significant in, and exclusive to either one of the datasets. We interpret the model accordingly, in terms of the biological and experimental phenomena that are common to the LGA tumor and normal profiles, as well as those that are exclusive to the LGA tumor or the normal profiles. Biological Results The GSVD separates the LGA pattern from CNVs common to the normal human and LGA tumor genomes and tumor-exclusive experimental batch effects. This is because the second tumor arraylet, which describes the LGA pattern, is mathematically orthogonal to the other tumor arraylets, which describe other sources of biological and experimental variation that compose the tumor dataset. For example, the first tumor arraylet, which is mathematically the most significant one in the tumor dataset, describes mostly unsegmented chromosomes [21, 22], each with a copy-number distribution that is approximately centered at the autosomal genome with a relatively large, chromosome-invariant width (Figure 3.3(a)). The first probelet, which is mathematically the most tumor-exclusive probelet, is correlated 31 with a tumor-exclusive experimental variation in the hybridization plate of the LGA tumor samples, with both hypergeometric [23] and Mann-Whitney-Wilcoxon P -values <10−2 (Figure 3.3(b) and Figure 3.4(a)). Together, the first probelet and tumor arraylet describe a tumor-exclusive experimental batch effect. The 53rd normal arraylet (Figure 3.5(d)), which is mathematically the most significant one in the normal dataset, and the 53rd LGA tumor arraylet (Figure 3.6(a)), both describe a deletion of the X chromosome relative to the autosomal genome. Consistently, the 53rd probelet, which is mathematically approximately common to the tumor and normal datasets, classifies the patients by gender, with both hypergeometric and Mann-Whitney-Wilcoxon P -values <10−9 (Figure 3.5(e) and Figure 3.4(b)). Together, the 53rd probelet and arraylets describe a male-specific X chromosome deletion, a CNV across the normal genomes (Figure 3.5(f)) that is conserved in the patient-matched LGA tumor genomes. Note that although the male-specific X chromosome deletion is conserved in the tumor genomes (Figure 3.5(b)), the LGA pattern, which is described by the second tumor arraylet, exhibits an unsegmented X chromosome copy-number distribution that is approximately centered at the autosomal genome with a relatively small, invariant width (Figure 3.5(a)). This illustrates the separation of the LGA tumorexclusive pattern from the male-specific X chromosome deletion that is common to the tumor and normal profiles. This GSVD separation of the LGA tumor and normal datasets into probelets, and tumor and normal arraylets, is blind, that is, without a priori knowledge of the sources of variation that compose the datasets. The TCGA annotations that describe the patients (e.g., gender), and the corresponding tumor and normal samples (e.g., the hybridization plate of the tumor vs. the normal samples) are used only to interpret the patterns of variation across the patients, and the tumor and normal genomes, which were uncovered by the GSVD. 32 The LGA Pattern is Correlated with LGA Outcome To examine the correlation of the LGA pattern with an LGA patient’s survival, we classified the discovery set of patients based upon the weight of the pattern, that is, the superposition coefficient of the second LGA tumor arraylet, in each patient’s tumor profile. These coefficients are linearly proportional to the relative copy numbers listed in the second LGA probelet. For the cutoff to be consistent with that previously established for the GBM pattern [8], we scaled the second GBM arraylet correlation cutoff of 0.15 by the Euclidean-, i.e., 2-norm of the Pearson correlations of the discovery tumor profiles with the second LGA tumor arraylet. The second probelet classifies the discovery set of patients into two groups of statistically significantly different prognoses (Figure 3.7(a)). The KM median survival time of 63 months of the group of patients with low coefficients is more than three times greater than that of the group of patients with high coefficients, with the corresponding log-rank test P -value <10−4 . The univariate Cox [24] proportional hazard ratio is >9. This means that a high weight of the LGA pattern in an LGA tumor’s profile confers >9 times the hazard of a low weight. To examine the correlation of the pattern with response to treatment, we classified the discovery set of patients by the GSVD and, in addition, by chemotherapy or radiation. Among the patients who were treated by either chemotherapy or radiation, the Kaplan-Meier (KM) [25] median survival time of the groups of patients with low coefficients is ∼3.5 times, and ∼4 years greater than the median survival time of the groups of patients with high coefficients. A low weight of the LGA pattern in an LGA tumor’s profile is, therefore, correlated with a significantly longer survival time, also in response to chemotherapy or radiation. To computationally validate that the LGA pattern is correlated with LGA outcome, we classified the Affymetrix-measured primary LGA tumor profiles of a validation set of 74 TCGA patients, mutually exclusive of the discovery set. The classification is based upon the correlation of the second LGA tumor arraylet with each patient’s tumor profile across the 933,827 Affymetrix probes, at a cutoff of 0.15 which is consistent with the previously published cutoff [8]. We find that the 33 results of the survival analyses of the LGA validation set are consistent with those of the LGA discovery set. Note also that in classifying the tumor profiles, the 8,102 Agilent-matched Affymetrix probes and, separately, the 4,697 consistently-aberrated probes among them, give qualitatively the same and quantitatively similar results as the 933,827 Affymetrix probes. The GSVD Reveals a Genome-Wide LGA Tumor-Exclusive Pattern of CNAs Encompassed in the GBM Pattern In a previous GSVD comparison of patient-matched Agilent-measured DNA copynumber profiles of primary GBM tumor and normal samples, we found that the second most GBM tumor-exclusive tumor arraylet describes a genome-wide pattern of co-occurring CNAs that is correlated with a GBM patient’s outcome [8]. Now, we find that the second LGA tumor arraylet describes a genome-wide pattern of co-occurring CNAs across the Affymetrix probes (Figure 3.5(a), and Figure 3.8(a)), which is similar to the GBM pattern (Figure 3.9(a)). To compare the LGA to the GBM pattern, we assigned to the LGA pattern CNAs in the chromosomes and chromosome arms as well as the genomic segments that were identified in the GBM pattern (Methods). We find that the LGA pattern is encompassed in the GBM pattern. Chromosomes, chromosome arms, and segments that are amplified or deleted in the LGA pattern are also amplified or deleted in the GBM pattern, respectively, and at a greater magnitude; some of those that show no copy-number change in the LGA pattern are amplified or deleted in the GBM pattern. Dominant in the LGA pattern, but at a lesser magnitude than in the GBM pattern, are the known, GBM-associated gain of chromosome 7 and loss of chromosome 10 [16, 17]. Also dominant in the LGA pattern, also at a lesser magnitude than in the GBM pattern, are GBM-associated focal CNAs ([8] and see also [9]). Among the LGA-shared GBM-associated focal CNAs, we find amplifications and deletions that contribute to a decreased activity of the tumor suppressor protein p53. These include gains of segments containing the p53-inactivating protein-encoding MDM4 (1q32.1) and the p53-degrading protein-encoding MDM2 (12q15), and losses of segments 34 containing CDKN2A (9p21.3) and PTEN (10q23.31). The tumor suppressor protein encoded by PTEN negatively regulates the Mdm2 protein via the Akt pathway. Of the three known transcript variants of CDKN2A, one encodes a p53-stabilizing, Mdm2-sequestering protein. The other two variants encode isoforms of the tumor suppressor protein p16. Together with the retinoblastoma (Rb) protein tumor suppressor, and in parallel to p53 [26], p16 acts as a checkpoint of human normal to tumor cell transformation, by promoting cell cycle arrest, apoptosis, and senescence in response to rat sarcoma virus (Ras) -mediated hyperactive growth factor signaling [27, 28]. Amplifications and deletions that are involved in increased growth factor signaling are also among the LGA-shared GBM-associated CNAs. These include gains of segments containing the epidermal growth factor receptor EGFR (7p11.2), the hepatocyte growth factor receptor MET (7q31.2), and the fibroblast growth factor receptor (FGFR) substrate FRS2 (12q15) [29], and a loss of a segment containing the transforming growth factor-β (TGF-β) -induced growth inhibitor CDKN2B (9p21.3) [30]. Additional LGA- and GBM-shared focal amplifications and deletions contribute to a suppression of the tumor suppressor protein Ptc1 by the Hedgehog (Hh) signaling pathway (Figure 3.10). These include gains of segments containing the Hh ligand-encoding SHH (7q36.3) and the Hh signal-transducing protein-encoding SMO (7q32.1), and a loss of a segment containing the Hh negative regulator proteinencoding SUFU (10q24.32) [31]. Note that reduced Ptc1 activity is also shared by the brain cancer medulloblastoma, where it was shown to contribute to the development of the tumor [32]. The GBM pattern consists of additional CNAs that are missing from the LGA pattern, including the GBM-associated loss of the short arm of chromosome 9 (9p). Among the focal GBM-specific CNAs we find amplifications that contribute to decreased Rb activity. These include gains of segments containing the Rb-binding protein-encoding KDM5A (12p13.33), the Rb-phosphorylating protein-encoding CDK4 (12q14.1), and cyclin E1 CCNE1 (19q12), which repression by Rb is necessary to prevent replication of senescent cells [33, 34]. 35 Additional GBM-specific gains are of segments containing the oncogenes AKT3 (1q44) [35] and Harvey Ras-encoding HRAS (11p15.5) [36]. We find, therefore, that the GBM-specific amplifications, of AKT3 and HRAS, and these that are involved in decreased Rb activity, together with the LGA- and GBM-shared deletion of the p16-encoding CDKN2A, and the other deletions and amplifications that are involved in decreased activity of p53, enhance the opportunity for human normal to tumor cell transformation in response to growth factor signaling in GBM relative to LGA (Figure 3.11). Gains of segments containing putative drug targets are also among the GBMspecific CNAs, including the methyltransferases-encoding METTL2B (7q32.1) and METTL2A (17q23.2), and the serine/threonine kinase-encoding TLK2 (17q23.2) [8, 37]. To additionally compare the LGA and GBM patterns, we identified 8,102 pairs of one-to-one overlapping Affymetrix and Agilent probes among the 933,827 Affymetrix probes of the LGA pattern and the 212,696 Agilent probes of the GBM pattern (Methods). The GBM-associated CNAs in chromosomes, chromosome arms, and segments are visible across the 8,102 pairs of probes, even though these are <1% and 4% of the probes that constitute the LGA and GBM patterns, respectively. The LGA-shared CNAs are visible in both the LGA and GBM patterns, whereas the GBM-specific CNAs are visible only in the GBM pattern (Figure 3.9(b) and Figure 3.8(b)). By assigning to the LGA and GBM patterns CNAs in the 8,102 Affymetrix and Agilent probes, respectively, we additionally identified 4,697 pairs of one-to-one overlapping probes that are consistently aberrated in the LGA and GBM patterns. We find that the LGA-shared CNAs are visible across these 4,697 pairs of probes, in both the LGA and GBM patterns (Figure 3.9(c) and Figure 3.8(c)). The GBM-specific CNAs are visible only across the remaining 3,405 Agilent probes in the GBM pattern (Figure 3.9(d) and Figure 3.8(d)). 36 Discussion A GSVD comparison of patient-matched profiles of LGA tumor and normal samples revealed a tumor-exclusive genome-wide pattern of CNAs. We showed, and computationally validated, that this LGA pattern is correlated with an LGA patient’s outcome. The GSVD separated this pattern from other sources of experimental and biological variation, common to the tumor and normal profiles, or exclusive to the tumor or the normal profiles, without a priori knowledge of these variations. We also showed that the LGA pattern is encompassed in a previously-identified genome-wide pattern of CNAs identified in GBM [8], where GBM-specific CNAs encode for enhanced opportunities for transformation and proliferation via growth and developmental signaling pathways in GBM relative to LGA. The LGA datasets had been publicly available in TCGA since 2015, and analyzed by using several methods. The pattern, however, remained unknown until the datasets were modeled by using the GSVD. This illustrates the ability of comparative spectral decompositions in general, and the GSVD in particular, to find what other methods miss. Note that in a GSVD comparison between two datasets, the only assumption is that the structure of the datasets is that of two full column-rank matrices of matched columns. It is, therefore, not limited to profiles of human cells, DNA copy-number profiles, or profiles measured by DNA microarray platforms, nor is it limited to molecular biological datasets. The GSVD was first formulated as a comparative spectral decomposition to model cell cycle phase-matched mRNA expression profiles of synchronized cells from human and yeast [12]. The model predicted a genome-wide causal coordination between DNA replication and mRNA expression, which was then experimentally verified. This demonstrated that the GSVD can be used to correctly predict previously unknown cellular mechanisms. Since the GSVD was first formulated as a comparative spectral decomposition and used to model cell cycle phase-matched mRNA expression profiles of synchronized cells from human and yeast [12], the GSVD has been used to separate the similar from the dissimilar between different species, as well as between different types of molecular biological profiles, mostly large-scale (e.g., mRNA and protein expression in addition to DNA copy-number profiles), and different profiling technologies (e.g., NGS and quantitative 37 real-time PCR in addition to DNA microarray platforms) [38, 39, 40, 41] (see also [42, 43]). Methods LGA Discovery Datasets Construction We selected an LGA discovery set of 59 TCGA patients. The 59 patients were diagnosed with World Health Organization (WHO) grades III or II astrocytoma. The patient-matched primary LGA tumor and normal tissue samples were obtained from US tissue source sites. The TCGA survival annotations available for these patients were consistent. Each tumor or normal profile lists median-centered log2 TCGA raw level 2 of the Affymetrix Genome-Wide Human SNP Array 6.0-measured DNA copy numbers. The profiles are organized in one tumor and one normal dataset, of M1 , M2 = 933,827 autosomal and X chromosome nonpolymorphic copy-number probes, with valid data in all N = 59 patient-matched pairs of tumor and normal profiles, respectively. Arraylet Visualization To visualize the first tumor arraylet and 53rd normal arraylet, we segmented each arraylet and assigned each segment a P -value by using the CBS, as described [22]. Probelet Interpretation To biologically or experimentally interpret the first and 53rd probelets, which are the most significant probelets in the tumor and normal datasets, respectively, we assessed the subsets of patients that are of high or low relative copy numbers in each probelet for enrichment in any one of the multiple TCGA annotations that describe the patients, e.g., gender, and the corresponding tumor and normal tissue samples, e.g., the hybridization plate of the tumor samples. Note that the copy numbers that are listed in the first probelet, which is also the most tumor-exclusive probelet, are linearly proportional to the weights or superposition coefficients of the first tumor arraylet in the tumor profiles of the patients. The copy numbers listed in the 53rd probelet, which is also approximately common to the tumor and normal datasets, are linearly proportional to the coefficients of the 53rd tumor and normal arraylets 38 in the tumor and normal profiles of the patients, respectively. The P -value of each enrichment was calculated assuming a hypergeometric probability distribution of the K annotations among the N patients of the discovery set, and of the subset of k ⊆ K observed annotations among the subset of n patients that are of high or low copy ( )−1 ∑n (K )(N −K ) numbers in each probelet, as described [23], P (k; n, N, K) = Nn . i=k i n−i In each probelet, we also assessed the distribution of the copy numbers among the different groups of each TCGA annotation by using boxplots, and calculating the corresponding Mann-Whitney-Wilcoxon P -values. LGA Validation Dataset Construction We selected an LGA validation set of 74 TCGA patients, which is mutually exclusive of the discovery set. The 74 patients were diagnosed with WHO grades III or II astrocytoma. The primary LGA tumor samples of the patients were obtained from US tissue source sites. The TCGA survival annotations available for these patients were consistent. Each tumor profile lists median-centered log2 TCGA raw level 2 of the Affymetrix Genome-Wide Human SNP Array 6.0-measured DNA copy numbers across the same M1 = 933,827 probes of the LGA pattern. Missing data were not estimated. Probes among the 933,827 Affymetrix probes of the LGA pattern, the 8,102 Agilent-matched probes, or the 4,697 Agilent-matched consistently-aberrated probes, which are missing data in any one profile, were excluded from the calculations of this profile’s median copy number as well as the profile’s Pearson correlations with the LGA and GBM patterns. CNAs in the LGA Pattern To compare the Affymetrix-derived LGA pattern to the Agilent-derived GBM pattern, we mapped the 933,827 Affymetrix probes that constitute the LGA pattern onto the National Center for Biotechnology Information (NCBI) human genome sequence build 36 at the University of California at Santa Cruz (UCSC) human genome browser [21]. Previously, we also mapped the 212,696 probes of the Agilent Human Genome CGH 244A microarray platform that constitute the GBM pattern onto the human genome sequence build 36. We then assigned to the LGA pattern CNAs in the chromosomes and chromosome arms as well as the 111 of the 130 genomic 39 segments that were previously identified in the GBM pattern by using the circular binary segmentation (CBS) [22], which are at least five Agilent probes in length. The LGA pattern was assigned a gain or a loss in a chromosome or a chromosome arm if the deviation of the chromosome or the chromosome arm mean copy number from the genomic mean is greater than twice the genomic standard deviation, where the genomic mean and standard deviation are calculated for the autosomal genome, excluding the outlying chromosomes 7 and 10 and chromosome arm 9p, as described in [8]. A gain or a loss in a segment were assigned if the deviation of the segment mean copy number from the genomic mean is greater than twice the genomic standard deviation, or if the deviation from the chromosomal mean is greater than the chromosomal standard deviation, when this deviation is consistent with the deviation from the genomic mean. Cross-Platform Probe Matching To map genomic signals between microarray platforms, we created a one-toone mapping between the microarray probes on the Agilent Human Genome CGH Microarray 244A and the Affymetrix Genome-Wide Human SNP 6.0 microarray platform. The mapping was used to enable all cross-platform comparison and analyses of discrete genomic profiles measured across different sets of microarray probes. We matched pairs of one Agilent and one Affymetrix probe that overlap by at least one nucleotide when mapped onto the human reference genome sequence build 36. The Agilent probes are 45-60 nucleotides long, and the Affymetrix probes are 25 nucleotides long. When multiple Affymetrix probes overlapped a single Agilent probe, the Affymetrix probe closest to the genomic end coordinate of the Agilent probe was selected, to maintain a one-to-one matching between the DNA microarray platforms. Similarly, when a single Affymetrix probe overlapped more than one Agilent probe, the Agilent probe closest to the genomic start coordinate of the Affymetrix probe was selected. This cross-platform probe matching identified 8,102 pairs of one-to-one overlapping Affymetrix and Agilent probes. To identify the 4,697 pairs of one-to-one overlapping probes that are consistently aberrated in the LGA and GBM patterns, we assigned to the patterns CNAs in the 40 8,102 Affymetrix and Agilent probes, respectively. A gain or a loss in a probe were assigned if the deviation of the probe copy number from the genomic mean is greater than twice the genomic standard deviation, or if the deviation from the chromosomal mean is greater than the chromosomal standard deviation, when this deviation is consistent with the deviation from the genomic mean. GBM Dataset Construction We selected a GBM set of 364 patients, which are among the patients in the previous GBM discovery and validation sets [8]. The 364 patients were diagnosed with WHO grade IV astrocytoma, i.e., GBM. The TCGA survival annotations available for these patients were consistent. Each tumor profile lists median-centered log2 TCGA raw level 2 of the Affymetrix Genome-Wide Human SNP Array 6.0-measured DNA copy numbers across the same M1 = 933,827 probes of the LGA pattern. Missing data were not estimated. Probes among the 8,102 Agilent-matched probes, or the 4,697 Agilent-matched consistently-aberrated probes, which are missing data in any one profile, were excluded from the calculations of this profile’s median copy number as well as the profile’s Pearson correlations with the LGA and GBM patterns. 41 Figure 3.1: GSVD of LGA tumor and normal DNA copy-number profiles. The structure of the LGA discovery, tumor and normal datasets Di is that of two matrices of 59 matched patients, i.e., columns, and 933,827 not necessarily matched tumor and normal genomic regions, or Affymetrix probes, i.e., rows. The GSVD of Equation 2.4 simultaneously separates the datasets into a set of patterns of variation across the patients, i.e., probelets V T , which are identical for both datasets but have different weights, i.e., generalized singular values Σi , in the tumor than the normal dataset, combined with tumor- and normal-specific sets of patterns of variation across the genome, i.e., arraylets Ui . The GSVD is depicted in a raster display, with relative DNA copy-number gain (red), no change (black), and loss (green), explicitly showing only the first through the 10th, and the 50th through the 59th probelets and corresponding tumor and normal arraylets, and tumor and normal generalized singular values. The significance of each probelet in the tumor dataset relative to that in the normal dataset is defined by the angular distance of Equation 2.7, which is proportional to the ratio of the corresponding tumor to normal generalized singular values. The angular distances are depicted in the inset bar chart display, showing that the largest angular distances, also in magnitude, are >2π/15, and correspond to the first and second probelets, which are, therefore, the two most tumor-exclusive probelets. The angular distance corresponding to the 53rd probelet is in magnitude <π/16, and, therefore, the 53rd probelet is approximately common to both the tumor and normal datasets. 42 Figure 3.2: The significance of individual probelets in the LGA tumor and normal datasets is given by the generalized normalized Shannnon entropy. (a) Bar chart of the 10 largest generalized fractions of Equation 2.5 in the LGA tumor dataset shows that the two most tumor-exclusive probelets, i.e., the first and second probelets (Figure 3.1), are also the first and third most significant probelets in the tumor dataset, with >8% and 5% of the information in this dataset, respectively. The corresponding generalized normalized Shannon entropy of Equation 2.6 is 0.94. (b) Bar chart of the 10 largest generalized fractions in the normal dataset shows that the 53rd probelet, which is approximately common to the tumor and normal datasets, is the most significant probelet in the normal dataset with 10% of the information. 43 Figure 3.3: A tumor-exclusive batch effect is revealed by the GSVD. (a) Plot of the first, most LGA tumor-exclusive tumor arraylet describes mostly unsegmented chromosomes (black lines), each with a copy-number distribution that is approximately centered at the autosomal genome with a relatively large, chromosome-invariant width. The probes are ordered, and their copy numbers are colored, according to each probe’s chromosomal location. (b) Plot of the first LGA probelet describes the variation of the weight or superposition coefficient of the first tumor arraylet in the tumor profiles of the 59 patients. The subset of patients that are of high copy numbers is enriched in tumors, which hybridization plate was 2391 (red), rather than other plates (blue). The corresponding hypergeometric P -value is < 102 . (c) Raster display of the tumor dataset, with relative gain (red), no change (black), and loss (green) of DNA copy numbers, shows the correspondence between the LGA tumor profiles and the first probelet and tumor arraylet, which together describe a tumor-exclusive experimental batch effect. 44 Figure 3.4: Experimental and biological variation is captured by the GSVD. (a) Boxplots of the distribution of the copy numbers that are listed in the first probelet between two groups of the tumor hybridization plate. The group of tumor hybridization plate 2391 (red) has significantly greater copy numbers than the group of the other plates (blue). The corresponding Mann-Whitney-Wilcoxon P -value is < 102 . (b) Boxplots of the distribution of the copy numbers that are listed in the 53rd probelet between the two gender groups. The males (blue) have significantly greater copy numbers in the probelet than the females (red), which copy numbers are approximately centered at the autosomal genome. The corresponding Mann-Whitney-Wilcoxon P -value is < 109 (Figure 3.5). 45 Figure 3.5: Significant patterns are revealed by the GSVD of the LGA datasets. (a) Plot of the second most LGA tumor-exclusive tumor arraylet describes a genome-wide pattern of co-occurring CNAs across the tumor genome. Segments (black lines) that were identified in the previously-identified GBM pattern [8], and are amplified or deleted in the LGA pattern are also amplified or deleted in the GBM pattern, respectively, and at a greater magnitude. (b) Plot of the second LGA probelet describes the variation of the weight, or superposition coefficient of the LGA pattern in the tumor profiles of the 59 patients. The second probelet classifies the patients into two groups of low (red) and high (blue) weights, which are of statistically significantly different prognoses. (c) Raster display of the tumor dataset shows the correspondence between the tumor profiles and the second LGA probelet and tumor arraylet. (d) Plot of the 53rd LGA normal arraylet, which is the most significant in the normal dataset, describes an X chromosome-exclusive deletion across the normal genome. (e) Plot of the 53rd LGA probelet, which is approximately common to the tumor and normal datasets, and classifies the patients by gender into females (red) and males (blue), with both hypergeometric and Mann-Whitney-Wilcoxon P -values <10−9 . (f) Raster display of the normal dataset shows the male-specific X chromosome deletion across the normal genomes, which is conserved in the patient-matched LGA tumor genomes, but is separated from the second LGA tumor arraylet. 46 Figure 3.6: The male-specific X-chromosome deletion is identified in the LGA tumor dataset. (a) Plot of the 53rd LGA tumor arraylet describes a deletion of the X chromosome. (b) Plot of the 53rd LGA probelet, which is approximately common to the tumor and normal datasets, describes a classification of the patients by gender into females (red) and males (blue) (Figure 3.5). (c) Raster display of the tumor dataset shows the male-specific X chromosome deletion across the tumor genomes. This biological variation originated in the patient-matched LGA normal genomes. The GSVD separates this variation from the second LGA tumor arraylet. 47 Figure 3.7: Survival analyses of the LGA patients classified by GSVD. (a) KM curves of the discovery set of 59 patients classified by the weights, or superposition coefficients of the LGA pattern in their tumor profiles, as listed in the second probelet (Figure 3.5(b)). The KM median survival time of 63 months of the group of patients with low coefficients is more than three times greater than that of the group of patients with high coefficients. (b) Among the 29 patients in the discovery set treated by chemotherapy, the median survival time of the patients with low coefficients is approximately 3.5 times greater than that of the patients with high coefficients. This means that the LGA pattern is correlated with a patient’s response to chemotherapy. (c) Among the patients treated by radiation, the median survival times of patients with low and high coefficients are the same as among the chemotherapy-treated patients. This means that the pattern is also correlated with response to radiation. (d) KM curves of the validation set of 74 patients classified by the Pearson correlation of the LGA pattern with their tumor profiles. (e) Among the 46 patients treated by chemotherapy, the median survival times of the patients with low and high correlations are the same as in the validation set in general, and consistent with the chemotherapy-treated patients in the discovery set. (f) The median survival times of the radiation-treated patients with low and high correlations are the same as those of the chemotherapy-treated patients, and the validation set in general, and are consistent with the radiation-treated patients in the discovery set. 48 Figure 3.8: LGA genome-wide pattern of co-occurring CNAs is encompassed in the GBM pattern. (a) Plot of the second most LGA tumor-exclusive tumor arraylet, which was revealed by the GSVD, describes a genome-wide pattern of co-occurring CNAs across 933,827 Affymetrix probes. This LGA pattern is encompassed in the GBM pattern (Figure 3.9), and consists of GBM-associated LGA-shared CNAs (black), including, e.g., gain of a segment on chromosome 1 containing MDM4. (b) The GBM-associated LGA-shared CNAs, e.g., in MDM4, are visible across the 8,102 Agilent-matched Affymetrix probes, even though these are <1% of the probes that constitute the LGA pattern. (c) The LGA-shared CNAs are also visible across the 4,697 Agilent-matched consistently-aberrated Affymetrix probes. (d) These CNAs are not visible across the 3,405 remaining probes. 49 Figure 3.9: The GBM genome-wide pattern consists of LGA-shared and GBMspecific co-occurring CNAs. (a) Plot of the previously-identified GBM pattern [8], describes a genome-wide pattern of co-occurring CNAs across 212,696 Agilent probes. This GBM pattern consists of LGA-shared and GBM-specific global CNAs, as well as LGA-shared (black) and GBM-specific (blue) focal CNAs. (b) Both LGA-shared and GBM-specific global and focal CNAs are visible across the 8,102 Affymetrixmatched Agilent probes, even though these are <4% of the Agilent 212,696 probes that constitute the GBM pattern. (c) The LGA-shared global and focal CNAs are visible across the 4,697 Affymetrix-matched consistently-aberrated Agilent probes, but not the GBM-specific CNAs. (d) The GBM-specific global and focal CNAs are visible across the 3,405 remaining Agilent probes, but not the LGA-shared CNAs. 50 Figure 3.10: Schematic mapping of the GBM and LGA CNAs onto the Hh pathway. GBM-specific CNAs encode for enhanced opportunities for proliferation via developmental signaling pathways in GBM relative to LGA. The schematic mapping of the CNAs onto the Hh pathway describes gains (red) and losses (green) of genes (rectangles), which are LGA- and GBM-shared (black) or GBM-specific (blue), and relationships, which directly or indirectly lead to increased (lines with arrows) or decreased (lines with bars) activities of the genes, and the tumor suppressor protein Ptch1 (circle). 51 Figure 3.11: Schematic mapping of the GBM and LGA CNAs onto the Ras pathway. GBM-specific CNAs encode for enhanced opportunities for transformation via growth signaling pathways in GBM relative to LGA. The schematic mapping of the GBM and LGA CNAs onto the Ras pathway describes gains (red) and losses (green) of genes and gene transcript variants (rectangles), which are LGA- and GBM-shared (black) or GBM-specific (blue), and relationships, which directly or indirectly lead to increased (lines with arrows) or decreased (lines with bars) activities of the genes and transcripts, and the tumor suppressor proteins p53 and Rb (circles). 52 References [1] T. Boveri, “Concerning the origin of malignant tumours by Theodor Boveri. Translated and annotated by Henry Harris,” Journal of Cell Science, vol. 121, no. Supplement 1, pp. 1–84, 2008. [2] S. Heim, “Boveri at 100: Boveri, chromosomes and cancer,” The Journal of Pathology, vol. 234, no. 2, pp. 138–141, 2014. [3] D. Hanahan and R. A. Weinberg, “Hallmarks of cancer: the next generation,” Cell, vol. 144, no. 5, pp. 646–674, 2011. [4] P. Sankaranarayanan, et al., “Tensor GSVD of patient- and platform-matched tumor and normal DNA copy-number profiles uncovers chromosome arm-wide patterns of tumor-exclusive platform-consistent alterations encoding for cell transformation and predicting ovarian cancer survival,” PLoS One, vol. 10, no. 4, p. e0121396, 2015. [5] Cancer Genome Atlas Research Network, et al., “Integrated genomic analyses of ovarian carcinoma,” Nature, vol. 474, no. 7353, pp. 609–615, 2011. [6] M. G. Prisco, et al., “Prognostic role of metastasis tumor antigen 1 in patients with ovarian cancer: a clinical study,” Human Pathology, vol. 43, no. 2, pp. 282– 288, 2012. [7] M. Harries and M. Gore, “Part I: chemotherapy for epithelial ovarian cancer– treatment at first diagnosis,” The Lancet Oncology, vol. 3, no. 9, pp. 529–536, 2002. [8] C. H. Lee, et al., “GSVD comparison of patient-matched normal and tumor aCGH profiles reveals global copy-number alterations predicting glioblastoma multiforme survival,” PLoS One, vol. 7, no. 1, p. e30098, 2012. [9] Cancer Genome Atlas Research Network, “Comprehensive genomic characterization defines human glioblastoma genes and core pathways,” Nature, vol. 455, no. 7216, pp. 1061–1068, 2008. [10] C. F. Van Loan, “Generalizing the singular value decomposition,” SIAM Journal on Numerical Analysis, vol. 13, no. 1, pp. 76–83, 1976. [11] C. C. Paige and M. A. Saunders, “Towards a generalized singular value decomposition,” SIAM Journal on Numerical Analysis, vol. 18, no. 3, pp. 398–405, 1981. [12] O. Alter, et al., “Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms,” Proceedings of the National Academy of Sciences, vol. 100, no. 6, pp. 3351–3356, 2003. [13] M. G. Netsky, et al., “The longevity of patients with glioblastoma multiforme,” Journal of Neurosurgery, vol. 7, no. 3, pp. 261–269, 1950. 53 [14] W. J. Curran, et al., “Recursive partitioning analysis of prognostic factors in three Radiation Therapy Oncology Group malignant glioma trials,” Journal of the National Cancer Institute, vol. 85, no. 9, pp. 704–710, 1993. [15] T. Gorlia, et al., “Nomograms for predicting survival of patients with newly diagnosed glioblastoma: prognostic factor analysis of EORTC and NCIC trial 26981-22981/ce. 3,” The Lancet Oncology, vol. 9, no. 1, pp. 29–38, 2008. [16] R. N. Wiltshire, et al., “Comparative genetic patterns of glioblastoma multiforme: potential diagnostic tool for tumor classification,” Neuro-Oncology, vol. 2, no. 3, pp. 164–173, 2000. [17] A. Misra, et al., “Array comparative genomic hybridization identifies genetic subgroups in grade 4 human astrocytoma,” Clinical Cancer Research, vol. 11, no. 8, pp. 2907–2918, 2005. [18] C. Daumas-Duport, et al., “Grading of astrocytomas: a simple and reproducible method,” Cancer, vol. 62, no. 10, pp. 2152–2165, 1988. [19] M. L. C. Van Veelen, et al., “Supratentorial low grade astrocytoma: prognostic factors, dedifferentiation, and the issue of early versus late surgery,” Journal of Neurology, Neurosurgery & Psychiatry, vol. 64, no. 5, pp. 581–587, 1998. [20] C. Van Loan, “Computing the CS and the generalized singular value decompositions,” Numerische Mathematik, vol. 46, no. 4, pp. 479–491, 1985. [21] W. J. Kent, et al., “The human genome browser at UCSC,” Genome Research, vol. 12, no. 6, pp. 996–1006, 2002. [22] A. B. Olshen, et al., “Circular binary segmentation for the analysis of array-based DNA copy number data,” Biostatistics, vol. 5, no. 4, pp. 557–572, 2004. [23] E. Eden, et al., “GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists,” BMC Bioinformatics, vol. 10, no. 1, p. 48, 2009. [24] D. R. Cox, “Regression models and life-tables,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 187–220, 1972. [25] E. L. Kaplan and P. Meier, “Nonparametric estimation from incomplete observations,” Journal of the American Statistical Association, vol. 53, no. 282, pp. 457–481, 1958. [26] C. J. Sherr and F. McCormick, “The RB and p53 pathways in cancer,” Cancer cell, vol. 2, no. 2, pp. 103–112, 2002. [27] A. E. Karnoub and R. A. Weinberg, “Ras oncogenes: split personalities,” Nature Reviews Molecular Cell Biology, vol. 9, no. 7, pp. 517–531, 2008. [28] W. C. Hahn, et al., “Creation of human tumour cells with defined genetic elements,” Nature, vol. 400, no. 6743, pp. 464–468, 1999. 54 [29] U. Fischer, et al., “A different view on DNA amplifications indicates frequent, highly complex, and stable amplicons on 12q13-21 in glioma,” Molecular Cancer Research, vol. 6, no. 4, pp. 576–584, 2008. [30] G. J. Hannon and D. Beach, “pl5INK4B is a potentia— effector of TGF-βinduced cell cycle arrest,” Nature, vol. 371, no. 6494, pp. 257–261, 1994. [31] R. Rohatgi and M. P. Scott, “Patching the gaps in Hedgehog signalling,” Nature Cell Biology, vol. 9, no. 9, pp. 1005–1009, 2007. [32] R. Wechsler-Reya and M. P. Scott, “The developmental biology of brain tumors,” Annual Review of Neuroscience, vol. 24, no. 1, pp. 385–428, 2001. [33] A. Chicas, et al., “Dissecting the unique role of the retinoblastoma tumor suppressor during cellular senescence,” Cancer Cell, vol. 17, no. 4, pp. 376–387, 2010. [34] D. Etemadmoghadam, et al., “Amplicon-dependent CCNE1 expression is critical for clonogenic survival after cisplatin treatment and is correlated with 20q11 gain in ovarian cancer,” PloS One, vol. 5, no. 11, p. e15498, 2010. [35] K. M. Turner, et al., “Genomically amplified Akt3 activates DNA repair pathway and promotes glioma progression,” Proceedings of the National Academy of Sciences, vol. 112, no. 11, pp. 3421–3426, 2015. [36] K. M. Reilly, et al., “Nf1; trp53 mutant mice develop glioblastoma with evidence of strain-specific effects,” Nature Genetics, vol. 26, no. 1, pp. 109–113, 2000. [37] A. L. Hopkins and C. R. Groom, “The druggable genome,” Nature Reviews Drug Discovery, vol. 1, no. 9, pp. 727–730, 2002. [38] J. A. Berger, et al., “Jointly analyzing gene expression and copy number data in breast cancer using data reduction models,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 3, no. 1, p. 2, 2006. [39] A. W. Schreiber, et al., “Combining transcriptional datasets using the generalized singular value decomposition,” BMC Bioinformatics, vol. 9, no. 1, p. 335, 2008. [40] X. Xiao, et al., “Exploring metabolic pathway disruption in the subchronic phencyclidine model of schizophrenia with the generalized singular value decomposition,” BMC Systems Biology, vol. 5, no. 1, p. 72, 2011. [41] O. A. Tomescu, et al., “Integrative omics analysis. a study based on Plasmodium falciparum mRNA and protein data,” BMC Systems Biology, vol. 8, no. 2, p. S4, 2014. [42] S. P. Ponnapalli, et al., “A higher-order generalized singular value decomposition for comparison of global mRNA expression from multiple organisms,” PLoS One, vol. 6, no. 12, p. e28072, 2011. [43] X. Xiao, et al., “Multi-tissue analysis of co-expression networks by higher-order generalized singular value decomposition identifies functionally coherent transcriptional modules,” PLoS Genetics, vol. 10, no. 1, p. e1004006, 2014. CHAPTER 4 CROSS-PLATFORM VALIDATION OF PATTERN OF DNA COPY NUMBER ABERRATIONS This chapter is part of a published journal article in PLoS One Vol. 11 No. 10, article e30098 (October 2016). Reprinted with minor revisions in accordance with the Creative Commons Attribution (CC BY) license. Authors of this chapter are K. A. Aiello and O. Alter. Introduction In order for a genomic test to be translated into a clinical laboratory test, it must meet rigorous standards for analytic validity. The analytic validity of a test refers to the accuracy with the genetic characteristic of interest, such as a genomic signature, is identified in a given laboratory test. Such tests may be implemented using a variety of protocols, and consideration must be given to potential issues that arise due to the assay and experimental protocols chosen, the test’s reliability, the degree to which reliability varies between laboratories, and the complexity of test interpretation [1, 2]. To advance the previously identified GBM pattern [3] toward the clinic where it can be implemented as a laboratory test, we will first evaluate its analytic validity by demonstrating cross-platform and cross-tumor validation of the pattern. Highthroughput measurements, and microarray data in particular, are known to be noisy and susceptible to experimental artifacts [4, 5]. Here, we test the prognostic utility of the GBM pattern on replicate GBM copy-number profiles measured by a separate, Affymetrix-manufactured microarray platform to validate that the Agilent-derived GBM pattern can be reliably measured on multiple platforms of different design and probe architecture. This will further validate that the pattern describes the fundamental tumor biology of the aggressive GBM genotype, rather than an artifact 56 of the measurement technology. Due to the observed clinical and molecular similarities between GBM and some LGAs, we will also test the GBM pattern for prognostic utility in the LGA population. Notably, patients diagnosed with anaplastic, i.e., grade III astrocytoma, are known to have highly variable survival times. If the more aggressive subset of anaplastic astrocytomas can be identified at the time of diagnosis, they can be managed in the same way as GBMs, with a combination of maximal surgical resection, radiation, and adjuvant chemotherapy [6]. By extending our pattern to the general astrocytoma patient population, we can identify this subset of patients with the most aggressive tumors and improve their standard of care. Finally, we will evaluate the clinical utility of GBM pattern in the general astrocytoma population, comprising both GBM and LGA patients, and compare the prognostic ability of the pattern to existing indicators, including age at diagnosis, tumor grade, and laboratory tests for MGMT promoter methylation and IDH1 mutation. Results The GBM Pattern Is a Platform-Independent Predictor of GBM Survival We used the Agilent-derived GBM pattern to classify the primary GBM tumor profiles of a set of 364 TCGA patients whose genome-wide copy-number profiles were measured separately on two microarray platforms: the Agilent Human aCGH 244A microarray and the Affymetrix Genome-Wide Human SNP 6.0. The two platforms have significantly different designs, including the total number of probes measured, the length of each probe, and the distribution of coverage across the genome. For each GBM sample, the Agilent-measured profile and the Affymetrix-measured profile are derived from DNA aliquots extracted from the same lysed and homogenized portion of the tumor sample. Therefore, the Agilent- and Affymetrix-measured profiles for a given sample are biological replicates with the same underlying copy-number profile. Any differences between the replicate profiles is interpreted as experimental variation or microarray platform-specific bias. 57 We find that the GBM pattern is a platform-independent predictor of GBM survival. Classifying the GBM patients based upon the Affymetrix-measured tumor profiles, and across just the 4,697 matched probes, gives qualitatively the same and quantitatively similar results as the previous classification based upon the Agilentmeasured profiles, across the 212,696 Agilent probes [3]. As in the original Agilentbased classification, the KM median survival time of the group of patients with low correlations is >2.5 times, and >1.5 years greater than the approximately 1-year median survival time of the group of patients with high correlations (Figure 4.1(a)). The GBM Pattern Identifies Among the LGA Patients a Subtype, Similar to That Among the GBM Patients, Where the CNA Genotype Is Correlated with an Approximately 1-Year Survival Phenotype Since the GBM pattern encompasses the LGA pattern, we hypothesized that the GBM pattern can be generalized to the larger astrocytoma population, including lower-grade tumors. We used the GBM pattern to classify the Affymetrix-measured tumor profiles of the 133 TCGA patients in the LGA discovery and validation sets described in Chapter 3. The survival analysis results are consistent with those based upon the correlation with the Affymetrix-derived LGA pattern (Figure 4.1(b)). Because a high weight of the GBM pattern in either an LGA or a GBM tumor’s profile confers a greater hazard and a shorter survival time, we compared the survival of the groups of LGA and GBM patients that are identified by the GBM pattern. We find that the KM curves for these two groups overlap, with the corresponding log-rank test P-value >0.05, which means that the two groups are statistically indistinguishable based upon survival. Classifying the 133 LGA and 364 GBM, i.e., 497 astrocytoma patients, based upon the weight of the GBM pattern in each patient’s tumor profile, we find that the GBM pattern is a predictor of survival among the general primary astrocytoma population, independent of grade, where the CNA genotype that the GBM pattern describes is correlated with an approximately 1-year survival phenotype (Figure 4.1(c)). 58 We also assessed the distribution of several TCGA annotations of intratumor heterogeneity in each astrocytoma grade, including the tumor sample’s volume, the slide’s percents of tumor cells and nuclei, the portion’s weight, and the analyte’s and aliquot’s native, unamplified DNA quantities. We find that at the TCGA ranges for these annotations, the GBM pattern is independent of intratumor heterogeneity. Preliminary studies in formalin-fixed paraffin-embedded (FFPE) tissue samples showed that the microdissected FFPE tumor samples, which are only ∼1/100–1/1000 the tissue volume of the fresh frozen TCGA tumor samples, do not meet the same requirements for intratumor heterogeneity, and may be less suitable for genomic classification. The GBM Pattern Is a Platform-Independent Predictor of Astrocytoma Outcome To examine the correlation of the GBM pattern with an astrocytoma patient’s response to treatment, we classified the 497 patients by chemotherapy or radiation and, in addition, by the GBM pattern (Figure 4.2). These classifications give bivariate Cox hazard ratios which are close to, and within the 95% confidence intervals of the corresponding univariate ratios (Table 4.1). This means that the GBM pattern is a predictor of a patient’s survival independent of treatment, and, therefore, also a predictor of the patient’s response to treatment. Next, we examined the correlation of the GBM pattern with a patient’s age and a tumor’s grade (Figure 4.3) [7, 8, 9, 10]. We find that the log-rank test P -value, which corresponds to the classification by the GBM pattern, is less than the P -values which correspond to the classifications by age and grade. The univariate hazard ratio and the concordance index, which correspond to the GBM pattern, are greater than those that correspond to age and grade. These mean that the GBM pattern is statistically a better predictor of astrocytoma outcome than age or grade. Classifying the patients by the GBM pattern in addition to age or grade, we find that the GBM pattern is also statistically independent of age and grade. Combined with either age or grade, therefore, the GBM pattern is statistically an even better predictor of astrocytoma outcome (Figure 4.3). For example, the >4-year 59 survival difference among the patients classified by both the GBM pattern and age, is >3 times, and >2.5 years greater than the difference between the patients classified by age alone. The >3.5-year difference among the grades III and IV astrocytoma patients classified by the GBM pattern and grade, is >1.5 times, and 1.5 years greater than the difference between these patients classified by grade alone. We also compared the GBM pattern to the existing pathology laboratory tests for astrocytoma (Figure 4.4). Silencing of a tumor’s MGMT gene by hypermethylation of its DNA promoter region was associated with a GBM and, recently, also an LGA patient’s longer survival in response to temozolomide chemotherapy treatment [11, 12]. Mutation of the gene IDH1 was associated with a patient’s longer survival [13], and linked with patterns of mRNA expression and DNA methylation across several hundred genes and genomic regions, respectively, in the tumor’s genome [14, 15, 16]. We find that the genome-wide GBM pattern of CNAs is statistically a better predictor of astrocytoma outcome, corresponding to greater median survival time difference, proportional hazard ratio, and concordance index, than MGMT promoter methylation and IDH1 mutation (Table 4.1). The GBM pattern additionally classifies the patients with either a methylated or an unmethylated MGMT promoter, or a mutated or an unmutated IDH1, into two groups each, with an approximately 1- to 4-year survival difference, which means that it is independent of both. Combined with either existing pathology laboratory test, therefore, the GBM pattern is an even better predictor of astrocytoma. Discussion To date, statistically the best indicators of astrocytoma outcome in clinical use remain the patient’s age at diagnosis and the tumor’s grade [7, 8, 9, 10, 17, 18]. High-throughput molecular profiling efforts identified two indicative genetic loci that were translated into pathology laboratory tests, one locus of DNA hypermethylation, and the other of DNA mutation linked with mRNA expression and DNA methylation subtypes of astrocytoma [11, 12, 13, 14, 15, 16, 19]. Recurring DNA CNAs have been observed in astrocytoma tumors’ genomes for decades. However, copy-number 60 subtypes that are predictive of astrocytoma patients’ outcomes were not identified [20, 21]. Here, we showed that the genome-wide pattern of CNAs in a primary astrocytoma tumor’s DNA copy-number profile is a predictor of the patient’s survival and response to chemotherapy and radiation, statistically better than, and independent of the patient’s age, the tumor’s grade, and the existing laboratory tests as individual predictors. When the GBM pattern is combined with each of the existing indicators, it makes a better predictor than the indicator alone. Thus, the pattern provides additional prognostic information not currently available to astrocytoma patients, that can be used to help guide a patient’s treatment and care. We showed that the pattern is correlated with an approximately 1-year survival phenotype among the astrocytoma patients, regardless of the tumor grade. In particular, the GBM pattern identifies the subset of grade III astrocytoma patients whose outcome is indistinguishable the GBM patients. This subset of shorter-surviving lower-grade astrocytoma was previously observed in the clinic, but the patients were not identifiable at the time of diagnosis [6]. Now, given this information at the time of diagnosis, the aggressive lower-grade astrocytoma tumors, characterized by a high weight of the GBM pattern in the tumor’s genome, can be aggressively treated in the same manner as GBMs. The pattern is a platform-independent predictor of astrocytoma outcome, and therefore, it can be translated into a laboratory test by using non-disease-specific FDA-approved platforms, such as next-generation sequencing (NGS) [22]. Molecular testing of tissue specimens for clinical evaluation is carried out at pathology test laboratories. When a pathology laboratory adds a new prognostic test to their menu, they may implement the test using the measurement technology of their choice, according to their existing protocols and practices, after internal revalidation. Since different measurement technologies, such as microarray platforms from different manufacturers, are used at different laboratories, it is important to demonstrate that the prognostic contribution of the pattern is platform- independent, and the test can be reliably implemented on a variety of platforms. Additionally, the platform independence of the genome-wide pattern further demon- 61 strates that the it captures the fundamental tumor biology that characterizes the aggressive clinical genotype, and not an artifact of the measurement technology. This is because the GSVD, from which the GBM pattern was revealed, mathematically separates the pattern from other sources of experimental and biological variation. Rather than simplifying the data, as is commonly done, the GSVD use of the complexity of the data in order to tease out the fundamental patterns within them. Methods Cross-Platform Probe Matching To map genomic signals between microarray platforms, we created a one-toone mapping between the microarray probes on the Agilent Human Genome CGH Microarray 244A and the Affymetrix Genome-Wide Human SNP 6.0 microarray platform. The mapping was used to enable all cross-platform comparison and analyses of discrete genomic profiles measured across different sets of microarray probes. We matched pairs of one Agilent and one Affymetrix probe that overlap by at least one nucleotide when mapped onto the human reference genome sequence build 36. The Agilent probes are 45-60 nucleotides long, and the Affymetrix probes are 25 nucleotides long. When multiple Affymetrix probes overlapped a single Agilent probe, the Affymetrix probe closest to the genomic end coordinate of the Agilent probe was selected, to maintain a one-to-one matching between the DNA microarray platforms. Similarly, when a single Affymetrix probe overlapped more than one Agilent probe, the Agilent probe closest to the genomic start coordinate of the Affymetrix probe was selected. This cross-platform probe matching identified 8,102 pairs of one-to-one overlapping Affymetrix and Agilent probes. A subset of 4,697 pairs of one-to-one overlapping probes that are consistently aberrated, with the same call of gain, loss, or no CNA in both the LGA and GBM patterns were also identified as discussed in Chapter 3). GBM Dataset Construction We selected a GBM set of 364 patients, who are among the patients in the previous GBM discovery and validation sets [3]. The 364 patients were diagnosed with WHO grade IV astrocytoma, i.e., GBM. The TCGA survival annotations available for these 62 patients were consistent. Each tumor profile lists median-centered log2 TCGA raw level 2 of the Affymetrix Genome-Wide Human SNP Array 6.0-measured DNA copy numbers across the same M1 = 933,827 probes of the LGA pattern. Missing data were not estimated. Probes among the 8,102 Agilent-matched probes, or the 4,697 Agilent-matched consistently-aberrated probes, which are missing data in any one profile, were excluded from the calculations of this profile’s median copy number as well as the profile’s Pearson correlations with the LGA and GBM patterns. MGMT Promoter Methylation and IDH1 Mutation Annotations To estimate the MGMT promoter methylation status of a tumor, we used the TCGA raw level 1 of the Illumina Infinium Human Methylation 27 or 450 BeadChipmeasured DNA methylation levels, as described [12]. The IDH1 mutation status of the LGA and GBM tumors is from TCGA [16, 17]. 63 Figure 4.1: Survival analyses of astrocytoma patients classified by the GBM pattern show a significant difference in overall survival. KM curves, log-rank test P -values, and Cox proportional hazard ratios of (a) the GBM set of 364 patients, (b) the LGA discovery and validation sets of 133 patients, (c) the LGA and GBM sets of 497 patients, classified by the GBM pattern. 64 Figure 4.2: Survival analyses of astrocytoma patients classified by treatment demonstrate predictive value of GBM pattern for response. KM curves, log-rank test P -values, and Cox proportional hazard ratios of the 497 astrocytoma patients classified by (a) chemotherapy, (b) radiation, (c) the GBM pattern combined with chemotherapy, and (d) the GBM pattern combined with radiation. 65 Figure 4.3: Survival analyses of astrocytoma patients classified by existing indicators demonstrate independence of the GBM pattern from indicators in use. KM curves, log-rank test P -values, and Cox proportional hazard ratios of the 497 astrocytoma patients classified by (a) the patient’s age at diagnosis, (b) the tumor’s grade, (c) the GBM pattern combined with age, and (d) the GBM pattern combined with grade. 66 Figure 4.4: Survival analyses of astrocytoma patients classified by laboratory tests demonstrate independence of the GBM pattern from the common prognostic laboratory tests. KM curves, log-rank test P -values, and Cox proportional hazard ratios of the 497 astrocytoma patients classified by (a) MGMT promoter methylation, (b) IDH1 mutation, (c) the GBM pattern combined with MGMT, and (d) the GBM pattern combined with IDH1. Table 4.1: Cox proportonal hazard models for GBM pattern and existing indicators Cox Proportional Hazard Model Univariate Multivariate Multivariate Multivariate Multivariate Multivariate Predictor GBM Arraylet Chemotherapy Radiation Age Grade MGMT IDH1 GBM Arraylet Chemotherapy GBM Arraylet Radiation GBM Arraylet Age GBM Arraylet Grade GBM Arraylet MGMT 2 2 2 2 2 2 Hazard Ratio 4.1 1.9 2.6 2.8 2.8 2.1 3.0 4.5 2.2 4.5 3.0 3.1 1.9 3.0 1.7 4.3 1.4 95% Confidence Interval 3.0–5.8 1.5–2.4 2.0–3.4 2.2–3.6 2.0–3.7 1.6–2.7 1.8–4.9 3.2–6.3 1.7–2.8 3.2–6.2 2.3–3.9 2.2–4.4 1.4–2.5 2.1–4.3 1.2–2.4 2.9–6.3 1.1–1.9 P -value 2.9 × 10−17 4.7 × 10−7 3.4 × 10−13 1.4 × 10−15 1.4 × 10−10 8.6 × 10−8 1.0 × 10−5 6.1 × 10−19 6.3 × 10−10 7.3 × 10−19 3.1 × 10−16 5.7 × 10−10 7.5 × 10−6 4.2 × 10−9 1.5 × 10−3 2.4 × 10−13 1.0 × 10−2 Concordance Index 0.85 0.72 0.80 0.80 0.82 0.66 0.83 0.81 0.84 0.79 0.80 0.73 67 68 References [1] W. Burke, et al., “Genetic test evaluation: information needs of clinicians, policy makers, and the public,” American Journal of Epidemiology, vol. 156, no. 4, pp. 311–318, 2002. [2] W. Burke, “Genetic tests: clinical validity and clinical utility,” Current Protocols in Human Genetics, pp. 9–15, 2014. [3] C. H. Lee, et al., “GSVD comparison of patient-matched normal and tumor aCGH profiles reveals global copy-number alterations predicting glioblastoma multiforme survival,” PLoS One, vol. 7, no. 1, p. e30098, 2012. [4] M. K. Kerr, et al., “Analysis of variance for gene expression microarray data,” Journal of Computational Biology, vol. 7, no. 6, pp. 819–837, 2000. [5] O. Alter, et al., “Singular value decomposition for genome-wide expression data processing and modeling,” Proceedings of the National Academy of Sciences, vol. 97, no. 18, pp. 10101–10106, 2000. [6] J. S. Smith, et al., “PTEN mutation, EGFR amplification, and outcome in patients with anaplastic astrocytoma and glioblastoma multiforme,” Journal of the National Cancer Institute, vol. 93, no. 16, pp. 1246–1256, 2001. [7] M. G. Netsky, et al., “The longevity of patients with glioblastoma multiforme,” Journal of Neurosurgery, vol. 7, no. 3, pp. 261–269, 1950. [8] W. J. Curran, et al., “Recursive partitioning analysis of prognostic factors in three Radiation Therapy Oncology Group malignant glioma trials,” Journal of the National Cancer Institute, vol. 85, no. 9, pp. 704–710, 1993. [9] T. Gorlia, et al., “Nomograms for predicting survival of patients with newly diagnosed glioblastoma: prognostic factor analysis of EORTC and NCIC trial 26981-22981/ce. 3,” The Lancet Oncology, vol. 9, no. 1, pp. 29–38, 2008. [10] C. Daumas-Duport, et al., “Grading of astrocytomas: a simple and reproducible method,” Cancer, vol. 62, no. 10, pp. 2152–2165, 1988. [11] M. E. Hegi, et al., “MGMT gene silencing and benefit from temozolomide in glioblastoma,” New England Journal of Medicine, vol. 352, no. 10, pp. 997–1003, 2005. [12] P. Bady, et al., “MGMT methylation analysis of glioblastoma on the Infinium methylation BeadChip identifies two distinct CpG regions associated with gene silencing and outcome, yielding a prediction model for comparisons across datasets, tumor grades, and CIMP-status,” Acta Neuropathologica, vol. 124, no. 4, pp. 547–560, 2012. [13] B. Purow and D. Schiff, “Advances in the genetics of glioblastoma: are we reaching critical mass?,” Nature Reviews Neurology, vol. 5, no. 8, pp. 419–426, 2009. 69 [14] R. G. W. Verhaak, et al., “Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1,” Cancer Cell, vol. 17, no. 1, pp. 98–110, 2010. [15] H. Noushmehr, et al., “Identification of a CpG island methylator phenotype that defines a distinct subgroup of glioma,” Cancer Cell, vol. 17, no. 5, pp. 510–522, 2010. [16] C. W. Brennan, et al., “The somatic genomic landscape of glioblastoma,” Cell, vol. 155, no. 2, pp. 462–477, 2013. [17] Cancer Genome Atlas Research Network, et al., “Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas,” New England Journal of Medicine, vol. 2015, no. 372, pp. 2481–2498, 2015. [18] M. L. C. Van Veelen, et al., “Supratentorial low grade astrocytoma: prognostic factors, dedifferentiation, and the issue of early versus late surgery,” Journal of Neurology, Neurosurgery & Psychiatry, vol. 64, no. 5, pp. 581–587, 1998. [19] P. S. Mischel, et al., “DNA-microarray analysis of brain cancer: molecular classification for therapy,” Nature Reviews Neuroscience, vol. 5, no. 10, pp. 782–792, 2004. [20] R. N. Wiltshire, et al., “Comparative genetic patterns of glioblastoma multiforme: potential diagnostic tool for tumor classification,” Neuro-Oncology, vol. 2, no. 3, pp. 164–173, 2000. [21] A. Misra, et al., “Array comparative genomic hybridization identifies genetic subgroups in grade 4 human astrocytoma,” Clinical Cancer Research, vol. 11, no. 8, pp. 2907–2918, 2005. [22] F. S. Collins and M. A. Hamburg, “First FDA authorization for next-generation sequencer,” New England Journal of Medicine, vol. 369, no. 25, pp. 2369–2371, 2013. CHAPTER 5 ASTROCYTOMA GENOTYPE ENCODES FOR TRANSFORMATION AND PREDICTS SURVIVAL PHENOTYPE This chapter is an invited journal article published in a special issue of Applied Physics Letters (APL) Bioengineering on the topic of “Bioengineering of Cancer.”APL Bioengineering, Issue 2, Article 031909 (2018). Reprinted with permission c from the authors, ⃝2018 K. A. Aiello, S. P. Ponnapalli, and O. Alter, under the terms of a Creative Commons license. Abstract DNA alterations have been observed in astrocytoma for decades. A copy-number genotype predictive of a survival phenotype was only discovered by using the generalized singular value decomposition (GSVD) formulated as a comparative spectral decomposition. Here we use the GSVD to compare whole-genome sequencing (WGS) profiles of patient-matched astrocytoma and normal DNA. First, the GSVD uncovers a genome-wide pattern of copy-number alterations, which is bounded by patterns recently uncovered by the GSVDs of microarray-profiled patient-matched glioblastoma (GBM) and, separately, lower-grade astrocytoma and normal genomes. Like the microarray patterns, the WGS pattern is correlated with an approximately 1-year median survival time. By filling in gaps in the microarray patterns, the WGS pattern reveals that this biologically consistent genotype encodes for transformation via the Notch together with the Ras and Shh pathways. Second, like the GSVDs of the microarray profiles, the GSVD of the WGS profiles separates the tumor-exclusive pattern from normal copy-number variations and experimental inconsistencies. These include the WGS technology-specific effects of guanine-cytosine content variations across the genomes that are correlated with experimental batches. 71 Third, by identifying the biologically consistent phenotype among the WGS-profiled tumors, the GBM pattern proves to be a technology-independent predictor of survival and response to chemotherapy and radiation, statistically better than the patient’s age and tumor’s grade, the best other indicators, and MGMT promoter methylation and IDH1 mutation. We conclude that by using the complex structure of the data, comparative spectral decompositions underlie a mathematically universal description of the genotype-phenotype relations in cancer that other methods miss. Introduction Recurring DNA alterations have been recognized as a hallmark of cancer for over a century, [1] and observed in astrocytoma brain cancer for decades, without being translated into clinical use [2]. Meanwhile, the prognosis, diagnosis, and treatment of astrocytoma have remained largely unchanged. Temozolomide, the one drug that progressed from trials to standard of care, modestly improves the 1-year median survival time of grade IV astrocytoma, i.e., GBM, by less than three months [3]. This is despite advances in genomic profiling technologies and the growing number of publicly available genomic data [4, 5]. Only recently a copy-number genotype predictive of an astrocytoma survival phenotype was discovered, and only by using the GSVD to compare patient-matched primary adult GBM and, separately, grades III and II, i.e., lower-grade astrocytoma (LGA) tumor and normal genomes, profiled by Agilent comparative genomic hybridization (CGH) and Affymetrix single nucleotide polymorphism (SNP) microarray platforms, respectively [6, 7]. Note that primary GBM and LGA are different types of cancer. Their histopathologies overlap, and GBM is distinguished from LGA by the presence of necrosis or microvascular proliferation in the tumor. Their epidemiologies, however, differ, including the distributions of the results of existing tests, i.e., for MGMT promoter methylation and IDH1 mutation, and, therefore, also the distributions of treatments, i.e., chemotherapy and radiation [8]. To test the mathematical universality and biological consistency of the tumorexclusive genotype and phenotype, here we use the GSVD to additionally compare WGS read-count profiles of astrocytoma tumor and patient-matched normal DNA [9] 72 from the Cancer Genome Atlas (TCGA). We used the same computational workflow to construct the WGS astrocytoma set of patients as we previously used to construct the Agilent GBM and Affymetrix LGA discovery and validation sets (Methods and Figure 5.5). The resulting tumor and normal datasets have the structure of two matrices of N = 85 matched columns, i.e., patients, and M1 = 2,827,037 and M2 = 2,828,152 rows, i.e., tumor and normal 1K-nucleotide bins [10, 11]. The WGS technology complements the CGH and SNP microarray platforms to represent the main genomic profiling technologies. Note that each technology relies on a specific experimental design and a specialized computational protocol, which is sensitive to perturbations to the data, e.g., due to changes in the experimental batch or the computational preprocessing [12, 13, 14]. This has contributed to a low reproducibility, <70%, between technical replicates of the same sample and <50% between computational assessments of the same raw data, in assigning copy-number variations (CNVs) in normal DNA [15] or copy-number alterations (CNAs) in tumor DNA. The WGS set of bins, while different from the Agilent CGH and Affymetrix SNP sets of probes, provides a high-resolution representation of the human genome, like the CGH and SNP sets. The ≈2.8M bins, across the autosome and the X chromosome, include almost all of the 213K CGH and 934K SNP probes. In addition, the bins fill in gaps in the genome that are not covered by either set of probes, mostly in genomic regions of constitutive heterochromatin domains, e.g., the centromeres and telomeres. The WGS astrocytoma set of patients, while different from the mutually exclusive Agilent GBM and Affymetrix LGA discovery and validation sets of patients, statistically represents the astrocytoma patient population at large, like the GBM and LGA sets represent the GBM and LGA populations, respectively. The representation is in terms of both disease and normal phenotypes, e.g., gender and ethnicity, while reflecting biases against surgical resections in patients >75 years old or of diffuse tumors, which affect mostly GBM or LGA patients, respectively. The 85 WGS astrocytoma patients include ≈61%, 28%, and 11% primary GBM and grades III and II astrocytoma patients, diagnosed at the median ages of 60, 50, and 31 years, and with median survival times of 15, 58, and 63 months, respectively. IDH1 mutation was detected in 15%, 48%, and 86% of the tested GBM and grades III and II astrocytoma 73 patients, respectively. Treatment by chemotherapy was noted for 77% GBM and 55% LGA patients. There are 62% male and 38% female patients. Of the 85 WGS astrocytoma patients, 24, i.e., ≈28%, complement the discovery sets of 251 GBM and 59 LGA patients. Of these 24 patients, 14 complement the validation sets of 184 GBM and 74 LGA patients, and include GBM as well as grades III and II astrocytoma patients. The WGS astrocytoma tumor and patient-matched normal datasets, while different from the Agilent GBM and Affymetrix LGA datasets, represent a range of approaches to tissue collection from 1993 to 2012 as well as DNA extraction and genomic characterization, like the Agilent GBM and Affymetrix LGA datasets. Participating in generating the data were 18 TCGA tissue source sites (TSSs), two biospecimen core resources (BCRs), and three genomic characterization centers (GCCs), employing two different types of DNA sequencing instruments. Even while controlling for intratumor heterogeneity, TCGA parameters, e.g., the tumor sample’s volume, can span approximately two orders of magnitude. We find that, first, the GSVD identifies the same genotype-phenotype relation as significant in, and exclusive to, the WGS astrocytoma tumor relative to the patient-matched normal profiles, here as in the previous GSVDs of Agilent GBM and, separately, Affymetrix LGA tumor and normal profiles. The identification is invariably blind to, i.e., without a priori information about, the clinical labels of the patients, the experimental labels of the samples, or the genomic coordinates of the bins or probes. This identified relation is invariably robust to perturbations to the minimally preprocessed data and independent of intratumor heterogeneity as it is reflected in the TCGA parameters. Second, independent of the profiling technology, the GSVD blindly separates the tumor-exclusive genotype-phenotype relation from experimental batch effects. Affecting the WGS data, here we find guanine-cytosine (GC) content variations across the genomes that vary in magnitude between TCGA GCC and TSS batches. Affecting the microarray data, previously we found batches of, e.g., hybridization dates, scanners, and plates. Additional separation is from normal relations that are conserved in the tumor, e.g., the X chromosome genotype and the gender phenotype. 74 Note that depending on the technology, this relation is represented in the data as a male-specific deletion or a female-specific amplification of the X chromosome relative to the autosome or the normal male genome, respectively. Third, the tumor-exclusive genotype invariably predicts the phenotype of astrocytoma survival and response to chemotherapy and radiation statistically better than and independent of any other indicator, test, and treatment, here, for the WGS astrocytoma set of patients, as it did previously for the mutually exclusive Agilent GBM and Affymetrix LGA discovery and validation sets of patients. We, therefore, conclude that the tumor-exclusive genotype-phenotype relation is appropriate for the adult astrocytoma population at large, and suitable for all genomic profiling technologies. That is, that the GSVD formulated as a comparative spectral decomposition underlies a mathematically universal description of the genotype-phenotype relations in astrocytoma. The GSVD as a Comparative Spectral Decomposition Given two column-matched but row-independent real matrices Di ∈ RM ×N , each i with full column rank N ≤ Mi , the GSVD is an exact simultaneous factorization, [16, 17, 18, 19] Di = Ui Σi V where Ui ∈ RM ×N i T = N ∑ σi,n ui,n ⊗ vnT , i = 1, 2, n=1 are real and column-wise orthonormal and V T ∈ (5.1) RN ×N is real, invertible, and with normalized rows. The 2N positive generalized singular values are arranged in Σi = diag(σi,n ) ∈ RN ×N in a decreasing order of the ratio σ1,n/σ2,n. The GSVD is unique up to phase factors of ±1 of each triplet of corresponding column and row basis vectors, i.e., ui,n and vn , except in degenerate subspaces defined by subsets of pairs of generalized singular values of equal ratios, i.e., σ1,n /σ2,n . The GSVD generalizes the SVD from one to two matrices. Like the SVD, the GSVD is a mathematical building block of algorithms, e.g., for solving the problem of constrained least squares in algebra, [20] and theories, e.g., for describing oscillations near equilibrium in classical mechanics [21]. We formulated the GSVD as a comparative spectral decomposition that can simultaneously identify the similar and dissimilar between two column-matched but 75 row-independent matrices, and, therefore, create a single coherent model from two datasets recording different aspects of interrelated phenomena [22, 23]. This formulation [24, 25, 26, 27] is possible because the GSVD is exact, exists, and has uniqueness properties that directly generalize those of the SVD [28, 29] (Theorem 1). The only assumption is that there exists a one-to-one mapping between the columns of the matrices but not necessarily between their rows. We defined the significance of the row basis vector vn and the corresponding column basis vector ui,n in the corresponding matrix Di , i.e., the “generalized fraction” pi,n , to be proportional to the corresponding generalized singular value σi,n , and the “generalized normalized Shannon entropy” of Di to be proportional to the arithmetic mean of pi,n log pi,n (Figure 5.6). We defined the significance of vn and u1,n in D1 relative to that of vn and u2,n in D2 , i.e., the “GSVD angular distance,” to be a function of the ratio σ1,n /σ2,n that, from the cosine-sine decomposition, is related to an angle (Figure 5.1), −π/4 < θn = arctan(σ1,n /σ2,n ) − π/4 < π/4. (5.2) Note that the angular distances θn are different from the principal angles corresponding to canonical correlations, like the GSVD is different from canonical correlations analysis (CCA) [30]. A unique row basis vector vn that is significant in either D1 or D2 , and with an angular distance of θn ≈ ±π/4, which corresponds to a ratio of σ1,n /σ2,n ≫ 1 or ≪ 1, respectively, is mathematically approximately exclusive to either D1 or D2 , and for consistency should be interpreted with the corresponding column basis vector u1,n or u2,n to represent phenomena exclusive to either the first or the second dataset. A unique row basis vector vn that is significant in both D1 and D2 , and with an angular distance of θn ≈ 0, which corresponds to σ1,n /σ2,n ≈ 1, is mathematically common to D1 and D2 , and should be interpreted with both u1,n and u2,n to represent phenomena common to both datasets. Mathematically invariant under the exchange of the two matrices or the reordering of the pairs of matched columns or the rows, the GSVD is also blind to the labels of the matrices, the columns, and the rows. These labels are only used to interpret the row and column basis vectors in terms of the phenomena recorded in the datasets. 76 Astrocytoma Tumor-Exclusive Genotype and Phenotype The second most tumor-exclusive row basis vectors uncovered by the previous GSVDs of patient-matched Agilent GBM and, separately, Affymetrix LGA tumor and normal profiles are also the first and third most significant in the GBM and LGA tumor genomes, respectively. By using the clinical labels of the previous discovery sets of patients in survival analyses, these second row basis vectors were shown to separate subsets of patients of an approximately 1-year median survival time from the complement subsets of median survival times of three years in GBM and five years in LGA. The corresponding second GBM and LGA tumor column basis vectors, i.e., patterns, were shown to similarly separate subsets of patients of an approximately 1-year median survival time from the previous validation sets of patients. By using the genomic coordinates of the microarray probes in segmentation analyses, the GBM and LGA patterns were shown to describe similar genome-wide patterns of co-occurring DNA CNAs that encode for opportunities for transformation via the Ras and Shh pathways. The GBM pattern, which encompasses the LGA pattern, such that these opportunities are enhanced in GBM relative to LGA, includes most CNAs that were known as well as several that were unrecognized in GBM prior to its discovery. We found that the GBM pattern predicts GBM survival statistically better than any one CNA that it identifies, and that none of the previously known CNAs was correlated with GBM survival. We, therefore, suggested that the astrocytoma survival phenotype is an outcome of its global genotype. Here we find that the second tumor column basis vector uncovered by the GSVD of the WGS profiles is the second most significant in as well as exclusive to the astrocytoma tumor relative to the normal genomes and describes the same genotype (Figure 5.2). To compare the corresponding WGS astrocytoma pattern to the Agilent GBM and Affymetrix LGA patterns, we used the genomic coordinates of the WGS bins, and classified the 111 genomic segments of at least five Agilent probes in length, previously identified in the Agilent GBM pattern, as amplified, unaltered, or deleted in the WGS astrocytoma pattern in addition to the Affymetrix LGA pattern. The classification is based upon the differences, in standard deviations, between the relative copy-number means of the segments and the autosome or the chromosomes. 77 We find that the WGS astrocytoma pattern is approximately bounded above by the Agilent GBM and below by the Affymetrix LGA pattern; ≳83% of the segments that are amplified or deleted in the WGS astrocytoma pattern are a subset and a superset of, and of a lesser or greater magnitude than, those that are amplified or deleted in the Agilent GBM and Affymetrix LGA patterns, respectively. An Approximately 1-Year Median Survival Time Phenotype By using the clinical labels of the patients, we find that the WGS astrocytoma pattern is correlated with the same survival phenotype as the Agilent GBM and Affymetrix LGA patterns (Figure 5.7). Of the 85 patients, 52 are classified as having high weights of the astrocytoma pattern in their tumor profiles based upon the superposition coefficients of the second tumor column basis vector in the column vectors of the tumor dataset. The vector that lists these coefficients is linearly proportional to the second row basis vector. Of the same 85 patients, 54, including 51, i.e., ≈98% of the 52, have high Pearson correlations of their tumor profiles with the pattern. We use the correlation cutoff of 0.15, and compute the coefficient cutoff by scaling 0.15 by the Frobenius norm of the vector that lists the correlations, as was previously established for the Agilent GBM discovery set of patients and validated for the Agilent GBM validation, and Affymetrix LGA discovery and validation sets of patients. In Kaplan-Meier (KM) survival analyses, the subsets of patients with high superposition coefficients and, separately, Pearson correlations are of an approximately 1-year median survival time, statistically significantly shorter than the median survival time of five years of the complement subsets of patients. In Cox proportional hazards models, a high coefficient or, separately, correlation confers ≈8 times the hazard of a low coefficient or correlation, respectively. A Genotype Encoding for Transformation via the Notch Together with the Ras and Shh Pathways By filling in gaps in the genome that are not covered by either the Agilent or the Affymetrix probes, the WGS astrocytoma pattern adds to the description of the 78 genotype that corresponds to the 1-year survival phenotype. We find amplifications previously unrecognized in astrocytoma that encode for increased cell communication via the canonical Notch pathway in support of transformation via the Ras and Shh as well as the hominin-specific Notch pathway (Figure 5.3). The largest of the 111 segments, which spans ≈79M nucleotides on chromosome 1 across the bands 1p31.1-q23.3, is classified as unaltered in the WGS pattern, the same as in the microarray patterns. The segment contains the two largest gaps between the microarray probes on chromosome 1. The largest, a 23M-nucleotide gap (1p11.2-q21), includes the centromere. Circular binary segmentation (CBS) [31] of the WGS pattern identified a 21M-nucleotide segment (1p11.2-q12) within the gap, which is classified as amplified. At 739K nucleotides from the 5’ end of the gene NOTCH2 (1p12-p11.2), the amplification is within its promoter region [32]. Similarly, a 140K-nucleotide gap (9q34.3), which includes the 9q telomere, overlaps 79K of a 104K-nucleotide amplified segment in the promoter region of NOTCH1 at 1.6M nucleotides from its 5’ end. These amplifications within the promoter regions, rather than of the genes, encode for overexpression of wild-type NOTCH1/2 [33, 34]. Three genes in the core Notch pathway are on two of the 111 segments, which are approximately coextensive with 19q and 20p and are amplified in the GBM but not the LGA or astrocytoma patterns. The ligand-encoding JAG1 and DLL3 are involved in sending, and PSENEN in receiving, the Notch signals. These amplifications encode for overactivation of Notch in GBM. Note that the co-deletion of 1p and 19q, which can underactivate Notch, is associated with an oligodendroglioma brain cancer patient’s longer survival. Segmentation of the WGS pattern also identifies a 76K-nucleotide segment within the second largest gap on chromosome 1 (1q21.2). The segment, which is classified as amplified, maps to the neuroblastoma breakpoint family gene NBPF14, so-called because NBPF1 (1p36.13) was discovered in a screen for genes disrupted by a translocation in a neuroblastoma brain cancer patient’s normal genome [35]. The segment includes 38 repeats of a 1.5K-nucleotide sequence that encodes for a copy of the protein domain of unknown function 1220 (DUF1220) [36]. At 2.3M nucleotides from the 5’ end of the hominin-specific NOTCH2NL (1q21.1), the amplification is within its promoter region and encodes for its overexpression. 79 Overactivation of the canonical Notch pathway supports human normal to tumor cell transformation via the Ras and Shh as well as the hominin-specific Notch pathway. In response to Ras-mediated growth signals, wild-type NOTCH1/2 upregulate the cell cycle-promoting cyclin-dependent kinase (CDK) encoded by CDK4 and blocks the cell cycle arrest, apoptosis, and senescence-promoting CDK inhibitors p16INK4A and p15INK4B encoded by CDKN2A/B [37, 38, 39, 40]. Note that in the absence of CDK inhibitors, DNA-damaged cells acquire deformed polyploid nuclei [41, 42]. In response to Shh-mediated developmental signals, NOTCH1/2 facilitate the clearance of the tumor suppressor Ptch1, the concurrent accumulation of the Shh signal-transducing protein encoded by SMO, and the increased downstream conversion of the proteins encoded by the oncogenes GLI1/3 into cell cycle transcriptional activators [43, 44]. Note that Notch is critical for an Shh-induced medulloblastoma brain cancer tumor’s development [45]. In the hominin-specific Notch pathway, NOTCH2NL can act as a ligand-independent NOTCH1/2 [46]. Note that overexpression of NOTCH2NL and gain of DUF1220 are associated with an increased brain size, both developmentally within the human and evolutionarily within the primate population [47]. We also find consistency between the DNA CNAs and mRNA expression, which additionally supports the astrocytoma tumor-exclusive genotype-phenotype relation [48]. Of the 29 genes highlighted, 19 are overexpressed or underexpressed in the subset of tumors that have high weights of the WGS astrocytoma pattern in their profiles, with the corresponding Mann-Whitney-Wilcoxon (MWW) P -values <0.05. This subset of tumors corresponds to the subset of patients who have the approximately 1-year survival phenotype. Of these 19 genes, 16, i.e., ≈84%, consistently map to amplifications or deletions in the tumor-exclusive genotype (Figures 5.8–5.11). Blind Separation from Normal and Experimental Sources of Copy-Number Variation By using the experimental labels of the DNA samples, we find that the GSVD blindly, i.e., without a priori information, separates the astrocytoma tumor-exclusive genotype and phenotype from CNVs common to the normal and tumor genomes and 80 from experimental variations specific to the minimally preprocessed WGS profiles. These include the effects of the GC content variations across the tumor and normal genomes that vary in magnitude between experimental batches. The first tumor and 85th normal column basis vectors are the most significant in and exclusive to, and are correlated with the fractional GC content across the tumor and normal genomes, respectively, with both correlations ≳0.78 and both MWW P -values <10−10 (Figures 5.12–5.14). Both vectors roughly describe frequent spikes of reduced 5 copy numbers superimposed on an invariant baseline in agreement with the PCR amplification-dependent WGS technology underestimating the abundance of GC-poor sequences. The corresponding first and 85th row basis vectors are correlated with experimental variations in the GCC of the tumor and TSS of the normal DNA with both hypergeometric and both MWW P -values <10−2 (Figure 5.15). The 82nd row basis vector is the second and fifth most significant in the normal and tumor genomes, respectively, and approximately common to both. The vector classifies the patients by gender with both hypergeometric and MWW P -values <10−13 (Figure 5.16)). Both normal and tumor 82nd column basis vectors describe a deletion of the X chromosome with both MWW P -values <10−10 (Figures 5.17–5.19). 4 While the deletion is dominant in the normal and tumor genomes of the 53 male patients, it is missing from the astrocytoma pattern, where the X chromosome is classified as unaltered, the same as in the GBM and LGA patterns. The Tumor-Exclusive Genotype Predicts the Survival Phenotype Statistically Better than Any Other Indicator Because the Agilent GBM pattern encompasses the WGS astrocytoma and Affymetrix LGA patterns in the number and magnitude of CNAs, and because it was derived from the largest discovery set, i.e., of 251 patients, we additionally classified the 85 WGS astrocytoma patients based upon the correlations of the Agilent GBM pattern with their WGS astrocytoma tumor profiles. We find that the Agilent GBM pattern predicts survival statistically better than and independent of the best other indicators, i.e., the patient’s age and tumor’s grade, [49] and survival and response to treatments, 81 i.e., chemotherapy and radiation, better than the existing tests, i.e., for MGMT promoter methylation and IDH1 mutation [50, 51]. In KM analyses and Cox models of the patients, the pattern identifies the biologically consistent survival phenotype with greater median survival time differences, hazard ratios, and concordance indices, i.e., accuracies, and lesser log-rank P -values than either indicator or test (Figure 5.4), and, in KM analyses and Cox models of the treated patients, better than either test (Figure 5.20). The bivariate hazard ratios of the pattern and either indicator are within the 95% confidence intervals of the corresponding univariate ratios (Table 5.1). The pattern is also independent of intratumor heterogeneity as it is reflected in the TCGA parameters of the tumor sample’s volume, the slide’s percent tumor cells and percent tumor nuclei, the portion’s weight, and the analyte’s and aliquot’s DNA concentrations. This is consistent with the classifications based upon the WGS astrocytoma pattern, where the median survival time differences are the same as and the hazard ratios are within the 95% confidence intervals of those based upon the Agilent GBM pattern. This is also consistent with the classifications of an Affymetrix set of 497 astrocytoma patients and, separately, an Agilent set of 364 GBM patients, from the previous discovery and validation sets of GBM and LGA patients, based upon their Affymetrix and Agilent tumor profiles, respectively, where the pattern is independent of each treatment, indicator, and test (Figures 5.21 and 5.22, Tables 5.2 and 5.3). That the tumor-exclusive genotype-phenotype relation is statistically independent of the current indicators, tests, and treatments of astrocytoma implies that the information contained in the relation is not currently being used in clinical practice. This information includes, e.g., biochemically putative drug targets and combinations of drug targets that are predicted to be correlated with outcome. By using this information in clinical practice, therefore, it can be expected to improve the prognostics, diagnostics, and therapeutics of the disease. Discussion That the astrocytoma tumor-exclusive genotype-phenotype relation is invariably uncovered by, and only by, the GSVD, independent of the profiling technology and the 82 astrocytoma grade, highlights the role of mathematics in genomic data science and machine learning. Unlike most other analyses, the GSVD uses minimally preprocessed genomic data without feature engineering. This accounts for the robustness of the GSVD to perturbations to the data, and is possible because of its scalability to petabyte-sized data. Other analyses often standardize the data based upon assumptions that may confound the data and contribute to the low reproducibility noted in genomic profiling. Unlike most other analyses, the GSVD uses the patient-matched normal data to analyze the tumor data, including tumor genomic regions of normal CNVs, e.g., the X chromosome. This makes the GSVD sensitive to robust genotype-phenotype relations in small discovery sets of only, e.g., 251, 59, and 85 patients, and possibly imbalanced validation sets of, e.g., 184 and 74 patients, with large genomic profiles of, e.g., 213K, 934K, and 2.8M probes or bins each. This is possible because the GSVD uses the structure of the tumor and normal datasets, of two column-matched but row-independent matrices, in the blind source separation (BSS) [52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64] of the tumor-exclusive from the normal genotypephenotype relations and from experimental batch effects. Patient-matched normal CNVs are often missing from other analyses of tumor CNAs, even though CNVs overlap ≈12% of the normal human genome, [65] where they are 102 –104 times more frequent than point mutations, [66] and are associated with both tumor and normal development [67, 68, 69]. When other analyses use patient-matched normal data, it is to standardize the tumor data. This reduces the structure of the data to that of one matrix, and some of the information regarding the similar and dissimilar between the tumor and normal genomes may be lost. The GSVD as a comparative spectral decomposition [22, 23] has been extended from two to multiple matrices and, separately, two tensors [24, 25, 26]. A recent tensor GSVD comparison of ovarian cystadenocarcinoma tumor and patient- and microarray platform-matched normal copy-number profiles uncovered chromosome arm-wide patterns of tumor-exclusive platform-consistent CNAs that predict survival and response to chemotherapy. We conclude that comparative spectral decompositions, such as the GSVD, underlie a mathematically universal description of the 83 genotype-phenotype relations in cancer that other methods miss. Acknowledgements We thank R. A. Horn for insightful discussions of matrix analysis, M. P. Scott and R. A. Weinberg for thoughtful comments on the Shh and Ras signaling pathways, H. A. Hanson for careful reviews of survival analysis, R. L. Jensen and C. A. Palmer for helpful notes on astrocytoma intratumor heterogeneity, C. T. Wittwer for helpful notes on PCR technology, and J. S. Barnholtz-Sloan, J. Bowen, K. Devine, J. M. Gastier-Foster, K. M. Leraas, K. R. Mills Shaw, and J. C. Zenklusen for useful exchanges on TCGA. This work was funded by National Cancer Institute (NCI) U01 grant CA-202144 and by Utah Science, Technology, and Research (USTAR) Initiative support, both to O. A. S. P. P. and O. A. are co-founders of and equity holders in Eigengene, Inc. This does not alter our adherence to the policies of APL Bioengineering on sharing data and materials. Supplementary Information Construction of the WGS Astrocytoma Tumor and Patient-Matched Normal Datasets We obtained WGS binary alignment map (BAM) files of primary adult astrocytoma tumor and patient-matched normal DNA from a set of 52 GBM and 33 LGA patients from TCGA at the Genomic Data Commons (GDC) together with the clinical labels of the patients and experimental labels of the corresponding DNA samples [4, 5]. The total size of the TCGA raw level 1 BAM files is ≈23 terabytes or 0.02 petabytes. In each BAM file, we counted the number of Illumina HiSeq 2000- or Genome Analyzer II-measured sequence reads that map to each nonoverlapping 1K-nucleotide bin across the autosome and the X chromosome of the reference human genome hg19 by using copy-number estimation by a mixture of Poissons (CN.MOPS) [10, 11]. Each profile lists the log2 of the positive read counts, centered at the median of the bins that map to the autosome and are with positive counts in the corresponding BAM file, across the tumor, and, separately, normal bins with positive counts in all tumor or normal BAM files, respectively. 84 We used the same computational workflow to construct the WGS astrocytoma set of patients as we previously used to construct the Agilent GBM and Affymetrix LGA discovery and validation sets [6, 7] (Figure 5.5). The resulting tumor and normal datasets have the structure of two matrices of N = 85 matched columns, i.e., patients, and M1 = 2,827,037 and M2 = 2,828,152 rows, i.e., tumor and normal bins. Of the 85 patients, 24, i.e., ≈28%, complement the previous discovery sets of 251 GBM and 59 LGA patients. Of the 24 patients, 14 complement the previous validation sets of 184 GBM and 74 LGA patients. The ≈2.8M bins, across the autosome and the X chromosome, include almost all of the 213K CGH and 934K SNP probes. In addition, the bins fill in gaps in the genome that are not covered by either set of probes. Note that bin sizes below 1K nucleotides would have increased the sparsity of the 30X- to 60X-coverage WGS profiles, whereas bin sizes above 1K nucleotides would have reduced the resolution of the WGS profiles relative to the previous Agilent and Affymetrix microarray profiles as well as the ≈25K-nucleotide median size of a human protein-encoding gene. We find that, like the GSVDs of the microarray profiles, the GSVD of the WGS profiles is robust to, i.e., quantitatively similar and qualitatively the same under, perturbations to the datasets, e.g., due to changes in the preprocessing of the BAM files, including changing the bin size in the range of 100–2.5K nucleotides. Note that the robustness of the GSVD, the unique and significant row and column basis vectors, and the interpretations of the vectors, implies robustness of the single coherent model that the GSVD creates from the WGS profiles. Formulation of the GSVD as a Comparative Spectral Decomposition That the GSVD (Figure 5.1) can simultaneously identify the similar and dissimilar between two column-matched but row-independent matrices, and, therefore, create a single coherent model from two datasets recording different aspects of interrelated phenomena, [22, 23] is possible because the GSVD is exact, exists, and has uniqueness properties that directly generalize those of the SVD [28, 29]. Theorem 1 (Uniqueness properties of the GSVD) The GSVD of two column- 85 matched but row-independent real matrices Di ∈ RM ×N , each with full column rank i N ≤ Mi , is unique up to phase factors of ±1 of each triplet of corresponding column and row basis vectors, i.e., ui,n and vn , except in degenerate subspaces defined by subsets of pairs of generalized singular values of equal ratios, i.e., σ1,n /σ2,n . Proof [: First proof] Consider the GSVD as it is computed by using the eigenvalue decomposition of the balanced arithmetic mean of all pairwise quotients of A1 = D1T D1 −1 −1 and A2 = D2T D2 , i.e., S = 21 [A1 A−1 2 + (A1 A2 ) ]. We proved that the eigenvectors of S can be used to compute the normalized row basis vectors vn , such that the eigenvalue decomposition gives SV = V Λ, where Λ = diag(λn ) [24, 25]. We also proved that for normalized vn , the eigenvalues satisfy λn = 12 [(σ1,n /σ2,n )2 + (σ1,n /σ2,n )−2 ] ≥ 1, −1 where (σ1,n /σ2,n )2 and (σ1,n /σ2,n )−2 are the eigenvalues of A1 A−1 and (A1 A−1 2 2 ) , respectively. The corresponding column basis vectors u1,n and u2,n and generalized singular values σ1,n and σ2,n can, therefore, be computed by normalizing the columns of D1 V −T and D2 V −T , and arranged, together with vn , in U1 and U2 , Σ1 and Σ2 , and V T , respectively, in a decreasing order of σ1,n /σ2,n . The uniqueness properties of the GSVD follow from the uniqueness properties of the eigenvalue decomposition of S. Proof [: Second proof] Consider the GSVD as it is computed by using the QR decomposition of the appended D1 and D2 , followed by the SVD of the block of the column-wise orthonormal Q that corresponds to D1 , i.e., Q1 , [ ] [ ] D1 Q1 R = QR = D2 Q2 [ ] [ ] UQ1 ΣQ1 VQT1 R UQ1 ΣQ1 VQT1 R = = , UQ2 ΣQ2 Q2 (5.3) where R is upper triangular [19, 20]. Since D1 and D2 are with full column rank, then Q1 and Q2 are also with full column rank, VQT1 is orthonormal, and ΣQ1 is positive 1 diagonal. It follows from Equation 5.3 that the diagonal ΣQ2 = (I − Σ2Q1 ) 2 is also positive, and that UQ2 = Q2 VQ1 (I − Σ2Q1 )− 2 is column-wise orthonormal, 1 86 I = QT Q = QT1 Q1 + QT2 Q2 = VQ1 Σ2Q1 VQT1 + QT2 Q2 , Σ2Q2 = I − Σ2Q1 = (Q2 VQ1 )T (Q2 VQ1 ) > 0, UQT 2 UQ2 = I = [Q2 VQ1 (I − Σ2Q1 )− 2 ]T [Q2 VQ1 (I − Σ2Q1 )− 2 ]. 1 1 (5.4) That is, the SVD of Q1 also defines an SVD of Q2 , where the singular values are arranged in ΣQ2 in an increasing order, because the singular values of Q1 are arranged in ΣQ1 in a decreasing order. It follows from Equation 5.4 then that the SVD of Q1 factorizes D1 and D2 into the GSVD, U1 = UQ1 , U2 = UQ2 , 1 Σ1 = ΣQ1 {diag[(VQT1 R)(VQT1 R)T ]} 2 , 1 1 Σ2 = (I − Σ2Q1 ) 2 {diag[(VQT1 R)(VQT1 R)T ]} 2 , V T = {diag[(VQT1 R)(VQT1 R)T ]}− 2 VQT1 R, 1 (5.5) where U1 and U2 are column-wise orthonormal, Σ1 and Σ2 are positive diagonal, and V T , identical in both factorizations, has normalized rows. The positive generalized 2 −2 singular values are arranged in Σ1 Σ−1 in a decreasing order. 2 = ΣQ1 (I − ΣQ1 ) 1 The QR decomposition is unique and, from Equation 5.5, the uniqueness properties of the GSVD follow from the uniqueness properties of the SVDs of Q1 and Q2 . We defined the significance of the row and corresponding column basis vector vn and ui,n in the corresponding matrix Di to be the generalized fraction, which is interpreted to represent the fraction of information captured by vn and ui,n in the corresponding dataset (Figure 5.6). The generalized fraction, of the Frobenius norm of the outer product σi,n ui,n ⊗ vnT in the norm of Di , is proportional to the corresponding generalized singular value σi,n , 87 pi,n = 2 σi,n \|\|σi,n ui,n ⊗ vnT \|\|2 = > 0. N \|\|Di \|\|2 ∑ 2 σi,n (5.6) n=1 We defined the complexity of Di , i.e., the generalized normalized Shannon entropy, which is interpreted to represent the distribution of information among the row and column basis vectors vnT and ui,n , to be proportional to the arithmetic mean of pi,n log pi,n , 0 < di = −(log N ) −1 N ∑ pi,n log pi,n ≤ 1. (5.7) n=1 At its lower bound, an entropy of di → 0 corresponds to an ordered and redundant dataset, where, as in Equation 5.8, all the information is captured by one row basis vector and the corresponding column basis vector, i.e., v1T and ui,1 , { 1, n = 1, pi,n → 0, n ̸= 1. (5.8) An entropy of di = 1 corresponds to a disordered and random dataset, in which all vnT and ui,n are of equal significance and capture equal fractions of the information, i.e., pi,n = 1/n for all n. Here we find that the two most tumor-exclusive, i.e., the first and second row basis vectors, with the angular distances θ1 , θ2 > π/6, are also the first and second most significant in the tumor dataset, with the generalized fractions p1,1 , p1,2 ≳ 0.08. The most normal-exclusive, i.e., the 85th row basis vector, with θ85 < −π/6, is also the most significant in the normal dataset, with p2,85 > 0.23. The second most significant row basis vector in the normal dataset, i.e., the 82nd row basis vector, is also the fifth most significant in the tumor dataset, with p1,82 , p2,82 > 0.02. With \|θ82 \| < π/10, the 82nd row basis vector is common to both the normal and tumor datasets. Segmentation of the WGS Astrocytoma Pattern To compare the WGS astrocytoma pattern to the Agilent GBM and Affymetrix LGA patterns, we mapped the hg18 genomic start and end coordinates of the 130 segments previously identified in the Agilent GBM pattern to the reference human genome hg19, and classified the 111 genomic segments of at least five Agilent probes in length as amplified, unaltered, or deleted in the WGS astrocytoma pattern. We 88 then compared the classifications to the previously computed classifications of the same segments in the Affymetrix LGA and Agilent GBM patterns. To expand upon the description of the tumor-exclusive genotype by the microarray patterns, we segmented the WGS pattern by using CBS, [32] and classified the segments as amplified, unaltered, or deleted in the WGS pattern (Figure 5.6). We then mapped the segments in relation to gaps in the genome that are not covered by either the Agilent or the Affymetrix probes, where the DNA copy number was not measured by the microarrays. A segment is classified as amplified or deleted if the difference between the relative copy-number means of the segment and the autosome is greater than twice the standard deviation of the autosome, or if the difference between the means of the segment and the chromosome it maps to is greater than the standard deviation of the chromosome, and is consistent with the difference between the segment and the autosome. The mean and standard deviation of the autosome are computed excluding the astrocytoma outlying chromosomes 7 and 10 and chromosome arm 9p. Estimation of the Consistency Between the DNA CNAs and mRNA Expression We obtained TCGA level 3 Illumina HiSeq 2000 RNA sequencing profiles, which were available for the primary tumors of 62 of the 85 WGS astrocytoma patients. There are 29 genes highlighted in the tumor-exclusive genotype that corresponds to the 1-year survival phenotype. We assessed the mRNA expression of each of these genes in the subset of patients that have high weights of the WGS astrocytoma pattern in their primary tumor DNA copy-number profiles, relative to the complementary subset of patients that have low weights, by using boxplots and computing the corresponding MWW P -values (Figures 5.8–5.11). Visualization and Interpretation of the Significant Row and Column Basis Vectors To visualize the first tumor and 85th normal column basis vectors, we segmented the vectors by using CBS (Figures 5.12 and 5.13). To interpret the vectors, we computed the correlations of the first tumor and 85th normal column basis vectors 89 with the vectors that list the log2 of the fractional GC content across the tumor and normal bins, respectively. The fractional GC content was computed for the tumor and normal bins by counting the numbers of the A, C, G, and T nucleotides in the nonoverlapping 1K-nucleotide sequences in the reference human genome hg19 that correspond to the bins. We also assessed the distribution of the relative copy numbers listed in each vector between bins of ≤50% and >50% GC content by using boxplots and computing the corresponding MWW P -values (Figure 5.14). To interpret the corresponding first and 85th row basis vectors, we assessed the subsets of patients that are of either high or low copy numbers in each vector for enrichment in any one of the experimental labels of the tumor and normal DNA samples, e.g., the TCGA GCCs or TSSs. The P -value of each enrichment was computed assuming a hypergeometric probability distribution of the K labels among the N patients, and of the k ⊆ K labels among the n patients of either high or low copy numbers, i.e., ( )−1 ∑ ) n ( )( N K N −K P (k; n, N, K) = . n i n−i i=k (5.9) In each row basis vector, we also assessed the distribution of the copy numbers between the subset of patients corresponding to each of these labels and the complementary subset by using boxplots and computing the corresponding MWW P -values (Figure 5.15). Similarly, to interpret the 82nd row basis vector, we assessed the subsets of patients that are of high or low copy numbers in the vector for enrichment in gender, i.e., females or males. We also assessed the distribution of the copy numbers between the female and male patients (Figure 5.16). To interpret the 82nd tumor and normal column basis vectors, we assessed the distribution of copy numbers between bins that map to the autosome and the X chromosome (Figures 5.17–5.19). Classification of the WGS Astrocytoma, Affymetrix Astrocytoma, and Agilent GBM Tumor Profiles by Correlation with the Agilent GBM Pattern Of the 212,696 probes of the Agilent Human Genome CGH 244A microarray platform that constitute the GBM pattern, 212,619 were mapped by Agilent onto the 90 reference human genome hg19. To classify the WGS astrocytoma tumor profiles by correlation with the Agilent GBM pattern, we mapped each CGH probe to the WGS bin that contains the hg19 genomic start coordinate of the probe. When more than one probe mapped to one bin, the probe closest to the hg19 genomic start coordinate of the bin was selected, resulting in a one-to-one mapping of 211,096 probes of unique hg19 coordinates onto 211,096 nonoverlapping bins. To compare the KM analyses and Cox models of the WGS astrocytoma patients (Figures 5.8 and 5.7) to those of the Affymetrix astrocytoma, i.e., GBM and LGA, patients, we used the previously computed correlations of the Agilent GBM pattern with the minimally preprocessed TCGA raw level 2 tumor profiles of the discovery and validation sets of 59 and 74 LGA patients as well as 364 of the discovery and validation sets of 251 and 184 GBM patients, measured by the Affymetrix Genome-Wide Human SNP Array 6.0 microarray platform (Figure 5.20 and Table 5.2). To compare to the KM analyses and Cox models of the Agilent GBM patients, we used the previously computed correlations of the Agilent GBM pattern with the minimally preprocessed TCGA raw level 1 tumor profiles of 364 of the discovery and validation sets of 251 and 184 GBM patients, measured by the Agilent Human Genome CGH 244A microarray platform (Figure 5.21 and Table 5.3). We used the correlation cutoff of 0.15 as was previously established for the Agilent GBM discovery set of patients and validated for the Agilent GBM validation, and Affymetrix LGA discovery and validation sets of patients. To estimate the MGMT promoter methylation status of a tumor, we used the TCGA raw level 1 of the Illumina Infinium Human Methylation 27 or 450 BeadChipmeasured DNA methylation levels [51]. The IDH1 mutation status of the LGA and GBM tumors is from TCGA [5]. 91 Figure 5.1: The GSVD of the WGS read-count profiles of patient-matched astrocytoma tumor and normal DNA. The GSVD is depicted in a raster display with relative WGS read-count, i.e., DNA copy-number amplification (red), no change (black), and deletion (green). This GSVD depiction is denoted as approximate, even though the GSVD of Equation 5.1 is exact, because only the first through the 5th and the 81st through the 85th row and corresponding tumor and normal column basis vectors and generalized singular values are explicitly shown. The angular distances of Equation. 5.1 are depicted in a bar chart. The red and green contrasts for the datasets Di , the dataset-specific column basis vectors Ui and generalized singular values Σi , and the dataset-shared row basis vectors V T , are c = 1, 750 and 0.0005, and 5, respectively. 92 Figure 5.2: Astrocytoma tumor-exclusive genotype and phenotype. The astrocytoma tumor-exclusive genotype and phenotype is captured by the second tumor column basis vectors. The similar genome-wide patterns of CNAs described by the second (a) Agilent GBM, (b) Affymetrix LGA, and (c) WGS astrocytoma tumor column basis vectors are depicted in plots of relative copy numbers, ordered and colored based upon genomic coordinates and segmented by CBS (black lines), including GBM-specific (blue), GBM- and LGA-shared (black), or WGS technology-filled in (red) CNAs. (d) The second WGS astrocytoma row basis vector is depicted in a plot showing the classification of the 85 patients into low (red) or high (blue) superposition coefficients. (e) The WGS astrocytoma tumor dataset is depicted in a raster showing the tumor-exclusive genotype-phenotype relation. 93 Figure 5.3: The astrocytoma tumor-exclusive genotype encodes for signalling via the canonical Notch, Ras, Shh, and hominin-specific Notch pathways. The astrocytoma tumor-exclusive genotype encodes for increased cell communication via the canonical Notch pathway in support of transformation via the Ras, Shh, and hominin-specific Notch pathways. The astrocytoma genotype is depicted in a diagram of the WGS technology-filled in Notch pathway (yellow) in addition to the microarray-described Ras and Shh pathways, which include CNAs unrecognized in GBM prior to the discovery of the GBM pattern (violet). Explicitly shown are amplifications (red) and deletions (green) of genes and transcript variants (rectangles), either GBMand LGA-shared (black) or GBM-specific (blue), and relationships that directly or indirectly lead to increased (arrows) or decreased (bars) activities of the genes and transcripts, the tumor suppressor proteins p53, Rb, and Ptch1, and the oncoproteins Notch1, Notch2, and Notch2nl (circles). 94 Figure 5.4: Survival analyses of the WGS astrocytoma patients show superior performance of the Agilent GBM pattern when compared to existing indicators. The classifications of the 85 patients based upon (a) the Agilent GBM pattern and, in addition, (b) age or (c) grade, or (d) MGMT promoter methylation or (e) IDH1 mutation are depicted in KM curves highlighting median survival time differences (yellow) with the corresponding log-rank P -values and Cox hazard ratios. 95 Figure 5.5: Workflow for computation and interpretation of the GSVD. The GSVD invariably identifies the same genotype and phenotype as significant in and exclusive to the WGS astrocytoma tumor relative to the patient-matched normal profiles, here like in the previous GSVDs of Agilent GBM and, separately, Affymetrix LGA tumor and normal profiles. (a) Construction of the WGS astrocytoma tumor and patientmatched normal datasets. (b) Identification of the WGS astrocytoma tumor-exclusive genotype and phenotype. (c) Blind separation from normal and experimental sources of copy-number variation. (d) Technology- and grade-independent prediction of astrocytoma survival. 96 Figure 5.6: The most significant row basis vectors uncovered by the GSVD of the WGS astrocytoma tumor and normal datasets. (a) The 10 largest generalized fractions of Equation 5.6 in the WGS astrocytoma tumor dataset are depicted in a bar chart, showing that the two most tumor-exclusive row basis vectors, i.e., the first and second, are also the first and second most significant in the tumor dataset and capture ≈29% and 8% of the information, respectively. The corresponding generalized normalized Shannon entropy of Equation 5.7 is 0.78. (b) The 10 largest generalized fractions in the normal dataset are depicted in a bar chart, showing that the most normal-exclusive row basis vector, i.e., the 85th, is also the most significant in the normal dataset and captures ≈23% of the information. The 82nd row basis vector, which is approximately common to both datasets, is the second and fifth most significant and captures ≈14% and 2% of the information in the normal and tumor datasets, respectively. 97 Figure 5.7: Survival analyses of the WGS astrocytoma patients based upon the GSVD of the WGS datasets show that the GSVD is predictive of a patient’s overall survival. (a) The classification of the 85 patients into low (red) or high (blue) superposition coefficients based upon the second most tumor-exclusive row basis vector is depicted in KM curves showing a 50-month median survival time difference (yellow). The corresponding log-rank test P -value is <10−6 . The univariate Cox proportional hazard ratio is ≈8. (b) The classification of the 85 patients based upon the correlations of their tumor profiles with the second tumor column basis vector. 98 Figure 5.8: Differential mRNA expression in the Ras pathway is consistent with the corresponding DNA CNAs. The differential mRNA expression of genes highlighted in the Ras pathway in the subset of patients who have high weights of the WGS astrocytoma pattern in their primary tumor DNA copy-number profiles, i.e., the patients who have the approximately 1-year survival phenotype, is depicted in boxplots with the corresponding MWW P -values. These genes consistently map to amplifications or deletions in the tumor-exclusive genotype (Figure 5.3). 99 Figure 5.9: Differential mRNA expression in the Shh pathway is consistent with the corresponding DNA CNAs. The differential mRNA expression of genes highlighted in the Shh pathway in the subset of patients who have high weights of the WGS astrocytoma pattern in their primary tumor DNA copy-number profiles, i.e., the patients who have the approximately 1-year survival phenotype, is depicted in boxplots with the corresponding MWW P -values. These genes consistently map to amplifications or deletions in the tumor-exclusive genotype (Figure 5.3). 100 Figure 5.10: Differential mRNA expression in the Notch pathway is consistent with the corresponding DNA CNAs. The differential mRNA expression of JAG1, a gene highlighted in the Notch pathway in the subset of patients who have high weights of the WGS astrocytoma pattern in their primary tumor DNA copy-number profiles, i.e., the patients who have the approximately 1-year survival phenotype, is depicted in a boxplot with the corresponding MWW P -value. This gene consistently maps to a DNA amplification in the tumor-exclusive genotype (Figure 5.3). 101 Figure 5.11: Differential mRNA expression outside the Ras, Shh, and Notch pathways is consistent with the corresponding DNA CNAs. The differential mRNA expression of genes highlighted outside the signaling pathways of interest in the subset of patients who have high weights of the WGS astrocytoma pattern in their primary tumor DNA copy-number profiles, i.e., the patients who have the approximately 1-year survival phenotype, is depicted in boxplots with the corresponding MWW P -values. These genes consistently map to DNA amplifications in the tumor-exclusive genotype (Figure 5.3). 102 Figure 5.12: The first, most tumor-exclusive row basis vector and corresponding tumor column basis vector captures an experimental artifact. (a) The first tumor column basis vector is depicted in a plot of relative copy numbers, ordered and colored by their genomic coordinates and segmented by CBS (black lines), roughly describing frequent spikes of reduced copy numbers superimposed on an invariant baseline. The correlation of the vector with the fractional GC content across the tumor bins is 0.78. (b) The corresponding first, most tumor-exclusive row basis vector is depicted in a plot showing an enrichment of the BI (08) GCC (red) relative to the other centers (blue) among the 35 patients with high superposition coefficients of the first tumor column basis vector in their tumor profiles. The corresponding hypergeometric P -value of Equation 5.9 is <10−7 . (c) The WGS astrocytoma tumor dataset is depicted in a raster, with relative WGS read-count, i.e., DNA copy-number amplification (red), no change (black), and deletion (green), showing the GC content variation and its correlation with the experimental batches. 103 Figure 5.13: The 85th, most normal-exclusive row basis vector and corresponding tumor column basis vector captures an experimental artifact. (a) The 85th normal column basis vector is depicted in a plot of copy numbers, ordered and colored by their genomic coordinates and segmented by CBS (black lines), roughly describing frequent spikes of reduced copy numbers superimposed on an invariant baseline. The correlation of the vector with the fractional GC content across the normal bins is 0.91. (b) The corresponding 85th, most normal-exclusive row basis vector is depicted in a plot showing an enrichment of the TJU (CS) TSS (red) relative to the other sites (blue) among the 13 patients with high superposition coefficients of the 85th column basis vector in their normal profiles. The corresponding hypergeometric P -value is <10−2 . (c) The WGS normal dataset is depicted in a raster showing the GC content variation and its correlation with the experimental batches. 104 Figure 5.14: The first tumor and 85th normal column basis vectors are correlated with the fractional GC content across the tumor and normal genomes. The distributions of the copy numbers listed in the (a) first tumor and (b) 85th normal column basis vectors between tumor and normal bins, respectively, of >50% and ≤50% GC content are depicted in boxplots with the corresponding MWW P -values. 105 Figure 5.15: The first and 85th row basis vectors are correlated with experimental batches. The distributions of the copy numbers listed in the (a) first and (b) 85th row basis vectors between GCCs and TSSs, respectively, are depicted in boxplots with the corresponding MWW P -values. 106 Figure 5.16: The 82nd row basis vector is correlated with gender. The distribution of the copy numbers listed in the 82nd row basis vectors between females and males is depicted in a boxplot with the corresponding MWW P -value. 107 Figure 5.17: The 82nd row basis vector and corresponding tumor column basis vector capture gender-associated X chromosome variation in the tumor dataset. (a) The 82nd tumor column basis vector is depicted in a plot of copy numbers describing a deletion of the X chromosome relative to the autosome across the tumor bins. (b) The corresponding 82nd row basis vector is depicted in a plot showing an enrichment of the males (blue) relative to the females (red) among the 50 patients with high superposition coefficients of the 82nd tumor column basis vector in their tumor profiles. The corresponding hypergeometric P -value is <10−19 . (c) The WGS astrocytoma tumor dataset is depicted in a raster showing the normal male-specific X chromosome deletion conserved in the tumors. 108 Figure 5.18: The 82nd row basis vector and corresponding normal column basis vector capture gender-associated X chromosome variation in the normal dataset. (a) The 82nd normal column basis vector is depicted in a plot of copy numbers describing a deletion of the X chromosome across the normal bins. (b) The corresponding 82nd row basis vector is depicted in a plot. (c) The WGS normal dataset is depicted in a raster showing the normal male-specific X chromosome deletion. 109 Figure 5.19: The 82nd tumor and normal column basis vectors are correlated with a deletion of the X chromosome relative to the autosome across the tumor and normal genomes. The distributions of the copy numbers listed in the 82nd (a) tumor and (b) normal column basis vectors between tumor and normal bins, respectively, which map to the autosome and the X chromosome, are depicted in boxplots with the corresponding MWW P -values. 110 Figure 5.20: Survival analyses of the chemotherapy- and radiation-treated WGS astrocytoma patients with MGMT promoter methylation and IDH1 mutation test results show that the GBM pattern is a better predictor of overall survival in the patient-matched cohorts. The classifications of the chemotherapy-treated patients based upon (a) the Agilent GBM pattern, (b) MGMT promoter methylation, or (c) IDH1 mutation, and the classifications of the radiation-treated patients (d)–(f), are depicted in KM curves highlighting median survival time differences (yellow) with the corresponding log-rank P -values and Cox hazard ratios. 111 Figure 5.21: Survival analyses of the Affymetrix astrocytoma patients show that the GBM pattern is independent of existing indicators in the Affymetrix patient cohort. The classifications of the 497 patients based upon (a) the Agilent GBM pattern and, in addition, (b) age, (c) grade, (d) MGMT promoter methylation, or (e) IDH1 mutation are depicted in KM curves highlighting median survival time differences (yellow) with the corresponding log-rank P -values and Cox hazard ratios. 112 Figure 5.22: Survival analyses of the Agilent GBM patients show that the GBM pattern is independent of existing indicators in the Agilent patient cohort. The classifications of the 364 patients based upon (a) the Agilent GBM pattern and, in addition, (b) age, (c) MGMT promoter methylation, or (d) IDH1 mutation are depicted in KM curves highlighting median survival time differences (yellow) with the corresponding log-rank P -values and Cox hazard ratios. Table 5.1: Cox proportional hazards models of the WGS astrocytoma patients Cox Proportional Number of Predictor Hazard 95% Confidence P -value Hazards Model Patients Ratio Interval Univariate 85 Agilent GBM (Corr.) 5.3 2.4– 11.9 4.3×10−5 WGS Astrocytoma (Coeff.) 8.1 3.1– 21.3 2.0×10−5 WGS Astrocytoma (Corr.) 8.1 3.1– 21.3 2.1×10−5 Age 4.2 1.9– 9.2 3.6×10−4 Grade 3.5 1.8– 6.8 2.1×10−4 57 Agilent GBM (Corr.) 14.9 3.4– 64.3 3.1×10−4 MGMT Methylation 3.6 1.6– 8.0 2.2×10−3 75 Agilent GBM (Corr.) 10.8 3.2– 36.1 1.2×10−4 IDH1 Mutation 8.9 3.0– 26.1 6.7×10−5 Bivariate 85 Agilent GBM (Corr.) 3.9 1.6– 9.4 2.3×10−3 Age 2.7 1.1– 6.4 3.1×10−2 Agilent GBM (Corr.) 3.1 1.3– 7.6 1.3×10−2 Grade 2.3 1.1– 4.8 2.8×10−2 Concordance Index 0.95 0.87 0.89 0.83 0.75 0.96 0.79 0.95 0.95 0.86 0.80 113 Table 5.2: Cox proportional hazards models of the Affymetrix astrocytoma patients Cox Proportional Number of Predictor Hazard 95% Confidence P -value Concordance Hazards Model Patients Ratio Interval Index −17 Univariate 497 Agilent GBM (Corr.) 4.1 3.0– 5.8 2.9×10 0.85 −15 Age 2.8 2.2– 3.6 1.4×10 0.80 Grade 2.8 2.0– 3.7 1.4×10−10 0.82 −16 388 Agilent GBM (Corr.) 4.8 3.3– 7.0 3.5×10 0.86 −8 MGMT Methylation 2.1 1.6– 2.7 8.6×10 0.66 −12 403 Agilent GBM (Corr.) 3.6 2.5– 5.1 1.9×10 0.84 −11 IDH1 Mutation 3.9 2.6– 5.8 5.2×10 0.87 −10 Bivariate 497 Agilent GBM (Corr.) 3.1 2.2– 4.4 5.7×10 0.79 Age 1.9 1.4– 2.5 7.5×10−6 Agilent GBM (Corr.) 3.0 2.1– 4.3 4.2×10−9 0.80 Grade 1.7 1.2– 2.4 1.5×10−3 388 Agilent GBM (Corr.) 4.3 2.9– 6.3 2.4×10−13 0.73 MGMT Methylation 1.4 1.1– 2.0 1.0×10−2 403 Agilent GBM (Corr.) 2.2 1.3– 3.5 2.0×10−3 0.83 IDH1 Mutation 2.1 1.2– 3.7 8.0×10−3 114 Table 5.3: Cox proportional hazards models of the Agilent GBM patients Cox Proportional Number of Predictor Hazard 95% Confidence P -value Hazards Model Patients Ratio Interval Univariate 364 Agilent GBM (Corr.) 2.5 1.6– 4.0 4.1×10−5 Age 2.1 1.6– 2.7 3.4×10−7 255 Agilent GBM (Corr.) 2.5 1.5– 4.2 5.0×10−4 MGMT Methylation 1.5 1.1– 2.0 5.0×10−3 313 Agilent GBM (Corr.) 2.4 1.5– 3.8 1.7×10−4 IDH1 Mutation 2.4 1.5– 4.1 7.3×10−4 Bivariate 364 Agilent GBM (Corr.) 2.0 1.3– 3.2 3.2×10−3 Age 1.8 1.3– 2.4 6.5×10−5 255 Agilent GBM (Corr.) 2.4 1.4– 4.0 1.1×10−3 MGMT Methylation 1.4 1.1– 1.9 1.7×10−2 313 Agilent GBM (Corr.) 1.9 1.2– 3.1 1.1×10−2 IDH1 Mutation 1.8 1.0– 3.1 4.8×10−2 Concordance Index 0.76 0.72 0.73 0.58 0.77 0.77 0.71 0.61 0.76 115 116 References [1] T. Boveri, “Concerning the origin of malignant tumours by Theodor Boveri. Translated and annotated by Henry Harris,” Journal of Cell Science, vol. 121, no. Supplement 1, pp. 1–84, 2008. [2] R. G. Weber, et al., “Characterization of genomic alterations associated with glioma progression by comparative genomic hybridization,” Oncogene, vol. 13, no. 5, pp. 983–994, 1996. [3] S. Ellsworth, et al., “Clinical, radiographic, and pathologic findings in patients undergoing reoperation following radiation therapy and temozolomide for newly diagnosed glioblastoma,” American Journal of Clinical Oncology, vol. 40, no. 3, p. 219, 2017. [4] Cancer Genome Atlas Research Network, “Comprehensive genomic characterization defines human glioblastoma genes and core pathways,” Nature, vol. 455, no. 7216, pp. 1061–1068, 2008. [5] Cancer Genome Atlas Research Network, et al., “Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas,” New England Journal of Medicine, vol. 2015, no. 372, pp. 2481–2498, 2015. [6] C. H. Lee, et al., “GSVD comparison of patient-matched normal and tumor aCGH profiles reveals global copy-number alterations predicting glioblastoma multiforme survival,” PLoS One, vol. 7, no. 1, p. e30098, 2012. [7] K. A. Aiello and O. Alter, “Platform-independent genome-wide pattern of DNA copy-number alterations predicting astrocytoma survival and response to treatment revealed by the GSVD formulated as a comparative spectral decomposition,” PLoS One, vol. 11, no. 10, p. e0164546, 2016. [8] Q. T. Ostrom, et al., “CBTRUS statistical report: primary brain and other central nervous system tumors diagnosed in the united states in 2010–2014,” Neuro-Oncology, vol. 19, no. suppl 5, pp. v1–v88, 2017. [9] K. A. Aiello, et al., “Mathematically universal and biologically consistent astrocytoma genotype encodes for transformation and predicts survival phenotype,” in 2018 AACR Annual Meeting, American Association for Cancer Research, 2018. [10] G. Klambauer, et al., “cn.MOPS: Mixture of poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate,” Nucleic Acids Research, vol. 40, no. 9, pp. e69–e69, 2012. [11] D. Karolchik, et al., “The UCSC genome browser database: 2014 update,” Nucleic Acids Research, vol. 42, no. D1, pp. D764–D770, 2014. [12] R. R. Haraksingh, et al., “Comprehensive performance comparison of highresolution array platforms for genome-wide copy number variation (CNV) analysis in humans,” BMC Genomics, vol. 18, no. 1, p. 321, 2017. 117 [13] R. Shen and V. E. Seshan, “FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing,” Nucleic Acids Research, vol. 44, no. 16, pp. e131–e131, 2016. [14] R. J. Roberts, et al., “The advantages of SMRT sequencing,” Genome Biology, vol. 14, no. 6, p. 405, 2013. [15] D. Pinto, et al., “Comprehensive assessment of array-based platforms and calling algorithms for detection of copy number variants,” Nature Biotechnology, vol. 29, no. 6, p. 512, 2011. [16] C. F. Van Loan, “Generalizing the singular value decomposition,” SIAM Journal on Numerical Analysis, vol. 13, no. 1, pp. 76–83, 1976. [17] C. C. Paige and M. A. Saunders, “Towards a generalized singular value decomposition,” SIAM Journal on Numerical Analysis, vol. 18, no. 3, pp. 398–405, 1981. [18] S. Friedland, “A new approach to generalized singular value decomposition,” SIAM Journal on Matrix Analysis and Applications, vol. 27, no. 2, pp. 434–444, 2005. [19] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 2012. [20] G. H. Golub and C. F. Van Loan, Matrix Computations, vol. 3. JHU Press, 2012. [21] H. Goldstein, Classical Mechanics, vol. 2. Addison-Wesley, 1980. [22] O. Alter, et al., “Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms,” Proceedings of the National Academy of Sciences, vol. 100, no. 6, pp. 3351–3356, 2003. [23] O. Alter and G. H. Golub, “Integrative analysis of genome-scale data by using pseudoinverse projection predicts novel correlation between DNA replication and RNA transcription,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. 47, pp. 16577–16582, 2004. [24] S. P. Ponnapalli, et al., “A novel higher-order generalized singular value decomposition for comparative analysis of multiple genome-scale datasets,” in Workshop on Algorithms for Modern Massive Datasets (MMDS), Stanford University and Yahoo! Research, 2006. [25] S. P. Ponnapalli, et al., “A higher-order generalized singular value decomposition for comparison of global mRNA expression from multiple organisms,” PLoS One, vol. 6, no. 12, p. e28072, 2011. [26] P. Sankaranarayanan, et al., “Tensor GSVD of patient- and platform-matched tumor and normal DNA copy-number profiles uncovers chromosome arm-wide patterns of tumor-exclusive platform-consistent alterations encoding for cell transformation and predicting ovarian cancer survival,” PLoS One, vol. 10, no. 4, p. e0121396, 2015. 118 [27] K. A. Aiello, et al., “Patterns of DNA copy-number alterations revealed by the GSVD and tensor GSVD encode for cell transformation and predict survival and response to platinum in adenocarcinomas,” in 2018 AACR Annual Meeting, American Association for Cancer Research, 2018. [28] L. N. Trefethen and D. Bau, Numerical Linear Algebra. Society for Industrial and Applied Mathematics, 1997. [29] A. Edelman, et al., “The geometry of algorithms with orthogonality constraints,” SIAM Journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998. [30] L. M. Ewerbring and F. T. Luk, “Canonical correlations and generalized SVD: applications and new algorithms,” Journal of Computational and Applied Mathematics, vol. 27, no. 1-2, pp. 37–52, 1989. [31] A. B. Olshen, et al., “Circular binary segmentation for the analysis of array-based DNA copy number data,” Biostatistics, vol. 5, no. 4, pp. 557–572, 2004. [32] L. A. Lettice, et al., “Disruption of a long-range cis-acting regulator for Shh causes preaxial polydactyly,” Proceedings of the National Academy of Sciences, vol. 99, no. 11, pp. 7548–7553, 2002. [33] B. W. Purow, et al., “Expression of Notch-1 and its ligands, Delta-like-1 and Jagged-1, is critical for glioma cell survival and proliferation,” Cancer Research, vol. 65, no. 6, pp. 2353–2363, 2005. [34] W. Sun, et al., “Activation of the NOTCH pathway in head and neck cancer,” Cancer Research, vol. 74, no. 4, pp. 1091–1104, 2014. [35] G. Laureys, et al., “Constitutional translocation t (1; 17)(p36. 31-p36. 13; q11. 2-q12. 1) in a neuroblastoma patient. establishment of somatic cell hydrids and identification of PND/A12M2 on chromosome 1 and NF1/SCYA7 on chromosome 17 breakpoint flanking single copy markers,” Oncogene, vol. 10, no. 6, pp. 1087–1094, 1995. [36] M. OBleness, et al., “Finished sequence and assembly of the DUF1220-rich 1q21 region using a haploid human genome,” BMC Genomics, vol. 15, no. 1, p. 387, 2014. [37] S. Weijzen, et al., “Activation of Notch-1 signaling maintains the neoplastic phenotype in human Ras-transformed cells,” Nature Medicine, vol. 8, no. 9, p. 979, 2002. [38] W. C. Hahn, et al., “Creation of human tumour cells with defined genetic elements,” Nature, vol. 400, no. 6743, pp. 464–468, 1999. [39] H. Kiaris, et al., “Modulation of Notch signaling elicits signature tumors and inhibits HRAS1-induced oncogenesis in the mouse mammary epithelium,” The American Journal of Pathology, vol. 165, no. 2, pp. 695–705, 2004. 119 [40] M. E. Carlson, et al., “Imbalance between pSmad3 and Notch induces CDK inhibitors in old muscle stem cells,” Nature, vol. 454, no. 7203, p. 528, 2008. [41] T. Waldman, et al., “Uncoupling of S phase and mitosis induced by anticancer agents in cells lacking p21,” Nature, vol. 381, no. 6584, p. 713, 1996. [42] J. Irianto, et al., “DNA damage follows repair factor depletion and portends genome variation in cancer cells after pore migration,” Current Biology, vol. 27, no. 2, pp. 210–223, 2017. [43] J. H. Kong, et al., “Notch activity modulates the responsiveness of neural progenitors to sonic hedgehog signaling,” Developmental Cell, vol. 33, no. 4, pp. 373–387, 2015. [44] R. Rohatgi and M. P. Scott, “Patching the gaps in Hedgehog signalling,” Nature Cell Biology, vol. 9, no. 9, pp. 1005–1009, 2007. [45] E. Y. Lee, et al., “Hedgehog pathway-regulated gene networks in cerebellum development and tumorigenesis,” Proceedings of the National Academy of Sciences, 2010. [46] I. T. Fiddes, et al., “Human-specific NOTCH2NL genes affect Notch signaling and cortical neurogenesis,” Cell, vol. 173, no. 6, pp. 1356–1369, 2018. [47] M. C. Popesco, et al., “Human lineage–specific amplification, selection, and neuronal expression of DUF1220 domains,” Science, vol. 313, no. 5791, pp. 1304– 1307, 2006. [48] U. Fischer, et al., “Twelve amplified and expressed genes localized in a single domain in glioma,” Human Genetics, vol. 98, no. 5, pp. 625–628, 1996. [49] M. G. Netsky, et al., “The longevity of patients with glioblastoma multiforme,” Journal of Neurosurgery, vol. 7, no. 3, pp. 261–269, 1950. [50] P. Bady, et al., “MGMT methylation analysis of glioblastoma on the Infinium methylation BeadChip identifies two distinct CpG regions associated with gene silencing and outcome, yielding a prediction model for comparisons across datasets, tumor grades, and CIMP-status,” Acta Neuropathologica, vol. 124, no. 4, pp. 547–560, 2012. [51] C. W. Brennan, et al., “The somatic genomic landscape of glioblastoma,” Cell, vol. 155, no. 2, pp. 462–477, 2013. [52] P. Howland and H. Park, “Generalizing discriminant analysis using the generalized singular value decomposition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp. 995–1006, 2004. [53] J. A. Berger, et al., “Jointly analyzing gene expression and copy number data in breast cancer using data reduction models,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 3, no. 1, p. 2, 2006. 120 [54] W. De Clercq, et al., “Canonical correlation analysis applied to remove muscle artifacts from the electroencephalogram,” IEEE Transactions on Biomedical Engineering, vol. 53, no. 12, pp. 2583–2587, 2006. [55] A. E. Teschendorff, et al., “Elucidating the altered transcriptional programs in breast cancer using independent component analysis,” PLoS Computational Biology, vol. 3, no. 8, p. e161, 2007. [56] A. W. Schreiber, et al., “Combining transcriptional datasets using the generalized singular value decomposition,” BMC Bioinformatics, vol. 9, no. 1, p. 335, 2008. [57] I. Rustandi, et al., “Integrating multiple-study multiple-subject fMRI datasets using canonical correlation analysis,” in Proceedings of the MICCAI 2009 Workshop: Statistical modeling and detection issues in intra-and inter-subject functional MRI data analysis, vol. 1, p. 4, 2009. [58] X. Xiao, et al., “Exploring metabolic pathway disruption in the subchronic phencyclidine model of schizophrenia with the generalized singular value decomposition,” BMC Systems Biology, vol. 5, no. 1, p. 72, 2011. [59] O. A. Tomescu, et al., “Integrative omics analysis. a study based on Plasmodium falciparum mRNA and protein data,” BMC Systems Biology, vol. 8, no. 2, p. S4, 2014. [60] X. Xiao, et al., “Multi-tissue analysis of co-expression networks by higher-order generalized singular value decomposition identifies functionally coherent transcriptional modules,” PLoS Genetics, vol. 10, no. 1, p. e1004006, 2014. [61] Y. Levin-Schwartz, et al., “Data-driven fusion of EEG, functional and structural MRI: a comparison of two models,” in 2014 48th Annual Conference on Information Sciences and Systems (CISS), pp. 1–6, IEEE, 2014. [62] X. Chen, et al., “Joint blind source separation for neurophysiological data analysis: multiset and multimodal methods,” IEEE Signal Processing Magazine, vol. 33, no. 3, pp. 86–107, 2016. [63] Y. Wang, et al., “Matrix factorization reveals aging-specific co-expression gene modules in the fat and muscle tissues in nonhuman primates,” Scientific Reports, vol. 6, p. 34335, 2016. [64] Z. Chitforoushzadeh, et al., “TNF-insulin crosstalk at the transcription factor GATA6 is revealed by a model that links signaling and transcriptomic data tensors,” Science Signaling, vol. 9, no. 431, pp. ra59–ra59, 2016. [65] R. Redon, et al., “Global variation in copy number in the human genome,” Nature, vol. 444, no. 7118, p. 444, 2006. [66] J. R. Lupski, “Genomic rearrangements and sporadic disease,” Nature Genetics, vol. 39, p. S43, 2007. [67] S. J. Diskin, et al., “Copy number variation at 1q21. 1 associated with neuroblastoma,” Nature, vol. 459, no. 7249, p. 987, 2009. 121 [68] E. Vanneste, et al., “Chromosome instability is common in human cleavage-stage embryos,” Nature Medicine, vol. 15, no. 5, p. 577, 2009. [69] U. Fischer, et al., “Genome-wide gene amplification during differentiation of neural progenitor cells in vitro,” PloS One, vol. 7, no. 5, p. e37422, 2012. CHAPTER 6 CONCLUSIONS AND FUTURE WORK As new genomic profiling technologies become available, computational methods that are based on protocol-specific assumptions become outdated. Mathematical frameworks that are universal, however, can simultaneously identify consistent underlying biology and changing sources of experimental and biological variation, without a priori knowledge, and, therefore, create a single coherent model from genomic profiles. Prior to this work, a genome-wide pattern of DNA copy-number aberrations (CNAs) characterizing an aggressive subtype of glioblastoma (GBM) tumors was revealed by the generalized singular value decomposition (GSVD) of microarraymeasured copy number profiles [1]. The work presented in this dissertation builds upon those results by using the GSVD to study lower-grade astrocytoma (LGA) patients’ copy number profiles, enabling prognosis of the LGA patients and genomic comparison of the lower- and high-grade tumors. This work also revalidates that the GBM pattern captures fundamental tumor biology that can be used to classify astrocytoma patients in a technologically robust, platform-independent manner for tumor genomes measured on other microarray platforms and next generation sequencing technologies. This work makes a significant contribution to the state of astrocytoma prognostication by demonstrating that the pattern is a platform- and technology-independent prognostic predictor of patient survival and response to treatment in the general astrocytoma population, better than and independent of existing indicators. Prognosis of Lower-Grade Astrocytoma In Chapter 3 the generalized singular value decomposition (GSVD) [2] formulated as a comparative spectral decomposition [3] was shown to be an effective method for modeling patient-matched tumor and normal DNA copy number profiles from 123 lower-grade astrocytoma (LGA) patients. We showed, and computationally validated, that this LGA pattern is correlated with an LGA patient’s outcome. The GSVD separated this pattern from other sources of experimental and biological variation, common to the tumor and normal profiles, or exclusive to the tumor or the normal profiles, without a priori knowledge of these variations. These variations included the male-specific X chromosome deletion that is common to both datasets, and a tumor-exclusive experimental batch effect that is associated with the tumor samples hybridized on a particular 96-well plate. The study revealed a genome-wide pattern of tumor-exclusive DNA copy number alterations (CNAs) that capture features of the genomic dysregulation that characterize the tumorigenesis and progression of this diffuse, low grade brain tumor. We found that the pattern revealed by the GSVD was encompassed in the previously identified GBM pattern, meaning that all of the global and focal CNAs in the new LGA pattern were also found in GBM. Additional GBM-specific CNAs encode for enhanced opportunities for transformation and proliferation via growth and developmental signaling pathways in GBM relative to LGA. Dysregulation of the same developmental signal pathway, the Hedgehog pathway, is also shared by the brain cancer medulloblastoma, where it was shown to contribute to the tumor development [4, 5], but had not been conclusively connected to GBM. The LGA datasets had been publicly available in TCGA since 2015, and analyzed by using several methods. The pattern, however, remained unknown until the datasets were modeled by using the GSVD. This illustrates the ability of comparative spectral decompositions in general, and the GSVD in particular to find what other methods miss [6]. A natural direction for future research is toward further characterization of the cell phenotypes desribed by the LGA and GBM patterns. It was previously reported that among the TCGA GBM samples, ∼76% of the genes within recurrent CNAs have expression patterns that reflect the copy number changes [7]. The gene and protein expression levels of individual genes aberrated in the LGA and GBM patterns could be quantified to confirm the aberrant involvement of the Ras and Shh signaling pathways captured by the copy number patterns at the mRNA and protein levels. 124 Microarray Platform-Independence of GBM Pattern In Chapter 4, the GBM pattern was used to identify among the LGA patients a copy number subtype, statistically indistinguishable from that among the GBM patients, where the CNA genotype is correlated with an approximately 1-year survival phenotype. We also found that cross-platform classification of Affymetrix-measured LGA and GBM profiles by using the Agilent-derived GBM pattern demonstrates that the GBM pattern is a platform-independent predictor of astrocytoma outcome. Statistically, the pattern is a better predictor, corresponding to greater median survival time difference, proportional hazard ratio, and concordance index, than the patient’s age and the tumor’s grade, which are the best indicators of astrocytoma currently in clinical use [8, 9, 10]. The pattern is also independent of, and statistically better than existing laboratory tests for MGMT promoter methylation [11] and IDH1 mutation [12]. The pattern is also statistically independent of these indicators, and, combined with either one, is an even better predictor of astrocytoma outcome. We also showed that the classification of a sample using the GBM pattern is independent of the histological and genomic intratumor heterogeneity on the scale of the TCGA fresh frozen tissue specimens. The classification is consistent across both biological and technical replicates, making it a robust indicator. However, smaller formalin-fixed paraffin-embedded (FFPE) tissue samples obtained from microdissected slides are more susceptible to the effects of intratumor heterogeneity and do not meet this requirement for tumor classification. Additional work is needed to advance the GBM pattern to the clinic, where it can be implemented as a laboratory test to improve the standard of care for astrocytoma patients. Based on the conclusions of this study, further experimental revalidation of the GBM pattern should be pursued in fresh frozen tissue samples of size and quality similar to the TCGA biospecimens, rather than FFPE slides. This will enable repeatable classification of a tumor sample, leading to more reliable prediction of a patient’s outcome. If tumor blocks of sufficient size are available, limitations due to intratumor heterogeneity associated with small FFPE samples may be overcome by extracting 125 DNA from larger FFPE ribbons, which are more likely to capture a histologically and genomically representative portion of the tumor than microdissected tumor slides [13]. If tumors can be reliably classified by the GBM pattern measured from samples of this size, FFPE ribbons may be deemed suitable for further experimental validation. Measurement Technology-Independence of GBM Pattern In Chapter 5 the GSVD was used to reveal a genome-wide pattern of CNAs across the nearly continuous genomic regions of Illumina-measured whole genome sequencing (WGS) read depth profiles in a combined cohort of GBM and LGA patients. The pattern itself is a statistically significant predictor of an astrocytoma patient’s survival and response to treatment. The GSVD separates the pattern from other sources of variation in the datasets, including experimental artifacts associated with the GC content bias that are mathematically dominant in both the tumor and normal datasets. The GSVD separates these artifacts without a priori knowledge of these variations. This study demonstrated that the GSVD is a powerful mathematical framework that is capable of accurately uncovering fundamental biological patterns from genomic datasets, regardless of the measurement technology, without a priori knowledge of the experiment- or measurement-specific variations that affect the given technology. The Agilent microarray-derived GBM pattern was also used to successfully classify WGS measured genomic profiles. The pattern is a cross-technology prognostic indicator among the GBM patients, and separately among the LGA patients, as well as in the combined astrocytoma population. In the combined cohort, the GBM pattern is a statistically better prognostic predictor than age, grade, MGMT promoter methylation, and IDH1 mutation. It is independent of a patient’s age at diagnosis and a tumor’s grade, and combined with either indicator makes a better prognostic predictor than the indicator alone. This study demonstrated that the GBM pattern is a technically robust genomic signature that captures the underlying features of genomic dysregulation that characterize the clinically aggressive GBM subtype. A primary limitation of this study was the relatively small number of patientmatched samples that were sequenced in the TCGA cohort. With only 85 patients 126 in a cohort comprising both GBM and LGA tumors, the GSVD has limited power to separate the patterns of variation into its orthogonal bases. This is a particularly significant challenge in NGS-measured datasets, which have very large row dimensions corresponding to the nearly continuous measurement of the genome across almost three million genomic regions. While the high resolution of this measurement technology has the potential to reveal intricate details of the genomic features that characterize astrocytoma, it also yields an extremely large, heterogeneous dataset. Increasing the number of WGS profiles available from future studies will enable the computation of a larger GSVD, leading to better spectral separation of artifacts and cleaner patterns that capture the tumor biology. Future Directions with Higher-Order and Higher-Dimensional Datasets There are several opportunities to utilize the higher-order and higher-dimensional comparative spectral decompositions. By comparing and integrating data from multiple measurement platforms or varying data types, these frameworks can build a more consistent and complete model of the underlying system. Discoveries from such models would have far-reaching impact both the clinical and cancer biology research communities. Tensor GSVD for Platform-Matched Datasets In Chapter 4 we demonstrated that the GBM pattern is a platform-independent prognostic predictor by classifying microarray profiles measured on a different platform from the one from which the pattern was derived. However, it is possible to leverage higher-order mathematical frameworks to incorporate platform-consistency into the pattern finding method. Extending the structure of the datasets previously described, if data from two or more platforms are available, they can be mapped to comparable genomic coordinates such that the profiles on each platform are matched in the row dimension. Then, the patient- and probe-matched profiles are naturally structured as a tensor. Then, the tensor GSVD can be used to decompose the tumor and normal tensors, which may be row-independent (Figure 6.1) [14]. Given data of this structure, 127 the platform-consistent patterns are captured mathematically in the patterns with consistent values in the basis vector across the platform dimension. In this framework, the variation that is specific to either microarray platform, such as platform-specific bias or batch effects, is spectrally separated into patterns that have a high value for one platform in the basis vector, and low value(s) for the other platform(s). By mathematically selecting for patterns that are consistent across multiple microarray platforms, they are more likely to be platform-independent when tested experimentally on a different microarray platform or measurement technology. Coordinated Multi-Omic Signatures Higher-order comparative spectral decompositions can also be used to compare and integrate different types of patient-matched molecular biological datasets, such as DNA copy-number, DNA methylation, mRNA expression, protein expression. By integrating these different measurements of the same complex biological system, comparative spectral decompositions can be used to identify mathematically coordinated patterns between the data types. For example, the epigenetic changes that lead to differences in tumor cell phenotype recently emerged as an active area of research. Higher-order comparative spectral decompositions such as the higher-order GSVD (HOGSVD) [15, 16], which enable the comparison of more than two column-matched matrices, are ideally suited to solve this problem. The epigenetic changes, such as DNA methylation, are made to the structure of the DNA, and help regulate the expression of mRNA and protein expression, resulting in the emergence of a particular cell phenotype. This open research question could be investigated by using the HOGSVD to compare patient-matched profiles across many different types of data, including DNA copy-number, DNA methylation, mRNA expression, and protein expression profiles. A simultaneous decomposition of this structure would uncover coordinated, multi-omic patterns, where each probelet, or pattern of variation across the patients, would have a corresponding signature across each of the genomic data types. These coordinated patterns may reveal high-dimensional relationships within the data that cannot be captured by any single data type. Furthermore, the coordinated patterns 128 may provide novel insights into causal relationships or mechanisms of genomic dysregulation and the genotype-phenotype relationship in astrocytoma. 129 Figure 6.1: Tensor GSVD of patient- and platform-matched genomic profiles. The structure of the datasets D1 and D2 is that of two third-order tensors with one-to-one mappings between the column dimensions but different row dimensions. By leveraging the information captured in the structure of the data, the tensor GSVD can be used to identify platform-consistent patterns. Reprinted from Sankaranarayanan & Schomay et al. (2015) under the CC BY license [14]. 130 References [1] C. H. Lee, et al., “GSVD comparison of patient-matched normal and tumor aCGH profiles reveals global copy-number alterations predicting glioblastoma multiforme survival,” PLoS One, vol. 7, no. 1, p. e30098, 2012. [2] C. F. Van Loan, “Generalizing the singular value decomposition,” SIAM Journal on Numerical Analysis, vol. 13, no. 1, pp. 76–83, 1976. [3] O. Alter, et al., “Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms,” Proceedings of the National Academy of Sciences, vol. 100, no. 6, pp. 3351–3356, 2003. [4] R. Wechsler-Reya and M. P. Scott, “The developmental biology of brain tumors,” Annual Review of Neuroscience, vol. 24, no. 1, pp. 385–428, 2001. [5] M. Kool, et al., “Genome sequencing of SHH medulloblastoma predicts genotyperelated response to smoothened inhibition,” Cancer Cell, vol. 25, no. 3, pp. 393– 405, 2014. [6] K. A. Aiello and O. Alter, “Platform-independent genome-wide pattern of DNA copy-number alterations predicting astrocytoma survival and response to treatment revealed by the GSVD formulated as a comparative spectral decomposition,” PLoS One, vol. 11, no. 10, p. e0164546, 2016. [7] Cancer Genome Atlas Research Network, “Comprehensive genomic characterization defines human glioblastoma genes and core pathways,” Nature, vol. 455, no. 7216, pp. 1061–1068, 2008. [8] C. Daumas-Duport, et al., “Grading of astrocytomas: a simple and reproducible method,” Cancer, vol. 62, no. 10, pp. 2152–2165, 1988. [9] M. G. Netsky, et al., “The longevity of patients with glioblastoma multiforme,” Journal of Neurosurgery, vol. 7, no. 3, pp. 261–269, 1950. [10] W. J. Curran, et al., “Recursive partitioning analysis of prognostic factors in three Radiation Therapy Oncology Group malignant glioma trials,” Journal of the National Cancer Institute, vol. 85, no. 9, pp. 704–710, 1993. [11] M. E. Hegi, et al., “MGMT gene silencing and benefit from temozolomide in glioblastoma,” New England Journal of Medicine, vol. 352, no. 10, pp. 997–1003, 2005. [12] H. Yan, et al., “IDH1 and IDH2 mutations in gliomas,” New England Journal of Medicine, vol. 360, no. 8, pp. 765–773, 2009. [13] K. B. Geiersbach, et al., “FOXL2 mutation and large-scale genomic imbalances in adult granulosa cell tumors of the ovary,” Cancer Genetics, vol. 204, no. 11, pp. 596–602, 2011. 131 [14] P. Sankaranarayanan, et al., “Tensor GSVD of patient- and platform-matched tumor and normal DNA copy-number profiles uncovers chromosome arm-wide patterns of tumor-exclusive platform-consistent alterations encoding for cell transformation and predicting ovarian cancer survival,” PLoS One, vol. 10, no. 4, p. e0121396, 2015. [15] S. P. Ponnapalli, et al., “A novel higher-order generalized singular value decomposition for comparative analysis of multiple genome-scale datasets,” in Workshop on Algorithms for Modern Massive Datasets (MMDS), Stanford University and Yahoo! Research, 2006. [16] S. P. Ponnapalli, et al., “A higher-order generalized singular value decomposition for comparison of global mRNA expression from multiple organisms,” PLoS One, vol. 6, no. 12, p. e28072, 2011.
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6ms9p15