Classification and regression methods for functional brain network analysis

Classification and regression methods for functional brain network analysis

Title	Classification and regression methods for functional brain network analysis
Publication Type	dissertation
School or College	College of Engineering
Department	Kahlert School of Computing
Author	Wong, Eleanor
Date	2021
Description	Problems involving network inference and predictions based on network data come up in many domains, such as in biology, ecology, economics, sociology, and neuroscience. These problems are not trivial to solve using existing methods. First, when inferring weighted undirected networks from data, the estimations must satisfy a matrix constraint that ensures the resulting networks are valid. Second, traditional machine learning methods are not designed to naturally work with network objects as input data. There is a need for the development of machine learning methods that can accommodate network-valued input beyond using vectorized correlation matrices, which do not fully convey information about network topology or constraints. The methods described in this work address these concerns with the aim to 1) learn network structure from multivariate data that enforces positive-definiteness, a necessary condition for valid networks; 2) make classification and regression predictions directly on the manifold of valid networks and also from their derived topological features; and 3) properly control for confounding covariates in prediction models. While these methods are applicable to general network problems, they are especially pertinent to brain network analysis from resting-state functional magnetic resonance imaging (rsfMRI) data. The first method presented is for estimating a sparse functional brain network from a subject's image data, as a positive-definite Gaussian graphical model under the challenging setting of high dimensionality and low sample size typical of neuroimaging datasets. Subsequently, knowing the network structures for all subjects in a population dataset, this dissertation explores whether topological features of networks carry information for prediction of phenotype. This dissertation then introduces methods for making classification and regression predictions from network-valued inputs as objects on the Riemannian manifold of symmetric positive-definite matrices. The results on an rsfMRI dataset of autism show that these network representations bring state-of-the art improvement in prediction performance. Lastly, this work looks at how to train and interpret prediction models in the presence of confounding information. The methods discussed here are used to learn the components of brain connectivity that drive the prediction of behavior and disease.
Type	Text
Publisher	University of Utah
Subject	network analysis; rsfMRI; network inference; prediction of phenotype
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	© Eleanor Wong
Format	application/pdf
Format Medium	application/pdf
ARK	ark:/87278/s6hp8x2n
Setname	ir_etd
ID	2100247
OCR Text	Show CLASSIFICATION AND REGRESSION METHODS FOR FUNCTIONAL BRAIN NETWORK ANALYSIS by Eleanor Wong A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computing School of Computing The University of Utah December 2021 Copyright © Eleanor Wong 2021 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Eleanor Wong has been approved by the following supervisory committee members: Thomas H. Fletcher , Chair(s) 9/28/2021 Date Approved Jeffrey Phillips , Member 9/21/2021 Date Approved Vivek Srikumar , Member 9/21/2021 Date Approved Guido Gerig , Member 9/22/2021 Date Approved Brandon Anthony Zielinski , Member 9/29/2021 Date Approved by Mary Hall , Chair/Dean of the Department/College/School of Computing and by David B. Kieda , Dean of The Graduate School. ABSTRACT Problems involving network inference and predictions based on network data come up in many domains, such as in biology, ecology, economics, sociology, and neuroscience. These problems are not trivial to solve using existing methods. First, when inferring weighted undirected networks from data, the estimations must satisfy a matrix constraint that ensures the resulting networks are valid. Second, traditional machine learning methods are not designed to naturally work with network objects as input data. There is a need for the development of machine learning methods that can accommodate network-valued input beyond using vectorized correlation matrices, which do not fully convey information about network topology or constraints. The methods described in this work address these concerns with the aim to 1) learn network structure from multivariate data that enforces positive-definiteness, a necessary condition for valid networks; 2) make classification and regression predictions directly on the manifold of valid networks and also from their derived topological features; and 3) properly control for confounding covariates in prediction models. While these methods are applicable to general network problems, they are especially pertinent to brain network analysis from resting-state functional magnetic resonance imaging (rsfMRI) data. The first method presented is for estimating a sparse functional brain network from a subject’s image data, as a positive-definite Gaussian graphical model under the challenging setting of high dimensionality and low sample size typical of neuroimaging datasets. Subsequently, knowing the network structures for all subjects in a population dataset, this dissertation explores whether topological features of networks carry information for prediction of phenotype. This dissertation then introduces methods for making classification and regression predictions from network-valued inputs as objects on the Riemannian manifold of symmetric positive-definite matrices. The results on an rsfMRI dataset of autism show that these network representations bring state-of-the art improvement in prediction performance. Lastly, this work looks at how to train and interpret prediction models in the presence of confounding information. The methods discussed here are used to learn the components of brain connectivity that drive the prediction of behavior and disease. iv For Mom and Dad, to whom I owe everything. I can never thank you enough for your unconditional love and support. CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii CHAPTERS 1. 2. 3. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Network structural learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Machine learning with network data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Prediction of neurological disorders from imaging . . . . . . . . . . . . . . . . . . 1.1.4 Regression prediction of cognitive, behavioral, and personality traits from functional networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 5 5 6 7 7 FUNCTIONAL CONNECTOMICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 The human brain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Functional MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Task-based fMRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Resting-state fMRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2.1 rsfMRI datasets studied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The connectomics pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Atlasing and parcellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Connectivity feature analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 11 11 14 14 15 15 16 ADAPTIVE SPARSITY IN GAUSSIAN GRAPHICAL MODELS . . . . . . . . . . . . 20 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Sparse network learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Sparsity in resting-state networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Adaptively sparse precision matrix estimation . . . . . . . . . . . . . . . . . . . . . 3.4.3 Gaussian likelihood parameterized by the Cholesky decomposition . . . 20 23 23 23 24 25 25 25 26 3.4.4 Adaptive sparsity prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Expectation maximization algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Cell signaling data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Application to resting-state functional network estimation . . . . . . . . . . . . . . . 3.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. KERNEL PARTIAL LEAST SQUARES REGRESSION OF NETWORKS . . . . . . 42 4.1 4.2 4.3 4.4 4.5 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Topological data analysis background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Applying persistent homology to brain networks . . . . . . . . . . . . . . . . . . 4.5.2 Leveraging topological features for statistical analysis . . . . . . . . . . . . . . . 4.5.3 Partial least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Kernel partial least squares regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.3 Relating fMRI correlations to ADOS total scores . . . . . . . . . . . . . . . . . . . . 4.6.4 Topological kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. 43 43 44 45 47 47 48 48 49 49 49 50 50 50 51 RIEMANNIAN REGRESSION AND CLASSIFICATION OF NETWORKS . . . . 54 5.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Riemannian background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Affine invariant metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Log-Euclidean metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1.1 Backpropagation for Llog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1.2 Backpropagation for Laff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. 27 29 32 32 34 36 37 54 55 56 56 58 58 59 60 61 61 61 61 63 64 CONFOUND ADJUSTMENT IN NETWORK DECODING PROBLEMS . . . . . 68 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Confound control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Model interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Fluid and crystallized intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 68 69 70 72 73 6.2.4 Brain networks and intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 Brain networks and sex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Confound selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Confound regression (CV2CR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 HCP data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Synthetic experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 HCP experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2.1 Regression method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2.2 Intelligence and confounds: difference between male and female groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2.3 Prediction accuracy of confound control methods . . . . . . . . . . . . . . 6.4.2.4 “Knocking out” individual subnetworks . . . . . . . . . . . . . . . . . . . . . 6.4.3 rsfMRI network visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3.1 Sex weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3.2 Fluid intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3.3 Crystallized intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3.4 Difference between male and female in crystallized intelligence . . 6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7. 74 75 76 76 76 77 78 78 80 80 81 82 84 84 84 85 85 86 86 DISCUSSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.1 Summary of work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2 Discussion of methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.3 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 viii LIST OF FIGURES 3.1 Correlation vs. partial correlation. Left: Hub and spokes model. Bold lines are positive partial correlation connections, thin lines are edges from full correlation. Center: Full correlation matrix. Right: The lower half of the partial correlation matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Sparsity by hierarchical prior. Comparison of estimates corresponding to Gaussian likelihood (dotted line), Laplace prior (dashed line), and proposed hierarchical prior (solid line) for the 2 × 2 matrix example. Estimates of both the diagonal entry (left) and off-diagonal entry (right) are shown. . . . . . . . . . . . 38 3.3 ROC for GLASSO and proposed method on 20 random synthetic datasets generated from a 40x40 matrix. Correctly detected sparse entries are considered true positives. In all 20 datasets, our method adaptively determined results (red dots) more optimal than cross-validating GLASSO over a range of ρ values (curves). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Persistent homology computation and barcode. . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Mapping a brain network to the metric space. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Box plots of the classification accuracy over ten-fold cross-validation. . . . . . . . . 65 5.2 Plot of the connections with highest weights in the classification. . . . . . . . . . . . 65 5.3 Classification weights grouped by subnetwork. . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Predicted vs. true ADOS scores for regression under AIM. . . . . . . . . . . . . . . . . . 66 5.5 Regression weights grouped by subnetwork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 Confound dependency. Confounds C affect both rsfMRI features X and target variables Y. Confound control is to learn the relationship between X and Y conditional on C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2 Histograms of target and confound variables in the HCP dataset. Two-tailed t-test show that male and female groups in the HCP dataset significantly differ in distributions for age, fluid, and crystallized intelligence variables. . . . 88 6.3 Generated data experiment. Synthetic data generated from model in Section 6.4.1 and predicted using 20-fold cross-validation is used to show how different methods of confound control affect prediction accuracy and weights estimation under various parameters. In all cases a)-f), WDCR and CVCR predict worse than CV2CR. When confound is a major source of signal in X, i.e., e) and f), CV2CR outperforms Y ∼ X and Y 0 ∼ X in both prediction COD and estimation of weights. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.4 HCP dataset projected onto the top PCA dimensions. First and second PCA dimensions of the HCP rsfMRI dataset. Male subjects in blue and female subjects in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.5 Predicted vs. true graphs. Comparison of true vs. predicted plots of no confound control and the best confound control models from Table 6.6. Blue points are male and red points are female subjects. . . . . . . . . . . . . . . . . . . . . . . . 90 6.6 PCA dimensions. How the predictions would be across PCA dimensions if there was no inner loop cross-validation to determine dimensionality. Black lines are for (M+F) overall accuracy, while blue and red lines are accuracy for male and female groups respectively. The dashed and dotted lines signify different confound regression schemes. We observed a large gap in accuracy between male and female groups for crystallized intelligence. . . . . . . . . . . . . . . 91 6.7 Male and female rsfMRI network differences. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.8 Fluid intelligence network features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.9 Crystallized intelligence network features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.10 Crystallized intelligence models trained per sex group. . . . . . . . . . . . . . . . . . . . 94 x LIST OF TABLES 2.1 Graph-theoretic properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Results on simulated data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Results on zero and nonzero entry prediction of cell signaling data. . . . . . . . . . 41 3.3 Likelihood test from cell signaling data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1 Improvement of including topological data analysis to ADOS regression. . . . . 53 5.1 Accuracy performance of Riemannian and various state of the art classification methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Comparison of Riemannian features against baselines in PLS regression. . . . . . 67 6.1 AIC selection of confounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2 Significant confound variables to fluid and crystallized intelligence. . . . . . . . . . 95 6.3 Prediction accuracy of target variables Y by confound inputs only and by rsfMRI connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4 20-fold prediction accuracy of fluid intelligence. . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5 20-fold prediction accuracy of crystallized intelligence. . . . . . . . . . . . . . . . . . . . . 98 6.6 20-fold prediction accuracy of CogCrystalComp unadj for subjects with scores < 135. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.7 Knockout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 ACKNOWLEDGEMENTS I would like to express my deep gratitude to my advisor, Professor Tom Fletcher, whose guidance and support has made this work possible. I would like to thank my dissertation committee members for all of their thoughtful comments and feedback for improvement on my work. Moreover, I would like to especially thank my collaborators: Brandon Zielinski, Molly Prigge, Jeff Anderson, Jace King, for providing the datasets for my analyses; and Bei Wang, Suyash Awate, and Sourabh Palande for their contributions that have helped my work come to fruition. In addition, a big thanks to all the wonderful students, faculty, and staff whom have made Utah a great environment for intellectual exploration and made the graduate school experience worthwhile. Finally, I am forever indebted to my family: Mom, Dad, my brother, and my grandparents, whom have always been patient, loving, and supportive of all of my endeavors. I would like to acknowledge my partner for the emotional support and for never stop believing in me. CHAPTER 1 INTRODUCTION When many elements in a complex system interact with each other, it is useful to depict such a system as a network with nodes as elements and edges as interactions. Sometimes the interactions are given, such as whether two users are directly connected on a social network, but often they are not explicitly known. In such cases, they can only be inferred indirectly from observations. Some examples of inquiries across disciplines that deal with inferred networks are: • Systems biology - Based on gene expression data, how do gene products interact with each other? • Epidemiology - How does an infectious disease spread throughout a population? Can we determine who is infecting whom in unobserved contact? • Atmospheric science - What are the climate dynamics in the ecosystem, based on sensor data? • Finance - How connected are financial institutions? What are the spillover channels upon which economic shocks propagate? • Urban planning - Aside from fixed road networks, what are latent conditions that control the dynamics of traffic flow between areas that can only be inferred via observation of traffic data? • Neuroscience - Functional networks are regions in the brain that activate together with associated neural activity. How do differences in behavior reflect in the organization of functional networks? In all of these cases, to construct the complex system as a network, we have to determine the presence of interaction between components, and the strength of such interactions. If 2 the structure of the network is known, the next question is whether we can predict what happens if any component of this system get modified, and also subsequently, predict any effects of a hypothetical system. Unfortunately, network analysis is a difficult problem because the observed data is often high dimensional, noisy, and affected by many confounding factors, leading to an underdetermined system. Depending on the domain, the noise can come from misreporting in the data, or from the conditions under which the data is obtained. There is often no ground truth to compare to, and poor network inference would add compounding noise to subsequent prediction analysis. How networks are represented and used as input in prediction analysis can change the outcome of the analysis. Networks are very high dimensional objects, and cannot be directly input into the prediction algorithms that are intended to accept tabular data. Some information is lost when distilling network objects into a workable input format. This dissertation introduces methods for general network analysis, with application to functional brain networks analysis from resting-state functional magnetic resonance imaging (rsfMRI). The two questions that this dissertation focuses on are: 1) how to learn sparse networks from data in low-sample size/high-dimensional settings, in this case, rsfMRIs; 2) how to use these learned networks to make predictions, i.e., behavioral and neurological disorder phenotypes, and what does this tell us about brain connectivity of certain phenotypes? The first problem involves studying the dependence relationship between regions of interest (ROIs) in the brain based on their recorded time series in the rsfMRI. The second problem is to explore the different ways to represent network data, study whether the representations are distinct between different groups, and how to use these representations for the prediction of phenotypes. Ideally, we want to be able to identify the network variations in a population that serve as functional connectivity biomarkers linked to phenotypes. Confounding information, high levels of noise, large dimensionality of rsfMRI data, and the relatively small sample sizes in available neuroimaging datasets present the main challenges to tackling these problems. This introductory chapter gives an overview of motivations and contributions made and then describes the organization of this dissertation. 3 1.1 Motivation Scientists currently do not have a very deep understanding of the brain. In the functional tissue of the brain, there are approximately 86 billion neurons on average, which are cells that pass messages to each other by producing electrical impulses. The communication between these neurons form the functional network of the brain. With so many neurons, even if we are able to monitor their activity individually, there is simply too much data to make sense of. Instead, we can observe the brain at a coarser level by dividing the brain into multiple regions and looking at how these regions communicate. Few treatments are available for neurological connectivity disorders such as autism, because the pathway through which these diseases develop and propagate is not known. The effect of such diseases on the brain is not uniform, only certain areas of the brain are vulnerable. To begin to decode the brain, we can start by finding neural correlates of observable phenomena of human experience. Functional magnetic resonance imaging (fMRI) records the time series of blood oxygenation in the brain, which is a proxy for neural activity over time. Because the brain remains active even when subjects are at wakeful rest, rsfMRI, the imaging collected in the resting-state, is also of meaningful use to research. These signals show patterns of temporal dependency between regions, which we can indirectly infer as functional connectivity and map the brain network. Hopefully, by interpreting the results, this narrows down the brain to regions of interest that should be further experimentally investigated. More recently, researchers have found that the general population share links between behavior and resting-state network (RSN) patterns extracted from rsfMRI, which makes rsfMRI an appealing modality for studying brain organization. Moreover, imaging is simple (no tasks are involved) and noninvasive (no need for a contrast-agent), which is ideal for imaging large groups. Some of the topics encountered are as follows: 1.1.1 Network structural learning Resting-state fMRI analysis begins multivariate time series data, where each dimension is a blood oxygen level dependent (BOLD) signal recorded at a location of the brain. Typically, a partipant’s functional network can be derived by the calculating temporal association between regions. However, a well-known limitation of rsfMRI is the tradeoff 4 between spatial and temporal resolution. Acquiring images with finer voxel resolution requires a longer time to cover the brain, hence a smaller sample of time slices can be recorded within a fixed time period. High spatial resolution is desirable for researchers to accurately pinpoint locations of anomalous connectivity, but this can lead to instability in the computation of certain association measures when the dimensionality outsizes the number of time frames. There is no ground truth to what is the appropriate definition of regions to assign as nodes in the network. It is also unknown whether edges between network nodes would better be modeled as either directed or undirected, and weighted or binary. Another challenge in the network estimation problem is that some measures of association require choosing parameters in the computation, which could lead to very different estimations of the network. For example, some studies have used Granger causality to infer directed edges in the graph which has a parameter for time lag that needs to be fixed. For binary networks, a threshold needs to be set for what level of association passes for an edge in the network. A poor estimation of a network is not helpful to interpretation. Modeling RSN as a weighted undirected network is showing promise for rsfMRI analysis. It can be denoted by a a symmetric matrix of size d × d, with d nodes being brain regions, and each entry of the matrix as the weight of the pairwise association. Input features can be subsequently extracted from the estimated networks for further analysis e.g. supervised and unsupervised machine learning algorithms. 1.1.2 Machine learning with network data Once the structure of brain networks for individuals are inferred, the next task is to figure out the types of features that can be derived from networks as input into machine learning algorithms. The most straightforward input is to vectorize the triangular part of each subject’s symmetric network association matrix (for example, computed by pairwise correlation or partial correlation between brain regions). However, a network is a type of object that comes with intrinsic properties, and some information is lost when simply vectorizing a network association matrices into multivariate tabular data. Using vectorized networks as tabular input into machine learning algorithms does not take into account the topological structures characteristic of each network, nor enforce their sym- 5 metric positive-definite nature. By representing networks differently, either as objects that lie on a non-Euclidean manifold or characterize them by a tally count of their topological features, different properties can be preserved that can be used toward machine learning analysis. 1.1.3 Prediction of neurological disorders from imaging Autism spectrum disorder (ASD) is an umbrella term for a range of neurodevelopmental disorders characterized by impairment in social interactions. ASD is still poorly understood, estimated to affect 1% of the population, with large variation in symptoms and severity of the conditions amongst affected individuals. The causes and mechanism of how autism occurs are uncertain, but a popular theory is that patients with autism are characterized by anomalous development of too many and too few neurons in different areas of the brain. Because the BOLD signal in rsfMRI is linked to the neuronal activity of the brain, this hypothesis can be tested by studying whether functional connectivity features from rsfMRI has any predictive power in the classification and regression of autism severity measures. Researchers are interested in methods that not only predict well, but can be used to identify the driving factors for the prediction. With an interpretable machine learning model, we can locate where functional connectivity differs between neurotypical and neurodivergent groups. 1.1.4 Regression prediction of cognitive, behavioral, and personality traits from functional networks Some phenotypic metrics are on a continuous scale, such as measurements of intelligence or disease severity. Studies suggest that the distribution of intelligence levels are associated with brain network organization in a healthy population controlled for age. Identifying the connections that are predictive of intelligence helps us understand cognition in the brain. Similarly, some diseases such as autism and schizophrenia are spectrum disorders that range in severity from high to low functioning. Learning where and how connections between regions vary that are linked to disease severity narrows down the neurons that need to be studied for possible therapies. 6 1.2 Contributions This dissertation presents methods to draw insight about brain networks from rsfMRI data that address the above topics. The contributions are: 1. Sparse parameter-free learning of Gaussian graphical models - Gaussian graphical models have been shown to be effective in modeling rsfMRI networks. This contribution is a probabilistic method for learning sparse Gaussian graphical models that enforces a symmetric, positive-definite precision matrix, which is a necessary condition for non-degeneracy. Through sparsity, network estimation methods can be made robust in the high-dimensional low-sample size setting. Moreover, sparsity succintly represents network features for further downstream analysis. The proposed method of sparsity does not suffer from the soft-thresholding bias from L1 penalized approaches. The parameter-free nature of the model makes it an attractive candidate to be incorporated into more complex analysis where it is not practical to cross-validate over too many parameters. The model can also be applied to other domains where studying the relationship between variables is a problem of interest, such as in finance or biochemistry. 2. Classification and regression of network data under Riemannian metrics - When rsfMRI data for a group of participants are represented as symmetric positive-definite connectivity matrices, each matrix is actually a point on the smooth Riemmanian manifold of symmetric positive-definite matrices. Euclidean distance is not a suitable metric on this manifold, so Euclidean methods may be inappropriate for machine learning. Here the dissertation demonstrates that using methods designed for the manifold is preferable to standard Euclidean methods for rsfMRI classification and regression problems. 3. Regression of network data using topological data analysis - Another way to describe connectivity networks is by mapping brain regions to a point cloud in metric space and characterizing the topological features using persistent homology. This study proposes a method that incorporates topological data analysis and regression, and shows that the persistent homology features of brain networks are helpful in predicting autism severity. 7 4. Confound control - To correctly interpret the learned models predictive of behavior and identify the relevant network features, the effects of any potential confounding information should be disentangled from the analysis. Existing literature has not come to agreement on how to do confound control for regression problems. This study looks at how different methods of confound control impact prediction and interpretation, and suggests one way we can avoid accuracy bias when dealing with confounds. 1.3 Thesis statement Networks are objects that have properties such as positive-definiteness and characteristic topological features. When using network data for predictions, taking into account these properties along with properly controlling for confounds improves performance in classification and regression tasks. 1.4 Overview This dissertation is split into 6 chapters: • Ch. 2 covers the background to resting-state networks. The chapter discusses what is functional imaging, the types of neuroscience questions it tackles, and what results have been found using rsfMRI. It then describes the steps in the analysis pipeline, from preprocessing raw images to analysis of derived features. • Ch. 3 proposes a method for sparse functional network estimation from rsfMRI data. The method estimates a network from time series data as a Gaussian graphical model, with parsimony in network edges by sparsifying entries in association (precision) matrix of the network. By imposing the sparsity constraint on the Cholesky decomposition of the association matrix, this ensures that the final estimated precision matrix remains positive-definite, and is thus non-degenerate. This has been used to estimate a protein signaling network from gene expression data. In an autism rsfMRI dataset , the connectivity for individuals are estimated and used to test whether they are suitable for diagnosis on population data. • Ch. 4 describes topological data analysis of rsfMRI data for prediction of autism severity. Association matrices derived from the topological are mapped to a point 8 cloud in a metric space, and persistent homology features are computed. The resulting features are used as input into regression to autism severity measure. • Ch. 5 describes an approach for classification and regression of connectivity matrices on the Riemannian manifold of symmetric positive-definite matrices. Two metrics, the log-Euclidean and the affine invariant metrics, are employed for the classification of autism and regression to autism severity. For the first metric, manifold-valued connectivity data are projected to the origin prior to analysis. With the second metric, the optimal point of projection is learned from the data. The optimization for the both metrics can be flexibly adapted into multilayer neural network architectures. • Ch. 6 discusses how controlling for confounds can affect prediction accuracy and interpretation of the learned model. We can understand the brain better if we can interpret the features that drive the predictions in learned classification and regression models. For the simplest models, this involves finding the highest absolute value weights of the models. This chapter shows that confounds can cause poor predictions and an erroneous interpretation of the weights under poor confound control. The most appropriate way of confound control, of the methods compared on synthetic examples, is then used to discover the neural underpinnings of intelligence. The dissertation concludes in Chapter 7, tying together the presented contributions with a discussion of implications for future research. CHAPTER 2 FUNCTIONAL CONNECTOMICS As of now, there is still much that we do not understand about the human brain. While we know that the brain comprises of billions of neurons that transmit electrochemical signals in a network of synapses, it remains a mystery how these signals encode thoughts, memory, and action. Moreover, the encoding may differ depending on the location of brain region. What we do know is that the cerebral cortex part of the brain is responsible for cognitive function and controls voluntary actions. The cerebral cortex is the layer of gray matter of neuronal cell bodies that covers the white matter of myelinated axons. Neuroimaging has shown that individual tasks are associated with increased activity in certain areas of the cerebral cortex. Before we can answer questions like why a specific group of neurons are responsible for language, first we have to identify where these language neurons are located and how they’re connected. 2.1 The human brain The map of the neural connections in an organism is called a connectome, and the study of such map is called connectomics. The terms were coined independently in the mid 2000s by Olaf Sporns [160] and Patric Hagmanns [69], in an era when the technology to collect biological data at a massive scale has introduced new fields of -omics sciences endeavoring to map various aspects of biological systems computationally, a la genomics, proteomics, etc. A recent revolution in noninvasive neuroimaging technologies has made it feasible to acquire multimodal imaging data from many subjects at high spatial and temporal resolutions. Although such technology is not currently capable of reading minds as in science fiction, neuroimaging is becoming an indispensable tool for researchers in the disciplines of medicine, neuroscience, and psychology to visualize structure and activity in the brain. In particular, we can determine that neurons are functionally connected by 10 their correlated activity between regions. The hope is that we can derive generalizable connectivity maps that explain the multifaceted diversity of cognitive traits and neurological phenotypes. The hypotheses made from surveying the brain using neuroimaging can then be verified by more invasive methods, i.e., electrical stimulation, to see how changes in signals between these functional clusters of neurons affect behavior from the baseline. In turn, we can begin to understand brain pathology, which has traditionally been difficult to study because many neurological diseases do not exhibit pathognomonic signs. 2.2 Functional MRI Magnetic resonance imaging (MRI) is a noninvasive, in vivo technique for imaging anatomy. An MRI scanner produces a magnetic field that aligns the spin of hydrogen protons in the human body at a certain resonance frequency. When the resonance frequency is off, the hydrogen atoms relax and emit an nuclear magnetic resonance (NMR) signal. The receiver coils in the machine pick up the NMR signal to generate an image of the physiological structures. Functional magnetic resonance imaging (fMRI) is a type of MRI that measures the relaxation of protons indirectly associated with neuronal activity by acquiring a series of images over a duration of time. During a 2-6 second period upon stimulus onset, oxygenated blood is transported to associated brain regions to facilitate the passing of action potential signal between neurons, displacing deoxygenated blood [111]. This blood oxygen level dependent (BOLD) response recovers to original levels 6-12 seconds later. Since only deoxygenated hemoglobin in the blood is paramagnetic (prone to magnetism), the variation in the interference with the magnetic MR signal maps the activated neurons over time. This BOLD response is best detected by fMRI scanning at a temporal resolution of 1 to 2 second intervals. The spatial resolution of fMRI ranges from voxels of size 5mm to 1mm, which is detailed enough to distinguish Brodmann regions [21] in the cerebral cortex, subcortical nuclei, and hippocampal subfields. The noninvasive nature of fMRI makes it an attractive modality in neuroscience research and clinical applications to map regions of brain activity. It is simple to perform, reproducible, and does not require the use of contrast agents, as opposed to other functional imaging methods such as positron emission tomography (PET). Neuroscience research 11 applications range from identifying regions associated with various stimuli, to getting a better understanding of neurodevelopmental and neurodegenerative disorders. In a clinical setting, fMRI is used to do presurgical planning for tumor resection. Functional MRI studies generally fall under two categories: task-based (tfMRI) and resting-state (rsfMRI). In a task-based fMRI experiment, the experimenter assigns to participants tasks or a series of stimuli and analyzes their neural responses. In rsfMRI, subjects are scanned while explicitly asked to not perform any tasks. The brain activity occuring while participants are at wakeful rest reveals information about the functional organization between regions of the brain. 2.2.1 Task-based fMRI Task-based functional MRI experiments are designed to see which areas of the brain are activated by different categories of processing and action. Spatially separated regions responding in sync are described as being functionally connected, forming functional subnetworks. For example, tfMRI has led to some key findings that the fusiform face area in the temporal lobe is responsible for facial perception and recognition in normal healthy adults, by imaging the subjects’ brain activity after showing them faces versus inanimate objects. Using this imaging modality, researchers are able to locate the regions responsible for visual and somatosensory information, language function, attention and working memory, episodic memory, emotion processing, and decision making [16]. All of these results bring us closer toward understanding the organization of the brain. Some well known functional subnetworks are: default mode network (DMN), attention (dorsal and ventral), auditory, sensorimotor, cingulo-opercular, and fronto-parietal [188]. Careful planning with stimulus and tasks presented in repeated blocks and/or events over a period of time is necessary for a good signal to noise ratio of the data. Moreover, factors such as demographics, e.g., age and sex, can have a confounding influence on the experimental outcome. Depending on project aim, the stimulus may need to be designed in such a way to address these complicating factors. 2.2.2 Resting-state fMRI The brain continues to be active while subjects are at passive rest, such as regions in the DMN, located in the medial prefrontal cortex, posterior cingulate cortex/precuneus, 12 and angular gyrus. The DMN is mainly associated with introspective activities such as daydreaming and mind-wandering, but also can be activated by social working memory tasks [161]. Other regions can also spontaneously coactivate, with some studies proposing that resting-state functional connectivity patterns may be intrinsically unique to the individual [56]. However, there also general trends to rsfMRI patterns in a population. Studies of functional connectivity profiles population-wide suggest that rsfMRI captures a wide range of information about the individual that has informed our understanding of: neurological disorders (Alzheimer’s and other dementias [4, 150], schizophrenia [95]), cognitive ability (intelligence [183]), demographics (sex [144, 183], age [54, 183]), and emotional states (life satisfaction [155]). It is known that the brain, adjusted for volume, is structurally different between age groups. Differences have been observed in functional studies as well, see [146] for a review of rsfMRI aging studies. The same can be said for male and female sex groups – sex classification from rsfMRI can be achieved with high accuracy [71]. Resting-state functional MRI has also been used to classify individuals with autism [84], schizophrenia [133], and Alzheimer’s disease [20, 172] with varying degrees of success. In one population of healthy young adults, exploratory analysis identifies a single axis along which rsfMRI data covaries with psychometric measures and lifestyle behaviors such as tobacco and cannabis usage [155]. Resting-state fMRI analysis offers many promises to decoding the workings of the brain. The simple administration of rsfMRI experiments is a significant advantage over tfMRI, which relies heavily on the task design and subject compliance. A task stimulus may be impractical for studies of neurological disorders. Neuroimaging data acquisition is an expensive process–there needs to be enough volunteers for the analysis or the study would be underpowered, and the scan duration needs to be sufficiently long to capture meaningful signals. The lower barriers to conducting an rsfMRI experiment compared to tfMRI makes it easier to collect the sample size necessary for sound generalizable inference. Moreover, the paradigm in a tfMRI experiment is designed around a scientific question in mind, so the data collected can only be used to answer that specific question [8]. RsfMRI data can be reused for studies of different behavioral measures and has better flexibility for exploratory work [155]. That said, analyzing the rich data provided by rsfMRI is not without challenges. On 13 the target variable side of things, there is no guarantee that the measurements of the behavior of interest are reliable or immutable. For the imaging input, there are many possible sources of noise in rsfMRI data, leading to a low signal to noise ratio compared to other types of imaging data. The brain operates quickly at the millisecond scale yet rsfMRI frames are only recorded at second-long intervals. There are many sources of imaging artifacts that produce readings of synchrony between brain regions that have nothing to do with neural function, such as the thermal and magnetic properties of the scanner drifting over time. Next, heart rate, respiration, blinking, swallowing and other involuntary movements of participants are captured in the recordings that can obscure the signal of interest. Despite best efforts to prevent head motion, the tiniest movement can still produce large artifacts. Collecting a sufficient number of time slices in a brain scan is therefore requisite for a stable estimate of the functional connectivity for each subject. Participants need to stay still in the resting-state for at least 5-7 minutes inside the MRI machine [167], which may be difficult to do for those with certain health conditions, e.g., children with autism or developmental delay. Since no stimulus is presented in rsfMRI, the functional connectivity signals are not in sync between subjects. In addition, even the recorded signals for the same subject can differ from one session to the next [113] due to causes including the rumination of unwanted thoughts [102] and fluctuating wakefulness levels [164] during the resting state. It is difficult to normalize the data for these sources of variance. It is without a doubt that one would encounter the curse of dimensionality (p >> n) when examining population rsfMRI data. As is typical of neuroimaging studies, the low sample size n of study participants does not grow proportionate to the higher resolution p of rsfMRI features as scanning technology advances. It is possible to collect larger sample sizes by aggregating data from different institutions such as in ABIDE [42], but site differences confound the analysis and leads to questions of how one should account for such differences. First of all, different hospitals may vary in imaging equipment and scans are acquired under different protocol. Subjects may have been told to rest with eyes closed instead of open. Any preprocessing done to the images could also be different. Secondly, versions of behavioral or cognitive evaluations may be inconsistent between acquisition sites, e.g., different IQ tests, or disease severity recorded at different scales. 14 These inconsistencies mean that there is high levels of data heterogeneity and possibly large subsets of participants with missing data. Lastly, the data distributions may simply be different across locations because of the demographics they serve. Consequently, the amount of noise prevalent in both input and target variables present unique challenges to building rsfMRI models of behavioral phenomena. 2.2.2.1 rsfMRI datasets studied This dissertation will develop methods for rsfMRI analysis that address some of the properties of datasets from Autism Brain Imaging Data Exchange (ABIDE) [42] and the Human Connectome Project (HCP) [168]. ABIDE is an aggregate of rsfMRI and structural MRI scans of healthy subjects and subjects with autism spectrum disorder (ASD) from 20 different sites, along with phenotypic information such as intelligence, social, and communication evaluation metrics. Conditions falling under the ASD umbrella affect approximately 1% of people in the world [173], yet there are still many gaps in our knowledge about ASD. People with ASD exhibit a wide range of symptoms and behavior, from major difficulties in social and learning, to cognitively high functioning but with deficits in communication and emotion recognition. From this dataset, the hope is to be able to extract useful knowledge about the neural basis of autism. The HCP dataset of healthy young adult subjects is collected using state-of-the-art instrumentation and acquisition technology. It includes behavioral and genetics data with the aim of establishing a baseline for functional connectivity in relation to what is typical in human personality, temperament, and response to stimuli. 2.3 The connectomics pipeline We can construct connectomes from resting state fMRI, which identifies how distinct regions are functionally connected. To use the components of connectome to predict certain conditions or behavior in a population, the general machine learning pipeline is to: 1. Preprocess raw images 2. Atlas / parcellate into regions of interest (ROIs) 3. Derive the connectivity features for each subject 15 4. Use machine learning on connectivity features The following sections of this chapter discusses each step in more detail, covering the background, associated challenges, and where the contributions of this dissertation fit in. 2.3.1 Preprocessing There are several popular pipelines for fMRI preprocessing [37, 185, 186], that share in common a general set of steps. Conventionally, the first step is to do structural preprocessing typical to MRI images, which includes removing the non-cerebral tissues from the fMRI images and normalizing images to a template such that it is possible to compare across subjects. For functional preprocessing specific to fMRI images, the basic steps are: 1. Slice timing correction - Because a single 3D fMRI volume of the brain is imaged as a sequential series of 2D slices, slice timing correction adjusts for the slight latency that occurs between slices. One slice is selected as the reference and all other slices are interpolated according to the reference slice. 2. Motion realignment - To account for subjects movement in the scanner during the duration of the scan, images in the time series are all aligned together by rigid body transformation, minimizing some cost function e.g., least square or normalized correlation ratio. Rigid body transformation is defined by translation and rotation in the X, Y, Z axes. In additional to the basic preprocessing steps above, nuisance variables are regressed out from the signal to remove the confounding effects of low frequency scanner drift and physiological processes i.e., breathing and heartbeat. 2.3.2 Atlasing and parcellation Because the acquisition of the images tends to be at a higher resolution than brain structures of interest, prior to analysis and construction of brain networks, the signals from the brain are grouped into nonoverlapping regions. This can be done in one of three ways: 1) as according to specific regions of interest (ROIs) known from previous experiments; 2) as according to a general atlas; 3) or by data-driven parcellation. The 16 first two types of methods rely on prior information, such as by meta-analyses. The Harvard-Oxford parcellation [41] is determined from manually identifying cortical ROIs. The ROIs derived in Power parcellation [129] are spheres of fixed radii centered at voxels that are functionally modulated during certain tasks. A method such as parcellation from Glasser et al. [65] combines both manual and automated delineation of areal regions based on myelin, cortical thickness maps and functional connectivity. On the other hand, the data driven parcellation approach looks at group data, reducing the high dimensional (from image resolution) time series rsfMRI data to an average signal from each parcel of similarly behaving image voxels. Data-driven parcellation may be more suitable depending on the problem of interest. Parcellation models fall under the categories of clustering methods [38, 66, 67, 188], component analysis by independent component analysis (ICA) or principal component analysis (PCA) [2, 171], and Markov random fields [110, 143]. 2.3.3 Connectivity feature analysis After preprocessing the rsfMRI data of all subjects, the whole dataset is of dimensions d × t × n, where d is the number of ROIs, t is the number of frames over the duration of the scan, and n is the number of participants in the dataset. The type of feature derivation depends on the scientific question being asked. Ultimately, the goal of connectomics research is to improve our understanding of the human brain. Many methods have been proposed for how to represent rsfMRI networks in population-level studies, such as for learning connectivity differences between groups, or for the task of making classification and regression predictions. Although a few methods work directly with the multivariate time series signals obtained from the preprocessed images (e.g., [51, 165]), the most common is by explicitly describing functional connectivity for each subject as a network graph [154]. Using the time series signals obtained from a set of ROIs in a rsfMRI scan for each individual, a graph can be constructed with ROIs as nodes and edges indicating the pairwise temporal dependence between regions. Some works also consider the dynamics of such graphs changing over time [7, 40, 86, 133, 193]. Pairwise Pearson correlation between ROIs is the simplest example of an undirected measure of temporal dependence, which leads to a representation of a subject’s network 17 graph as a correlation matrix. Other measures include partial correlation, coherence, mutual information, or causal modeling [154]. In general, there is a tradeoff between the robustness and the complexity of the model. The next chapter will discuss this in detail. With d ROIs, the full functional network features are now of p = O d2 dimensionality. The dimensionality of the data becomes even more challenging to work with given the low sample size n of existing datasets. At the typical parcellation of 100+ ROIs in an fMRI image such as the popularly-used Harvard-Oxford parcellation [41], there would be at least 5,000 variables. In contrast, the number of subjects in a rsfMRI study can usually be on the order of hundreds or even less. This setting can be problematic: e.g., low statistical power, computations unstable, and models overfit. Dimensionality reduction or sparse techniques are some possible approaches to tractable analysis. Graph theoretical measures are popularly used in connectomics because they are simple to compute and reduce the network to only a handful of features. Using summary features of the network give a qualitative understanding of the variation in connectivity. Measures are derived from network properties that require an initial thresholding of the network weights to binary edges. Some of these metrics are described in Table 2.1 (see [23] for a full explanation). It has been found that graph theoretical measures capture network efficiency and modularity differences between healthy control population and people with major depressive disorder [187], autism spectrum disorder [90, 97], Alzheimer’s disease [99], among other neurological disorders. However, there are two issues: 1) While these measures usually reveal some idea about whether if the topology of the brain networks are characterized by many hubs or if there are many long or short distance connections, they do not indicate where these features are located in the brain; 2) Another drawback is the need to threshold weighted networks, because typical graph theoretical metrics are derived from networks with binary edge weights. There are two approaches to thresholding: absolute and proportional. In absolute thresholding, edges with weights above a certain value (such as 0.3) are labeled as 1s and all else are set to 0. In proportional thresholding, the top n% of strongest edges are set to 1s and the rest to 0s. This percentage value is called the density of the graph [3, 17, 64, 91]. In the study by [76], neither thresholding methods are recommended for comparing networks between subject groups. For networks which systemically exhibit weaker functional connectivity, thresholding biases network charac- 18 terization toward that of a random graph (when vertices and edges are determined randomly. The graph theoretic metrics derived from these thresholded networks are sensitive to this undesirable bias, which is not helpful for understanding the underlying differences between subject groups. Alternatively, to address this second issue of hard thresholding, topological data analysis (TDA) is a qualitative way of describing network topology in terms of voids and connected components over all scales [25]. As topological features appear and disappear when adjusting the level of connectivity merging vertices together, they can be summarized as persistent homology features. The use of TDA applied to brain networks will be explored in Chapter 4. Neural network methods have become increasingly popular in many computational domains for their ability to automatically extract features and learn nonlinear relationships in the data. Yet, at this point, it is unclear whether they apply well to rsfMRI datasets because of the limitations of high noise level and low sample size. A study by Schulz et al. [148] applied to the UKBiobank dataset of structural and functional brain imaging data [24] have found that linear models perform on par with more complex models even on sample sizes approaching 10,000, unlike on computer vision datasets which deep models substantially outperform simpler methods at similar scales. He et al. [71, 72] have compared the prediction performance of deep architectures proposed for rsfMRI analysis, e.g., graph convolutional networks [125], against linear kernel regression on the HCP data and note that the latter generally predicts as well. Simple methods, such as linear regression or logistic regression, are easiest to interpret by looking at the weights of the model. When applied to rsfMRI analysis, this would show which connection features in the brain drive predictions. Given that it is more challenging to interpret the learned models of deeper architectures [118] without clear benefits in accuracy, this dissertation chooses to focus on shallow methods that leverage the properties of network data to most straightforwardly explain connectivity phenomena in the brain. 19 Table 2.1. Graph-theoretic properties. 1. Degrees - The number of edge connections extending from a single node. While the distribution of degrees in a random network are Gaussian, the distribution of a nonrandom network should exhibit a heavy-tail toward higher degrees that follows the power law. 2. Centrality - Centrality is the measure of how many shortest paths go through a node. Nodes with high centrality are hubs that crucial to the efficiency of the network. 3. Modularity - A network is modular if it is characterized by clusters of highly interconnected nodes and the modules are sparsely intraconnected. The clustering coefficient is the ratio of connections with nearest neighbors of a node over the total number of possible connections. 4. Graph-efficiency - Efficiency is a measure inversely related to the minimum number of edges to get from one node to another. 5. Small-worldness - The small-worldness of a network with degrees exhibiting power law distribution is measured by the ratio of clustering coefficient to path length normalized by comparison that of a random network. CHAPTER 3 ADAPTIVE SPARSITY IN GAUSSIAN GRAPHICAL MODELS A network graph, G = (V, E), is a set of nodes, V, and edges, E. It is a way to represent complex relational data in which object elements are vertices and the interaction between them are edges. In a directed graph, edges indicate causal relationships from one object to another. An undirected graph models the mutual associations with no directionality. There are many types of network structure estimation techniques under the categories of directed and undirected networks, whether edges are binary or weighted. This chapter focuses on the study of weighted undirected graphs. The chapter is based on the collaborative work done with Suyash Awate and Tom Fletcher, published in the article ”Adaptive sparsity in Gaussian graphical models” at the International Conference on Machine Learning (2013) [182]. 3.1 Introduction In some problems, the edges in the network are explicitly known, but often in biologically motivated problems, the edges need to be inferred. To infer a weighted undirected graph, each edge has a weight value to be estimated that explains the strength of the association between nodes. Such a graph can be described as a symmetric matrix, Ω, of size d × d, where d is the number of nodes. Each entry of the matrix is the pairwise weight of the edge between the two nodes. A 0 entry indicates no edge. This weighted graph inference problem is particularly common in neuroscience and genomics. In functional brain network analysis, many approaches have been proposed for characterizing network graphs based on graph theory. Some interesting network properties include: location of highly connected hub nodes, clustering hierarchies, and metrics for small-worldness and efficiency of graphs, covered in Chapter 2. In order to derive these high level features, it is necessary to get the low level structure of the graph first, namely the network matrix, Ω. 21 The choice of what kind of values are used as weights for Ω has significant impact on conclusions drawn from high level analysis. It is typical to use a thresholded Pearson’s correlation matrix as input into the graph theoretical analysis. A couple of arguments urge to take caution when following these steps [156]. First, a high-level attribute summarizes a network down to a single value, e.g., network efficiency, which results in a loss of relevant information. Second, a full correlation matrix could be too dense that graph theoretic measures like communication path length may not make much sense (every node is connected). To address these concerns in rsfMRI analysis, Smith et al. [154] have compared several types of network models to see how well they capture the properties of synthetically generated brain signals under a variety of scenarios. Ranging from data-driven and undirected to complex a priori directed models, (in order of increasing complexity: full correlation, partial correlation, Bayes nets, structural equation modelling, dynamic Bayes nets, dynamic causal modelling), they find that the data-driven models are overall more robust and accommodating to a large number of nodes. While there is no single model that wins at every scenario (such as varying fMRI session length, inaccurate ROIs, addition of confounds, cyclic connections, etc.), the partial correlation, which will be explained in detail, is a reasonable compromise that estimates graph structure well and can parsimoniously handle large dimensionality. Partial correlation is the degree of conditional dependencies between two variables, with the effects from all other variables in the model removed. Consider a hub and spokes model (Figure 3.1), where A is an independent variable and all other nodes are positively generated from A with some random Gaussian noise, e.g., B = A + e. This correlates variables B, C, D, E to each other, indirectly via A. Estimating the full correlation matrix of this network yields all matrix values very close to 1, which does not highlight how A functions as the hub of this network. In contrast, B, C, D, E are pairwise conditionally independent of each other, which show up as zero entries in the partial correlation matrix. This results in a sparse partial correlation matrix where the nonzero entries reveal that those variables are conditionally dependent on A. Formally, B and C are defined as conditionally independent given A if and only if P( B, C \| A) = P( B \| A) P(C \| A). A Markov random field (MRF) is an undirected graphical model with conditional dependencies as edge weights, satisfying the Markov property that says any node X in the 22 graph is conditionally independent of rest of the nodes given its immediate neighbors. The structure learning problem for MRFs is to infer these edge weights from observed data. With an assumption that the random variables come from a multivariate Gaussian distribution, the graph is called a Gaussian graphical model (GGM). The mathematical computation for partial correlation is this: let X and Y be the variables of interest to partially correlate, and let there be a third variable Z. Then we can regress X, Y to Z, with the coefficients ( w∗X = arg min w w ∑ (xi − hw, zi i)2 ) i =1 ( wY∗ = arg min N N ∑ (yi − hw, zi i)2 ) , i =1 and residuals equal to r X,i = xi − hw∗X , zi i rY,i = yi − hwY∗ , zi i. The partial correlation X, Y is therefore ρ̂ XY ·Z = r N ∑iN=1 r X,i rY,i − ∑iN=1 r X,i ∑iN=1 rY,i 2 r 2 . N N N N 2 2 N ∑i=1 r X,i − ∑i=1 r X,i N ∑i=1 rY,i − ∑i=1 rY,i A simpler way to compute pairwise partial correlation between all variables V in the model is to invert the sample covariance matrix to get the precision (a.k.a. inverse covariance) matrix P = S−1 and divide by negative of the diagonal, such that ρ Xi Xj ·V\{Xi ,Xj } = p − √ piiijp jj . The above steps for partial correlation computation are simple, with a major caveat–the sample covariance matrix needs to be inverted. Sample covariance matrices are semipositive definite. When the covariance matrix is close to singular, ie. has close to zero eigenvalues, the inversion is mathematically unstable. This happens when the number of data samples, n, is less than the number of nodes, d. To put this in the context of rsfMRI network estimation, suppose that the parcellated fMRI data, X, is multivariate Gaussian from d regions with n time points. If n is too small relative to d, the d × d sample covariance matrix would be difficult to invert. 23 3.2 3.2.1 Related work Sparse network learning In this chapter, a method to estimate a precision matrix, Ω, and subsequently partial correlations, is proposed. The matrix Ω encodes the graph structure of a Gaussian graphical model [141], the presence of an edge between variables indicated by a nonzero entry in Ω. For many applications, especially those involving high-dimensional data, it is desirable to prevent overfitting by utilizing priors that favor parsimonious models. Such models should exhibit as few interactions between variables as possible, which in the GGM case corresponds to a sparse precision matrix. A recent approach to simultaneously learn the structure and estimate the parameters in GGMs was to augment the Gaussian likelihood with a penalty on the L1 norm of the precision matrix [14, 48, 61, 116, 138, 194]. This could also be thought of as a maximum a posteriori (MAP) estimation problem with a Laplace prior on the entries of the precision matrix. One challenge with these methods is that they require selection of a parameter that controls the amount of sparsity by weighting the penalty term. Friedman et al. [61] selected this parameter through cross-validation by either minimizing prediction error or maximizing likelihood of the left-out data. Foygel et al. [60] proposed an extended Bayesian information criterion (EBIC) for choosing the regularization parameter. Fan and Li [52] discussed the general setting of coefficient selection and estimation in statistical models using penalized likelihood functions. They pointed out that one issue with all L p penalties is that the resulting estimates of large coefficients would be shrunk, i.e., biased towards zero. They introduced a penalty function called the smoothly-clipped absolute deviation (SCAD), which mitigated this bias for large estimates but still required parameter tuning of the weight on the penalty. Scaled LASSO from Sun and Zhang [163] estimated the noise level and regression coefficients to find an appropriate penalty level, but this again faced issues with L1 penalty’s bias. 3.2.2 Sparsity in resting-state networks In the synthetic rsfMRI data experiments from Smith et al. [154], using L1 regularization for inverse covariance estimated network structure better than unregularized partial correlation. Moreover, in reality, regularization is a necessity. Long rsfMRI sessions would 24 be uncomfortable for participants, and shorter durations would yield more ”fluke” associations because of fewer time points captured. Regularization makes the model less sensitive to noise. A number of works have proposed sparse regularization for estimation of Gaussian graphical models applied to learning resting-state networks. Ng et al. [120] used GGM to model the connectivity profiles for a whole dataset, with L1 penalty for sparsity on the population level model Ω from which individuals deviate. This idea of sparsity at a population level was revisited in a Bayesian setting by Colclough [33], which they tested with both a strongly-sparse spike-and-slab prior and a weakly-sparse normalCauchy prior on off-diagonal elements. Wang et al. proposed CLIME [175], which estimated the network for the individual by placing L1 penalty on Ω and an additional penalty to encourage positive-definiteness such that the kΩS − I k2 < λ for some parameter λ. The drawback to this approach was that there are two parameters that need to be tuned, and does not enforce symmetry in practice. In all of these methods, there is parameter selection involved. Rather than using a penalty to induce sparsity, a method Nie et al. [121] proposed to avoid the selection of parameters by defining the association between nodes X1 and X2 to be the minimum of all partial correlations computed by subsets of nodes V \{ X1 , X2 }. This is helpful in identifying presence or absence of edge between nodes, but the drawback is that the weights of the association matrix are no longer meaningful. 3.3 Contribution The network input for the proposed methods in previous chapters are correlation matrices computed from time series of a set of ROIs. Weighted correlation networks are very dense representations. This chapter discusses an alternative, via sparse precision matrices. The proposed hierarchical Bayesian model for sparse precision matrix estimation in GGMs in this chapter has several attractive features. First, it does not require any parameter tuning–the sparsity of the model is adapted automatically to the data. Sparsity is achieved by modeling the precision matrix with a hierarchical scale mixture of Gaussians prior with a parameter-free Jeffreys’ hyperprior. Second, experimentation shows this hierarchical prior does not suffer from the bias problem inherent to L1 penalties. Finally, the symmetry and positive-definiteness of the precision matrix is naturally enforced by utilizing a Cholesky decomposition. Most importantly, the model empirically shows to have 25 performance over the widely-used GLASSO [61] method with optimal sparsity parameter chosen by cross-validation. Through a series of experiments, the performance of different estimators on simulated data is tested by measuring error to the correct solution, and on real data from a cell-signaling experiment, performance is measured based on ability of the trained model to explain a left-out test set. 3.4 3.4.1 Methods Bayesian inference Before diving into the details of the method, a brief introduction to Bayesian statistical inference is presented. Bayes’ theorem says: P (model \| data) = P (data \| model) P (model) P (data) where P (model \| data) is called the posterior, P (data \| model) is the likelihood of the observed data, P (model) is the prior of the model, and P (data) is the marginal likelihood. Bayesian methods aim to optimize and infer based on the posterior probability. In maximium a posteriori estimation, the model that maximizes this posterior probability is selected. This differs from frequentist methods that are driven by the data likelihood. The restaurant rating problem is a classic example of the difference between Bayesian and frequentist statistics–suppose that a new restaurant has just opened and there is only a single 5-star review on Yelp. Based on this single review, the frequentist would immediately assume that the restaurant is 5-star quality, whereas the Bayesian takes into account a prior notion that all restaurants are of middling quality (2.5 stars) unless more reviews show otherwise. As more reviews flow in, the Bayesian and frequentist opinions begin to converge. Therefore, the Bayesian perspective is attractive in the small sample size setting, which is common in medical imaging. 3.4.2 Adaptively sparse precision matrix estimation Let x denote the n × d matrix of observed data, which is thought of as realizations of a multivariate Gaussian random variable X = ( X1 , . . . , Xd ) ∼ N (0, Ω−1 ). The goal is to estimate the precision matrix Ω, while controlling the complexity of the model. The typical approach for favoring sparse estimates is to apply an L1 penalty to Ω. This is given by the estimation problem 26 Ω̂ = arg min ln \|Ω\| − tr(ΩS) − ρkΩk1 , Ω (3.1) where S is the sample covariance matrix of x, and ρ is a parameter that controls the amount of sparsity. Due to its computational efficiency, the most popular algorithm for solving this estimation problem is the graphical least absolute shrinkage and selection operator (GLASSO) method [61]. This is equivalent to a MAP estimate with a Laplacian prior on the entries of Ω. That is, Ωij ∼ Laplace(0, λ), with λ = 2ρ/n, which gives the prior density, d λ exp (−λkΩk1 ) . p(Ω \| λ) = 2 In this section is a formulation of a Bayesian hierarchical model that is designed to induce sparsity on Ω while removing the need for parameter tuning. The section begins with a discussion of a natural parameterization of the Gaussian likelihood using the Cholesky decomposition. Next, a hierarchical prior is introduced to be applied to the entries of Ω, which extends the adaptive sparsity prior developed in [55] for sparse regression problems. Finally, a description of the MAP estimation procedure using the Expectation Maximization (EM) algorithm is presented, which has closed-form coordinate ascents for the maximization step. 3.4.3 Gaussian likelihood parameterized by the Cholesky decomposition Denote the Cholesky decomposition of Ω as Ω = LL T , where L is lower triangular. Using this decomposition, the multivariate Gaussian model can be naturally formulated in the form of a regression problem [141]. Define the coefficients β ij = − Lij /L jj and the precision of X j as ω j = 1/σj2 = L2jj . Then the lower triangular entries of Ω are given by j Ωij = ∑ j Lik L jk = k =1 ∑ βik β jk ωk , for i ≥ j. (3.2) k =1 Now the multivariate Gaussian model N (0, Ω−1 ) is equivalent to the set of regression problems, X j = µ j + ∑ β ij ( Xi − µi ) + e j , i> j 1 e j ∼ N (0, ω − j ), j = 1, . . . , d. (3.3) The β ij and ω j coefficients parameterize the Gaussian precision matrix in our proposed model. This formulation naturally enforces the symmetry and positive-definiteness of the 27 precision matrix. The matrix Ω = LL T is clearly symmetric for any choice of coefficients. Also, the ω j coefficients are the eigenvalues of Ω, so Ω being positive-definite is equivalent to the ω j being strictly positive. This is satisfied for any reasonable prior that does not shrink the ω j to zero, demonstrated in a following example. A more practical form of the multivariate Gaussian likelihood utilizes the sufficient statistic, the sample covariance matrix S. Computations involving the sample covariance matrix are more efficient than those using the full data matrix (which is typically larger). The multivariate Gaussian log-likelihood is given by `(Ω \| x ) ∝ n ln \|Ω\| − ntr(ΩS). Note that the Gaussian likelihood written this way would technically involve the biased sample covariance matrix, i.e., with a factor of (1/n). This form of the Gaussian log-likelihood rewrites in terms of the β and ω coefficients as d d d `( β, ω \| x ) ∝ n ∑ ln ω j − n ∑ ∑ Ωij Sij j =1 d d j =1 i =1 = n ∑ ln ω j − n ∑ i (3.4) i =1 j =1 i −1 2∑ j ∑ Sij βik β jk ωk j =1 k =1 ! + ∑ Sii β2ik ωk . (3.5) k =1 The two terms inside the parentheses of (3.5) correspond to the off-diagonal terms in Ω, which arise in pairs, and the diagonal terms, which appear just once. 3.4.4 Adaptive sparsity prior As first described by Andrews and Mallows [10], the Laplace distribution is equivalent to the marginal distribution of a scale mixture of Gaussians, with exponential scale distribution. More specifically, let θ be a random variable with the hierarchical distribution p(θ \| τ ) ∼ N (0, τ ), γ . p(τ \| γ) ∼ Exp 2 Then the marginal distribution of θ with respect to γ integrates to ˆ ∞ p(θ \| γ) = p(θ \| τ ) p(τ \| γ)dτ 0 √ γ √ = exp (− γ\|θ \|) , 2 28 yielding the Laplace distribution with ρ = √ γ/2. The level of sparsity in LASSO regression depends upon the parameter on the L1 penalty. In this hierarchical-Bayes context, sparsity is controlled by γ. Figueiredo [55] has proposed to remove γ by replacing the exponential hyperprior on τ by a Jeffreys’ noninformative hyperprior, i.e., p(τ ) ∝ 1 . τ This is equivalent to a log penalty on θ, which yields sparse solutions and is nearly unbiased for large coefficients. Although this hyperprior is improper, it has the advantages that it is scale-invariant and parameter-free. This modified hierarchical model is no longer equivalent to the Laplace prior, but has been shown in [55] to be effective in regression and classification problems. The same hierarchical prior has also been used by Park and Casella [127] in a Bayesian formulation of the LASSO model for regression. Adopting this prior for the Gaussian precision matrix estimation problem results in the following hierarchical model for an adaptive sparsity estimation of Ω: X ∼ N (0, Ω), Ωij ∼ N (0, τij ), p(τij ) ∝ for i > j, (3.6) 1 . τij Notice that prior is only on the off-diagonal entries of Ω. In other words, shrinkage is not applied to the diagonal elements, although the model could easily be changed to include this. In what follows, treat the τij as latent variables that are integrated out. This leaves a posterior p(Ω \| x ) that does not have any parameters that need to be tuned, which is a main advantage over the Laplace prior. Also, the following simple example demonstrates that the hierarchical model in (3.6) does not suffer from the bias problem for large coefficients that plagues the Laplace prior. Note that this approximate unbiasedness of the MAP estimate comes at the cost of a non-convex posterior function. As Fan and Li [52] show, convex penalty functions cannot simultaneously satisfy the properties of sparsity, continuity, and approximate unbiasedness. Consider a 2 × 2 sample covariance matrix, −1 1 s S = , s 1 29 where s is a scalar parameter satisfying \|s\| < 1 to ensure positive-definiteness. Then the maximum-likelihood estimate (MLE) of Ω is just given by the inverse of S. Now consider the L1 -penalized Gaussian likelihood model in (3.1), but with the L1 penalty only on the off-diagonal element, Ω12 = Ω21 . Because the diagonal entries of Ω will remain equal, we can parameterize Ω with two unknown coefficients a = Ω11 = Ω22 and b = Ω12 = Ω21 . Then the estimate of Ω under the L1 -penalized likelihood is Ω̂ = â b̂ b̂ â = arg max n ln( a2 − b2 ) − 2n( aS11 + bS12 ) − \|b\|. ( a,b) (3.7) The solution to this problem for the two unique entries a, b of Ω is shown in Figure 3.2 for varying values of the parameter s in the sample covariance and for a sample size of n = 10. This is compared to the MLE solution, which has solution â = 1 and b̂ = s, and to the MAP estimate of the posterior under the proposed model (3.6). The proposed model estimate was computed using the EM algorithm described in the next subsection. There are two important features to notice in these plots. First, the L1 solution for the off-diagonal b entry gives the familiar “soft-threshold” rule, which forces the estimate to be zero below some threshold, but is biased towards zero by an additive constant for larger coefficient values. Second, the diagonal a entry is nearly perfect (error within 10−4 ) for the proposed MAP estimate, while the a estimate for the L1 penalty estimate is biased downwards away from s = 0. This is the case even though there is no L1 penalty directly on the a entry, and it arises from the interaction between a and b in the log-determinant term in (3.7). 3.4.5 Expectation maximization algorithm The optimization is with an EM algorithm to estimate a sparse Ω as the maximizer of the posterior defined in (3.6). Although the prior on the precision matrix in (3.6) is written in terms of the entries Ωij , it is convenient to rewrite this in terms of the β ij and ω j coefficients discussed in Section 3.4.3. Relationship (3.2) switches between the two parameterization of Ω. • E step : EM iteratively maximizes the Q function, which is a lower bound on the log-posterior, ln p( β, ω \| x ). The Q function is defined as the expectation over the latent variables τij of the complete log-posterior, given the current estimate of the parameters at 30 iteration t, denoted by β̂ (t) , ω̂(t) . Hence, d i −1 ˆ Q( β, ω \| β̂ (t) , ω̂(t) ) = ∑∑ ln p( β, ω \| x, τij ) p(τij \| x, β̂ (t) , ω̂(t) )dτij . i =1 j =1 The complete log-posterior inside the integral above splits into the sum of the loglikelihood, which does not depend on τ, and the log-prior, which does. Then the Q function breaks into two terms, Q( β, ω \| β̂ (t) , ω̂(t) ) = `( β, ω \| x ) + E( β, ω ), (3.8) where `( β, ω \| x ) is the log-likelihood given by (3.4), and E( β, ω ) corresponds to the integral of the log-prior term. It is given by d i −1 ˆ E( β, ω ) = ∑∑ ln p( β, ω \| τij ) p(τij \| x, β̂ (t) , ω̂(t) )dτij i =1 j =1 d i −1 ˆ = ∑∑ i =1 j =1 d i −1 = ∑∑ i =1 j =1 ´ ln p(Ωij \| τij ) p(τij \| x, Ω̂ij(t) )dτij −1 1 −2 2 2 τij Ωij N ( Ω̂ij(t) \| 0, τij ) dτij ´ −1 τij N (Ω̂ij(t) \| 0, τij−1 )dτij d i −1 = ∑ ∑ Ω̂ij−(2t) Ω2ij i =1 j =1 = d i −1 j i =1 j =1 k =1 ∑ ∑ Ω̂ij−(2t) ∑ βik β jk ωk !2 . Here, (3.2) is used in the last line to write the final expression in terms of the β and ω coefficients. The evaluation of the integral above follows the same derivation as in [55]. • M step : The maximization step cannot be done in closed form for the entire set of β and ω coefficients at once. However, the maximization of Q along a single coordinate at a time, i.e., a single β ij or ω j , while keeping the other coordinates fixed, has a closed form. To take the derivative of the Q function (3.8) with respect to the β ab , the derivative of the log-likelihood term is given by d `( β, ω \| x ) = −2n dβ ab a −1 ∑ Saj β jb ωb + Saa β ab ωb j =1 ! . (3.9) 31 Next, the derivative of the E term in the Q function is given by a −1 β ω d jb b E( β, ω ) = 2 ∑ 2 dβ ab j=b \| Ω̂ aj(t) \| ∑ β ak β jk ωk k =1 β ib ωb2 d ∑ +2 j \|Ω̂ia(t) \|2 i = a +1 a ∑ βik β ak ωk . (3.10) k =1 Both equations (3.9) and (3.10) are linear in β ab . So, maximization of the Q function with respect to β ab is equivalent to solving a linear equation k1 β ab + k0 = 0, that is, setting β ab = −k0 /k1 , with a −1 k0 = −2n ∑ Saj β jb ωb j =1 a −1 +2 ∑ j=b j β jb ωb \|Ω̂ aj(t) \|2 ∑ β ak β jk ωk k =1 k6=b d +2 β ib ωb 2 i = a+1 \| Ω̂ia(t) \| ∑ a ∑ βik β ak ωk , (3.11) k =1 k6=b k1 = −2nSaa ωb +2ωb2 a −1 β2jb j=b aj(t) \| ∑ \|Ω̂ d + 2 ∑ i = a +1 ! β2ib \|Ω̂ia(t) \|2 . (3.12) Next, the derivative of Q with respect to ωm is dQ dωm ! d i −1 n 2 = − n ∑ 2 ∑ Sij β im β jm − Sii β im ωm i =1 j =1 ! j d −1 d 2β im β jm −∑ ∑ ∑ βik β jk ωk . 2 j=m i = j+1 \| Ω̂ij(t) \| k =1 Setting to zero and multiplying through by ωm , this becomes a quadratic in ωm . So, the 2 + c ω + c = 0, with maximum of Q with respect to ωm is a solution to the equation c2 ωm 0 1 m c0 = n, (3.13) d c1 = − n ∑ i =1 i −1 2 ∑ Sij β im β jm − Sii β2im j =1  d −1 − ! d j  2β im β jm    ∑ β ik β jk ωk  , 2 j=m i = j+1 \| Ω̂ij(t) \| k =1 ∑ ∑ (3.14) k6=m d −1 c2 = − d ∑ ∑ j = m i = j +1 2β2im β2jm \|Ω̂ij(t) \|2 . (3.15) 32 Algorithm 1 summarizes all steps. Notice that c2 is always negative, and thus the discriminant of the quadratic formula, c21 − 4c0 c2 , is always positive, and both solutions are real. Also, the fact that c0 is always nonzero ensures that ωm = 0 is never a solution, and therefore the estimate Ω̂ is guaranteed to remain strictly positive-definite. An important implementation detail is that many of the entries of Ω̂(t) will be driven to zero, but we need to divide by them in the above calculations. A similar issue arises in quadratic approximations of penalized likelihoods. Following the approach in [85], a small epsilon is added to the denominator to ensure numerical stability. This gives the following fast gradient ascent algorithm, which iterates over each coordinate and updates them one at a time using the above equations (linear in β ab and quadratic in ωm ). Algorithm 1 Expectation Maximization for Sparse Gaussian Estimation Input: Sample covariance matrix S 1. Initialize β̂, ω̂ to MLE, given by solutions to the d regression problems in (3.3) 2. Use (3.2) to initialize Ω̂(0) 3. Until convergence, i.e., until gradient is zero: (a) Set each β̂ ab = −k0 /k1 , using (3.11), (3.12) q (b) Set each ω̂m = (c2 + c21 − 4c2 c0 )/2c2 , using (3.13), (3.14), (3.15) (c) Update Ω̂(t) , using (3.2) 3.5 3.5.1 Results Simulated data For synthetic experiments, we generated data from a range of precision matrices of varying size and sparsity levels and results from the proposed method are compared with that of ordinary least squares (OLS), i.e., the Gaussian MLE, and GLASSO. The sizes of the precision matrices are d2 = 102 , 202 , and 402 . Precision matrices composed from lower triangle matrices with levels of 5, 10, and 20% non-zero entries in the off-diagonals are tested. The diagonal entries are Gaussian distributed with µdiag = 1 and σdiag = 0.1, and the non-zero off-diagonal entries are Gaussian distributed with µ\diag = 0 and σ\diag = 1. This ensures that the precision matrices are positive definite. For each of the nine scenarios, 33 twenty different sets of data are generated, each having 1.5d2 sample points. A first pass of the algorithm is initialized from the OLS solution with an epsilon value of 1e-3 and a stopping criterion of when the maximum change in the estimated entries was less than 1e-7. The algorithm is then repeated, initializing from the output of the previous step with epsilon values of 1e-5 and 1e-7. This three-step refinement to lower epsilons while maintaining numerical stability serves to avoid local minima and confirms that the choice of epsilon does not affect the solution. The MAP estimate should correctly set sparse entries exactly to zero, such as is shown in Figure 3.2 (right plot). However, because the MAP is optimized with a gradient descent, some zero entries may not be numerically exactly zero. To differentiate between truly nonzero entries and zero entries that are numerically off, the posterior is maximized with respect to a numerical zero threshold value (i.e., a value for which all entries numerically smaller get set to zero). Note that this is simply part of the optimization process, and that the MAP estimates do truly produce sparse solutions. In all experiments initializing from the OLS solution, the EM algorithm converged to a local maximum that was better tha000n the GLASSO and OLS solutions. Although not a guarantee of a global maximum, the optimization does achieve reasonable answers locally. All computations are done in R. The GLASSO models we compare against are selected by ten fold cross-validation for the regularization parameter ρ that maximizes likelihood. Friedman et al. [61] used the list of ten ρ values suggested by glassopath() function in the GLASSO package for their cross-validation. Because empirically the optimal ρ sometimes lies below this range, the range of glassopath() values is expanded with ten more equally spaced ρ values starting from 1e-4 to the minimum glassopath() value to make sure that the optimal ρ is within range. The first metric used to rate the performance of our proposed method is the Frobenius norm of the difference of the estimations from the true Ω, i.e., the root mean squared error, kΩ − Ω̂k F . This error also gets split to see how much it results from what should be zero versus nonzero terms in the true Ω. With the threshold that maximizes the MAP estimate, we then calculate the percentage of zero entries in the true solution our method identifies correctly. The results are averaged over the twenty different trials from each scenario and are summarized in Table 3.1. 34 From the results, we can see that our method has the best performance in terms of errors and zero prediction in all scenarios, whereas GLASSO tended to fail especially in denser and larger matrices. The bias from the Laplace prior in GLASSO shrinks entries that should be nonzero as described in Section 2.2, whereas our method mitigates this bias for large entries. This can be seen in the comparison of the nonzero errors, where in some cases of the experiment, GLASSO has a higher nonzero error than the OLS estimate. For the error and prediction accuracy of zero entries, our method performs the best by far. It is known that cross-validation parameter selection for ρ in GLASSO is overly-conservative, i.e., it does not strongly induce sparsity, which can be seen in the poor performance for GLASSO on zero prediction and zero entry error. Another reason for GLASSO’s poor performance is that for even only moderately dense matrices, cross-validation yields ρ values very close to 0, and thus gives results near that of the OLS, as can be observed in the similar zero error rates between OLS and GLASSO for the denser matrices. Because of our hierarchical prior that adaptively induces sparsity, our method is able to consistently predict with mid-90% accuracy the zero entries in Ω across the various sparsity scenarios. In the nonzero accuracy numbers, GLASSO is consistently close to 100%, but this is because its overly-conservative sparsity finds far too many entries to be nonzero. We note that a large majority of the entries in the matrices are supposed to be zero, so the zero accuracy contributes more to the overall accuracy. Figure 3.3 shows that our method yields more optimal results than by cross-validating GLASSO. 3.5.2 Cell signaling data To show how our method works on real-world data, we run our algorithm on the protein signaling dataset from [145]. This is the same dataset analyzed in [61]. The d = 11 protein dataset has n = 7466 samples. In both GLASSO and our proposed method, the likelihood term dominates the prior because of the large sample size. The result is that the optimal ρ for running GLASSO on all of the data is 0, which is also consistent with what Friedman et al. [61] have reported. Therefore, OLS, GLASSO, and our proposed method produce almost identical results. The estimated graph presented in [145] is conventionally accepted by biology experts. A more interesting experiment is to train on smaller subsamples of the data and test 35 whether GLASSO and our proposed method would yield estimates similar to what biologists expect, while keeping in mind that Sachs’ directed graph with only binary values may not be an exact ground truth for direct comparison. Sachs’ graph contains 43 edges and 78 pairs of nodes without edges. Both GLASSO and the proposed method are trained on samples of 200 points. Averaging over 100 runs, we calculate the percentage of zero and nonzero entries in agreement between the Sachs’ model and both GLASSO and our proposed model. For our method, we repeated the same three-step epsilon refinement procedure that we used for synthetic data experiments with epsilon down to 1e-9 because the entries in the precision matrix of the cell signaling data are much smaller by a factor of 1/1000. For GLASSO, we used leaveone-out cross-validation (LOOCV) on the subsample over the range of ρ from glassopath augmented with ten evenly spaced values from 1e-4 to the minimum glassopath value to find the optimal ρ. Also, as we did with the simulated data experiment, to account for numerical error we applied a threshold of delta according to the maximum posterior. We also tried this for GLASSO, but GLASSO’s optimization attained its maximum posterior with no thresholding necessary. We note that the results from GLASSO are not symmetric positive definite. The average relative difference between corresponding off-diagonal Ω̂ij −Ω̂ijT entries is 19%, calculated by k Ω̂ T ij + Ω̂ij kF . For calculating the prediction accuracies relative to Sachs’ model, we convert the MAP estimates for GLASSO and our method into binary matrices where all nonzero entries are coded into ones. The prediction accuracies of the zero and nonzero entries for both methods are displayed in Table 3.2, along with the combined prediction accuracies for all entries of the precision matrix. The results are consistent with the accuracy rates from synthetic experiments. Our proposed method performs much better than GLASSO with the zero prediction but slightly worse with the nonzeros. We observe that for GLASSO the cross-validation can at times chose a model with a ρ very close to zero, resulting in a precision matrix that is not sparse. Overall, our proposed method gives estimations with higher similarity to the conventionally-accepted graph from [145]. Because of the large sample size in the full set of data, the OLS precision matrix based on the entire dataset is a good approximation to the true precision matrix. The fit of the models to the left-out test data is measured by the Gaussian log-likelihood, `(Ω̂ \| xtest ). 36 Another experiment is to calculate as a measure of the error rate the Frobenius norms of the differences between the OLS precision matrix based on the entire dataset and the estimated models trained from subsampling via GLASSO and our method. Again, we used the three-step epsilons up to 1e-9 for our method and do LOOCV for GLASSO over the same range of ρ as the previous cell data experiment. Results are averaged over twenty trials. Table 3.3 shows these results. Compared to OLS and GLASSO, our method has a higher likelihood and a lower norm difference from the OLS precision matrix from all of the dataset. 3.6 Application to resting-state functional network estimation We apply the adaptive sparse precision matrix estimation method to the resting-state fMRI data from ABIDE autism dataset [42] to estimate partial correlation matrices. While it is useful as a method to model an individual’s sparse brain network, unfortunately, the estimated brain network features do not improve classification or regression as compared to Pearson correlation matrices. We acquired the data from the Preprocessed Connectome Project, preprocessed under the CPAC pipeline and parcellated according to the Harvard-Oxford atlas (d = 111). For the regression task, the goal is to linearly regress the estimated partial correlation matrices and predict the Autism Diagnostic Observation Schedule (ADOS) severity scores. Because many of the sites are missing ADOS information for some of their subjects, the analysis is limited to the Utah site with the most amount of subjects with ADOS. In this data there are 62 subjects with scores ranging from 0 to 21. Scores below 10 are considered typically developing. The prediction model is a partial least squares regression (PLS) [180] trained in a leave-one-out scheme to predict the ADOS of the left-out subject, and then we compute a RMSE after iterating over all subjects in the Utah site. In comparison to the baseline of estimating from raw correlation matrices which has a RMSE of 6.17, the RMSE of using partial correlation is worse at 6.33. We also look at the PLS regression of log-Euclidean transformed correlation and partial correlation matrices (see Chapter 5 and [181]). The RMSE of the latter partial correlation matrices is 6.01, worse than its correlation matrix counterpart with an RMSE of 5.42 [181]. Similarly, in a classification task of autism versus control applied to the full Prepro- 37 cessed ABIDE dataset with all sites (468 controls and 403 ASD subjects), using the raw partial correlation input gets an accuracy rate of 0.55. This is close to chance, and lower than the 0.65 accuracy of the baseline with raw correlation matrices. Inverting the partial correlation matrices back to correlation matrices also has the same low classification rate of 0.55. This suggests that imposing sparsity on individual connectivity matrices takes away statistical power to detect group differences. Later works from [101, 136] agree that the correlation has higher power than sparse partial correlation features for distinguishing groups. The adaptive sparse precision matrix estimation identifies a sparse set of edges for a single connectome. But when comparing many networks together, there is too much variability to which edges are sparse between subjects, which negatively impacts the performance. 3.7 Discussion In conclusion, this chapter has introduced a method for estimating sparse precision and partial correlation matrices from multivariate normal data. These matrices describe the network structure of Gaussian graphical models, where each edge in the graph represents the conditional dependence between variables. Useful for estimating a single sparse graph, this model has been successfully applied to synthetic data and to mapping cell-signaling interactions. However, in connectomics where each individual subject has a network and we are trying to compare multiple networks together, the model does not enforce sparsity consistently across subjects which loses predictive power in a group analysis. Despite partial correlation estimating networks well overall in the synthetic experiments from [154], partial correlation is more sensitive than correlation to the effects of poor ROI definition. It may be the situation here that another parcellation would be better suited for group analysis using partial correlation. Therefore, in the following chapters, the typical dense Pearson correlation matrix will be used as in the first step of input. 38 Figure 3.1. Correlation vs. partial correlation. Left: Hub and spokes model. Bold lines are positive partial correlation connections, thin lines are edges from full correlation. Center: Full correlation matrix. Right: The lower half of the partial correlation matrix. 1.00 0.4 0.99 0.2 0.98 0.97 b^ 0.0 ^a 0.96 0.95 −0.2 0.94 −0.4 −0.4 −0.2 0.0 s 0.2 0.4 −0.4 −0.2 0.0 0.2 0.4 s Figure 3.2. Sparsity by hierarchical prior. Comparison of estimates corresponding to Gaussian likelihood (dotted line), Laplace prior (dashed line), and proposed hierarchical prior (solid line) for the 2 × 2 matrix example. Estimates of both the diagonal entry (left) and off-diagonal entry (right) are shown. 39 ● ● ● ● ● ● ● ● ● ● 0.6 0.4 0.2 True Positive Rate 0.8 1.0 ROC 0.0 0.2 0.4 0.6 0.8 False Positive Rate Figure 3.3. ROC for GLASSO and proposed method on 20 random synthetic datasets generated from a 40x40 matrix. Correctly detected sparse entries are considered true positives. In all 20 datasets, our method adaptively determined results (red dots) more optimal than cross-validating GLASSO over a range of ρ values (curves). 40 Table 3.1. Results on simulated data. Matrix Size % L nonzero OLS GLASSO Proposed FROBENIUS NORM OF ERRORS 10x10 20x20 5% 10% 20% 5% 10% 20% 1.008 1.502 2.569 1.387 2.031 2.788 0.689 2.176 2.501 1.370 1.978 2.702 0.420 0.942 1.614 0.529 1.010 1.628 40x40 5% 10% 20% 1.335 3.063 4.367 1.308 2.901 4.084 0.457 1.538 3.395 Matrix Size % L nonzero OLS GLASSO Proposed NONZERO ERROR 10x10 20x20 5% 10% 20% 5% 10% 20% 0.142 0.211 0.341 0.109 0.173 0.178 0.163 0.418 0.330 0.106 0.168 0.172 0.112 0.184 0.242 0.084 0.116 0.121 40x40 5% 10% 20% 0.054 0.097 0.129 0.052 0.091 0.121 0.040 0.071 0.117 Matrix Size % L nonzero OLS GLASSO Proposed ZERO ERROR 10x10 20x20 5% 10% 20% 5% 10% 20% 0.092 0.117 0.158 0.063 0.076 0.096 0.031 0.044 0.156 0.063 0.075 0.094 0.010 0.010 0.010 0.002 0.012 0.015 40x40 5% 10% 20% 0.031 0.067 0.082 0.030 0.063 0.076 0.000 0.005 0.014 Matrix Size % L nonzero OLS GLASSO Proposed % NONZERO PREDICTION ACCURACY 10x10 20x20 40x40 5% 10% 20% 5% 10% 20% 5% 10% 20% 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 99.99 100.0 99.62 100.0 100.0 100.0 99.94 100.0 99.78 99.98 100.0 79.23 83.18 89.74 91.94 78.20 93.69 85.46 87.41 Matrix Size % L nonzero OLS GLASSO Proposed % ZERO PREDICTION ACCURACY 10x10 20x20 40x40 5% 10% 20% 5% 10% 20% 5% 10% 20% 0.12 0.14 0.00 0.25 0.18 0.14 0.35 0.12 0.17 57.56 44.86 0.27 0.46 0.43 0.43 0.85 1.18 1.33 100.0 99.59 98.75 99.59 96.77 95.27 99.99 96.82 94.29 41 Table 3.2. Results on zero and nonzero entry prediction of cell signaling data. % Zero % Nonzero % Overall Prediction Prediction Prediction Accuracy Accuracy Accuracy GLASSO 34.576% 77.418% 49.800% Proposed 71.512% 68.744% 70.527% Table 3.3. Likelihood test from cell signaling data. Likelihood OLS GLASSO Proposed -4.47319e5 -4.48852e5 -4.45147e5 Frobenius Norm of Error 5.60782e-3 5.12098e-3 5.06741e-3 CHAPTER 4 KERNEL PARTIAL LEAST SQUARES REGRESSION OF NETWORKS This chapter introduces regression analysis for network valued data using topological features. There is a growing demand for methods to analyze multiple networks, driven by large amount of data collection. Although the application presented in this chapter is on brain connectivity in autism spectrum disorder, the methods developed here can also be used with other types of biological networks, e.g., for understanding the relationship between gene networks from microarray data and Alzheimer’s disease pathogenesis. In the field of connectomics, understanding the communication network between functional regions of the brain is a vital goal towards uncovering the biological mechanisms behind several diseases and neuropsychiatric disorders. This is particularly important in autism spectrum disorder (ASD), which converging evidence has shown to be characterized by abnormal functional and structural connectivity. The development of imaging biomarkers that correlate with specific behavioral symptoms in a population would be beneficial for early diagnosis or for tracking treatment efficacy. Achieving this requires new methods for extracting relevant network representations from neuroimaging, and statistical models for regressing this high-dimensional and non-Euclidean information with behavioral measures. A primary goal of this chapter is to show that higher-order non-Euclidean connectivity features are valuable in rsfMRI analysis. The work in this chapter was published in Proceedings of IEEE International Symposium on Biomedical Imaging (2016)1 . 1© 2016 IEEE. Reprinted, with permission, from E. Wong, S. Palande, B. Wang, B. Zielinski, J. Anderson, and P. T. Fletcher, “Kernel partial least squares regression for relating functional brain network topology to clinical measures of behavior,” in Biomedical Imaging (ISBI), April 2016. 43 4.1 Introduction Analysis of brain networks begins with a matrix of pairwise associations between brain regions, e.g., correlations between time series in resting-state functional MRI (rs-fMRI). Such an association matrix is often directly used elementwise as the representation of the network in a regression analysis, hypothesis test, or classification. However, the success of such analyses is dependent on distance functions. Graph-theoretic methods [23] have emerged as a powerful means to describe the organization of brain networks. Graphtheoretic measures (e.g., small worldness, modularity - refer to Chapter 2) have shown promising ability to explain impairments in brain networks characteristic of neuropsychiatric disorders such as ASD. Graph-theoretic measures require that the association between nodes be thresholded to a binary value (connection or no connection), which loses information about the strength or certainty of the association. On the other hand, while the raw association matrix retains all information about the association strength between nodes, an element-wise representation does not encode higher order information about the topology of the network, e.g., relationships between columns and rows. Since ASD is an umbrella term for a spectrum of disorders with varying degrees of severity, to address these points, this chapter presents a topological data analysis (TDA) regression strategy to find out whether observed continuous behavioral information is captured by network topology. Specifically, the proposed method is to use persistent homology features, explained in the following sections, because they express the association network at all levels of thresholding. These topological features are recorded in a persistence barcode, which is not a Euclidean vector, making direct use of linear statistical models difficult. However, it is possible to define an inner product between two persistence barcodes, and use kernel partial least squares (kPLS) for regressing brain network topological features against behavioral scores. 4.2 Related work Traditionally, graph theoretical metrics are used to describe network topology. A number of works have shown supporting evidence that neurological diseases are characterized by abnormality in the functional connectivity in terms of graph metrics. By looking at modularity and clustering, Rudie et al. [139] found that individuals with ASD exhibit 44 weaker than normal connectivity within functional subnetworks and stronger connectivity between networks. Many works [19, 35, 62, 139] agree that within a system, long-range (e.g., anterior-posterior) connectivity is reduced, but there are differing conclusions on whether short-range connectivity is increased or not. Since these graph theoretical metrics require an initial thresholding to derive an unweighted sparse graph, the choice of thresholding may be responsible for the discrepancy between these results. Persistent homology, which works on all scales without having to select a thresholding parameter, is becoming an emerging tool in studying complex networks [78] and in particular, brain networks [26, 98, 103, 104, 123]. In [103, 104], networks are derived from the FDG-PET images from the phenotypic groups of ADHD, ASD, and control subjects. The authors show that the disease groups significantly differ from the control subjects based on the associated persistent homology features represented as single linkage dendrograms. They conclude that the ASD group exhibits local over-connectivity and global under-connectivity in comparison to control. This method is repeated in Khalid et al [98] to identify functional connectivity differences between mice model groups imaged in EEG. The ASD and control groups have also shown statistically significant structural differences using persistent homology features from structural covariance MRI [123]. 4.3 Contribution In the previous works, persistent homology features have not been used for diagnosis by learning tasks such as classification or regression. Moreover, rsfMRI signals are much noisier compared to the modalities in the aforementioned works. Cassidy et al [26] have proposed using rsfMRI persistent homology features, although not for the purpose of analyzing pathology. This chapter applies TDA to resting-state fMRI analysis for regression to autism severity by linearly combining 0- and 1-dimensional topological feature kernels. To our knowledge, this work is the first to use persistent homology and kernel regression on brain networks. Using a combination of kernels breaks down how each type of persistent homology features contribute to predictive power, and the results show that the higher-order features do capture relevant phenotypic information. This work is the result of collaboration with Sourabh Palande, Bei Wang, Brandon Zielinski, and Jeffrey Anderson. My main contributions to this work include the application and 45 cross-validation of kernel partial least squares. 4.4 Topological data analysis background Topology, in mathematics, is the study of geometric properties that are preserved under continuous deformations, e.g., stretching and bending. In biological studies, where data is often high dimensional, noisy, and incomplete, it is useful to first get a qualitative understanding of the data that is agnostic to the choice of metrics, parameters, and coordinate systems. Whereas quantitative methods focus on features that are sensitive to the chosen distance function, e.g. curvature, TDA extends concepts from topology to qualitatively describe the intrinsic properties of geometric objects formed by data in its point cloud representations; see, e.g., [50] for seminal work on the topic and [25, 63] for excellent surveys. TDA allows a comparison of networks that have differing numbers of nodes and edges, which is not possible with standard graph analysis methods. An important concept in topology is homology, the underlying idea that spaces can be deemed similar or different under continuous warping by studying their holes and connectivity. A space can be decomposed to its path connected components (0-dimensional features), tunnels (1-), and voids (2-) at higher dimensions. The count of kth dimensional feature present is the kth Betti number. Two spaces are considered equivalent when their Betti numbers at all dimensions are the same. Assuming that the observed data comes from a subspace of Rn , we can attempt to infer the subspace (its Betti numbers) by covering it with balls centered at each data point in the point cloud. To avoid manually fixing the radius e of the balls, persistent homology, [50], a main ingredient in TDA, automatically detects and systematically characterizes the topological features of the subspace at all scales. To visually convey the key ideas behind persistent homology, see Figure 4.1 for an illustrative example. In Figure 4.1(a), imagine five tiny infectious cells are born at time t = 0 in the culture and start to grow linearly in time. These cells differ by their degrees of infectiousness, where blue > red > purple > yellow > green. When two cells grow large enough to intersect each other, the more infectious cell will kill the less infectious one, and both cells merge to form a highly infectious cluster. Using persistent homology, we investigate the topological changes within the growing sequence of cells indexed by 46 time (a filtration). In particular, we focus on important events when cells merge with one another to form clusters (i.e., components) or tunnels. We begin by tracking the birth and death times of each cell (or cluster of cells) as well as its lifetime in the filtration. At t = 0.23, the green cell gets infected by the blue cell and dies; and the two cells merge into one blue cluster; therefore the green cell has a lifetime (i.e., persistence) of 0.23. At t = 0.3, the yellow cell gets infected by the red cell and turns red; therefore it dies at t = 0.3. Similarly, the purple cell dies at t = 0.4 while the red cluster of cells dies at t = 0.45. At time t = 0.53, something interesting happens as the collection of cells forms a tunnel; and the tunnel disappears at t = 0.62. One way to record and visualize the appearance (birth), the disappearance (death), and the persistence of topological features is via persistence diagrams [32], which is a scatterplot such that each point p = ( a, b) is a topological feature with birth time a and death time b. Equivalently, another way is via persistence barcodes [63], as in Figure 4.1(b), where such a feature is summarized by a horizontal bar that begins at a and ends at b. Other visualizations are also available, e.g., persistent landscapes [22]. On the other hand, from a computational point of view, the above nested sequence of spaces formed by cells can also be represented by a nested sequence of simplicial complexes with a much smaller footprint, as illustrated in Figure 4.1(c). Suppose we represent each cell by its nucleus, a black point within its center. At a fixed parameter t, if two cells intersect each other, we construct an edge (i.e., 1-simplex) connecting their nuclei. Similarly, we construct a triangle (i.e., 2-simplex) for every three nuclei and a tetrahedron (i.e., 3-simplex) for every four nuclei whose corresponding cells have pairwise intersections. Such a complex is referred to as the Rips complex, denoted as R(t). For example, at t = 2.5, an edge is formed connecting a pair of nuclei in the red cluster; and at t = 5, a triangle is constructed among a set of pairwise intersecting cells. It is clear that for parameters t1 ≤ t2 , R(t1 ) ⊆ R(t2 ), forming a nested sequence. Persistent homology then captures the topological changes of these Rips complexes, producing similar barcode as in Figure 4.1(b). 47 4.5 4.5.1 Methods Applying persistent homology to brain networks The key is to first map a given brain network to a point cloud in the metric space (intuitively this corresponds to the set of cells in Figure 4.1), where network nodes map to points, and the measures of association between pairs of nodes map to distances between pairs of points. While it is highly advantageous to be able to characterize networks over a range of scales, using persistent homology for statistical inference of brain networks in medical studies is not trivial. With higher dimensional topological features, methods rely on jackknife and permutation tests to show statistical significance. Some methods only compare the 1-dimensional persistent homological features, e.g., the number of connected components or the size of the largest cluster, because they are monotonic over time [29, 30, 105] and can be compared through the Kolmogorov-Smirnoff test by treating the barcode features as cumulative density functions. In this chapter, the distance d between two points u, v in the metric space is computed using their correlation or partial correlation coefficients ρ(u, v) in the brain network, that p is, d(u, v) = 1 − ρ(u, v). Since the values are capped between -1 and 1, network nodes that are highly correlated are close to each other in the point cloud, and those that are fully anti-correlated are furthest away. Subsequently, a nested sequence of simplicial (e.g., Rips) complexes could be constructed in the metric space for persistent homology computation. See Figure 4.2 for an illustration. To interpret the extracted topological features from the simplicial complexes with respect to the brain network, dimension-0 features capture how nodes in the brain networks are grouped into clusters based on their correlations; while dimension-1 and dimension-2 features encode how these nodes are glued together forming tunnels and voids. Low persistence features capture high correlation, and possibly microscopic interactions among the network nodes; while high persistent features reveal low correlation, and potentially mesoscopic and macroscopic interactions. Some initial works [26, 103] have shown that the distributions of the topological features with high or low persistence can be indicative of differences among network organizations; although few systematic investigations have been carried out so far. 48 4.5.2 Leveraging topological features for statistical analysis To interface topological features with the statistical algorithms such as regression or classification, we employ a recent technique proposed by Reininghaus et al. [134] that imposes a stable, multi-scale topological kernel for persistence barcodes, which connects topological features encoded in the barcodes with popular kernel-based learning techniques such as kernel SVM and kernel PCA. In a nutshell, the topological kernel KσTDA ( A, B) (parametrized by a scale parameter σ) measures similarity between a pair of barcodes A and B obtained from two different functional networks. It is defined as, KσTDA ( A, B) = 1 8πσ ∑ e− \|\| p−q\|\|2 8σ − e− \|\| p−q̄\|\|2 8σ , p∈ A,q∈ B where for every q = ( a, b) ∈ B, q̄ := (b, a). 4.5.3 Partial least squares Partial least squares regression (PLS) [115] is a dimensionality reduction technique that finds two sets of latent dimensions from datasets X and Y such that their projections on the latent dimensions are maximally covarying. In comparison to principal component regression which separately reduces the dimension of regressors, PLS finds relevant latent variables that facilitate a better regression fit between datasets X (the predictor/regressor) and Y (the predicted response variables). Similar to Singh et al. [151], we consider here the case where the X predictors are features from brain imaging, and the Y responses are clinical measures of behavior. X is an n × N matrix of zero-mean variables and Y is an n × M matrix of zero-mean variables. PLS decomposes X and Y into X = TP T + E and Y = UQ T + F, where T and U are n × p matrices of the p latent variables, the N × p matrix P and the M × p matrix Q are orthonormal matrices of loadings, and the n × N matrix E and the n × M matrix F are residuals. The first latent dimension in the PLS regression can be computed using the iterative NIPALS algorithm [180], which finds loading vectors w and u, such that the data projected onto these vectors, t = Xw and u = Yc, has maximal covariance. Subsequent latent dimensions are then found by deflating the previous latent dimension from X and Y, then repeating the NIPALS procedure. 49 4.5.4 Kernel partial least squares regression Rosipal and Trejo [137] derived the kernel partial least squares (kPLS) algorithm which assumes that the regressor data X is mapped by some mapping Φ to a higher dimensional inner product space F . Let K be the Gram matrix of data X, such that the entries of the kernel k( x, x 0 ) between two vectors in F is equal to the inner product hΦ( x ), Φ( x 0 )iF . The kernel form of the NIPALS algorithm scales to unit norm vectors t and u instead of the vectors w and c. It initializes a random vector u and repeats the following steps until convergence: t = ΦΦ T u = Ku ktk → 1 c = YT t u = Yc kuk → 1 At convergence, K is then deflated by K ← ( I − tt T )K ( I − tt T ) to compute additional latent dimensions. Similar to PLS, the regression equation is Ŷ = ΦB = KU ( T T KU )−1 T T Y. 4.6 Results We apply the proposed kPLS regression between brain network topological features and behavior in an rs-fMRI study of autism. The goal is to test the ability to predict autism severity, as measured by the Autism Diagnostic Observation Schedule (ADOS), from topological features of functional networks. We compare the proposed kPLS with a TDA kernel to using kPLS with raw correlation features. We also compare combining both TDA and raw correlation features into a single kernel in kPLS. Results below show that the combination of raw correlation kernel with the proposed topological features results in the best predictive performance. 4.6.1 Data The proposed method is applied to Autism Brain Imaging Data Exchange (ABIDE), a joint effort across multiple international sites aggregating subjects’ rs-fMRI scans and behavioral information such as ADOS. To avoid data heterogeneity from site differences, 50 such as different scanner models, protocols, etc., the analysis is limited to the Utah site. There was a total of n = 87 subjects with both rs-fMRI and ADOS information (30 typicallydeveloping control subjects and 57 ASD subjects). The ADOS is an evaluation for autism based on social and communication behaviors. Subscores are assigned for both criteria, and at the clinician’s discretion, subjects scoring total > 8 are diagnosed with ASD. 4.6.2 Preprocessing All fMRI data are preprocessed using the Functional Connectomes-1000 scripts, which includes skull stripping, motion correcting, registration, segmentation and spatial smoothing. Next, the time series for each of 264 regions are extracted based on Power’s regions of interest [129]. The Pearson correlation coefficient is then computed between each pair of regions, resulting in a 34,716 dimensional feature space for each subject (by vectorizing the strictly upper triangular part of the 264 × 264 correlation matrix). These pairwise correlations are used to compute the persistence barcodes for the topological features in Section 4.4. 4.6.3 Relating fMRI correlations to ADOS total scores Using the rs-fMRI correlation values, a linear kernel Kcor is defined by taking the Euclidean dot products of the features. Kernel PLS is used to regress the ADOS score Y against rs-fMRI correlations. The predictive power of the model is tested using leave-oneout cross-validation (LOOCV), i.e., for each subject, kPLS regression is trained on the other n − 1 subjects to predict the left out subject’s ADOS score. The evaluation of the prediction Ŷ of the true ADOS scores Y is by the root mean squared error (RMSE). 4.6.4 Topological kernels From each subject’s fMRI correlation matrix the persistence barcodes of the dimension 0 and dimension 1 topological features are computed using the procedure detailed in Section 4.4. The kernels KTDA0 and KTDA1 obtained from these barcodes are normalized by the median of the absolute values of their entries. There are four free parameters in the kPLS regression: the kernel parameters σ0 and σ1 for dimension 0 and dimension 1 features, and linear weights w0 , w1 for combining the kernels to give KTDA+cor = w0 KTDA0 + w1 KTDA1 + (1 − w0 − w1 )Kcor . Leave-one-out cross-validation is over all combinations of parameters, 51 with weights w0 and w1 in the range from 0 to 1 by 0.05, and log kernel sizes log10 (σ0 ) and log10 (σ1 ) from -8 to 6 by 0.2. There is also an evaluation of KTDA , with constraint (1 − w0 − w1 ) = 0, such that the combined kernel only uses topological features. The above methods Kcor , KTDA+cor , and KTDA are compared against a baseline of using the mean ADOS value of the other n − 1 subjects for prediction. The RMSE results show that correlation matrices and topological features have promising predictive power over the mean prediction baseline (see Table 4.1). To ensure that these regressions are also better than using random signals, this process is also tested on generated n random correlation matrices from i.i.d. N (0, 1) time series of the same size as the real fMRI data and computing their linear kernel. Random signals perform worse than the ADOS mean prediction baseline, with an RMSE of 6.47359. The p-value significance of the RMSE results is determined using permutation tests. Looking at the test statistics RMSEmethod2 - RMSEmethod1 for all pairwise comparisons of the three kernel methods plus baseline, in each of 100,000 permutations, random pairwise swaps of method2 and method1 predictions for subjects are performed. The p-value is the percentage of permuted difference statistics that greater than the unpermuted statistic. From the results, both KTDA (parameters: σ0 = −6.6, σ1 = 1.8, w0 = 0.05, w1 = 0.95) and Kcor show evidence of improvement over baseline and noise. Augmenting Kcor with topological features, KTDA+cor has the best predictive power, and is the only method that is statistically significantly better than baseline. This best result is with the parameters σ0 = −7.8, σ1 = 2.8, w0 = 0.10, w1 = 0.40. These results show that topological features derived from correlations of rs-fMRI have the potential to explain the connection between resting-state functional brain networks and autism severity. 4.7 Discussion Although TDA persistent homology features alone is not sufficient for diagnosis, KTDA being more predictive over baseline demonstrates that TDA features contain information relevant to studying disease. When brain networks are simply input as vectorized correlation matrices into regression algorithm, some useful information about the topology of the networks are lost. This is shown by Kcor and KTDA+cor having different regression results, with KTDA+cor being more predictive. The scale values in the kernels KTDA0 , KTDA1 , and 52 Kcor are normalized by the medians of their absolute values, so w0 and w1 are comparable. The higher w1 weight indicates that the dimension 1 topological features play a larger role than dimension 0 features in the regression to autism severity. Figure 4.1. Persistent homology computation and barcode. 53 Figure 4.2. Mapping a brain network to the metric space. Table 4.1. Improvement of including topological data analysis to ADOS regression. RMSE ADOS mean KTDA Kcor ADOS mean 6.4302 KTDA 6.3553 0.316 cor K 6.0371 0.055 0.095 KTDA+cor 6.0156 0.048 0.075 0.288 ADOS prediction results. Columns 2 to 4 are p-values for the permutation test of improvement of row method over column method. CHAPTER 5 RIEMANNIAN REGRESSION AND CLASSIFICATION OF NETWORKS In the previous chapter, topological data analysis has been proposed as a method for doing machine learning on brain networks in terms of their topological qualities, but it relies on kernelization of the data, which has some drawbacks when it comes to interpretation and visualization of the results and extension to other types of models. This chapter takes a more quantitative approach by employing metrics that are designed for the space of symmetric positive definite matrices (SPD), which is a Riemannian manifold. The method here is to project the manifold data onto a flat Euclidean space, and optimize the point at which to do the projection that best fits the task. Although the method can easily be used as part of a deep neural network, here it is limited to the simplest regression/logistic regression model to show that the simple projection has merit for analyzing brain networks. This work was published as “Riemannian regression and classification models of brain networks applied to autism” in Connectomics in Neuroimaging (2018), with coauthors Jeffrey S. Anderson, Brandon A. Zielinski, and P. Thomas Fletcher [181]. 5.1 Related work Because the space of SPD matrices is a Riemannian manifold, geodesics distances is a natural way to measure difference between SPD matrices. A couple of Riemannian metrics are the affine-invariant metric (AIM) and the log-Euclidean metric (LEM). In medical image analysis, the affine-invariant metric for SPD matrices has originally been applied to the analysis of diffusion tensor imaging [58, 59, 128]. Since the affine-invariant metric is computationally intensive, the log-Euclidean metric [11] has been offered as a faster alternative. Other measures have also been proposed, e.g., Cholesky, Power-Euclidean, and Root Stein Divergence (see a review in [47]), but these are not true geodesic distances. SPD matrices occur in many areas of computer vision, and these metrics have found applications in 55 pedestrian detection [166], image set classification [83, 174], etc. Recently, Riemannian properties of SPD matrices have been utilized in deep learning models [28, 45, 82, 192]. In particular, some works showed evidence that pooling by second order statistics, i.e., using the SPD covariance matrix [28, 192], outperforms max-pooling in computer vision tasks. Works that use AIM and LEM to analyze SPD resting-state fMRI connectivity matrices have shown promising results [15, 44, 119, 130, 131, 149, 170, 197]. Varoquaux et al. [170] introduced a probabilistic model on the tangent space for comparing single subject correlation matrices from a group model to identify outlier stroke patients from a group of healthy controls. Ng et al. [119] use the AIM for transport on the SPD manifold to remove nonlinear commonalities between scans in longitudinal studies. The works from [130, 131] construct models on the tangent space for variation in a population. In these works, the tangent space is either generically set at the identity, or located at the Fréchet mean of the manifold valued-data. Yger and Sugiyama [190] show that the choice of location for tangent space matters and suggest to optimize the base point for where the tangent space should be located to do kernel target alignment on the SPD manifold. The ABIDE dataset is now ubiquitous for testing performance of classification algorithms for resting-state fMRI network data. In particular, the already preprocessed data from the Preprocessed Connectome Project is very accessible and hence it is easy to compare results with other methods [49, 80, 125]. Some of the recently proposed methods employ complex deep models that take days to train, require careful tuning of a large number of parameters, or kernel methods such that results are difficult to visualize. We demonstrate that Riemannian methods have comparable state-of-the-art performance while capable of maintaining the interpretability of the model. Being able to identity specific connections is useful to researchers for understanding which functional communications in the brain are implicated in autism and to clinicians developing potential treatments, e.g., brain stimulation, that target specific regions. 5.2 Contribution This contributions of this work are: 1. It is the first attempt to show that the simple strategy of projecting SPD matrices to the tangent space performs well for rsfMRI data from ABIDE dataset, in comparison 56 to other modern methods. In fact, a subsequent work [39] that has applied this to multiple rsfMRI dataset with various types of preprocessing and classifiers has corroborated with the results presented here and recommends the tangent projection approach 2. The results are interpretable when projecting from the identity 3. Rather than placing the tangent space at the identity or the Fréchet mean, this work proposes to simultaneously learn logistic regression and regression models while optimizing the base point of the tangent space. 5.3 Riemannian background A d × d matrix M is symmetric positive definite (SPD) if z T Mz > 0, ∀z 6= 0 ∈ Rd . Let d d S++ be the space of all symmetric positive definite matrices. S++ is the interior of a convex cone, the boundary of the cone is the set of semidefinite matrices with null eigenvalues. The central axis of this cone is the set of isotropic diagonal matrices starting from the origin and increasing in trace. d The space S++ is a Riemannian manifold, which is a smooth manifold with a metric d . tensor, a family of inner products defined at the tangent space of all points M ∈ S++ d Several metrics have been proposed for S++ that utilize its Riemannian properties. The manifold is not a vector space, so using Euclidean methods on this space may be problematic. A few undesirable properties of treating it as a vector space have been noted, e.g., arithmetic mean is not invariant to inversion [12]. 5.3.1 Affine invariant metric d The set of SPD matrices S++ is a differential manifold. Given a matrix d × d matrix M, the matrix exponential operation exp ( M) of M is defined as the series ∞ exp ( M ) = ∑ k =0 Mk , k where M0 = Id . A tangent space of a differentiable manifold is the real vector space that contains all vectors that tangentially pass through a point x in the manifold. The space of d , is an open set of Symd , the space of symmetric matrices. positive definite matrices, S++ d As such, the tangent space at any point of S++ can be identified with Symd . The matrix 57 exponential provides a method for taking an infinitesimal step in the direction of a vector in the tangent space at identity. The matrix exponential is absolutely convergent to a well-defined matrix. It can be shown that for all symmetric matrices W ∈ Symd , exp (W ) is a SPD matrix, and for all d , there exists a symmetric matrix V ∈ Symd such that exp V = M, so this M ∈ S++ ( ) d . Therefore, the tangent space of Sd is the set of all symmetric mapping is bijective for S++ ++ matrices Symd . Another property of the matrix exponential is that, if S and U are square matrices and U is invertible, then exp USU −1 = U exp (S) U −1 . Using this property, it is simple to compute the matrix exponential for symmetric matrices via eigendecomposition. The inverse operation of the matrix exponential is the matrix logarithm, given by the power series ∞ log(W ) = ∑ (−1) k +1 k =1 ! (W − I ) k . k Once again, this can also be computed for symmetric matrices simply by eigendecomposition. The input is a SPD matrix W and the output is the tangent vector (symmetric matrix). The affine-invariant metric (AIM), first used in medical image analysis by [58, 128], starts with defining the scalar inner product of tangent vectors W1 , W2 at the identity by hW1 \|W2 i Id = tr W1T W2 and at other points M by 1 1 1 1 hW1 \|W2 i M = h M− 2 W1 M− 2 \| M− 2 W2 M− 2 i Id . Given the above, two Riemannian operations can then be introduced that generalize the mapping between points and tangent vectors on the SPD manifold 1 2 1 2 Exp M1 ( X ) = M1 exp Log M1 ( M2 ) = M1 log −1 −1 M1 2 XM1 2 −1 −1 M1 2 M2 M1 2 1 M12 (5.1) 1 M12 (5.2) where Exp and Log denote the Riemannian operations, and exp and log denote the matrix exponential and logarithm described previously. Exp M1 ( X ) returns a point at time one d along the geodesic starting at M1 ∈ S++ and with tangent speed vector X. Log M1 ( M2 ) is the inverse operation which yields that vector in the tangent space that Exp maps M1 to 58 M2 . A geodesic distance between points M1 and M2 can therefore be defined under this framework by taking the Frobenius norm of the tangent vector that Log maps one to the other by 1 2 dist ( M1 , M2 ) = M1 log 5.3.2 −1 −1 M1 2 M2 M1 2 1 M12 . (5.3) F Log-Euclidean metric Another popular alternative is the log-Euclidean metric (LEM) proposed by [11], which d . Defining the group product places a Lie group structure on S++ M1 ∗ M2 = exp (log ( M1 ) + log ( M2 )) d gives S++ an abelian Lie group structure. Please refer to [12] for details. The LEM leads to a simple form for the distance: dist ( M1 , M2 ) = k log ( M1 ) − log ( M2 ) k F . (5.4) Comparing the metrics 5.3 and 5.4, notice that distances under the LEM are equivalent to those under the AIM when one of the two matrices, M1 or M2 , is equal to the identity matrix. 5.4 Methods The proposed work is to analyze rsfMRI networks by first mapping SPD connectivity matrices from a Riemannian manifold to a Euclidean space upon which Euclidean methods would be effective. Connectivity matrices derived using covariance based measurements of association (e.g. covariance, correlation, partial correation) are symmetric positive semi-definite. Matrices with null eigenvalues can be regularized to SPD by setting zero eigenvalues to some small epsilon. d , is a flat vector The space of symmetric matrices Symd , which is a tangent space to S++ space. Therefore we can map the connectivity matrices to the tangent spaces specified by AIM and LEM, and proceed with standard Euclidean analysis of regression or logistic regression for classification. The LEM offers a way to map SPD matrices to the Euclidean tangent space at identity d [82], i.e., f : S++ → Rd × d f LEM ( M ) = log ( M ) . (5.5) 59 In the AIM case, the mapping is given by f AIM ( M2 ) = log −1 −1 M1 2 M2 M1 2 , (5.6) where M1 is the base point [75]. The proposed method requires the selection for a base point on the manifold where the input connectivity matrices get projected to the tangent space. For data analysis, we can think of M1 in equation 5.3 as a base point that all data points are compared to. The Fréchet mean, which is a generalization of the centroid to metric spaces, has been proposed before as a reasonable choice for M1 [57, 119] when working with the AIM metric. It is analogous to the process of whitening and centering data to mean zero in standard analysis. If ( x1 , . . . xn ) are N points in metric space M, then the Fréchet mean is defined by N µ f = arg minµ f ∑ dist µ f , xi . i =1 The reader may refer to [58] for the computation of the Fréchet mean of SPD data under AIM. In addition to using the Fréchet mean as base point, we also propose to optimize for a base point M1 that best fits the task. 5.4.1 Implementation We apply this framework for the purpose of classifying and regressing network data. For classification, we propose to use logistic regression, and learn M1 simultaneously using the backpropagation algorithm for matrix operations, i.e., matrix logarithm, introduced by [88]. Implementation-wise, the AIM has to be defined in two layers Affine-transform layer: Laff ( M2 ) = CM2 C, − 21 where C = M1 (5.7) is a substitution for the base point and also has to be a SPD matrix, and M2 is SPD-valued input data. Log layer: Llog ( X ) = log ( X ) , where X = CM2 C has to be SPD also. (5.8) 60 As a concrete example, in the 2-class logistic regression case, the probability of a data point M2 being in class Y is given by P(Y = 1\| M2 ) = 1 , 1+exp(−(c+ βvec(log( M1−1/2 M2 M1−1/2 )))) where M1 is a base point to optimize over and vec (·) is the vectorization of a matrix. The energy function is the standard cross-entropy for logistic regression. Because of chain rule, minimizing the cross-entropy with respect to M1 involves computing the matrix logarithm backgradient Z. 5.4.1.1 Backpropagation for Llog Following the steps in [82, 88], the backpropagation for Llog is as follows: Let the (l − 1)th layer have variable Xl −1 , and at the l th layer be the matrix logarithm operation g ( Xl −1 ) = log ( Xl −1 ) = U log (Σ) U T = Xl . Function Ll is the backpropagating loss function at the l th layer. Define A : B = Tr A T B . Let δLl δXl : dXl = δLl −1 δXl −1 : dXl −1 and L be the functional of the variations of upper layer variables with respect to variations of lower layer variables, such that dXl = dg ( Xl −1 ; dXl −1 ) = L (dXl −1 ) . Then δLl δLl : dXl = : L (dXl −1 ) = L∗ δXl δXl δLl δXl : dXl −1 ⇒ L ∗ δLl δXl = δLl −1 δXl −1 where L∗ is the adjoint operator of L. Writing X in terms of U and Σ in the eigendecomposition, δLl −1 δXl −1 : dXl −1 = δLl −1 δU δLl −1 : dΣ. With the constraints of U T U = I and Σ having a diagonal structure, then dΣ = U T dXl −1 U diag dU = 2U R T ◦ U T dXl −1 U , ( 1 i 6= j σi −σj where R (i, j) = , σi = Σii . Therefore, 0 i=j δLl −1 δLl −1 δLl −1 T : dXl −1 = 2U R ◦ + U T : dXl −1 . δXl −1 δU δΣ dU + δΣ For the matrix logarithm, dX = 2U log (Σ) dU T + UΣ−1 dΣU T . The partial derivatives are l −1 δLl −1 δL =2 U log (Σ) δU δXl −1 l δLl −1 δL −1 T =Σ U U δΣ δXl : 61 We can then solve for Z = 5.4.1.2 δLl −1 δXl −1 . Backpropagation for Laff − 12 After obtaining Z, the gradient with respect to C = M1 is δL δC = M2 CZ + ZCM2 . During optimization, there is no need to compute M1 directly. To keep C positive-definite, update C by taking the Riemannian Exp map from 5.1, such that Cnew = ExpCold δCδLold . 5.5 Results 5.5.1 Dataset Once again, the dataset that this model is applied on is the collection of rsfMRI and phenotypic data for typically developing controls and ASD subjects from ABIDE I, acquired from multiple sites. The ABIDE dataset is frequently used due to its open accessibility for researchers. However, learned connectomes are hard to compare because they are sensitive to the steps of preprocessing used. The Preprocessed Connectome Project [36] provides ABIDE data already preprocessed with popular pipelines, such that performance from multiple proposed rsfMRI methods are directly comparable. Whereas in the previous chapter, the ABIDE data has been preprocessed using Power regions [129], the data used in this chapter is acquired directly from the Preprocessed Connectome Project, preprocessed with the CPAC pipeline and parcellated according to the Harvard-Oxford atlas to be consistent with [125]. There are 468 controls and 403 ASD subjects. The resulting time series at each of the d = 111 ROIs are normalized to mean = 0 and stdev = 1. 5.5.2 Classification The first comparison is between raw and Fisher-transformed Pearson’s correlation matrices, as well as eigenvalue-regularized and log-Euclidean transformed matrices as input for each subject into logistic regression for classification. Raw correlation matrices are decomposed and lower-bounded with small eigenvalues of 0.5, and re-composed into regularized correlation matrices to ensure that the matrices are SPD. Log-Euclidean matrices are obtained by taking the matrix logarithm of the regularized correlation matrices. All matrices are then reduced to upper triangles and vectorized into feature vectors. Matrix features involving log-Euclidean transform are of 6205 dimensions because diagonal en- 62 tries are included in the upper triangle, whereas all other features are of 6105 dimensions. Using the Scikit-Learn implementation of logistic regression with L2 penalty as classifier, the classification performance is evaluated through a nested ten-fold cross-validation scheme (folds selected at random). At each fold, 10% of the data is set aside for testing, and the other 90% is ten-fold cross-validated to get the best parameter for L2 penalty. The range of parameters we cross-validate over are [0.01, 0.05, 0.075, 0.1, 0.2, 0.5, 0.75, 1.0, 3.0, 5.0]. The results of using affine-invariant transformed matrices with an optimization for the base point in TensorFlow is also compared using the same ten-fold cross-validation scheme. At the first layer, square correlation matrices are affine-invariant transformed with variable M2 , then linearized to a 6205-dimensional vector and fed into a sigmoid function for classification. The cost function to optimize over is the sum of the logistic regression cross-entropy plus L2 penalty with parameter λ. In TensorFlow, the range of parameters cross-validated over are [5, 10, 15, 20, 100, 200]. The matrix backpropagation is modified to the method described in the previous section. The optimization is run until convergence within 50 iterations. Table 5.1 shows the results. The baseline of using just vectorized correlation matrix features has an accuracy score of 65.7%, comparable to baseline scores reported in [1, 125]. A t-test showed that both the log-Euclidean and the affine-invariant transformed features have a statistical significant improvement in performance over the raw correlation baseline (p = 0.02 and p = 0.002, respectively). The regularized correlation matrix shows similar accuracy to the raw correlation features, indicating that the increase in performance is solely due to the Riemannian mappings. Figure 5.1 describes the range of classification accuracy from ten-fold cross-validation for the baseline compared to log-Euclidean and affine-invariant mapped features. Using the model learned from the log-Euclidean features, we visualize the highest weights in the classification thresholded at \|w\| > 0.25 in Figure 5.2. Red connections indicate positive weights that push classification toward the ASD group (label=1) and blue connections are negative weights toward the control group. Figure 5.3 is a diagram of the learned weights on the ROIs grouped by subnetworks from [153]–visual, default mode, sensorimotor, auditory, executive control, and frontoparietal networks. The colormap runs from negative values in blue (driving classification toward control) to positive values in red (toward autism). For visualization, very small 63 weights have been filtered. There is evidence of patterns within and between subnetworks. The weights within a subnetwork are simplified to their means within a block. The results show that control subjects tend to have higher intranetwork connectivity especially within the sensorimotor, executive control, and default mode networks, whereas subjects with ASD have stronger internetwork connectivity, e.g., between the default mode and the sensorimotor networks. This is in agreement with existing literature that the default mode network is not well segregated from other subnetworks for the ASD population [13, 178, 189]. 5.5.3 Regression To show that the Riemannian features also have predictive power in regression, the performance of log-Euclidean and affine-invariant transformed matrices are compared against raw connectivity matrices and TDA of connectivity matrices (from previous chapter) in the prediction of autism severity as measured by the ADOS Total score. Since the time series have been normalized to unit standard deviation in the preprocessing, they are both covariance and correlation matrices. Though there has been work on doing regression on Riemannian manifolds [34, 100], it has not been applied to ASD analysis. Because regression is more challenging than classification, and some sites lack ADOS scores for control subjects, the analysis is limited to the largest site (Utah) with a roughly even split of ASD and control subjects that have ADOS Total scores. In the previous chapter on TDA (Chapter 4), the data is preprocessed with Power regions. In this chapter, since the data comes from the Preprocessed Connectome Project, hence the TDA results are rerun for apples to apples comparison (using the same kernel partial least squares procedure as previously described). The Utah site has 62 subjects with scores ranging from 0 to 21. Scores below 10 are considered typically developing. For the raw correlation, Fisher-transformed correlation, and Log-Euclidean input features, we use partial least squares regression (PLS) optimized with the NIPALS algorithm [180] to make the prediction. The optimization for the base point under the affine-invariant metric is not trivial with NIPALS. To use partial least squares for AIM, we set the base point as the Fréchet mean. We also consider base point optimization using a linear regression layer with L2 regularization and cross-validating for 64 the regularization parameter. Table 5.2 shows the root mean squared error (RMSE) and the Q2 coefficient of determination values between Riemannian and baseline correlation matrix features to show the predictive power of the various models. The Q2 value is calculated through a leave-oneout cross-validation (LOOCV) scheme. The plot of true versus predicted ADOS using the Fréchet mean base point is shown in Figure 5.4. To show the statistical significance of improvement, we do a permutation test. We sum up the absolute value of the residuals of the LOOCV predictions and take the difference of the proposed method from the baseline correlation as the test statistic. Then we do 10000 permutations swapping the predictions between the two classes and sum up the number of times that the differences are greater than our nonpermuted test statistic value. Both log-Euclidean and affine-invariant metrics signicantly improve over the raw and Fisher correlation baselines (also similarly significant by t-test on the RMSE). The regression weights in Figure 5.5 show similar patterns to the classification results, though not the same. This is expected because the regression data is only a single-site subset. The classification and the regression weights share a correlation of 0.31, reasonably consistent for such high-dimensional data. Summarizing weights into means of each block, we can see the pattern that the intraconnectivity in the default mode and sensorimotor networks drives the regression toward low ADOS scores (control) and interconnectivity between the two networks pushes regression toward high ADOS scores. 5.6 Discussion Because of the noncompleteness of the space, using Euclidean distance on the space of connectivity matrices does not enforce the symmetric positive-definite properties of connectivity matrices and is therefore not ideal for analysis. This chapter demonstrates that the Riemannian representation of SPD matrices is beneficial for autism classication and regression tasks and comparable in performance to other modern methods. In particular, the results are interpretable under the log-Euclidean metric, whereas the affine-invariant metric leads to higher learning performance. For future work, how the choice of ROI may have an effect on predictions will be compared. We will also develop the affine-invariant base point update for other analyses. The gradient descent method presented here can be 65 adapted to update simultaneously with multilayer neural networks. To do so, SPD data would be first mapped onto the Euclidean tangent space using either f LEM or f AIM in the first layer and followed by standard architecture such as feedforward networks. Figure 5.1. Box plots of the classification accuracy over ten-fold cross-validation. Figure 5.2. Plot of the connections with highest weights in the classification. 66 Figure 5.3. Classification weights grouped by subnetwork. Figure 5.4. Predicted vs. true ADOS scores for regression under AIM. 67 Figure 5.5. Regression weights grouped by subnetwork. Table 5.1. Accuracy performance of Riemannian and various state of the art classification methods. Method Validation Acc (stdev) Sensitivity Specificity Abraham et al. [1] CV10 0.668 Dvornek et al. [49] CV10 0.685 (0.06) Parisot et al. [125] CV10 0.695 Heinsfeld et al. [74] CV10 0.70 0.74 0.63 Raw Correlation CV10 0.657 (0.06) 0.728 0.573 Fisher Correlation CV10 0.672 (0.05) 0.737 0.594 Regularized Correlation CV10 0.660 (0.06) 0.741 0.565 Log-Euclidean CV10 0.700 (0.05) 0.809 0.575 Aff-Inv w/ Fréchet Mean CV10 0.700 (0.05) 0.801 0.577 Aff-Inv w/ Base point Update CV10 0.711 (0.05) 0.838 0.585 Table 5.2. Comparison of Riemannian features against baselines in PLS regression. RMSE Q2 Better than Raw Better than Fisher Raw Correlation 6.17 -0.050 Fisher Correlation 6.18 -0.062 Topological Data Analysis 5.73 0.125 Log-Euclidean 5.42 0.182 p = 0.0112 p = 0.0127 Aff-Inv w/ Fréchet Mean 5.36 0.202 p = 0.0064 p = 0.0069 Aff-Inv w/ Base point Update 5.33 0.209 p = 0.0002 p = 0.0001 CHAPTER 6 CONFOUND ADJUSTMENT IN NETWORK DECODING PROBLEMS As statistical machine learning algorithms are increasingly developed and applied to neuroimaging problems, a question emerges: How can we utilize these models to draw links between brain function and behavior? Supervised learning algorithms show success in predicting demographic and behavioral attributes from neuroimaging data with high accuracy, yet not all algorithms are easily interpretable and therefore do little to yield insight about the associated brain phenomena. For the goal of medical image understanding, the analysis should be designed such that the criteria driving predictions can be interpreted by a human being [81]. Moreover, the effects of any confounding factor that drives prediction need to be decodable from the analysis. This chapter focuses on how to handle confounding factors in interpretable models for the study of resting-state functional magnetic resonance imaging (rsfMRI). 6.1 Introduction For the problem of identifying functional biomarkers predictive of cognitive ability and behavior from rsfMRI, what could be considered confounding variables is context dependent. In a prediction model, how one accounts for the effects of such confounding variables with the target variable can lead to 1) markedly different predictions and 2) incorrect interpretation of the model. For instance, certain neurological conditions can be highly correlated with head motion (HM), which also affects the acquired rsfMRI signals. Without controlling for HM as a confounding factor, models can predict neurological conditions with boosted accuracy simply by picking up on the signal associated with HM rather than the underlying signal of interest. However, such models would not be generalizable to input data that has been corrected for HM, nor would they give much insight about the neural underpinnings of the conditions aside from head motion itself. 69 Recently, Snoek et al. [157] have raised the issue of the lack of consistency in confound control methodology throughout neuroimaging literature. Differences in confound control, or lack thereof, leads to inflation or deflation in reported prediction accuracy, and thus prediction performance cannot be comparable across studies. In the most typical method of confound control for regression problems, a set of variables deemed to be potential confounds are initially linearly regressed out from the inputs, but Snoek et al. [157] propose to do this confound adjustment within cross-validation instead. This chapter demonstrates why it is crucial to do confound control on both input and target variables within cross-validation in regression models. Doing confound control this way does not lead to bias in prediction accuracy, and maintains interpretability of the model. This chapter puts forward the recommendation that confound variable candidates be chosen for their relevance to the target variable, including any possible non-additive interactions between confounds that would also be pertinent to the prediction of the target. Multiple confound variables give rise to a combinatorial number of interactions within confounds. The suggestion is to automatically select for the final set of confounds and their interactions using information criteria. Assessing whether variables are truly confounding the target prediction, or whether their nonlinear interactions should be included in the analysis is rarely addressed in rsfMRI literature. The following sections show justification for the suggested method of confound control with examples using synthetic data. This section also demonstrates the effectiveness of the proposed confound control strategy in predicting fluid and crystallized intelligence using the rsfMRI data from the Human Connectome Project (HCP) [169]. In the analysis across different methods of confound control of this dataset, it is observed that there is a substantial difference between male and female groups in the predictability of crystallized intelligence, that is not present in fluid intelligence. 6.2 Background Resting-state functional connectivity estimates the processes happening in the brain while it is at rest. Parcellating the brain into predefined regions of interests (ROIs), these functional connections can be estimated as the pairwise association between ROIs. Increasing evidence demonstrates that rsfMRI connectivity uniquely fingerprints an individual’s 70 behavioral and demographic traits [56]. At a broad population level, rsfMRI features capture a spectrum of phenotypes. Several neuroimaging initiatives have been created for the purpose of mapping the human brain. The most substantial data collections, Human Connectome Project (HCP) and UKBiobank [24], have compiled rsfMRI data from thousands of volunteers in conjunction with their lifestyle, demographic, and psychometric information. The benefit of having such large-scale data collection is that it allows for building models generalizable across the human population. While many machine learning methods have been developed in recent years that focus on using rsfMRI to predict specific behavioral measures, in order to map the brain we need to study how behavioral phenotypes trace back to functional connectivity features in the brain. Being able to decode the complex wiring of the brain is a necessary step toward understanding the underlying biology of neurological disorders. Researchers have found evidence that rsfMRI carries information for predicting age [46, 71, 107] and sex [177, 184, 196], as well as phenotypes of intelligence [71] and various neurological disorders [126]. However, to interpret a rsfMRI model, it is important that the effects of confounding variables be disentangled from the analysis. Confounding variables are multiple sources of information that contribute to the prediction. For example, it is possible to get a very high accuracy prediction of phenotype for sex-biased neurological diseases such as schizophrenia [114] simply by predicting the sex of the subjects. This may not necessarily be a drawback if an accurate prediction is the end goal, but if the research goal is to interpret the learned model and understand the neurological basis of schizophrenia common to both sexes, then sex should be treated as a confound variable to first isolate the effects of the general differences between male and female brain connectivity. 6.2.1 Confound control For the correct translation of machine learning models into clinical practice, it is important to consider how we account for the impact of confounding information on the target variable. Since many neuroimaging works have been focused on the goal of classification/regression accuracy rather than interpretation of the learned model, there has not been consistency across works on how they handle confound variables. While it is most typical to regress confounds out of the predictor inputs X, some works [53, 71] 71 adjust for the confounds (age, sex, and head motion) initially out of the target variable Y (fluid intelligence) prior to cross-validation. Others, like Parisot et al. [126], deliberately include confounds (age and sex) in the analysis for the purpose of boosting prediction performance. For the goal of interpretation, Snoek et al. [157] note that the confounding information needs to be removed from the analysis in decoding studies. Instead of an initial removal of confounds from predictor variables using whole-dataset confound regression (WDCR) prior to model training, they propose cross-validation confound regression (CVCR), which is adjusting for confounds from the predictor variables during the cross-validation steps. WDCR and CVCR Let the dataset be {yi , xi , ci }in=1 of n samples, the observations of random target variableY, multivariate input variables X of dimension k, and multivariate confound variable C of dimension p respectively. To estimate the linear effect of C on X, express C as a design matrix, with 1s along the first column for the intercept. Then C is a matrix of size n × ( p + 1). X is a n × k data matrix, and Y is a n × 1 data vector. The weights that linearly predict the effect of C on −1 T each input variable dimension X can be estimated by ŵX = C T C C X. (WDCR) Adjusted data X̃ = X − C ŵX is now split into folds for cross-validation analysis. Y is trained on X̃. (CVCR) Confound adjustment happens within each training fold: −1 T T C ŵXtrain = Ctrain Ctrain Xtrain , X̃train = Xtrain − Ctrain ŵXtrain , and the test set is train adjusted using the parameters learned from the training set, X̃test = Xtest − Ctest ŵXtrain . Snoek et al. find that using WDCR followed by partitioning train/test data after confound control leads to problematic below-chance predictions of the target variable. This is prone to happen when there is a high correlations between C and Y, which means a low feature-target correlation between X and Y. After an initial confound adjustment of C out of X, if the distribution of feature-target correlation is low with a narrow standard deviation, then it becomes likely that during cross-validation an algorithm would be trained on features in Xtrain that are positively or negatively related to Y in the train set but have signs flipped in the test set Xtest . In fact, in a special case where Y is only linearly dependent on confounds for their synthetic binary classification experiment, i.e., corr (C, Y ) = 1, WDCR has a prediction accuracy of 0% (much worse than 50% accuracy expected by chance). By moving confound control into the cross-validation splits, they show that CVCR also cross-validates confound regression, resulting in minor accuracy deflation but an overall improvement in prediction performance over chance or WDCR. Nonetheless, Dinga et al. 72 [43] caution that CVCR is not appropriate for nonlinear machine learning methods which can still pick up signal from residual confounding information and bias predictability of Y. For this reason, this chapter adheres to linear regression models for prediction to ensure that the effect of confounds be properly isolated. Alternatively, rather than confound regression out of X, confound control by counterbalancing is to ensure that confounds are not correlated to the target variable Y. Post-hoc counterbalancing [132] involves analysis on a subsample of data such that the confound and target become decorrelated. Snoek’s experiments show that the process of subsampling in post-hoc counterbalancing discards samples that are harder to predict, and therefore introduces an accuracy inflation bias. Chyzhyk et al. [31] propose to counterbalance the test dataset within cross-validation instead. However, counterbalancing in general is only feasible with single confound variable. At higher number of confounds, and especially considering multiple confound variables, it becomes difficult to subsample such that the target becomes decorrelated with every single confound variable. 6.2.2 Model interpretation To clarify what is meant by model interpretation, we focus on the simple linear methods such as linear and logistic regression for predicting continuous and binary variables (see Molnar [117] for a review of other types of interpretable models). Analysis from Schulz [148] finds traditional linear models perform just as well as complex neural network models on sample sizes of less than 10,000, which is larger than most current neuroimaging datasets. Rudin [140] offers the perspective that there is no tradeoff between the accuracy and interpretability of a model, with a Rashomon set argument that at least one model would be both accurate and interpretable in a set of accurate predictive models for a dataset. Therefore, if interpretation is key, she calls for using interpretable models in the first place. For a linear regression that uses k-dimensional rsfMRI features X to predict the target phenotype Y by ŷi = β̂ 0 + β̂ 1 xi1 + β̂ 2 xi2 + . . . + β̂ k xik , we can interpret the trained model by selecting the top absolute coefficients values β̂ j=1,...,k to see how a unit change in xij affects ŷ, or sort by t-statistics t β̂ j = β̂ j SE( β̂ j ) to identify the connectivity features that are important in the prediction. The variables age, sex, and head motion are confounding 73 factors for both rsfMRI X and intelligence attributes Y (see Figure 6.1). If the problem is framed as a graphical model, what is of interest is the relationship between X and Y conditional on the set of confounds C [43]. As shown later, how confounds are controlled for influences the estimation of coefficients β̂ j . With real world neuroimaging analysis, target prediction performance can be evaluated but there may not be a ground truth to the rsfMRI brain networks we are trying to infer. Therefore, it is important that the chosen method of confound control be able to predict accurately and estimate β̂ correctly for interpretability. 6.2.3 Fluid and crystallized intelligence In this work, the aim is to decode the functional neural basis of intelligence using rsfMRI data from the Human Connectome Project (HCP). According to Cattell-Horn’s gc − g f theory [27, 79], human general intelligence has two subcategories: fluid intelligence, g f , and crystallized intelligence, gc . Here both types are analyzed. Fluid intelligence is the ability to process novel unfamiliar concepts, whereas crystallized intelligence is the ability to reason using knowledge gained from experience. Cattell theorizes that new concepts digested using fluid intelligence are transfered to long-term memory as part of crystallized intelligence. Psychometric assessments of general intelligence measure fluid intelligence using pictorial patterns and crystallized intelligence by language-related tasks. In HCP, fluid intelligence is measured by the number correct on the Penn Matrix Task (PMAT24 A CR), which asks participants to identify the last image that completes the pattern to a series of pictures. Crystallized intelligence (CogCrystalComp unadj) in HCP is the average of raw scores between the Picture Vocabulary Test, where participants select pictures that best associate with the audio-delivered vocabulary, and a Reading Test in which participants read words out loud. Several rsfMRI analyses [53, 71, 108] regard sex, age, and head motion (HM) as confound variables in predicting intelligence. In psychology research, it is highly controversial whether there are sex differences in general intelligence. Some works claim that tests of intelligence favor men over women (fluid: [89, 112, 122], crystallized: [152, 176]), though others report no difference [92, 96]. Steinmayr [162] finds that the perceived inconsistencies are dependent on the methods chosen for assessment and analysis. Since sex plays a big role in connectivity, a model that predicts intelligence well for both sexes should control 74 for these sex differences. 6.2.4 Brain networks and intelligence Structurally, brain size and volume, and gray matter volume are positively associated with higher intelligence [9, 179]. Combining results from 37 neuroimaging studies, the parieto-frontal integration theory (P-FIT) presented by Jung and Haier [94] proposes that the basis of human intelligence is described by the brain networks connecting regions in the frontal, parietal, temporal, and cingulate cortices. In the structural studies, Jung and Haier demonstrate that intelligence as measured by the Weschler Adult Intelligence Scale (WAIS) shows correlation to the left and right frontal regions, but not with the temporal or occiptal lobes. However, amongst the task-based functional studies, bilateral activations in the frontal and occipital cortices are associated with intelligence, with higher activations in the left than the right hemispheres. In reasoning tasks, activations in the regions of the parietal lobe have been observed in a majority of the studies. In some cognitive taskbased fMRI studies, there is evidence for a negative correlation between higher intelligence and the deactivation of the default mode network (self-referential processing) during task [18, 109] and the positive correlation between higher intelligence and activation of the fronto-parietal (executive function) and cingulo-opercular networks (responsible for task set maintenance) [18, 68, 106]. RsfMRI differs from task-based fMRI because of the theory that the task activates regions that would otherwise not be active during rest. To assess whether similar structural observations would be true in resting-state fMRI networks, Song et al. [159] find that in a small rsfMRI dataset of 59 subjects, functional connectivity strength in the frontal, parietal, occipital, and limbic lobes are correlated to intelligence. Moreover, spontaneous communication between frontal and posterior regions has an important role. Hearne et al. [73] map results from multiple rsfMRI intelligence studies [124, 147, 158, 159] to the topological characterization from Gordon [67]. Aside from regions in the P-FIT system (fronto-parietal, cingulo-opercular-parietal, and dorsal-ventral regions under the Gordon parcellation), they report that these studies identify common intelligence-related connectivity that extend to features in the dorsal attention, default mode, fronto-parietal, and visual networks. 75 In their own experiments to learn relationship between intelligence and networks on the larger HCP dataset, Hearne et al. consider PMAT and Picture Vocabulary Test scores as measures of intelligence. RsfMRI networks are characterized by network based statistics [195] and inter-network connections between default mode and frontal-parietal regions as well as regions in the cortex not associated with any other network have been found to be significantly intelligence-related (bilateral superior medial frontal cortex, superior orbital gyrus and temporal cortex, middle cingulate cortex and right middle frontal and supramarginal gyrus). They find that higher connectivity in these regions positively correlate to intelligence, and no other networks show statistical significance nor are there associations with decreased rsfMRI connectivity. This result is at odds with the conclusions from Basten [18] but agree with graph theory studies that global efficiency is characteristic of higher intelligence [77]. 6.2.5 Brain networks and sex Many works have found supporting evidence that male and female brains are functionally and structurally different. Males have larger brains on average [142], with more gray matter volume in the amygdalae and the hippocampi. However, females have larger volume in the right frontal pole, inferior and middle frontal gyrus, and the lateral occipital cortex. In structural connectivity of a same-aged population, males are found to have higher intrahemispheric communication and less interhemispheric connectivity compared to females [87]. This observation is again seen in functional imaging [93, 191]. Because previous experiments have smaller sample sizes, recent studies [71, 135] turn their focus to the UKBiobank dataset, which comprises of 5,000+ individuals. At such large-scaled data, He et al. [71] have shown that the classification of sex using rsfMRI networks over 20-fold cross-validation has 91.6% accuracy, regardless of deep or shallow methodology. Ritchie et al. [135] note that the males have higher sensorimotor, visual, rostral lateral prefrontal connectivity in strength and weighted degree, whereas females have higher connectivity in the default mode network, in strength, along with higher weighted degree in temporo-parietal junction, anterioral and medial temporal lobe, and in the bilateral posterior and dorsal anterior cingulate cortices. 76 6.3 6.3.1 Methods Confound selection Before diving into how confounds should be controlled for in decoding analysis, first we must consider how to decide whether a demographic or behavioral variable should be treated as a confound. In [53, 71], variables age, sex, and head motion are taken for granted as confounds for fluid intelligence, without further testing for cross-product interaction between confounds, such as age×sex. By visualizing how confound variables explain each other’s variances, Alfaro-Almagro et al. [6] find that there are non-additive and nonlinear effects between confounds. Whether first-order or higher-order (cross-product) confound terms should be controlled for is a combinatorial problem. Here, the philosophy is that variables should only be included as confounds if they are actually predictive of the target Y. Akaike information criterion (AIC) [5] is used to select for the set of first order confound candidates and their cross-terms that best regress to Y: let L̂ be the maximum likelihood of the model, then AIC = 2k − 2 log L̂ . 6.3.2 Confound regression (CV2CR) The suggestion here is to do confound control on both X and Y, modifying CVCR such that C is regressed from both sides, X and Y. Regressing C out of both X and Y conditions them both on the given set of confounds. This allows us to isolate the contributions of confounds and relevant underlying signal to the prediction. 77 CV2CR As with CVCR, C is a matrix of size n × ( p + 1). X is a n × k data matrix, and Y is a n × 1 data vector. Weights ŵX and ŵY are learned by regressing out confounds C from X and Y by ŵXtrain = T Ctrain Ctrain ŵYtrain = T Ctrain Ctrain −1 −1 T Ctrain Xtrain T Ctrain Ytrain . Adjusting the data, X̃train = Xtrain − Ctrain ŵXtrain Ỹtrain = Ytrain − Ctrain ŵYtrain Ỹtest = Ytest − Ctest ŵYtrain . Next, learn the regression of Ỹ on X̃ by first writing X̃ as a design matrix with 1s in the first column. X̃ is now a n × (k + 1) matrix. Then the regression weights are estimated by β̂ = T X̃train X̃train −1 T X̃train Ỹtrain . The prediction for test data is then given by predicting from X̃test and adding back the effects of confounds Ctest : Ŷ = X̃test β̂ + Ctest ŵYtrain . 6.3.3 HCP data The rsfMRI features are derived from Human Connectome Project data (HCP S1200). The sample contains subjects with no missing raw fluid intelligence (PMAT24 A CR) and raw crystallized intelligence information (CogCrystalComp unadj) and sex balanced such that there are an equal number of subjects for both sexes (465 males, mean age = 27.87 ± 3.65 (SD); 465 females, mean age 29.42 ± 3.62 (SD)). See Van Essen et al. [168] for recruitment information. Resting-state fMRI data is acquired with 3.0 Tesla Siemens Skyra scanner using 32-channel head coil. Subjects in the sample participate in four 15-min image acquisition sessions over 2 days. Subjects are asked to lie awake with eyes open. Resting-state fMRI data is parcellated according to Gordon brain parcellation [67], consisting of 333 regions in the cerebral cortex (the majority of which are part of the 12 Power communities [129]: Auditory, CinguloOpercular, CinguloParietal, DMN Default mode network, Dorsal attention, Frontoparietal, RetrosplenialTemporal, Salience, SMhand, SMmouth, Ventral attention, 78 Visual). The time series of each region in the parcellation are extracted by averaging across volumes excluding the first 20 from each scan session. A FIX ICA cleaning procedure and a general linear regression process that detrends 12 head motion signals from the image data are applied. Volumes before and after more than 0.2mm head motion displacement are removed from the data after linear detrending of residuals. The resting-state functional connectivity (FC) for each subject is described by a 333 × 333 matrix of Fisher-transformed pairwise Pearson correlation coefficients of the resulting time series from Gordon parcellation and preprocessing [67]. Matrices are symmetrical, yielding d =55,278 features. FC is averaged across all four sessions per subject. The HCP contains 280 non-imaging subject measures, including demographic and behavioral phenotypes. Of those, the target variables of interest to this study are PMAT24 A CR (mean=17, stdev=4.69) and CogCrystalComp unadj (mean=118.15, stdev=9.76). The confounds considered are age (in the range 22 to 37), sex (male=0, female=1), and head motion (mean=0.84, stdev=0.39) as used in [53, 71, 108]. Figure 6.2 shows histograms of target and confound variables in each sex group. Note that there are small, but statistically significant, differences in intelligence distributions between male and female groups. 6.4 6.4.1 Results Synthetic experiment To show that regressing out of both X and Y is ideal, we compared the proposed CV2CR to alternatives in a synthetic dataset. Both X and Y were assumed to be dependent on confound C. We designed a synthetic dataset, where X1 , the first dimension of X, and Y were a linear combination of the unobserved true signal Z and observed confound C, which was specified as univariate in this synthetic experiment. The rest of the dimensions of X were confound and noise. X1 = a 1 Z + a 2 C + e X X2...k ∼ N (0, eX I ) + f (C ) Y = b1 Z + b2 C + eY With this model, the correct set of weights that regresses Y ∼ X with confound controlled for should be ba11 , 0, . . . , 0 . We compared the different methods of confound control 79 by 1) regressing out C from X only (indicated by Y ∼ X 0 ), 2) from Y only (Y 0 ∼ X), and 3) from X, Y both (Y 0 ∼ X 0 ). The metrics used for comparison are 1) regression accuracy reported as the coefficient of determination (COD) computed from predictions across cross-validation folds, and 2) correlation of the average estimated coefficients (from all cross-validation folds) to the true set of weights ba11 , 0, . . . , 0 . We compared confound control performance over 20-fold cross-validation by generating 100 random datasets of sample size n = 100, with k = 25. Both true signal Z and confound C were normally generated, Z, C ∼ N (0, 1), f (C ) was a linear combination of C with random normal weights in N (0, 1). Keeping the noise to a minimum, eX , eY ∼ N (0, 0.5I ), the parameters a1 , a2 , b1 , b2 were individually varied one at a time while keeping others fixed to 1 to see how these parameters affect prediction. The Y ∼ X 0 regressions were carried out by WDCR and CVCR. For Y 0 ∼ X and Y 0 ∼ X 0 , confound regression out of Y was trained on the training set within the 20-folds. At test-time, the effect of confounds was predicted and added back in (see Section 6.3.2). The estimated regression weights were averaged across 20 folds for comparison against b1 , 0, . . . , 0 . a1 Figure 6.3 are the boxplots of the COD from 100 random datasets. First, it was universally true in all cases that WDCR predicted worse than CVCR as reported by Snoek et al. [157], but CVCR still had accuracy deflation across all combinations of parameters. Y ∼ X, Y 0 ∼ X and the suggested method of confound control CV2CR predicted much better than CVCR and WDCR across all tested combinations of parameters. When a2 was large, CV2CR predicted even better than Y 0 ∼ X and Y ∼ X. Looking at the correlation of estimated coefficients to the true weights b1 a1 , 0, . . . , 0 , we saw that the CVCR and CV2CR were best at estimating weights overall. When b2 was very high relative to all other parameters, not doing any confound control would not be able to recover the true weights. Hence, confound control is necessary for interpretation. When a2 was very high with the other parameters fixed, Y 0 ∼ X was not capable of recovering the true weights. The reason why Y 0 ∼ X estimated weights so poorly can be demonstrated with a simple example when a2 is large relative to a1 and f (C ). Suppose k = 2, X1 = 1Z + a2 C in the first dimension, X2 = f (C ) = −C in the second dimension, and Y = 1Z + 0C. Y can be 80 perfectly predicted by X with the regression weights β̂ = (1, a2 ), but this does not recover the confound controlled true weights of (1, 0). When a2 is larger than 1, by our metric of checking weights correlation, this would result in a negative weights correlation. To see why Y 0 ∼ X 0 is best at prediction accuracy, consider the univariate model X = a1 Z + a2 C, Y = b1 Z + b2 C, and a1 6= a2 6= b1 6= b2 6= 0. The only case in which Y can be perfectly predicted with a COD of 1 is with Y 0 ∼ X 0 . All other methods would introduce error in prediction. In the synthetic model, the only confound control method that was capable of accurate predictions and correct weights estimation is CV2CR. 6.4.2 6.4.2.1 HCP experiment Regression method Resting-state fMRI analysis suffers from high dimensionality coupled with relatively low population sample size. Despite many methods have been proposed for rsfMRI regression and classification problems, the analysis from Schulz et al. [148] argue that the benefits of nonlinear methods are not realized for neuroimaging datasets at the scale of fewer than 10,000 sample points. The current state-of-the-art in using rsfMRI for prediction of behavioral phenotypes is kernel ridge regression (KRR) using a linear kernel (the Pearson correlation between n samples [71]). However, using a kernel regression method makes the model hard to interpret, when on top of dealing with confounds also. We used principal component regression (PCAR) instead, which is a related method of regression. With a linear kernel, KRR is equivalent to non-kernel ridge regression (RR). The es −1 timated weights to KRR, β̂ = X T XX T + αI Y, are equal to the RR weights of β̂ = −1 T X T X + αI X Y, where α is a tuneable parameter. There is a relationship between (K)RR and PCAR [70]. In ridge regression, the predicted target variable is Ŷ = X β̂ = −1 T X X T X + αI X Y. Let X be SVD decomposable to UΣV T , with σi as singular val 2 σ ues. Then Ŷ = U diag σ2 +i α U T Y, with increasing penalty on smaller singular vali ues. PCAR, on the other hand, zeroes out singular values below threshold, such that Ŷ = U diag (1, . . . , 1, 0, . . . , 0) U T Y. This is the same as KRR with α = 1 for the retained dimensions and α = ∞ for the dropped dimensions. We used rsfMRI data from HCP to predict raw fluid and crystallized intelligence scores, 81 and compared the performance of PCAR and KRR over 20-fold cross-validation. Within each training fold, the α parameter in KRR and the number of dimensions k to use in PCAR were cross-validated. The purpose here was to establish that the two regression methods perform similarly, hence there was no confound control in this experiment. For fluid intelligence, KRR predicted with an accuracy of 0.342, similar to what is reported in [71], PCAR predicted slightly lower at 0.285. For crystallized intelligence, KRR had prediction correlation of 0.367, PCAR with prediction correlation 0.380. While PCAR underperformed on predicting fluid intelligence, it compared favorably in predicting crystallized intelligence, and the estimated model is simple to interpret. PCAR can also be used for classification, via logistic regression in a generalized linear model (PCALR). Cross-validating over 20 folds with a search within each fold for dimension k, PCALR predicts sex from HCP resting-state functional connectivity with 88.3% accuracy. He et al. [71] have reported a 91.6% sex classification accuracy on the larger UKBiobank dataset with 10000 subjects. Figure 6.4 shows the dataset projected to the first and second PCA dimensions, with blue for male and red for female subjects. 6.4.2.2 Intelligence and confounds: difference between male and female groups For the rest of the experiments using real data, we considered confounding information and used PCAR for regression analysis. The variables commonly considered as confounds when used to predicting intelligence are: age, sex, and head motion (HM) [53, 71, 108]. In many of these works aiming to predict intelligence, whether these variables actually have an effect on intelligence prediction had not be confirmed. First, we investigated whether these variables alone had any predictivity of fluid and crystallized intelligence variables. Using the whole sample dataset, PMAT24 A CR and CogCrystalComp unadj were regressed against age, sex, HM. Sex was coded as a binary variable with 1 for females and 0 for males. Separately, the cross term interactions age×sex and HM×sex were considered. With a binary variable, the presence of the interaction term indicates male and female groups have different regression models of age and head motion toward target. This tested for whether a single model or different models is preferable. Combining all these terms together led to the combinatorial problem of selecting the set of first order and interaction terms that best models the relationship between confounds and target variable. We used 82 AIC to select for the top predicting combination of variables, see Table 6.1. The significance of each of these confound variables was tested, with the null hypothesis that the confound variable had no correlation to the dependent variable. The results are shown in Table 6.2. Age, sex, and head motion (HM) were all significant variables predictive of PMAT24 A CR and CogCrystalComp unadj. Interestingly, age and head motion were more significant in crystallized intelligence regression than in fluid intelligence. When the interaction term age×sex was introduced, all of the confounds were still significant in the prediction toward crystallized intelligence, but now only head motion was significant toward fluid intelligence prediction. To test that functional connectivity were predictive of the target variables beyond the effects of confounds, the prediction accuracy of PMAT24 A CR and CogCrystalComp unadj using functional connectivity input was compared against the prediction accuracy of using only confounds as inputs. In Table 6.3, prediction by rsfMRI connectivity input had much higher accuracy than using age/sex/HM and age/sex/HM/age×sex as input. Also consistent with what was determined from confound selection by AIC earlier, the inputs of age/sex/HM/age×sex predicted better than age/sex/HM alone for crystallized intelligence. 6.4.2.3 Prediction accuracy of confound control methods To see if CV2CR could actually increase accuracy as in the synthetic experiment, we made comparisons across all permutations of whether confound should be regressed out of X or Y, with and without the age×sex interaction term, in the prediction of PMAT24 A CR and CogCrystalComp unadj using PCAR. Even the interactions between sex and brain features (brain×sex) toward the prediction of intelligence were considered–namely, the prediction performance of a single model (M+F) on the combined male and female groups with prediction by separate models trained individually on male and female groups (M, F). The latter is equivalent to the interaction between brain×sex. Confound adjustments on X and Y were indicated by X 0 and Y 0 . The confounds were first regressed out prior to PCA regression to ensure interpretability of results. For Y 0 adjustment, after regression prediction, the contribution by confounds to the prediction were added back in for accuracy evaluation. 83 Because the confounds significantly contributed to prediction of Y alone, it was expected that this should increase prediction over using just confound-adjusted X as suggested by the Snoek et al. This was tested over 20-fold cross-validation, with nested 3-fold cross-validation to determine the best dimensionality. See Tables 6.4 and 6.5 (columns under M+F and M,F) for PMAT24 A CR and CogCrystalComp unadj prediction results. Figure 6.5 shows the visualization for the best predicting method. For crystallized intelligence, which was earlier determined to be highly dependent on confounds including age×sex (Tables 6.1 and 6.2), the best prediction using confound control was with the proposed procedure of ( X 0 , Y 0 ), with ( X, Y 0 ) as a very close second. Regardless of method of confound control (or none), the female group was much more predictive than the male group of crystallized intelligence. For fluid intelligence, the proposed method of ( X 0 , Y 0 ) confound control also improved the prediction over ( X 0 , Y ) and ( X, Y 0 ), though including an interaction term did not help because it was not determined to be significant toward fluid intelligence (Table 6.2). Only (M+F) reliably gave predictions with positive coefficient of determination in both cases. Across the 20 folds, the best dimensionality for fluid intelligence was at approximately 40 and for crystallized intelligence at approximately 90. Having individual models for male and female groups (M, F) led to poorer results. To test whether the (M,F) model suffered from reduced training samples, the dataset was reduced by half as input into the best (M+F) model (see the (M+F 1 ) column in Tables 2 6.4 and 6.5). Reducing the dataset decreased the accuracy of the predictions, but not by much. The (M+F 1 ) model still outperformed the (M,F) model in predicting fluid intelli2 gence. Figure 6.5 shows a few (predominantly male) subjects that were outliers in their true crystallized intelligence scores. We repeated the prediction with only subjects that had crystallized intelligence scores of <135 to see if the outliers were responsible for the discrepancy in predictability between sexes. Prediction accuracy actually decreased across the board, but ( X 0 , Y 0 ) with age×sex interaction still predicted the best of the methods tested (Table 6.6). We also ran the experiment across all PCA dimensions to verify that the results were not due to a fixed dimensionality. In Figure 6.6, the black lines indicate the COD of the 84 M+F dataset. Individual COD for male and female subjects from the M+F model were also computed, plotted as blue and red respectively. The difference between male and female accuracies in crystallized intelligence was larger compared to that in fluid intelligence. Confound adjustment consistently reduced the gap in predictability of raw crystallized intelligence between males and females across dimensions. We used a permutation test to verify the significance of this difference in the predictability of CogCrystalComp unadj between sexes. The PCAR dimensionality was set to 100, which was best for predicting CogCrystalComp unadj. The prediction difference of (CODfemale − CODmale ) from 20-fold cross-validation was computed. For each of the 1000 permutations, the sex labels were randomly scrambled, and then the new difference 0 0 CODfemale − CODmale was recomputed. The p-value was 0.050 for X 0 Y 0 confound control with age×sex term (YYY in accordance with the column labeling from Table 6.5), and 0.042 for no confound control (NNN). Hence, the difference in predictability of crystallized intelligence was significant between sexes. Repeating this for PMAT24 A CR, the p-values were 0.732 and 0.767 for NYY and NNN settings, which was not significant. 6.4.2.4 “Knocking out” individual subnetworks To see if there may be individual subnetworks in the connectivity data that were responsible for the discrepancy in crystallized intelligence prediction accuracy, we repeated the 20-fold cross-validation prediction experiment using the ( X 0 , Y 0 ) setting, with subnetwork features knocked out one by one. However, the discrepancy between sexes was still present as seen across the board in Table 6.7. This indicates that the brain network differences between male and female groups were distributed globally throughout the brain network topology. 6.4.3 6.4.3.1 rsfMRI network visualization Sex weights PCA logistic regression classified sex from HCP rsfMRI functional connectivity features with 88.3% accuracy over 20-fold cross-validation (Section 6.4.2.1). The top 100 absolute value logistic regression feature weights that drove the sex classification are visualized in Figure 6.7. Blue connections classify toward male and red connections classify toward female. These observations agreed with that of Ritchie et al. [135], with higher connectivity 85 strength within the default mode network as characteristic of the female group, and higher intranetwork connectivity in the visual and sensorimotor mouth network in the male group. There was also strong cinguloopercular-DMN and sensorimotor hand network connectivity that drove female classification, which corresponded with Ritchie’s observation of high female connectivity in the temporo-parietal junction and bilateral posterior cortices. 6.4.3.2 Fluid intelligence Figure 6.8 shows the 100 connectivity features with highest PCA regression coefficients to predict PMAT24 A CR, using CV2CR for confound control. There were not many differences overall between using age×sex in the confound control versus without. The main subnetworks involved were the visual, default, dorsal attention, and frontoparietal subnetworks. In these subnetworks, higher connectivity in the top features were associated with higher fluid intelligence. Strong internetwork connections shown between DMN, CinguloOpercular, FrontoParietal were positively correlated to fluid intelligence. These subnetworks are regions under the P-FIT model. Interestingly, intranetwork connectivity were the main drivers of fluid intelligence prediction in the visual subnetwork. The results supported the findings from [124, 147, 159]. Between using age×sex interaction and without using age×sex interaction during confound control, the biggest difference was that DMN-frontoparietal connectivity was higher emphasized in the age×sex interaction model, and DorsalAttn-Visual connectivity was lower emphasized. 6.4.3.3 Crystallized intelligence From Figure 6.9, internetwork and intranetwork connectivity in frontoparietal, DMN, and cinguloopercular subnetworks which are important in fluid intelligence were also positively associated with crystallized intelligence. The dorsal and ventral attention networks played a bigger role in crystallized intelligence prediction. Within network connections in these two subnetworks were positively correlated with crystallized intelligence, whereas there was an internetwork connection between the two subnetworks that was prominently negatively-associated. Unlike in the regression to fluid intelligence, for the regression to crystallized intelligence the visual subnetwork had a mix of positively- and negatively- 86 correlated within network connections. There were also connections from ventral attention to Gordon regions not assigned to any Power communities [129] that were found to be predictive of crystallized intelligence. Hearne et al. [73] also note such connectivity with their analysis of intelligence (measured by PMAT and picture vocabulary) in the HCP dataset. Here, the crystallized intelligence score consisted of averaging raw picture vocabulary and the Reading Test. Between yes and no age×sex interaction control, the biggest difference was that including the confound interaction term understated importance of connectivity within the sensorimotor mouth network (which wasn’t highly weighted in either models) and emphasized DMN-cingulooperc and within DMN connectivity. 6.4.3.4 Difference between male and female in crystallized intelligence Figure 6.10 shows models trained individually on male and female groups (M,F). Qualitatively, the model trained on male group was more similar to the combined model M+F. In the female model, within network cinguloopercular connectivity played a bigger role, the frontoparietal to cingulooperc connection was less strong. Frontalparietal to DMN had a relatively prominent negatively associated connection that did not show up in the male model, which was the biggest difference between both models. 6.5 Discussion This chapter phrases the problem of confound control as a graphical model of input X and target variable Y with dependency on confounds C to demonstrate that confounds should be controlled for out of both X and Y. Here confound control was applied to linear regression for prediction, and the plan is to examine its effects on nonlinear methods in the future. Through synthetic experiments, this chapter has shown that how much of X and Y comprises of confound signal versus signal of interest affects the prediction accuracy and interpretability of various methods of confound control. Consistently throughout different parameter settings, CV2CR did not suffer from the prediction accuracy deflation seen in WDCR and CVCR. Even though Y ∼ X and Y 0 ∼ X predicted accurately when confounds were a small part of X, prediction accuracies in these methods degraded when confounds carried a large weight in X. Moreover, the estimated model weight coefficients 87 were unreliable for decoding studies. Of the methods tested, CV2CR was preferable for the accuracy in prediction and weight estimation uniformly across parameters. In the real data experiment on the HCP, CV2CR predicted well on both fluid intelligence and crystallized intelligence, with higher prediction accuracy over Y 0 ∼ X and Y ∼ X. If the synthetic model presented here applies to HCP data, then this suggests that the confounds had higher magnitude of signal relative to the signal of interest (intelligence) in rsfMRI input. The high sex classification accuracy of 88.3% supported this idea, and how we can determine the relative weights between true signal and confound is left to future work. As shown, non-additive interactions between confounds can be significantly predictive of the target variable and should be considered in decoding studies when controlling for confounds. The number of possible linear combinations of confound variables and their interactions becomes combinatorial, so it is necessary to have a way to filter for the most predictive candidates. From the experiments with fluid intelligence and crystallized intelligence prediction in the HCP, Akaike information criterion is an effective way to select for a predictive set of confounds with interactions and works well with CV2CR to predict accurately while controlling for confounds. 88 Figure 6.1. Confound dependency. Confounds C affect both rsfMRI features X and target variables Y. Confound control is to learn the relationship between X and Y conditional on C. 40 15 20 25 Variable Age Head Motion (HM) Fluid intelligence Crystallized intelligence 80 Male Crystallized Intelligence 30 Frequency Frequency 10 0 0 5 10 15 20 25 Male Average 27.85 0.832 17.748 119.467 40 Frequency 20 0 Female Crystallized Intelligence 40 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 30 80 60 20 0 10 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 40 35 20 30 40 Frequency 60 40 20 0 5 60 80 25 Male Fluid Intelligence 80 Female Fluid Intelligence 0 40 Frequency 20 20 10 40 0 35 20 30 0 10 0 25 Male Head Motion 60 50 40 30 Frequency 20 50 40 30 Frequency 20 10 0 20 Frequency Female Head Motion 60 Male Age 60 Female Age 100 120 140 160 Female Average 29.47 0.838 16.464 117.132 100 120 t-statistic -6.765 -0.239 4.178 3.615 140 160 p-value <0.001 0.811 <0.001 <0.001 Figure 6.2. Histograms of target and confound variables in the HCP dataset. Two-tailed t-test show that male and female groups in the HCP dataset significantly differ in distributions for age, fluid, and crystallized intelligence variables. 89 Figure 6.3. Generated data experiment. Synthetic data generated from model in Section 6.4.1 and predicted using 20-fold cross-validation is used to show how different methods of confound control affect prediction accuracy and weights estimation under various parameters. In all cases a)-f), WDCR and CVCR predict worse than CV2CR. When confound is a major source of signal in X, i.e., e) and f), CV2CR outperforms Y ∼ X and Y 0 ∼ X in both prediction COD and estimation of weights. Figure 6.4. HCP dataset projected onto the top PCA dimensions. First and second PCA dimensions of the HCP rsfMRI dataset. Male subjects in blue and female subjects in red. 90 Figure 6.5. Predicted vs. true graphs. Comparison of true vs. predicted plots of no confound control and the best confound control models from Table 6.6. Blue points are male and red points are female subjects. 91 0 50 100 150 200 0 50 100 150 200 Crystallized intelligence prediction w/o confound interaction Crystallized intelligence prediction w/ AgeSex interaction COD 0.00 0.10 0.10 0.20 PCA Dim 0.20 PCA Dim 0.00 COD 0.05 COD 0.00 0.05 X'Y' XY X'Y XY' 0.00 COD 0.10 Fluid intelligence prediction w/ AgeSex interaction 0.10 Fluid intelligence prediction w/o confound interaction 0 50 100 PCA Dim 150 200 0 50 100 150 200 PCA Dim Figure 6.6. PCA dimensions. How the predictions would be across PCA dimensions if there was no inner loop cross-validation to determine dimensionality. Black lines are for (M+F) overall accuracy, while blue and red lines are accuracy for male and female groups respectively. The dashed and dotted lines signify different confound regression schemes. We observed a large gap in accuracy between male and female groups for crystallized intelligence. 92 None Au dit y on Fr t l ta ie ar lP ta ul fa De or alAt Dors tn uloP Cing ha SM n Vis u al tal arie tt ralA t Ven nd SM mo uth CinguloOperc Figure 6.7. Male and female rsfMRI network differences. Fluid intelligence (X’,Y’) DorsalAttn DorsalAttn Ci ng Ci ng l ulo Pa rie ta l erc rc itory SMhand al u Vis r Pa o (A) Without AgeSex interaction Figure 6.8. Fluid intelligence network features. o o nt Fr (B) with AgeSex interaction al iet ttn SMhand ttn Fr al VentralA Aud itory VentralA Aud al u Vis iet ar oP nt Op De f De lt au lo gu t ul fa e Op ulo rie ta g Cin Pa Cin ulo 93 Crystallized intelligence (X’,Y’) No ne tal Au tal dit rie Pa nto Fro or arie toP n Fro y No ne Au dit or y al su Vi al su Vi Cingu loOp erc Cingu loOp erc (A) Without AgeSex interaction Figure 6.9. Crystallized intelligence network features. n Ve (B) with AgeSex interaction tn lAt tra al ult ult ttn Dors alAtt n CinguloPariet d SMhan Defa al CinguloPariet d SMhan Defa ttn lA rsa Do Ventra lA 94 Figure 6.10. Crystallized intelligence models trained per sex group. 95 Table 6.1. AIC selection of confounds. Top regression of “confound variables” to PMAT24 A CR AIC age + HM + sex 5403.936 age + HM + age×sex 5404.348 HM + age×sex 5404.904 age + HM + sex + HM×sex 5405.209 age + HM + age×sex + age×sex 5405.889 Top regression of “confound variables” to CogCrystalComp unadj AIC age + HM + sex + age×sex 6724.334 age + HM + sex + age×sex + HM×sex 6726.102 age + sex + age×sex + HM×sex 6727.879 age + HM + age×sex 6730.382 age + sex + age×sex 6732.045 Predicting fluid and crystallized intelligence target variables from combinations of potential confound variables and their cross-term interactions. The top model of fluid intelligence selected by Akaike information criterion was the linear combination of age, head motion, and sex, but the top model for crystallized intelligence involved the addition of the cross-term interaction age×sex. Table 6.2. Significant confound variables to fluid and crystallized intelligence. Y Input Variable P-value PMAT24 A CR age + HM + sex age * sex *** HM * age + HM + sex + age×sex age . sex > 0.1 age×sex > 0.1 HM * CogCrystalComp unadj age + HM + sex age sex * HM * age + HM + sex + age×sex age * sex * age×sex HM HM indicates head motion. P-value significance codes are: (*): <0.001, (): <0.01, (*): <0.05, (.): <0.1 96 Table 6.3. Prediction accuracy of target variables Y by confound inputs only and by rsfMRI connectivity. Y Input COD Corr RMSE PMAT24 A CR age/sex/HM 0.021 0.149 4.647 PMAT24 A CR age/sex/HM/age×sex 0.019 0.142 4.654 PMAT24 A CR rsfMRI connectivity 0.075 0.285 4.515 CogCrystalComp unadj age/sex/HM 0.026 0.164 9.745 CogCrystalComp unadj age/sex/HM/age×sex 0.033 0.184 9.712 CogCrystalComp unadj rsfMRI connectivity 0.135 0.380 9.072 97 Table 6.4. 20-fold prediction accuracy of fluid intelligence. M+F age×sex Y Y Y N 0 X Y Y N Y Y0 Y N Y Y COD 0.067 0.050 0.061 0.077 COD Male 0.043 0.016 0.035 0.048 COD Female 0.056 0.047 0.049 0.070 Cor 0.272 0.237 0.269 0.290 Cor Male 0.236 0.225 0.226 0.245 Cor Female 0.247 0.262 0.241 0.272 RMSE 4.533 4.575 4.550 4.509 RMSE Male 4.462 4.523 4.480 4.449 RMSE Female 4.604 4.625 4.618 4.568 M,F M+F 1 N Y N 0.051 0.015 0.049 0.238 0.222 0.267 4.572 4.525 4.619 N N Y 0.062 0.038 0.049 0.271 0.231 0.242 4.546 4.473 4.619 N N N 0.075 0.043 0.071 0.285 0.238 0.275 4.515 4.462 4.567 2 age×sex Y N N N 0 X Y Y N Y Y0 Y Y N Y COD -0.010 -0.026 -0.120 0.061 COD Male -0.033 -0.098 -0.196 0.041 COD Female -0.026 -0.023 -0.168 0.039 Cor 0.231 0.224 0.173 0.272 Cor Male 0.188 0.136 0.091 0.239 Cor Female 0.215 0.228 0.136 0.234 RMSE 4.717 4.981 4.699 4.644 RMSE Male 4.635 4.923 4.665 4.416 RMSE Female 4.798 5.035 4.732 4.854 Accuracy of predictions of PMAT24 A CR using various combinations of confound regression out of X and Y, indicated by X 0 and Y 0 . (M+F) means that the model was on the whole dataset with male and female subjects together. (M,F) means the regression was trained individually on male and female subjects, which is the same as considering the interaction between brain×sex. Because sample size was effectively reduced by half when training separately by sex, a comparison was made against (M+F) with subjects reduced by half, labelled as M+F 1 . For fluid intelligence X 0 and Y 0 confound control trained 2 on (M+F) led to highest accuracy (coefficient of determination) on the whole dataset. The confound set of (age, sex, HM) was most predictive. 98 Table 6.5. 20-fold prediction accuracy of crystallized intelligence. M+F age×sex Y Y Y N N 0 X Y Y N Y Y Y0 Y N Y Y N COD 0.157 0.117 0.155 0.148 0.109 COD Male 0.092 0.041 0.095 0.095 0.029 COD Female 0.201 0.173 0.193 0.177 0.169 Cor 0.409 0.351 0.409 0.399 0.341 Cor Male 0.343 0.283 0.346 0.338 0.271 Cor Female 0.450 0.443 0.445 0.428 0.438 RMSE 8.958 9.165 8.967 9.006 9.211 RMSE Male 9.320 9.886 9.603 9.604 9.950 RMSE Female 8.243 8.383 8.282 8.365 8.407 M,F M+F 1 N N Y 0.137 0.073 0.180 0.387 0.315 0.429 9.061 9.722 8.348 N N N 0.135 0.054 0.198 0.380 0.291 0.446 9.072 9.820 8.257 2 age×sex Y N N N 0 X Y Y N Y Y0 Y Y N Y COD 0.097 0.087 0.075 0.109 COD Male 0.056 0.029 -0.016 0.030 COD Female 0.111 0.119 0.146 0.161 Cor 0.356 0.344 0.333 0.371 Cor Male 0.298 0.262 0.218 0.292 Cor Female 0.381 0.387 0.410 0.413 RMSE 9.268 9.320 9.384 9.243 RMSE Male 9.810 9.944 10.177 9.987 RMSE Female 8.694 8.651 8.517 8.459 For prediction of CogCrystalComp unadj, the best set of confounds was to consider interaction: (age, sex, HM, age × sex). X 0 and Y 0 confound control trained on (M+F) led to highest accuracy on the whole dataset. 99 Table 6.6. 20-fold prediction accuracy of CogCrystalComp unadj for subjects with scores < 135. M+F age×sex Y Y Y N N N N X0 Y Y N Y Y N N 0 Y Y N Y Y N Y N COD 0.126 0.095 0.118 0.117 0.086 0.105 0.105 COD Male 0.072 0.026 0.059 0.052 0.017 0.066 0.022 COD Female 0.156 0.141 0.154 0.158 0.131 0.120 0.165 Cor 0.368 0.322 0.360 0.354 0.309 0.345 0.341 Cor Male 0.320 0.258 0.304 0.282 0.254 0.302 0.251 Cor Female 0.396 0.397 0.395 0.399 0.378 0.357 0.407 RMSE 8.348 8.491 8.385 8.390 8.536 8.447 8.466 RMSE Male 8.570 8.780 8.632 8.661 8.821 8.600 8.800 RMSE Female 8.119 8.192 8.130 8.109 8.242 8.292 8.077 Several subjects were outliers with very high crystallized intelligence scores. The experiment was repeated for subjects with CogCrystalComp unadj < 135. Table 6.7. Knockout. M+F Network KO No KO DMN SMhand SMmouth Visual COD 0.157 0.145 0.138 0.148 0.160 COD Male 0.092 0.078 0.071 0.054 0.099 COD Female 0.201 0.190 0.184 0.227 0.199 Network KO [None] CingPariet RetrospTmp CingOprc VentAttn COD 0.135 0.151 0.135 0.141 0.115 COD Male 0.063 0.073 0.051 0.096 0.043 COD Female 0.186 0.210 0.201 0.159 0.166 Network KO FrntPariet Auditory Salience DorsAttn COD 0.135 0.148 0.137 0.160 COD Male 0.075 0.062 0.061 0.105 COD Female 0.171 0.216 0.192 0.191 Individual subnetworks were removed from input rsfMRI data to see how accuracy would change. Confound control using ( X 0 , Y 0 ). Regions under Gordon parcellation that were not identified as part of any Power communities are labeled as [None]. CHAPTER 7 DISCUSSION This dissertation has covered multiple aspects of network analysis, from the problem of network inference to how networks can be used for making predictions. The methods have a neuroimaging focus, but they can be applied to problems from other domains as well. In the neuroimaging analysis pipeline, these methods are used to extract functional brain networks from rsfMRI data, derive features for the diagnosis of neurological disorders, make predictions of behavior and condition severity, and identify the neurological bases of phenotypes. The domain specific challenges of working with rsfMRI data are low sample size, high dimensionality, and high levels of noise. The datasets analyzed in this dissertation have dataset sample sizes of approximately 1000 subjects on the high end (Human Connectome Project) and fewer than 100 subjects on the low end (Utah autism data). Compared to data from other domains, resting-state connectivity data is particularly prone to noise because of the indirect way that rsfMRI measures brain activity via the seconds-long BOLD signal and of the many unrelated external factors that contribute to the recorded signals. The methods presented have been developed in these considerations in mind. This chapter discusses the pros and cons of the methods, and lessons learned. 7.1 Summary of work To recap, a method of extracting a network from multivariate normal data as a Gaussian graphical model has been proposed in Chapter 3. A GGM is computed as the precision matrix (inversion of the sample covariance matrix), from which partial correlation can be derived as measure of direct pairwise association between nodes in the network. Regularization is a necessity to accurately estimate sparse GGMs when there are few data points, but the typical L1 -based approaches require tuning parameters for the level of sparsity. The proposed method does not have the need for parameter-tuning, and enforces the symmetric positive-definite property of the valid networks. Applying this method to real 101 data, the proposed method performed well at estimating the structure of a sparse gene expression network. However, the learned rsfMRI networks using this method did not do well in either classification of autism diagnosis or regression prediction of autism severity score. In Chapters 4 and 5, two different approaches to network representation for predictions are proposed as alternatives to the vectorized correlation matrix. The vectorized correlation matrix is a convenient input format for representing network-object data in existing machine learning methods, but methods trained only on vectorized correlation matrices cannot make use of the networks’ topological and symmetric positive-definite properties. It could potentially be problematic for interpretation that a machine learning model trained on such data is able to make predictions on inputs that are not valid networks. Chapter 4 has shown that the topological information has some predictive power in regression of behavioral condition, especially in conjunction with vectorized correlation matrix values by the summation of their kernels. Chapter 5 has demonstrated that enforcing input be fully symmetric positive-definite matrices from a Riemannian manifold improves the prediction over Euclidean methods. One thing to note is that the symmetric positive-definite manifold is made of the space of covariance matrices, and correlation matrices are a subset of this space with signals standardized to unit variance. With the log-Euclidean metric, the Riemannian data is first projected onto the Euclidean tangent space at the identity upon which Euclidean machine learning operations occur. The base point is optimized for under the Affine-Invariant metric. Chapter 6 has proposed a strategy to control confounds that is interpretable and does not bias prediction accuracy like other existing methods. Interpretability involves identifying the features with the top weights that contribute to the prediction, which is an important step to decoding the neural basis of observed behavior. The synthetic experiments demonstrated that the common approach to confound control by regressing confounds out of dependent variables makes the weights uninterpretable, and potentially brings down the accuracy. Regressing confounds out of independent variables keeps the weights interpretable but significantly diminishes accuracy in most scenarios. The method proposed is to regress confounds out of both independent and dependent variables during training, and add the effect of confounds back in during testing. Another contribution 102 in this work is how to identify the set of potential confound variables, including any non-additive interactions, which has not been discussed in other brain decoding works. The method was used on real data for prediction of fluid and crystallized intelligence, with sex, age, and head motion as confound variables. The proposed method yielded the highest prediction accuracy over cross-validation in comparison to other schemes for confound control. 7.2 Discussion of methods There are a couple of possible reasons that may explain why sparse GGMs did not do well in representing brain networks for disease prediction. The first is that the edges were not consistently sparse between subjects, which could instead be addressed by a hierarchical group model using the same prior to control sparse connections consistently for all subjects. The second reason is that sparse GGM networks may not be good input for disease prediction. Kim et al. [101] have synthetic experiments on testing for group differences in data generated from sparse and nonsparse precision and covariance matrices. They have found that partial correlation features perform better when the true covariance matrix is sparse. On the other hand, when the data comes from GGM networks that are sparse, then correlation matrices give higher power and stability as features. The work from Bullmore and Sporns [23] suggests that brain networks are sparse, exhibiting small-world topology with efficient communication. The proposed method may be accurately estimating rsfMRI networks as GGMs that are genuinely sparse, and hence correlation matrices are better features for purpose of diagnosis prediction. Between the two approaches for regression in Chapters 4 and 5, some criteria for comparison of the algorithms are accuracy performance, speed, and ease of use. The Riemannian features appear to be slightly better at autism severity regression than topological features. Also, regression with the log-Euclidean transformation has the benefit of being interpretable, whereas it is much more difficult to do so with TDA using kernel PLS. The fastest method to converge is Riemannian projection method under the log-Euclidean metric. TDA kPLS becomes more computationally involved with higher dimensional topological features, and requires parameter selection. The slowest method is using the Affine-Invariant metric because of the repeated computation of SVDs in the metric which 103 is an expensive process, so updates take much longer with high dimensional matrices. The TDA kPLS method is flexible for accepting other types of data by summing up kernels, but has parameters to tune for the contribution of various kernels toward prediction, whereas the Riemannian method has only one parameter for the level of L2 regularization used in linear and logistic regression. Both approaches are able to handle the amount of noise typical in rsfMRI data–kPLS by projecting the data onto a lower dimensional subspace; the Riemannian method using L2 regularization, which compared favorably to other state-of-the-art methods in autism classification. 7.3 Future directions • Combining kernels under the Riemannian metrics with TDA kernels for classification and regression problems - This can be done by computing a kernel of inner products between the all pairs of data on the tangent space and adding this kernel to the TDA kernels. Also, only dimension-0 and dimension-1 topological features had been used in the TDA experiments, but it is possible to use higher dimensional features for future experiments • Investigating confound control behavior beyond linear methods - Only linear prediction methods had been used in Chapter 6, both for predictions and for regression of confounds out of data. A future project is to use this method of confound control on the Riemannian methods introduced, or on other types of nonlinear neural network methods. Comparison for model interpretation differences will be made. • Understanding the gap in predictability between male and female groups in crystallized intelligence - The experiments from Chapter 6 have found that the female group is more predictive than male group of crystallized intelligence across dimensionality. Since crystallized intelligence in HCP is measured by the averaging the scores from several individual tests, the plan is to investigate each test individually. REFERENCES [1] Abraham, A., Milham, M.P., Di Martino, A., Craddock, R.C., Samaras, D., Thirion, B., Varoquaux, G.: Deriving Reproducible Biomarkers from Multi-Site Resting-State Data: An Autism-Based Example. NeuroImage 147, 736–745 (2017) [2] Abraham, A., Dohmatob, E., Thirion, B., Samaras, D., Varoquaux, G.: Extracting Brain Regions from Rest fMRI with Total-Variation Constrained Dictionary Learning. In: International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI). pp. 607–615. Springer (2013) [3] Achard, S., Bullmore, E.: Efficiency and Cost of Economical Brain Functional Networks. PLoS Comput. Biol. 3(2), e17 (2007) [4] Agosta, F., Pievani, M., Geroldi, C., Copetti, M., Frisoni, G.B., Filippi, M.: Resting State FMRI in Alzheimer’s Disease: Beyond the Default Mode Network. Neurobiol. Aging 33(8), 1564–1578 (2012) [5] Akaike, H.: Information Theory and an Extension of the Maximum Likelihood Principle. In: Selected Papers of Hirotugu Akaike, pp. 199–213. Springer (1998) [6] Alfaro-Almagro, F., McCarthy, P., Afyouni, S., Andersson, J.L., Bastiani, M., Miller, K.L., Nichols, T.E., Smith, S.M.: Confound Modelling in UK Biobank Brain Imaging. NeuroImage p. 117002 (2020) [7] Allen, E.A., Damaraju, E., Plis, S.M., Erhardt, E.B., Eichele, T., Calhoun, V.D.: Tracking Whole-Brain Connectivity Dynamics in the Resting State. Cereb. Cortex p. bhs352 (2012) [8] Amaro Jr, E., Barker, G.J.: Study Design in FMRI: Basic Principles. Brain Cogn. 60(3), 220–232 (2006) [9] Andreasen, N.C., Flaum, M., Swayze, V., O’Leary, D.S., Alliger, R., Cohen, G., Ehrhardt, J., Yuh, W.T., et al.: Intelligence and Brain Structure in Normal Individuals. Am. J. Psychiatry 150, 130–130 (1993) [10] Andrews, D.F., Mallows, C.L.: Scale Mixtures of Normal Distributions. J. R. Stat. Soc. 36(1), 99–102 (1974) [11] Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Log-Euclidean Metrics for Fast and Simple Calculus on Diffusion Tensors. Magn. Reson. Med. 56(2), 411–421 (2006) [12] Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric Means in a Novel Vector Space Structure on Symmetric Positive-Definite Matrices. SIAM J. Matrix Anal. Appl. 29(1), 328–347 (2007) 105 [13] Assaf, M., Jagannathan, K., Calhoun, V.D., Miller, L., Stevens, M.C., Sahl, R., O’Boyle, J.G., Schultz, R.T., Pearlson, G.D.: Abnormal Functional Connectivity of Default Mode Sub-Networks in Autism Spectrum Disorder Patients. NeuroImage 53(1), 247–256 (2010) [14] Banerjee, O., Ghaoui, L.E., dAspremont, A.: Model Selection through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. J. Mach. Learn. Res. 9(Mar), 485–516 (2008) [15] Barachant, A., Bonnet, S., Congedo, M., Jutten, C.: Classification of Covariance Matrices Using a Riemannian-Based Kernel for BCI Applications. Neurocomputing 112, 172–178 (2013) [16] Barch, D.M., Burgess, G.C., Harms, M.P., Petersen, S.E., Schlaggar, B.L., Corbetta, M., Glasser, M.F., Curtiss, S., Dixit, S., Feldt, C., et al.: Function in the Human Connectome: Task-FMRI and Individual Differences in Behavior. NeuroImage 80, 169–189 (2013) [17] Bassett, D.S., Bullmore, E., Verchinski, B.A., Mattay, V.S., Weinberger, D.R., MeyerLindenberg, A.: Hierarchical Organization of Human Cortical Networks in Health and Schizophrenia. J. Neurosci. 28(37), 9239–9248 (2008) [18] Basten, U., Stelzel, C., Fiebach, C.J.: Intelligence is Differentially Related to Neural Effort in the Task-Positive and the Task-Negative Brain Network. Intelligence 41(5), 517–528 (2013) [19] Belmonte, M.K., Allen, G., Beckel-Mitchener, A., Boulanger, L.M., Carper, R.A., Webb, S.J.: Autism and Abnormal Development of Brain Connectivity. J. Neurosci. 24(42), 9228–9231 (2004) [20] Binnewijzend, M.A., Schoonheim, M.M., Sanz-Arigita, E., Wink, A.M., van der Flier, W.M., Tolboom, N., Adriaanse, S.M., Damoiseaux, J.S., Scheltens, P., van Berckel, B.N., et al.: Resting-state FMRI Changes in Alzheimer’s Disease and Mild Cognitive Impairment. Neurobiol. Aging 33(9), 2018–2028 (2012) [21] Brodmann, K.: Vergleichende Lokalisationslehre der Grosshirnrinde in ihren Prinzipien dargestellt auf Grund des Zellenbaues. Barth (1909) [22] Bubenik, P.: Statistical Topological Data Analysis Using Persistence Landscapes. J. Mach. Learn. Res. 16(1), 77–102 (2015) [23] Bullmore, E., Sporns, O.: Complex Brain Networks: Graph Theoretical Analysis of Structural and Functional Systems. Nat. Rev. Neurosci. 10(3), 186–198 (2009) [24] Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L.T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., OConnell, J., et al.: The UK Biobank Resource with Deep Phenotyping and Genomic Data. Nature 562(7726), 203–209 (2018) [25] Carlsson, G.: Topology and Data. Bull. Am. Math. Soc. 46(2), 255–308 (2009) [26] Cassidy, B., Rae, C., Solo, V.: Brain Activity: Conditional Dissimilarity and Persistent Homology. International Symposium on Biomedical Imaging (ISBI) pp. 1356 – 1359 (2015) 106 [27] Cattell, R.B.: Theory of Fluid and Crystallized Intelligence: A Critical Experiment. J. Educ. Psychol. 54(1), 1 (1963) [28] Cherian, A., Gould, S.: Second-Order Temporal Pooling for Action Recognition. Int. J. Comput. Vis. 127(4), 340–362 (2019) [29] Chung, M.K., Hanson, J.L., Lee, H., Adluru, N., Alexander, A.L., Davidson, R.J., Pollak, S.D.: Persistent Homological Sparse Network Approach to Detecting White Matter Abnormality in Maltreated Children: MRI and DTI Multimodal Study. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 300–307. Springer (2013) [30] Chung, M.K., Hanson, J.L., Ye, J., Davidson, R.J., Pollak, S.D.: Persistent Homology in Sparse Regression and Its Application to Brain Morphometry. IEEE Trans. Med. Imaging 34(9), 1928–1939 (2015) [31] Chyzhyk, D., Varoquaux, G., Thirion, B., Milham, M.: Controlling a Confound in Predictive Models with a Test Set Minimizing its Effect. In: 2018 International Workshop on Pattern Recognition in Neuroimaging (PRNI). pp. 1–4. Eur. J. Educ. (2018) [32] Cohen-Steiner, D., Edelsbrunner, H., Harer, J.: Stability of Persistence Diagrams. Discrete Comput. Geom. 37(1), 103–120 (2007) [33] Colclough, G.L., Woolrich, M.W., Harrison, S.J., López, P.A.R., Valdes-Sosa, P.A., Smith, S.M.: Multi-Subject Hierarchical Inverse Covariance Modelling Improves Estimation of Functional Brain Networks. NeuroImage 178, 370–384 (2018) [34] Cornea, E., Zhu, H., Kim, P., Ibrahim, J.G., Alzheimer’s Disease Neuroimaging Initiative: Regression Models on Riemannian Symmetric Spaces. J. R. Stat. Soc. Series B Stat. Methodol. 79(2), 463–482 (2017) [35] Courchesne, E., Pierce, K.: Why the Frontal Cortex in Autism Might be Talking Only to Itself: Local Over-Connectivity but Long-Distance Disconnection. Curr. Opin. Neurobiol. 15(2), 225–230 (2005) [36] Craddock, C., Benhajali, Y., Chu, C., Chouinard, F., Evans, A., Jakab, A., Khundrakpam, B.S., Lewis, J.D., Li, Q., Milham, M., et al.: The Neuro Bureau Preprocessing Initiative: Open Sharing of Preprocessed Neuroimaging Data and Derivatives. Front. Neuroinform. 7 (2013) [37] Craddock, C., Sikka, S., Cheung, B., Khanuja, R., Ghosh, S.S., Yan, C., Li, Q., Lurie, D., Vogelstein, J., Burns, R., et al.: Towards Automated Analysis of Connectomes: The Configurable Pipeline for the Analysis of Connectomes (C-PAC). Front. Neuroinform. 42, 10–3389 (2013) [38] Craddock, R.C., James, G.A., Holtzheimer, P.E., Hu, X.P., Mayberg, H.S.: A Whole Brain FMRI Atlas Generated via Spatially Constrained Spectral Clustering. Hum. Brain Mapp. 33(8), 1914–1928 (2012) [39] Dadi, K., Rahim, M., Abraham, A., Chyzhyk, D., Milham, M., Thirion, B., Varoquaux, G., Alzheimer’s Disease Neuroimaging Initiative, et al.: Benchmarking Functional 107 Connectome-Based Predictive Models for Resting-State FMRI. NeuroImage 192, 115–134 (2019) [40] Demirtaş, M., Tornador, C., Falcón, C., López-Solà, M., Hernández-Ribas, R., Pujol, J., Menchon, J.M., Ritter, P., Cardoner, N., Soriano-Mas, C., et al.: Dynamic Functional Connectivity Reveals Altered Variability in Functional Connectivity Among Patients with Major Depressive Disorder. Hum. Brain Mapp. 37(8), 2918–2930 (2016) [41] Desikan, R.S., Ségonne, F., Fischl, B., Quinn, B.T., Dickerson, B.C., Blacker, D., Buckner, R.L., Dale, A.M., Maguire, R.P., Hyman, B.T., et al.: An Automated Labeling System for Subdividing the Human Cerebral Cortex on MRI Scans into Gyral Based Regions of Interest. NeuroImage 31(3), 968–980 (2006) [42] Di Martino, A., Yan, C.G., Li, Q., Denio, E., Castellanos, F.X., Alaerts, K., Anderson, J.S., Assaf, M., Bookheimer, S.Y., Dapretto, M., Deen, B., Delmont, S., Dinstein, I., Ertl-Wagner, B., Fair, D.A., Gallagher, L., Kennedy, D.P., Keown, C.L., Keysers, C., Lainhart, J.E., Lord, C., Luna, B., Menon, V., Minshew, N.J., Monk, C.S., Mueller, S., Müller, R.A., Nebel, M.B., Nigg, J.T., O’Hearn, K., Pelphrey, K.A., Peltier, S.J., Rudie, J.D., Sunaert, S., Thioux, M., Tyszka, M., Uddin, L.Q., Verhoeven, J.S., Wenderoth, N., Wiggins, J.L., Mostofsky, S.H., Milham, M.P.: The Autism Brain Imaging Data Exchange: Towards a Large-Scale Evaluation of the Intrinsic Brain Architecture in Autism. Mol. Psychiatry 19(6), 659–667 (2014) [43] Dinga, R., Schmaal, L., Penninx, B.W., Veltman, D.J., Marquand, A.F.: Controlling for Effects of Confounding Variables on Machine Learning Predictions. BioRxiv (2020) [44] Dodero, L., Minh, H.Q., San Biagio, M., Murino, V., Sona, D.: Kernel-Based Classification for Brain Connectivity Graphs on the Riemannian Manifold of Positive Definite Matrices. In: International Symposium on Biomedical Imaging (ISBI). pp. 42–45. IEEE (2015) [45] Dong, Z., Jia, S., Zhang, C., Pei, M., Wu, Y.: Deep Manifold Learning of Symmetric Positive Definite Matrices with Application to Face Recognition. In: Thirty-First AAAI Conference on Artificial Intelligence (2017) [46] Dosenbach, N.U., Nardos, B., Cohen, A.L., Fair, D.A., Power, J.D., Church, J.A., Nelson, S.M., Wig, G.S., Vogel, A.C., Lessov-Schlaggar, C.N., et al.: Prediction of Individual Brain Maturity Using FMRI. Science 329(5997), 1358–1361 (2010) [47] Dryden, I.L., Koloydenko, A., Zhou, D., et al.: Non-Euclidean Statistics for Covariance Matrices, with Applications to Diffusion Tensor Imaging. Ann. Appl. Stat. 3(3), 1102–1123 (2009) [48] Duchi, J., Gould, S., Koller, D.: Projected Subgradient Methods for Learning Sparse Gaussians. In: Uncertainty in Artificial Intelligence (UAI) (2008) [49] Dvornek, N.C., Ventola, P., Pelphrey, K.A., Duncan, J.S.: Identifying Autism from Resting-State FMRI Using Long Short-Term Memory Networks. In: International Workshop on Machine Learning in Medical Imaging. pp. 362–370. Springer (2017) [50] Edelsbrunner, H., Letscher, D., Zomorodian, A.J.: Topological Persistence and Simplification. Discrete Comput. Geom. 28, 511–533 (2002) 108 [51] El Gazzar, A., Cerliani, L., van Wingen, G., Thomas, R.M.: Simple 1-D Convolutional Networks for Resting-State FMRI Based Classification in Autism. In: International Joint Conference on Neural Networks (IJCNN). pp. 1–6. IEEE (2019) [52] Fan, J., Li, R.: Variable Selection Via Nonconcave Penalized Likelihood and its Oracle Properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001) [53] Fan, L., Su, J., Qin, J., Hu, D., Shen, H.: A Deep Network Model on Dynamic Functional Connectivity With Applications to Gender Classification and Intelligence Prediction. Front. Neurosci. 14, 881 (2020) [54] Ferreira, L.K., Busatto, G.F.: Resting-State Functional Connectivity in Normal Brain Aging. Neurosci. Biobehav. Rev. 37(3), 384–400 (2013) [55] Figueiredo, M.A.: Adaptive Sparseness for Supervised Learning. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1150–1159 (2003) [56] Finn, E.S., Shen, X., Scheinost, D., Rosenberg, M.D., Huang, J., Chun, M.M., Papademetris, X., Constable, R.T.: Functional Connectome Fingerprinting: Identifying Individuals Using Patterns of Brain Connectivity. Nat. Neurosci. 18(11), 1664–1671 (2015) [57] Fletcher, P.T.: Geodesic Regression and the Theory of Least Squares on Riemannian Manifolds. Int. J. Comput. Vis. 105(2), 171–185 (2013) [58] Fletcher, P.T., Joshi, S.: Principal Geodesic Analysis on Symmetric Spaces: Statistics of Diffusion Tensors. In: Computer Vision and Mathematical Methods in Medical and Biomedical Image Analysis, pp. 87–98. Springer (2004) [59] Förstner, W., Moonen, B.: A Metric for Covariance Matrices. In: Geodesy-the Challenge of the 3rd Millennium, pp. 299–309. Springer (2003) [60] Foygel, R., Drton, M.: Extended Bayesian Information Criteria for Gaussian Graphical Models. In: Advances in Neural Information Processing Systems (2010) [61] Friedman, J., Hastie, T., Tibshirani, R.: Sparse Inverse Covariance Estimation with the Graphical LASSO. Biostatistics 9(3), 432–441 (2008) [62] Geschwind, D.H., Levitt, P.: Autism Spectrum Disorders: Developmental Disconnection Syndromes. Curr. Opin. Neurobiol. 17(1), 103–111 (2007) [63] Ghrist, R.: Barcodes: The Persistent Topology of Data. Bull. Am. Math. Soc. 45, 61–75 (2008) [64] Ginestet, C.E., Nichols, T.E., Bullmore, E.T., Simmons, A.: Brain Network Analysis: Separating Cost from Topology Using Cost-Integration. PloS One 6(7), e21570 (2011) [65] Glasser, M.F., Coalson, T.S., Robinson, E.C., Hacker, C.D., Harwell, J., Yacoub, E., Ugurbil, K., Andersson, J., Beckmann, C.F., Jenkinson, M., et al.: A Multi-Modal Parcellation of Human Cerebral Cortex. Nature 536(7615), 171–178 (2016) 109 [66] Golland, P., Golland, Y., Malach, R.: Detection of Spatial Activation Patterns as Unsupervised Segmentation of FMRI Data. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 110–118. Springer (2007) [67] Gordon, E.M., Laumann, T.O., Adeyemo, B., Huckins, J.F., Kelley, W.M., Petersen, S.E.: Generation and Evaluation of a Cortical Area Parcellation from Resting-State Correlations. Cereb. Cortex 26(1), 288–303 (2016) [68] Gray, J.R., Chabris, C.F., Braver, T.S.: Neural Mechanisms of General Fluid Intelligence. Nat. Neurosci. 6(3), 316–322 (2003) [69] Hagmann, P.: From Diffusion MRI to Brain Connectomics (2005) [70] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Science & Business Media (2009) [71] He, T., Kong, R., Holmes, A.J., Nguyen, M., Sabuncu, M.R., Eickhoff, S.B., Bzdok, D., Feng, J., Yeo, B.T.: Deep Neural Networks and Kernel Regression Achieve Comparable Accuracies for Functional Connectivity Prediction of Behavior and Demographics. NeuroImage 206, 116276 (2020) [72] He, T., Kong, R., Holmes, A.J., Sabuncu, M.R., Eickhoff, S.B., Bzdok, D., Feng, J., Yeo, B.T.: Is Deep Learning Better than Kernel Regression for Functional Connectivity Prediction of Fluid Intelligence? In: International Workshop on Pattern Recognition in Neuroimaging (PRNI). pp. 1–4. IEEE (2018) [73] Hearne, L.J., Mattingley, J.B., Cocchi, L.: Functional Brain Networks Related to Individual Differences in Human Intelligence at Rest. Sci. Rep. 6(1), 1–8 (2016) [74] Heinsfeld, A.S., Franco, A.R., Craddock, R.C., Buchweitz, A., Meneguzzi, F.: Identification of Autism Spectrum Disorder Using Deep Learning and the ABIDE Dataset. NeuroImage: Clinical 17, 16–23 (2018) [75] Helgason, S.: Differential Geometry, Lie Groups and Symmetric Spaces. Math. Surveys Monogr 83 (1978) [76] van den Heuvel, M.P., de Lange, S.C., Zalesky, A., Seguin, C., Yeo, B.T., Schmidt, R.: Proportional Thresholding in Resting-State FMRI Functional Connectivity Networks and Consequences for Patient-Control Connectome Studies: Issues and Recommendations. NeuroImage 152, 437–449 (2017) [77] van den Heuvel, M.P., Stam, C.J., Kahn, R.S., Pol, H.E.H.: Efficiency of Functional Brain Networks and Intellectual Performance. J. Neurosci. 29(23), 7619–7624 (2009) [78] Horak, D., Maletić, S., Rajković, M.: Persistent Homology of Complex Networks. J. Stat. Mech. Theory Exp. p. P03034 (2009) [79] Horn, J.L., Cattell, R.B.: Age Differences in Fluid and Crystallized Intelligence. Acta. Psychol. (Amst) 26, 107–129 (1967) 110 [80] Hromatka, M., Liu, W., Anderson, J.S., Zielinski, B.A., DuBray, M.B., Fletcher, P.T.: Accounting for Heterogeneity Across Multiple Imaging Sites Using MultiTask Learning. In: MICCAI 2015 Workshop on Bayesian and Graphical Models for Biomedical Imaging (2015) [81] Hu, J., Cao, L., Li, T., Liao, B., Dong, S., Li, P.: Interpretable Learning Approaches in Resting-State Functional Connectivity Analysis: The Case of Autism Spectrum Disorder. Comput. Math Methods Med. 2020 (2020) [82] Huang, Z., Van Gool, L.: A Riemannian Network for SPD Matrix Learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017) [83] Huang, Z., Wang, R., Shan, S., Li, X., Chen, X.: Log-Euclidean Metric Learning on Symmetric Positive Definite Manifold with Application to Image Set Classification. In: International Conference on Machine Learning (ICML). pp. 720–729 (2015) [84] Hull, J.V., Dokovna, L.B., Jacokes, Z.J., Torgerson, C.M., Irimia, A., Van Horn, J.D.: Resting-State Functional Connectivity in Autism Spectrum Disorders: A Review. Front. Psychiatry 7, 205 (2017) [85] Hunter, D.R., Li, R.: Variable Selection Using MM Algorithms. Ann. Stat. 33(4) (2005) [86] Hutchison, R.M., Womelsdorf, T., Gati, J.S., Everling, S., Menon, R.S.: Resting-State Networks Show Dynamic Functional Connectivity in Awake Humans and Anesthetized Macaques. Hum. Brain Mapp. 34(9), 2154–2177 (2013) [87] Ingalhalikar, M., Smith, A., Parker, D., Satterthwaite, T.D., Elliott, M.A., Ruparel, K., Hakonarson, H., Gur, R.E., Gur, R.C., Verma, R.: Sex Differences in the Structural Connectome of the Human Brain. Proc. Natl. Acad. Sci. U.S.A. 111(2), 823–828 (2014) [88] Ionescu, C., Vantzos, O., Sminchisescu, C.: Matrix Backpropagation for Deep Networks with Structured Layers. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2965–2973 (2015) [89] Irwing, P., Lynn, R.: Sex Differences in Means and Variability on the Progressive Matrices in University Students: A Meta-Analysis. Br. J. Psychol. 96(4), 505–524 (2005) [90] Itahashi, T., Yamada, T., Watanabe, H., Nakamura, M., Jimbo, D., Shioda, S., Toriizuka, K., Kato, N., Hashimoto, R.: Altered Network Topologies and Hub Organization in Adults with Autism: A Resting-State FMRI Study. PloS One 9(4), e94115 (2014) [91] Jalili, M.: Functional Brain Networks: Does the Choice of Dependency Estimator and Binarization Method Matter? Sci. Rep. 6, 29780 (2016) [92] Johnson, W., Bouchard Jr, T.J.: Sex Differences in Mental Abilities: g Masks the Dimensions on which They Lie. Intelligence 35(1), 23–39 (2007) [93] Jung, M., Mody, M., Saito, D.N., Tomoda, A., Okazawa, H., Wada, Y., Kosaka, H.: Sex Differences in the Default Mode Network with Regard to Autism Spectrum Traits: A Resting State FMRI Study. PloS One 10(11), e0143126 (2015) 111 [94] Jung, R.E., Haier, R.J.: The Parieto-Frontal Integration Theory (P-FIT) of Intelligence: Converging Neuroimaging Evidence. Behav. Brain Sci. 30(2), 135 (2007) [95] Kambeitz, J., Kambeitz-Ilankovic, L., Leucht, S., Wood, S., Davatzikos, C., Malchow, B., Falkai, P., Koutsouleris, N.: Detecting Neuroimaging Biomarkers for Schizophrenia: A Meta-Analysis of Multivariate Pattern Recognition Studies. Neuropsychopharmacology 40(7), 1742–1751 (2015) [96] Kaufman, J.C., Chen, T.H., Kaufman, A.S.: Ethnic Group, Education, and Gender Differences on Six Horn Abilities for Adolescents and Adults. J. Psychoeduc. Assess. 13(1), 49–65 (1995) [97] Kazeminejad, A., Sotero, R.C.: Topological Properties of Resting-State FMRI Functional Networks Improve Machine Learning-Based Autism Classification. Front. Neurosci. 12, 1018 (2019) [98] Khalid, A., Kim, B.S., Chung, M.K., Ye, J.C., Jeon, D.: Tracing the Evolution of MultiScale Functional Networks in a Mouse Model of Depression Using Persistent Brain Network Homology. NeuroImage 101, 351–363 (2014) [99] Khazaee, A., Ebrahimzadeh, A., Babajani-Feremi, A.: Application of Advanced Machine Learning Methods on Resting-State FMRI Network for Identification of Mild Cognitive Impairment and Alzheimer’s Disease. Brain Imaging Behav. 10(3), 799–817 (2016) [100] Kim, H.J., Adluru, N., Collins, M.D., Chung, M.K., Bendlin, B.B., Johnson, S.C., Davidson, R.J., Singh, V.: Multivariate General Linear Models (MGLM) on Riemannian Manifolds with Applications to Statistical Analysis of Diffusion Weighted Images. In: Computer Vision and Pattern Recognition (CVPR). pp. 2705–2712 (2014) [101] Kim, J., Wozniak, J.R., Mueller, B.A., Pan, W.: Testing Group Differences in Brain Functional Connectivity: Using Correlations or Partial Correlations? Brain connectivity 5(4), 214–231 (2015) [102] Kühn, S., Vanderhasselt, M.A., De Raedt, R., Gallinat, J.: The Neural Basis of Unwanted Thoughts during Resting State. Soc. Cogn. Affect. Neurosci. 9(9), 1320–1324 (2013) [103] Lee, H., Kang, H., Chung, M.K., Kim, B.N., Lee, D.S.: Persistent Brain Network Homology from the Perspective of Dendrogram. IEEE Trans. Med. Imaging 31(12), 2267–2277 (2012) [104] Lee, H., Kang, H., Chung, M.K., Kim, B.N., Lee, D.S.: Weighted Functional Brain Network Modeling via Network Filtration. NIPS Workshop on Algebraic Topology and Machine Learning (2012) [105] Lee, H., Kang, H., Chung, M.K., Lim, S., Kim, B.N., Lee, D.S.: Integrated Multimodal Network Approach to PET and MRI Based on Multidimensional Persistent Homology. Hum. Brain Mapp. 38(3), 1387–1402 (2017) [106] Lee, K.H., Choi, Y.Y., Gray, J.R., Cho, S.H., Chae, J.H., Lee, S., Kim, K.: Neural Correlates of Superior Intelligence: Stronger Recruitment of Posterior Parietal Cortex. NeuroImage 29(2), 578–586 (2006) 112 [107] Li, H., Satterthwaite, T.D., Fan, Y.: Brain Age Prediction Based on Resting-State Functional Connectivity Patterns Using Convolutional Neural Networks. In: International Symposium on Biomedical Imaging (ISBI). pp. 101–104. Eur. J. Educ. (2018) [108] Li, J., Biswal, B.B., Meng, Y., Yang, S., Duan, X., Cui, Q., Chen, H., Liao, W.: A Neuromarker of Individual General Fluid Intelligence from the White-Matter Functional Connectome. Transl. Psychiatry 10(1), 1–12 (2020) [109] Lipp, I., Benedek, M., Fink, A., Koschutnig, K., Reishofer, G., Bergner, S., Ischebeck, A., Ebner, F., Neubauer, A.: Investigating Neural Efficiency in the Visuo-Spatial Domain: An FMRI Study. PloS One 7(12), e51316 (2012) [110] Liu, W., Awate, S.P., Anderson, J.S., Fletcher, P.T.: A Functional Network Estimation Method of Resting-State FMRI Using a Hierarchical Markov Random Field. NeuroImage 100, 520–534 (2014) [111] Logothetis, N.K.: The Underpinnings of the BOLD Functional Magnetic Resonance Imaging Signal. J. Neurosci. 23(10), 3963–3971 (2003) [112] Lynn, R., Irwing, P.: Sex Differences in Mental Arithmetic, Digit Span, and g Defined as Working Memory Capacity. Intelligence 36(3), 226–235 (2008) [113] McGonigle, D.J., Howseman, A.M., Athwal, B.S., Friston, K.J., Frackowiak, R., Holmes, A.P.: Variability in FMRI: An Examination of Intersession Differences. NeuroImage 11(6), 708–734 (2000) [114] McGrath, J., Saha, S., Chant, D., Welham, J.: Schizophrenia: A Concise Overview of Incidence, Prevalence, and Mortality. Epidemiol. Rev. 30(1), 67–76 (2008) [115] McIntosh, A., Bookstein, F., Haxby, J., Grady, C.: Spatial Pattern Analysis of Functional Brain Images Using Partial Least Squares. NeuroImage 3(3), 143 – 157 (1996) [116] Meinshausen, N., Bühlmann, P.: High-Dimensional Graphs and Variable Selection with the LASSO. Ann. Stat. pp. 1436–1462 (2006) [117] Molnar, C.: Interpretable Machine Learning https://christophm.github.io/interpretable-ml-book/ (2019), [118] Montavon, G., Samek, W., Müller, K.R.: Methods for Interpreting and Understanding Deep Neural Networks. Digit. Signal Process. 73, 1–15 (2018) [119] Ng, B., Dressler, M., Varoquaux, G., Poline, J.B., Greicius, M., Thirion, B.: Transport on Riemannian Manifold for Functional Connectivity-Based Classification. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 405–412. Springer (2014) [120] Ng, B., Varoquaux, G., Poline, J.B., Thirion, B.: A Novel Sparse Group Gaussian Graphical Model for Functional Connectivity Estimation. In: International Conference on Information Processing in Medical Imaging. pp. 256–267. Springer (2013) [121] Nie, L., Yang, X., Matthews, P.M., Xu, Z.W., Guo, Y.K.: Inferring Functional Connectivity in FMRI Using Minimum Partial Correlation. Int. J. Autom. Comput. 14(4), 371–385 (2017) 113 [122] Nyborg, H.: Sex-Related Differences in General Intelligence g, Brain Size, and Social Status. Pers. Individ. 39(3), 497–509 (2005) [123] Palande, S., Jose, V., Zielinski, B., Anderson, J., Fletcher, P.T., Wang, B.: Revisiting Abnormalities in Brain Network Architecture Underlying Autism Using TopologyInspired Statistical Inference. In: International Workshop on Connectomics in Neuroimaging. pp. 98–107. Springer (2017) [124] Pamplona, G.S., Santos Neto, G.S., Rosset, S.R., Rogers, B.P., Salmon, C.E.: Analyzing the Association between Functional Connectivity of the Brain and Intellectual Performance. Front. Hum. Neurosci. 9, 61 (2015) [125] Parisot, S., Ktena, S.I., Ferrante, E., Lee, M., Moreno, R.G., Glocker, B., Rueckert, D.: Spectral Graph Convolutions for Population-Based Disease Prediction. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 177–185. Springer (2017) [126] Parisot, S., Ktena, S.I., Ferrante, E., Lee, M., Guerrero, R., Glocker, B., Rueckert, D.: Disease Prediction Using Graph Convolutional Networks: Application to Autism Spectrum Disorder and Alzheimer’s Disease. Med. Image Anal. 48, 117–130 (2018) [127] Park, T., Casella, G.: The Bayesian LASSO. J. Am. Stat. Assoc. 103(482), 681–686 (2008) [128] Pennec, X., Fillard, P., Ayache, N.: A Riemannian Framework for Tensor Computing. Int. J. Comput. Vis. 66(1), 41–66 (2006) [129] Power, J.D., Cohen, A.L., Nelson, S.M., Wig, G.S., Barnes, K.A., Church, J.A., Vogel, A.C., Laumann, T.O., Miezin, F.M., Schlaggar, B.L., et al.: Functional Network Organization of the Human Brain. Neuron 72(4), 665–678 (2011) [130] Qiu, A., Lee, A., Tan, M., Chung, M.K.: Manifold Learning on Brain Functional Networks in Aging. Med. Image Anal. 20(1), 52–60 (2015) [131] Rahim, M., Thirion, B., Varoquaux, G.: Population-Shrinkage of Covariance to Estimate Better Brain Functional Connectivity. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 460–468. Springer (2017) [132] Rao, A., Monteiro, J.M., Mourao-Miranda, J., Alzheimer’s Disease Initiative, et al.: Predictive modelling Using Neuroimaging Data in the Presence of Confounds. NeuroImage 150, 23–49 (2017) [133] Rashid, B., Arbabshirani, M.R., Damaraju, E., Cetin, M.S., Miller, R., Pearlson, G.D., Calhoun, V.D.: Classification of Schizophrenia and Bipolar Patients Using Static and Dynamic Resting-State FMRI Brain Connectivity. NeuroImage 134, 645–657 (2016) [134] Reininghaus, J., Huber, S., Bauer, U., Kwitt, R.: A Stable Multi-Scale Kernel for Topological Machine Learning. Computer Vision and Pattern Recognition (CVPR) (2015) 114 [135] Ritchie, S.J., Cox, S.R., Shen, X., Lombardo, M.V., Reus, L.M., Alloza, C., Harris, M.A., Alderson, H.L., Hunter, S., Neilson, E., et al.: Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants. Cereb. Cortex 28(8), 2959–2975 (2018) [136] Ronicko, J.F.A., Thomas, J., Thangavel, P., Koneru, V., Langs, G., Dauwels, J.: Diagnostic Classification of Autism Using Resting-State fMRI Data Improves with Full Correlation Functional Brain Connectivity Compared to Partial Correlation. J. Neurosci. Methods 345, 108884 (2020) [137] Rosipal, R., Trejo, L.: Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. J. Mach. Learn. Res. 2, 97–123 (2002) [138] Rothman, A.J., Bickel, P.J., Levina, E., Zhu, J., et al.: Sparse Permutation Invariant Covariance Estimation. Electron. J. Stat. 2, 494–515 (2008) [139] Rudie, J.D., Brown, J., Beck-Pancer, D., Hernandez, L., Dennis, E., Thompson, P., Bookheimer, S., Dapretto, M.: Altered Functional and Structural Brain Network Organization in Autism. NeuroImage: Clinical 2, 79–94 (2013) [140] Rudin, C.: Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat. Mach. Intell. 1(5), 206–215 (2019) [141] Rue, H., Held, L.: Gaussian Markov Random Fields: Theory and Applications. Chapman and Hall (2005) [142] Ruigrok, A.N., Salimi-Khorshidi, G., Lai, M.C., Baron-Cohen, S., Lombardo, M.V., Tait, R.J., Suckling, J.: A Meta-Analysis of Sex Differences in Human Brain Structure. Neurosci. Biobehav. Rev. 39, 34–50 (2014) [143] Ryali, S., Chen, T., Supekar, K., Menon, V.: A Parcellation Scheme Based on von Mises-Fisher Distributions and Markov Random Fields for Segmenting Brain Regions Using Resting-State FMRI. NeuroImage 65, 83–96 (2013) [144] Sacher, J., Neumann, J., Okon-Singer, H., Gotowiec, S., Villringer, A.: Sexual Dimorphism in the Human Brain: Evidence from Neuroimaging. Magn. Reson. Imaging 31(3), 366–375 (2013) [145] Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D.A., Nolan, G.P.: Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science 308(5721), 523–529 (2005) [146] Sala-Llonch, R., Bartrés-Faz, D., Junqué, C.: Reorganization of Brain Networks in Aging: A Review of Functional Connectivity Studies. Front. Psychol. 6, 663 (2015) [147] Santarnecchi, E., Tatti, E., Rossi, S., Serino, V., Rossi, A.: Intelligence-Related Differences in the Asymmetry of Spontaneous Cerebral Activity. Hum. Brain Mapp. 36(9), 3586–3602 (2015) [148] Schulz, M.A., Yeo, B.T., Vogelstein, J.T., Mourao-Miranada, J., Kather, J.N., Kording, K., Richards, B., Bzdok, D.: Different Scaling of Linear Models and Deep Learning in UKBiobank Brain Images Versus Machine-Learning Datasets. Nat. Commun. 11(1), 1–15 (2020) 115 [149] Schwartz, E., Nenning, K.H., Kasprian, G., Schuller, A.L., Bartha-Doering, L., Langs, G.: Multivariate Manifold Modelling of Functional Connectivity in Developing Language Networks. In: International Conference on Information Processing in Medical Imaging (IPMI). pp. 311–322. Springer (2017) [150] Sheline, Y.I., Raichle, M.E.: Resting State Functional Connectivity in Preclinical Alzheimer’s Disease. Biol. Psychiatry 74(5), 340–347 (2013) [151] Singh, N., Fletcher, P.T., Preston, J.S., King, R.D., Marron, J., Weiner, M.W., Joshi, S., Alzheimer’s Disease Neuroimaging Initiative (ADNI, et al.: Quantifying Anatomical Shape Variations in Neurological Disorders. Med. Image Anal. 18(3), 616–633 (2014) [152] Van der Sluis, S., Posthuma, D., Dolan, C.V., de Geus, E.J., Colom, R., Boomsma, D.I.: Sex Differences on the Dutch WAIS-III. Intelligence 34(3), 273–289 (2006) [153] Smith, S.M., Fox, P.T., Miller, K.L., Glahn, D.C., Fox, P.M., Mackay, C.E., Filippini, N., Watkins, K.E., Toro, R., Laird, A.R., et al.: Correspondence of the Brain’s Functional Architecture During Activation and Rest. Proc. Natl. Acad. Sci. U.S.A. 106(31), 13040–13045 (2009) [154] Smith, S.M., Miller, K.L., Salimi-Khorshidi, G., Webster, M., Beckmann, C.F., Nichols, T.E., Ramsey, J.D., Woolrich, M.W.: Network Modelling Methods for FMRI. NeuroImage 54(2), 875–891 (2011) [155] Smith, S.M., Nichols, T.E., Vidaurre, D., Winkler, A.M., Behrens, T.E., Glasser, M.F., Ugurbil, K., Barch, D.M., Van Essen, D.C., Miller, K.L.: A Positive-Negative Mode of Population Covariation Links Brain Connectivity, Demographics and Behavior. Nat. Neurosci. 18(11), 1565–1567 (2015) [156] Smith, S.M., Vidaurre, D., Beckmann, C.F., Glasser, M.F., Jenkinson, M., Miller, K.L., Nichols, T.E., Robinson, E.C., Salimi-Khorshidi, G., Woolrich, M.W., et al.: Functional Connectomics from Resting-State FMRI. Trends Cogn. Sci. 17(12), 666–682 (2013) [157] Snoek, L., Miletić, S., Scholte, H.S.: How to Control for Confounds in Decoding Analyses of Neuroimaging Data. NeuroImage 184, 741–760 (2019) [158] Song, M., Liu, Y., Zhou, Y., Wang, K., Yu, C., Jiang, T.: Default Network and Intelligence Difference. IEEE Trans. Auton. Mental Develop. 1(2), 101–109 (2009) [159] Song, M., Zhou, Y., Li, J., Liu, Y., Tian, L., Yu, C., Jiang, T.: Brain Spontaneous Functional Connectivity and Intelligence. NeuroImage 41(3), 1168–1176 (2008) [160] Sporns, O., Tononi, G., Kötter, R.: The Human Connectome: A Structural Description of the Human Brain. PLoS Comput. Biol. 1(4), e42 (2005) [161] Spreng, R.N.: The Fallacy of a ”Task-Negative” Network. Front. Psychol. 3, 145 (2012) [162] Steinmayr, R., Beauducel, A., Spinath, B.: Do Sex Differences in a Faceted Model of Fluid and Crystallized Intelligence Depend on the Method Applied? Intelligence 38(1), 101–110 (2010) 116 [163] Sun, T., Zhang, C.H.: Sparse Matrix Inversion with Scaled LASSO. J. Mach. Learn. Res. 14(1), 3385–3418 (2013) [164] Tagliazucchi, E., Laufs, H.: Decoding Wakefulness Levels from Typical FMRI Resting-State Data Reveals Reliable Drifts Between Wakefulness and Sleep. Neuron 82(3), 695–708 (2014) [165] Thomas, R.M., Gallo, S., Cerliani, L., Zhutovsky, P., El-Gazzar, A., van Wingen, G.: Classifying Autism Spectrum Disorder Using the Temporal Statistics of Resting-State Functional MRI Data with 3D Convolutional Neural Networks. Front. Psychiatry 11, 440 (2020) [166] Tuzel, O., Porikli, F., Meer, P.: Pedestrian Detection via Classification on Riemannian Manifolds. IEEE Trans. Pattern Anal. Mach. Intell. 30(10), 1713–1727 (2008) [167] Van Dijk, K.R., Hedden, T., Venkataraman, A., Evans, K.C., Lazar, S.W., Buckner, R.L.: Intrinsic Functional Connectivity as a Tool for Human Connectomics: Theory, Properties, and Optimization. J. Neurophysiol. 103(1), 297–321 (2010) [168] Van Essen, D.C., Smith, S.M., Barch, D.M., Behrens, T.E., Yacoub, E., Ugurbil, K., Consortium, W.M.H., et al.: The WU-Minn Human Connectome Project: An Overview. NeuroImage 80, 62–79 (2013) [169] Van Essen, D.C., Ugurbil, K., Auerbach, E., Barch, D., Behrens, T., Bucholz, R., Chang, A., Chen, L., Corbetta, M., Curtiss, S.W., et al.: The Human Connectome Project: A Data Acquisition Perspective. NeuroImage 62(4), 2222–2231 (2012) [170] Varoquaux, G., Baronnet, F., Kleinschmidt, A., Fillard, P., Thirion, B.: Detection of Brain Functional-Connectivity Difference in Post-Stroke Patients Using Group-Level Covariance Modeling. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 200–208. Springer (2010) [171] Varoquaux, G., Sadaghiani, S., Pinel, P., Kleinschmidt, A., Poline, J.B., Thirion, B.: A Group Model for Stable Multi-Subject ICA on FMRI Datasets. NeuroImage 51(1), 288–299 (2010) [172] de Vos, F., Koini, M., Schouten, T.M., Seiler, S., van der Grond, J., Lechner, A., Schmidt, R., de Rooij, M., Rombouts, S.A.: A Comprehensive Analysis of Resting State FMRI Measures to Classify Individual Patients with Alzheimer’s Disease. NeuroImage 167, 62–72 (2018) [173] Vos, T., Allen, C., Arora, M., Barber, R.M., Bhutta, Z.A., Brown, A., Carter, A., Casey, D.C., Charlson, F.J., Chen, A.Z., et al.: Global, Regional, and National Incidence, Prevalence, and Years Lived with Disability for 310 Diseases and Injuries, 1990–2015: A Systematic Analysis for the Global Burden of Disease Study 2015. The Lancet 388(10053), 1545–1602 (2016) [174] Wang, R., Guo, H., Davis, L.S., Dai, Q.: Covariance Discriminative Learning: A Natural and Efficient Approach to Image Set Classification. In: Computer Vision and Pattern Recognition (CVPR). pp. 2496–2503. IEEE (2012) 117 [175] Wang, Y., Kang, J., Kemmer, P.B., Guo, Y.: An Efficient and Reliable Statistical Method for Estimating Functional Connectivity in Large Scale Brain Networks Using Partial Correlation. Front. Neurosci. 10, 123 (2016) [176] Wechsler, S., de Cassia Nakano, T., da Silva Domingues, S.F., Rosa, H.R., da Silva, R.B.F., Silva-Filho, J.H., Minervino, C.A.d.S.M.: Gender Differences on Tests of Crystallized Intelligence. Eur. J. Educ. 7(1), 59–72 (2014) [177] Weis, S., Patil, K.R., Hoffstaedter, F., Nostro, A., Yeo, B.T., Eickhoff, S.B.: Sex Classification by Resting State Brain Connectivity. Cereb. Cortex 30(2), 824–835 (2020) [178] Weng, S.J., Wiggins, J.L., Peltier, S.J., Carrasco, M., Risi, S., Lord, C., Monk, C.S.: Alterations of Resting State Functional Connectivity in the Default Network in Adolescents with Autism Spectrum Disorders. Brain Res. 1313, 202–214 (2010) [179] Witelson, S., Beresh, H., Kigar, D.: Intelligence and Brain Size in 100 Postmortem Brains: Sex, Lateralization and Age Factors. Brain 129(2), 386–398 (2006) [180] Wold, H.: Soft Modeling by Latent Variables: The Nonlinear Iterative Partial Least Squares Approach. Perspectives in Probability and Statistics, Papers in Honour of M.S. Bartlett pp. 520–540 (1975) [181] Wong, E., Anderson, J.S., Zielinski, B.A., Fletcher, P.T.: Riemannian Regression and Classification Models of Brain Networks Applied to Autism. In: International Workshop on Connectomics in Neuroimaging. pp. 78–87. Springer (2018) [182] Wong, E., Awate, S.P., Fletcher, P.T.: Adaptive Sparsity in Gaussian Graphical Models. In: International Conference on Machine Learning (ICML). pp. 311–319 (2013) [183] Wu, K., Taki, Y., Sato, K., Hashizume, H., Sassa, Y., Takeuchi, H., Thyreau, B., He, Y., Evans, A.C., Li, X., et al.: Topological Organization of Functional Brain Networks in Healthy Children: Differences in Relation to Age, Sex, and Intelligence. PloS One 8(2), e55347 (2013) [184] Xu, C., Li, C., Wu, H., Wu, Y., Hu, S., Zhu, Y., Zhang, W., Wang, L., Zhu, S., Liu, J., et al.: Gender Differences in Cerebral Regional Homogeneity of Adult Healthy Volunteers: A Resting-State FMRI Study. Biomed Res. Int. 2015 (2015) [185] Xu, T., Yang, Z., Jiang, L., Xing, X.X., Zuo, X.N.: A Connectome Computation System for Discovery Science of Brain. Sci. Bull. 60(1), 86–95 (2015) [186] Yan, C., Zang, Y.: DPARSF: A MATLAB Toolbox for Pipeline Data Analysis of Resting-State FMRI. Front. Syst. Neurosci. 4, 13 (2010) [187] Ye, M., Yang, T., Qing, P., Lei, X., Qiu, J., Liu, G.: Changes of Functional Brain Networks in Major Depressive Disorder: A Graph Theoretical Analysis of Resting-State FMRI. PloS One 10(9), e0133775 (2015) [188] Yeo, B.T., Krienen, F.M., Sepulcre, J., Sabuncu, M.R., Lashkari, D., Hollinshead, M., Roffman, J.L., Smoller, J.W., Zöllei, L., Polimeni, J.R., et al.: The Organization of the Human Cerebral Cortex Estimated by Functional Connectivity. Network 21, 22 (2011) 118 [189] Yerys, B.E., Gordon, E.M., Abrams, D.N., Satterthwaite, T.D., Weinblatt, R., Jankowski, K.F., Strang, J., Kenworthy, L., Gaillard, W.D., Vaidya, C.J.: Default Mode Network Segregation and Social Deficits in Autism Spectrum Disorder: Evidence from Non-Medicated Children. NeuroImage: Clinical 9, 223–232 (2015) [190] Yger, F., Sugiyama, M.: Supervised Log-Euclidean Metric Learning for Symmetric Positive Definite Matrices. arXiv Preprint arXiv:1502.03505 (2015) [191] Ypma, R.J., Moseley, R.L., Holt, R.J., Rughooputh, N., Floris, D.L., Chura, L.R., Spencer, M.D., Baron-Cohen, S., Suckling, J., Bullmore, E.T., et al.: Default Mode Hypoconnectivity Underlies a Sex-Related Autism Spectrum. Biol. Psychiatry Cogn. Neurosci. Neuroimaging 1(4), 364–371 (2016) [192] Yu, K., Salzmann, M.: Second-Order Convolutional Neural Networks. arXiv preprint arXiv:1703.06817 (2017) [193] Yu, Q., Erhardt, E.B., Sui, J., Du, Y., He, H., Hjelm, D., Cetin, M.S., Rachakonda, S., Miller, R.L., Pearlson, G., et al.: Assessing Dynamic Brain Graphs of Time-Varying Connectivity in FMRI Data: Application to Healthy Controls and Patients with Schizophrenia. NeuroImage 107, 345–355 (2015) [194] Yuan, M., Lin, Y.: Model Selection and Estimation in the Gaussian Graphical Model. Biometrika 94(1), 19–35 (2007) [195] Zalesky, A., Fornito, A., Bullmore, E.T.: Network-Based Statistic: Identifying Differences in Brain Networks. NeuroImage 53(4), 1197–1207 (2010) [196] Zhang, C., Dougherty, C.C., Baum, S.A., White, T., Michael, A.M.: Functional Connectivity Predicts Gender: Evidence for Gender Differences in Resting Brain Connectivity. Hum. Brain Mapp. 39(4), 1765–1776 (2018) [197] Zhao, Q., Kwon, D., Pohl, K.M.: A Riemannian Framework for Longitudinal Analysis of Resting-State Functional Connectivity. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 145–153. Springer (2018)
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6hp8x2n