Data analysis and visualization using basis selection for matrix approximation

Data analysis and visualization using basis selection for matrix approximation

Title	Data analysis and visualization using basis selection for matrix approximation
Publication Type	dissertation
School or College	College of Engineering
Department	Computing
Author	Perry, Daniel J.
Date	2017
Description	This dissertation is about analyzing and visualizing datasets using basis selection techniques for matrix approximation. A large portion of the previous work in basis selection and matrix approximation has been focused entirely on new algorithms to improve specific measures of quality and has been largely motivated by the goals of reducing runtime and minimizing the error introduced. We contribute to that body of knowledge, but we also enlarge the types of motivating problems and interesting applications available to basis selection techniques. Specifically, in addition to contributing to well-studied problems, such as the computational aspects of kernel-based learning and general low-rank matrix approximations, we also introduce two real-world problems where basis selection aids in significant ways: subset-based visualization and proxy-construction for uncertainty quantification of resource-demanding simulations. We hope this dissertation will motivate others to study and extend ideas and techniques that are specifically motivated by these fascinating problems. The full set of concepts discussed here can be categorized into two fundamental ideas: using appropriate basis selection to improve human interpretability of datasets and basis selection to address computational burden. We present these ideas in a collection of five papers. The first paper introduces a novel subset-based visualization motivated by an application to topology optimization design exploration, and emphasizes the ability of a subset matrix to visually summarize a dataset. The second and third papers address computational limitations in kernel-based learning, introducing a novel basis search technique for the Nystrom approximation and a random-projection type approximation, respectively. The fourth paper introduces a novel algorithm and analysis related to general subset-based matrix approximation, touching on both computational and interpretation aspects of basis selection. The fifth paper considers a novel basis selection approach to proxy-function construction for faster uncertainty quantification of compute-intensive simulations.
Type	Text
Publisher	University of Utah
Subject	Computer science
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	© Daniel J. Perry
Format	application/pdf
Format Medium	application/pdf
ARK	ark:/87278/s6bg7x99
Setname	ir_etd
ID	1536060
OCR Text	Show DATA ANALYSIS AND VISUALIZATION USING BASIS SELECTION FOR MATRIX APPROXIMATION by Daniel J. Perry A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computing School of Computing The University of Utah December 2017 Copyright c Daniel J. Perry 2017 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Daniel J. Perry has been approved by the following supervisory committee members: Ross Whitaker , Chair(s) 23 Oct 2017 Date Approved Robert Michael Kirby , Member 23 Oct 2017 Date Approved Jeffrey Phillips , Member 23 Oct 2017 Date Approved Aditya Bhaskara , Member 23 Oct 2017 Date Approved Braxton Osting , Member 23 Oct 2017 Date Approved by Ross Whitaker , Chair/Dean of the Department/College/School of Computing and by David B. Kieda , Dean of The Graduate School. ABSTRACT This dissertation is about analyzing and visualizing datasets using basis selection techniques for matrix approximation. A large portion of the previous work in basis selection and matrix approximation has been focused entirely on new algorithms to improve specific measures of quality and has been largely motivated by the goals of reducing runtime and minimizing the error introduced. We contribute to that body of knowledge, but we also enlarge the types of motivating problems and interesting applications available to basis selection techniques. Specifically, in addition to contributing to well-studied problems, such as the computational aspects of kernel-based learning and general low-rank matrix approximations, we also introduce two real-world problems where basis selection aids in significant ways: subset-based visualization and proxy-construction for uncertainty quantification of resourcedemanding simulations. We hope this dissertation will motivate others to study and extend ideas and techniques that are specifically motivated by these fascinating problems. The full set of concepts discussed here can be categorized into two fundamental ideas: using appropriate basis selection to improve human interpretability of datasets and basis selection to address computational burden. We present these ideas in a collection of five papers. The first paper introduces a novel subset-based visualization motivated by an application to topology optimization design exploration, and emphasizes the ability of a subset matrix to visually summarize a dataset. The second and third papers address computational limitations in kernel-based learning, introducing a novel basis search technique for the Nyström approximation and a random-projection type approximation, respectively. The fourth paper introduces a novel algorithm and analysis related to general subset-based matrix approximation, touching on both computational and interpretation aspects of basis selection. The fifth paper considers a novel basis selection approach to proxy-function construction for faster uncertainty quantification of compute-intensive simulations. For Becca CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii CHAPTERS 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Basis selection for improved human interpretability . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Subset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Basis selection to address computational burden . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.1.1 Nyström approximation in kernel-based methods . . . . . . . . . . . . . . 6 1.2.1.2 Random-projection approximation methods . . . . . . . . . . . . . . . . . . 7 1.2.1.3 Column subset selection as a preprocessing step . . . . . . . . . . . . . . . 7 1.2.2 Simulation science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Overview and summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2. VISUALIZATION OF TOPOLOGY OPTIMIZATION DESIGNS WITH REPRESENTATIVE SUBSET SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Topology optimization for industrial design . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Dimension reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Subset-based visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Structure topology optimization analysis . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Subset-based visualization of topology optimization designs . . . . . . . . . . . . . 2.4.1 Representative subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Simultaneous subset and coefficient computation . . . . . . . . . . . . . . . . . . 2.4.3 Subset size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Visual encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Linked views . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Subset selection evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 19 20 20 21 21 22 22 23 23 23 23 24 24 24 2.5.1.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1.3 User-based evaluation of STO designs . . . . . . . . . . . . . . . . . . . . . . . 2.5.1.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Full system evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2.4 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2.5 Additional examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3. NYSTRÖM SKETCHES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Nyström coreset and sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 PCA and kPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 kPCA projection error and Nyström approximation . . . . . . . . . . . . . . . . 3.3 General formulation of the Nyström sketch learning problem . . . . . . . . . . . . . 3.3.1 Optimal Nyström sketch is the PCA basis for the trivial lifting function 3.4 Optimal Nyström sketch using nonlinear least-squares methods . . . . . . . . . . 3.4.1 Linear kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Symmetric stationary kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Batch algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Online algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Robust online algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Parameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Optimal Nyström sketch learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Online extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.5 Robust extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. 25 25 25 26 27 27 27 28 29 29 30 33 34 34 35 35 36 36 37 38 39 39 40 40 41 42 42 42 45 46 46 47 STREAMING KERNEL PRINCIPAL COMPONENT ANALYSIS . . . . . . . . . . . . 49 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Our results vs. previous streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Algorithms and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Spectral error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Frobenius error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.A High probability bound for RFFMaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.A.1 High probability bound "for all" bound for RFFMaps . . . . . . . . . . . . . . . 4.A.2 For each bound for RFFMaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 50 51 52 52 53 55 55 58 60 60 60 5. DETERMINISTIC COLUMN SUBSET SELECTION . . . . . . . . . . . . . . . . . . . . . . . 62 5.1 5.2 5.3 5.4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deterministic column sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Synthetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Real . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Analysis of tail oblivious methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Tail oblivious methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2.1 Proof outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2.2 Construction of matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2.3 Analyzing the reconstruction error . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2.4 Proof of Theorem 5.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. 63 65 66 67 67 67 68 69 71 71 71 71 72 75 75 76 76 ALLOCATION STRATEGIES FOR HIGH-FIDELITY MODELS IN THE MULTIFIDELITY REGIME . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2.1 Gramian non-parametric multifidelity algorithm . . . . . . . . . . . . . . . . . . . 87 6.2.2 Subset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.3 Allocation strategies for multifidelity simulation . . . . . . . . . . . . . . . . . . . . . . . 90 6.3.1 Relaxation of the exact subset problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.3.2 Group orthogonal matching pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 6.3.3 Theoretical discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4.1 Methods compared . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4.2 Burgers equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.4.2.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.4.3 Double pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.4.3.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.4.3.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.4.4 Compressible flow simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.4.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4.4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4.5 Structure topology optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4.5.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.5.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 7. CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vii 7.1 Additional motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.1.1 Subset-based visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.1.2 Proxy construction for uncertainty quantification . . . . . . . . . . . . . . . . . . . 111 7.2 Out-of-sample Nyström approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3 Objectives guide heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 viii LIST OF FIGURES 1.1 Eigenfaces example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Structure topology optimization visualization interface . . . . . . . . . . . . . . . . . . . 19 2.2 Structure topology optimization design space . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Diagram of the proposed subset-based visualization . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Representative subset examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Linked scatterplot and detail view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 CelebA reconstruction and coverage tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.7 Alternatives non-negative weighted subset views . . . . . . . . . . . . . . . . . . . . . . . . 27 2.8 Beam reconstruction using PCA compared to a subset reconstruction . . . . . . . . 28 2.9 GOMP-NN view of STO2 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.10 PCA view of STO2 dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1 Results for synthesized swirl dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Projection error in batch mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3 Results for real-world datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Results using robust method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 RandomNoisy dataset results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.2 CPU dataset results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Adult dataset results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 Forest dataset results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.1 Synthetic Gaussian dataset details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Real datasets spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Deterministic results on the synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4 Probabilistic results on the synthetic datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.5 PCA embedding of data showing different leverage scores . . . . . . . . . . . . . . . . 81 5.6 Deterministic results on the real datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.7 Probabilistic results on the real datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1 Burgers equation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.2 Double pendulum results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 Compressible flow simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.4 Compressible flow simulation point samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.5 Compressible flow results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.6 Structure topology optimization diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.7 Structure topology optimization samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.8 Structure topology optimization results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.9 Structure topology optimization comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.10 Structure topology optimization samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 x LIST OF TABLES 2.1 Shape reconstruction error test results for STO D1 . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2 Shape feature coverage test results using expert list for STO D1 . . . . . . . . . . . . . 26 2.3 Shape feature coverage test results using nonexpert list for STO D1 . . . . . . . . . 27 2.4 Relationships reported by domain experts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.5 Prior known relationships assessments by domain experts . . . . . . . . . . . . . . . . . 28 3.1 Data sets and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1 Asymptotic train time, test time, and space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.1 Real and synthetic datasets examined . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.1 Theoretical comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Summary of methods used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 ACKNOWLEDGEMENTS I thank my advisor, Ross Whitaker, for providing guidance towards interesting problems and, especially, for helping me to learn to notice and challenge the many assumptions one makes about his own work in any technical endeavor. I also thank my committee, Mike Kirby, Jeff Phillips, Aditya Bhaskara, and Braxton Osting, for the many helpful discussions, lessons, and collaboration on interesting projects. I additionally thank two other specific faculty who served on my initial committee, Tom Fletcher and Sarang Joshi, who also provided encouragment and direction towards interesting problems. Thanks to the SCI Institute and the School of Computing for bringing together the outstanding, world-class people and facilities without which I could not have finished this achievement. Thanks to my parents, family, and friends for the ongoing support and encouragement for pursuing my dreams. Thanks to my kids, Mark, Hannah, and Clara, for always putting a smile on my face, especially on the tough days. Most of all, my love and thanks to my wife Becca, to whom this work is dedicated, for her unending support and love. CHAPTER 1 INTRODUCTION 2 This dissertation is about analyzing and visualizing datasets using various basis selection techniques for matrix approximation. Here we use the phrase basis selection to refer to a set of low-rank matrix approximation techniques. That basis set can be an orthogonal basis of the data space, a nonorthogonal set of basis elements, or, in some cases, a subset of the data matrix. To be precise, consider a data matrix A ∈ Rd×n with n columns, or data points, each with d elements. We seek an approximation B ∈ R p×m , where p ≤ d, m ≤ n, which minimizes an objective function g( A, B) measuring the underlying task-specific goal. A well-known example is to find the best rank-k approximation of A, where we set p = d and m = k, where k = rank( A), and we seek to minimize g( A, B) = k A − BB T Ak2F where B T B = Ik . This type of basis selection has a well-known solution that can be efficiently computed: find the top k left singular vectors in a singular value decomposition (SVD), also commonly referred to as a principal component analysis (PCA) [1]. Another example is to select a subset of matrix columns that optimally spans the column-space of A, and use that subset in place of the full matrix. In this case, we set p = d and select m columns from A, where m n, which minimizes g( A, B) = k A − BB+ Ak2F , with B+ indicating the Moore-Penrose pseudoinverse of B. This is commonly referred to as the column subset selection problem (CSSP), and in contrast to PCA, has been shown to be NP-complete [2, 3]. In this dissertation, we will consider a variety of basis selection strategies with differing objectives, computability, and approximation guarantees. We use the very general terms analyze and visualize datasets to refer to a range of analytical tasks that include visual summarization, proxy function construction for high-cost simulations, nonlinear dimension reduction, and other related tasks, each with different objectives for the corresponding approximation matrix B. The goals of each of the analytical tasks we will explore here can be generally described in two categories: improving human interpretability and addressing computational cost. 1.1 Basis selection for improved human interpretability Dimension reduction (DR) has been an important tool for improving human interpretability of high-dimensional datasets [4]. A typical DR method seeks for an approximation of size p d, m = n, ranging from the SVD objective to various nonlinear 3 or statistically motivated objectives. The dimension-reduced data can then be used, for example, in a 2D or 3D visualization to improve understanding of the data or to aid in a domain-specific task. While many DR techniques are optimal in terms of data compression or in creating a useful 2D embedding, one drawback to a large class of DR techniques is that the resulting basis in which the dimension-reduced data are presented are abstract by design and can therefore be difficult to interpret. For example, while PCA has provably optimal properties in terms of variance preservation and squared reconstruction error, it can be difficult in many cases to understand what each of the abstract principal component directions represents. Here we present an alternative dimension reduction technique, which purposely focuses on selecting a DR basis which is directly interpretable. The technique relies on the selection of a representative subset of the dataset. This subset selection uses p = d and m n with a variety of potential objectives to guide the basis selection. The proposed approach typically requires a larger number of bases to be selected (i.e., 6 to 12 instead of 2 or 3), which results in a different choice in visualization technology: parallel coordinates instead of a 2D or 3D scatterplot. The choice in the number of bases is somewhat limited by the amount of display space of the visualization, and we discuss such decisions in detail in the appropriate chapters. 1.1.1 Dimension reduction One of the most widely used DR techniques is principal component analysis (PCA) [1]. PCA seeks a linear projection to a lower dimension that maximally preserves the variance of the dataset as well as minimizes the reconstruction error. We discussed this objective above, but here we would consider the matrix BA ∈ R p×n the approximation matrix instead of just B ∈ R p×d . Other methods perform DR with alternative goals in mind. For example, independent component analysis (ICA) [5] can perform DR with the goal of maximizing the statistical independence of the estimated components, and multidimensional scaling (MDS) [6, 7] can perform DR with the goal of optimally preserving interobject distances from the original space and chosen distance. When MDS is used with a so-called nonlinear distance metric in the original space, the resulting DR is nonlinear. Many other nonlinear DR techniques, including nonlinear extensions of PCA, such as 4 kernel PCA [8], or nonlinear geometric DR methods, such as Isomap [9] and local linear embedding (LLE) [10], have been used to better model the geometric relationship of the data. These approaches can provide arguably better representations of data on nonlinear manifolds. Likewise, probabilistic nonlinear DR methods, such as stochastic neighbor embedding (SNE) [11], its t-distribution variant t-SNE [12], and parametric embedding (PE) [13], perform DR with the goal of preserving similarity, but do so by minimizing the difference between a probabilistic model (such as a Gaussian or student t-distribution) of the points in the original and dimension-reduced space. All these DR methods are commonly used to present the data in a 2D or 3D scatterplot and related types of visual encoding [12]. This type of visual presentation is useful for observing clusters and other spatial patterns in the data and, especially when combined with other linked views, can be quite powerful for understanding the dataset as a whole. A drawback to the DR approaches described above is that often the dimensions or axis on which the dataset is projected are abstract and can be difficult to understand directly. Improving DR from an interpretation standpoint has been addressed in the literature [14, 15, 16]. These methods aid in the use of standard DR techniques. For example, [14] and [15] both provide important tools for either finding the right point of view of the DR scatterplot or enhancing the choice of DR parameters for existing systems. In contrast, [16] automatically selects a subset to represent clusters in the DR. Even with this assisted tooling, there are also philosophical concerns with interpreting the abstract dimensions as concrete ideas [17]. 1.1.2 Subset selection One approach to alleviate the concerns presented about traditional DR is to use a subset of the elements themselves in representing the entire dataset. A subset of the data matrix is another type of basis selection and can aid in human understanding. This idea has been explored in both the data mining and visualization communities. In data mining, for example, [17] discusses the need for matrix decomposition that takes into account human interpretation, and proposes the CUR matrix decomposition. A similar procedure presented in [18] and implemented in [19], known as the interpolative decomposition (ID), argues for a decomposition that interpolates among columns from the input data to 5 preserve the original structure. CUR, ID, and related methods are presented as alternatives to more popular approaches such as the SVD. The bases found using CUR and ID are more naturally understood than the singular vectors of an SVD. One clear example of this is the so-called eigenfaces [20] approach to face recognition. While the eigenfaces decomposition, shown in Figure 1.1, has many nice computational aspects for face recognition itself, trying to make sense of what features each eigenface represents can prove challenging. The idea of using a subset for DR is important and has been explored in various forms [21, 22, 23, 24, 25, 26]. In [26], the authors use examples to define the extent of an embedding space because the examples provide an undiluted representation of the underlying set. In another set of examples, [16] and [21] both use a subset of the dataset to represent the contents of data clusters or to identify points in a 2D embedded layout, because the subset presents a natural summary of data without any confusing abstraction or obfuscation. The works [22, 23, 24, 25] all use subsets because of their numerical closeness to the remaining data elements, allowing for the creation of powerful embedding methods. Subsets are a powerful representation because they contain details of the dataset, they avoid abstraction, and when selected appropriately, they can accurately represent neighbors. The authors in [23] use greedy residual-based sampling to select the subset and PE, referred to above, to create 2D embedding. Least square projection (LSP) [22] and part-linear multidimensional projection (PLMP) [24] use a subset of the data to learn a linear low-dimensional embedding, which is then applied to the full dataset. LSP is motivated by increasing the DR quality by limiting the learned projection to a set of control points, while PLMP is primarily motivated by computational challenges for extremely high-dimension datasets. LSP and PLMP both use clustering methods (k-medoids and bisecting k-means) to select the controls points. Local affine multidimensional projection (LAMP) [25] provides for user selection of the subset and then uses that subset to produce a locally affine projection for the full dataset. Another interactive technique is presented in [26], where the user can interactively select basis extents from specific samples in the dataset, for the purposes of controlling how the scatterplot data are displayed. Design galleries [21] is another interesting example of using subsets to improve human interpretation of a dataset. The authors in [21] present a 2D embedded layout and summarize the visual content of clusters or positions in the embedding using small multiples of a 6 subset of the full dataset. This dissertation will present some work that makes use of a subset selection basis for the dataset that improves human interpretation of a dataset and directly addresses some of the concerns raised above. 1.2 Basis selection to address computational burden When using a selected basis in place of the original matrix in a computation, there is typically some benefit in terms of efficiency from the simple fact that the basis is smaller in size (i.e., some p d or m n) than the full matrix. Here we will discuss this simple scenario as well as more nuanced instances where the matrix approximation provides computational advantages in more indirect ways. We consider the computational advantages these methods provide in two general applications: machine learning and simulation science. 1.2.1 1.2.1.1 Machine learning Nyström approximation in kernel-based methods Kernel-based learning methods have gained popularity in the last few decades. Specifically, approaches such as kernel PCA [27], spectral clustering [28], and many other have been used successfully on a variety of problems, see, e.g., [29, 30, 31]. However, like other nonparametric learning models, their application to large datasets can be limited by the need to retain all the training data. One of the most successful approximation techniques for kernel-based learning, generally referred to as the Nyström approximation, was introduced in [32] and further analyzed in [33, 34] where a small subset of the data is used to approximate the full Gram matrix. The primary difficulty with Nyström approximation is to select the right subset-if the subset is too large, unneeded computational cost is incurred and if the subset does not contain a representative set, too much error is incurred. Because of this tradeoff, there has been significant work in the area of Nyström subset selection (see, e.g., [35, 36, 37, 38, 39]), where different methods have been used to select a Nyström basis B ∈ Rd×m , where B is a subset of the original data matrix A. In contrast to previous work, this dissertation will present some novel approaches to constructing an appropriate Nyström basis by using an 7 optimization formulation with the Nyström approximation loss as the objective function, resulting in a matrix approximation B ∈ Rd×m with columns that do not have to be a subset of A. By removing the subset constraint, the resulting basis reduces reconstruction error in the Nyström approximation. 1.2.1.2 Random-projection approximation methods Recently, there has been significant effort to develop techniques that make use of random projections to estimate the lifting function in a typical kernel-based method directly [40, 41, 42, 43]. These methods can be thought of as a type of random basis for kernel approximation, where instead of storing a subset of the data from the input set, a set of random directions is retained for computing the nonlinear lifting. In other words, A in this case is a set of n points in a potentially infinite-dimensional reproducing kernel Hilbert space (RKHS) space, and a method like [40] generates a finite approximation A0 ∈ R p×n . These type of approaches are particularly useful in a streaming setting because the random basis is unbiased, making it less susceptible to dramatic changes in the data stream, which can be of concern to a Nyström approximation (i.e., needing to adapt the Nyström matrix samples as the data changes). Consequently, we also present here an adaptation of this basis selection approach appropriate to the streaming setting. The work presented introduces a second approximation matrix B ∈ R p×m , m n that approximates A0 and can be used to estimate KPCA projections. The approach improves the computational reach of a random-projections-based approximation to KPCA by reducing the amount of space needed. 1.2.1.3 Column subset selection as a preprocessing step There are many applications in which we would like to perform dimension reduction using a submatrix of the given large matrix. The most common example of this is the problem of feature selection, in which we have a matrix whose columns correspond to features, and the goal is to select a small subset of informative or robust features [44, 45, 46]. The advantages of performing feature selection as a preprocessing step include removing redundant features and reducing computational needs in the later machine learning stages. Using a small subset of the columns for reconstruction often leads to more interpretable solutions to learning tasks that may be performed on the reduced data [18, 17, 47]. Using 8 a feature subset instead of less interpretable DR, such as PCA, is particularly important when later stages allow the possibility of feature interpretability as well, such as lasso regression [48] or decision trees [49] and related techniques. Specifically, we are interested in an approximation B ∈ Rd×m of the matrix A ∈ Rd×n , where the columns of B are subsets of A. A successful way to capture the notion above is via the so-called column subset selection problem (CSSP), which seeks to minimize k A − BB+ Ak2F . This dissertation will also present some novel algorithms and an analysis of CSSP, which can provide advantages in terms of human interpretation as well as in reducing computational burden. 1.2.2 Simulation science One of the major research challenges in simulation science is enabling a scientist's ability to quantify the effects of uncertainty of important simulation parameters. A common approach to understanding the effect of a parameter is to evaluate an ensemble of simulations for various parameter values. This approach is reasonable when multiple runs of a simulation model or experiment are easily obtained. Unfortunately, a simulation model or discretization that is more true to life, or has higher fidelity, requires additional computational resources due to, for example, an increased number of discrete elements or more expensive modeling of complex phenomena. For these reasons, a high-fidelity model simulation can incur significant computation cost, and repeating such a simulation for a sufficient number of times to understand parameter effects can quickly become infeasible. Recent work has introduced an effective solution to this problem by using multiple fidelities of simulation models where less-expensive, lower-fidelity versions of the highfidelity model are used to learn the parametric structure of a simulation. This structure is utilized to choose a small parameter ensemble at which high-fidelity runs are assembled, providing insight into the finer details and effects of the simulation parameters [50, 51]. One important decision in this allocation process is which of the parameter values should be used in the less costly low- and medium-fidelity simulations, and which of the values are worth the more costly high-fidelity simulation. This decision becomes extremely important for situations where it is computationally infeasible to run the high-fidelity simulation more than a small number of times. This dissertation presents work that reformulates this selection process as a subset selection problem, and shows improved high-fidelity 9 simulation proxy construction in terms of reconstruction error as compared to previous approaches. The approach considers a large set of lower cost low-fidelity simulations in the matrix A ∈ Rd×n , with the goal of selecting the best subset matrix B ∈ Rd×m , m n that minimizes the reconstruction error k A − BB+ Ak2F . Then we use the parameters corresponding to the best subset of A for the more expensive high-fidelity simulations. 1.3 Overview and summary of contributions The remainder of this dissertation will describe a set of work built around the central idea of using basis selection for matrix approximation in a variety of different application areas. These various applications are presented as a set of different papers prepared for the specific domain of interest, and have either been published or under review in the respective domain area. However, overall, each method takes this same approach of selecting a representative basis and using that basis to either improve human interpretability or address computational problems. 1.3.1 Overview We first present, in Chapter 2, a novel visualization approach based on a subsample basis. The subsample basis is selected using a modified form of dictionary learning. Using this carefully chosen subset as a basis, we visually summarize the remainder of the dataset using a parallel-coordinates-based visual encoding with the subset as the parallel coordinate axis. Because the subset must visually summarize the entire dataset, it is important that it be chosen carefully to represent the visual features of the datasets. Using a set of pilot studies and computational experiments, we show that the proposed dictionary-learning-based subset selection technique has specific benefits over comparable approaches in terms of reconstruction error and human-interpretable feature selection. We also use a user session with domain experts to establish the utility of the tool as a whole. Overall, we find that by carefully choosing a visual basis to represent a dataset, we are able to effectively summarize the dataset for the specific visualization tasks we explored. In Chapter 3, we present a novel approach to Nyström approximation. Previous work in Nyström approximation has focused primarily around using subsets for Gram matrix approximation, but here we present an alternative approach which selects a new, previ- 10 ously unseen basis for the Nyström estimation. This basis is chosen by formulating a nonlinear least-squares problem around the Nyström estimation error, and solving for a basis which minimizes the error. After finding the Nyström basis, the resulting approximation error is much smaller than corresponding subset-basis estimations. We demonstrate the method on a variety of synthetic and real-world datasets and show that the method is competitive in terms of reconstruction error, if having a smaller Nyström basis is worth the upfront investment. While Nyström approximation has excellent approximation error characteristics, randomdirection-based approximations for kernels are also useful when a desire for data-oblivious approximation is important. This can be particularly useful in a streaming setting where data modes could change over time. In Chapter 4, we present a novel form of streaming kernel PCA which takes advantage of this alternative kernel approximation method by combining it with established matrix sketching techniques. We show results on a variety of synthetic and real-world datasets, and demonstrate important improvements in space and runtime as compared to competing methods. Chapter 5 presents novel theoretical and experimental work in the area of column subset selection, which is a form of basis selection where the subset is the basis. Constructing a subsample basis can be important for preprocessing of datasets for more expensive downstream machine learning models. Many models will allow a user to connect classification or regression results back to specific features of the dataset. However, such a connection is difficult to understand unless the dataset features used as input to the model are easily understood. Subset selection of features is one way to maintain human interpretability of the dataset features, while still gaining some computational advantages. We show a set of improvements to previous subset selection techniques as well as novel theoretical analysis of the general deterministic subset selection problem. Uncertainty quantification of a simulation is difficult when the simulation is sufficiently large or computationally intensive to make repeated evaluations prohibitive. Previous work has proposed the idea of constructing a proxy from a carefully chosen set of simulation parameters, where guidance for that parameter chosen is given by running a cheaper, lower-fidelity version of a simulation many times. In Chapter 6, we present a novel experimental analysis of subset-selection techniques for selecting this representative set of 11 parameters, in this subset-based proxy construction. Here the subset-basis being selected is chosen to optimally represent other simulations that are too expensive to run, using the lower-fidelity simulation set. Dominant in those experiments is the dictionary-learningbased subset selection technique we explored in Chapter 1, which we found empirically to be superior in many ways to previous subsampling techniques for this problem. We explore this comparison over a variety of different simulation problems. 1.3.2 Summary Chapters 2 through 6 of this dissertation are reprints of published papers and papers under review. • Chapter 2 presents a novel visualization approach based on a subsample basis, including a user-based analysis of how well different types of basis selection techniques work in terms of interpretability [52], • Chapter 3 introduces a novel approach to Nyström approximation for nonlinear dimension reduction using a carefully chosen basis of out-of-sample elements for the Nyström matrix [53], • Chapter 4 shows an alternative random-directions-based approach to approximating kernel PCA in a streaming setting using a well-known matrix approximation algorithm [54], • Chapter 5 discusses some important innovations in deterministic column selection using statistical leverage scores [55], • Chapter 6 discusses the allocation strategy of a multifidelity simulation technique as a subset selection problem, resulting in significant improvements to a simulation proxy function [56]. 1.4 References [1] I. T. Jolliffe, Principal Component Analysis. New York, NY, USA: Springer Verlag, 1986. [2] W. J. Welch, "Algorithmic complexity: Three np-hard problems in computational statistics," Journal of Statistical Computation and Simulation, vol. 15, no. 1, pp. 17-25, 1982. 12 [3] Y. Shitov, "Column subset selection is np-complete," arXiv:1701.02764, 2017. [Online]. Available: https://arxiv.org/abs/1701.02764 [4] S. Liu, D. Maljovec, B. Wang, P. Bremer, and V. Pascucci, "Visualizing highdimensional data: Advances in the past decade," Eurographics Conference on Visualization (EuroVis), 2015. [5] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis. & Sons, 2001. [6] J. B. Kruskal and M. Wish, Multidimensional Scaling. John Wiley Sage, 1978, vol. 11. [7] S. Ingram, T. Munzner, and M. Olano, "Glimmer: Multilevel mds on the gpu," IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 2, pp. 249-261, 2009. [8] B. Schölkopf and A. J. Smola, Learning with kernels: Support vector machines, regularization, optimization, and beyond. MIT Press, 2002. [9] J. B. Tenenbaum, V. De Silva, and J. C. Langford, "A global geometric framework for nonlinear dimensionality reduction," Science, vol. 290, no. 5500, pp. 2319-2323, 2000. [10] S. T. Roweis and L. K. Saul, "Nonlinear dimensionality reduction by locally linear embedding," Science, vol. 290, no. 5500, pp. 2323-2326, 2000. [11] G. E. Hinton and S. T. Roweis, "Stochastic neighbor embedding," in Advances in Neural Information Processing Systems, 2002, pp. 833-840. [12] L. Van der Maaten and G. Hinton, "Visualizing data using t-sne," Journal of Machine Learning Research, vol. 9, no. 2579-2605, p. 85, 2008. [13] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T. D. Griffiths, and J. B. Tenenbaum, "Parametric embedding for class visualization," Neural Computation, vol. 19, no. 9, pp. 2536-2556, 2007. [14] D. B. Coimbra, R. M. Martins, T. T. Neves, A. C. Telea, and F. V. Paulovich, "Explaining three-dimensional dimensionality reduction plots," Information Visualization, vol. 15, no. 2, pp. 154-172, 2016. [15] R. M. Martins, D. B. Coimbra, R. Minghim, and A. C. Telea, "Visual analysis of dimensionality reduction quality for parameterized projections," Computer Graphics, vol. 41, pp. 26-42, 2014. [16] P. Joia, F. Petronetto, and L. G. Nonato, "Uncovering representative groups in multidimensional projections," in Computer Graphics Forum, vol. 34, no. 3. Wiley Online Library, 2015, pp. 281-290. [17] M. W. Mahoney and P. Drineas, "CUR matrix decompositions for improved data analysis," Proceedings of National Academy Sciences, vol. 106, no. 3, pp. 697-702, 2009. [18] H. Cheng, Z. Gimbutas, P.-G. Martinsson, and V. Rokhlin, "On the compression of low rank matrices," SIAM Journal on Scientific Computing, vol. 26, no. 4, pp. 1389-1404, 2005. 13 [19] P.-G. Martinsson, V. Rokhlin, Y. Shkolnisky, and M. Tygert, "Id: A software package for low-rank approximation of matrices via interpolative decompositions, version 0.2," 2008. [20] M. A. Turk and A. P. Pentland, "Face recognition using eigenfaces," in Proceedings of Conference on Computer Vision and Pattern Recognition. IEEE, 1991, pp. 586-591. [21] J. Marks, B. Andalman, P. A. Beardsley, W. Freeman, S. Gibson, J. Hodgins, T. Kang, B. Mirtich, H. Pfister, W. Ruml et al., "Design galleries: A general approach to setting parameters for computer graphics and animation," in Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. ACM Press/AddisonWesley Publishing Co., 1997, pp. 389-400. [22] F. V. Paulovich, L. G. Nonato, R. Minghim, and H. Levkowitz, "Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping," Transactions of Visual Computing Graphics, vol. 14, no. 3, pp. 564-575, 2008. [23] Y. Chen, L. Wang, M. Dong, and J. Hua, "Exemplar-based visualization of large document corpus," Transactions on Visualization and Computer Graphics, vol. 15, no. 6, 2009. [24] F. V. Paulovich, C. T. Silva, and L. G. Nonato, "Two-phase mapping for projecting massive data sets," IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 6, pp. 1281-1290, 2010. [25] P. Joia, D. Coimbra, J. A. Cuminato, F. V. Paulovich, and L. G. Nonato, "Local affine multidimensional projection," Transactions on Visualization and Computer Graphics, vol. 17, no. 12, pp. 2563-2571, 2011. [26] H. Kim, J. Choo, H. Park, and A. Endert, "Interaxis: Steering scatterplot axes via observation-level interaction," Transactions on Visualization and Computer Graphics, vol. 22, no. 1, pp. 131-140, 2016. [27] B. Schölkopf, A. Smola, and K.-R. Müller, "Nonlinear component analysis as a kernel eigenvalue problem," Neural Computation, vol. 10, no. 5, pp. 1299-1319, 1998. [28] A. Y. Ng, M. I. Jordan, Y. Weiss et al., "On spectral clustering: Analysis and an algorithm," Advances in Neural Information Processing Systems, vol. 2, pp. 849-856, 2002. [29] J. Wang, J. Lee, and C. Zhang, "Kernel trick embedded gaussian mixture model," in Algorithmic Learning Theory. Springer, 2003, pp. 159-174. [30] P. Snape and S. Zafeiriou, "Kernel-pca analysis of surface normals for shape-fromshading," in Proceedings of Conference on Computer Vision and Pattern Recognition. IEEE, 2014, pp. 1059-1066. [31] E. Schubert, A. Zimek, and H.-P. Kriegel, "Generalized outlier detection with flexible kernel density estimates," in Proceedings of 14th International Conference on Data Mining. SIAM, 2014, pp. 542-550. 14 [32] C. Williams and M. Seeger, "Using the nyström method to speed up kernel machines," in Proceedings 14th Annual Conference on Neural Information Processing Systems, no. EPFL-CONF-161322, 2001, pp. 682-688. [33] P. Drineas and M. W. Mahoney, "On the nyström method for approximating a gram matrix for improved kernel-based learning," Journal of Machine Learning Research, vol. 6, pp. 2153-2175, 2005. [34] A. Gittens and M. W. Mahoney, "Revisiting the nyström method for improved largescale machine learning," Journal of Machine Learning Research, vol. 28, no. 3, pp. 567- 575, 2013. [35] M. Ouimet and Y. Bengio, "Greedy spectral embedding," in Proceedings 10th International Workshop Artificial Intelligence and Statistics, 2005, pp. 253-260. [36] A. J. Smola and B. Schökopf, "Sparse greedy matrix approximation for machine learning," in Proceedings 17th International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., 2000, pp. 911-918. [37] A. J. Smola, O. L. Mangasarian, and B. Schölkopf, "Sparse kernel feature analysis," in Classification, Automation, and New Media. Springer, 2002, pp. 167-178. [38] M. E. Tipping, "Sparse kernel principal component analysis," in Proceedings Annual Conference on Neural Information Processing Systems, 2001, pp. 633-639. [39] C. Alzate and J. A. Suykens, "Kernel component analysis using an epsilon-insensitive robust loss function," Transactions on Neural Network Learning Systems, vol. 19, no. 9, pp. 1583-1598, 2008. [40] A. Rahimi and B. Recht, "Random features for large-scale kernel machines," in Proceedings Annual Conference on Neural Information Processing Systems, 2007, pp. 1177- 1184. [41] --, "Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning," in Proceedings Annual Conference on Neural Information Processing Systems, 2009, pp. 1313-1320. [42] Q. Le, T. Sarlós, and A. Smola, "Fastfood-approximating kernel expansions in loglinear time," in Proceedings International Conference on Machine Learning, 2013. [43] D. Lopez-Paz, S. Sra, A. Smola, Z. Ghahramani, and B. Schölkopf, "Randomized nonlinear component analysis," arXiv preprint arXiv:1402.0119, 2014. [44] T. F. Chan and P. C. Hansen, "Some applications of the rank revealing qr factorization," Journal of Scientific and Statistical Computing, vol. 13, no. 3, pp. 727-741, 1992. [45] M. Gu and S. C. Eisenstat, "Efficient algorithms for computing a strong rank-revealing QR factorization," Journal of Scientific Computing, vol. 17, no. 4, pp. 848-869, 1996. [46] G. H. Golub and C. F. Van Loan, Matrix Computations. JHU Press, 2012. 15 [47] D. Anderson, C. Melgaard, S. Du, M. Mahoney, K. Wu, and M. Gu, "Spectral gap error bounds for improving CUR matrix decomposition and the Nyström method," Proceedings International Conference on Artificial Intelligence and Statistics, vol. 38, pp. 19-27, 2015. [48] R. Tibshirani, "Regression shrinkage and selection via the lasso," Journal of the Royal Statistical Society. Series B (Methodological), pp. 267-288, 1996. [49] L. Breiman, "Random forests," Machine Learning, vol. 45, no. 1, pp. 5-32, 2001. [50] A. Narayan, C. Gittelson, and D. Xiu, "A stochastic collocation algorithm with multifidelity models," Journal of Computing, vol. 51, no. 3, pp. 2005-2035, 2013. [51] X. Zhu, A. Narayan, and D. Xiu, "Computational aspects of stochastic collocation with multifidelity models," Journal of Uncertainty Quantification, vol. 2, no. 1, pp. 444-463, 2014. [52] D. J. Perry, V. Keshavarzzadeh, S. Y. Elhabian, R. M. Kirby, M. Gleicher, and R. T. Whitaker, "Visualization of topology optimization designs with representative subset selection," Submitted Transactions on Visualization and Computer Graphics, 2017. [53] D. J. Perry, B. Osting, and R. T. Whitaker, "Nystrom sketches," in Accepted Proceedings Joint European Conference Machine Learning and Knowledge Discovery in Databases. Springer International Publishing, 2017. [54] M. Ghashami, D. J. Perry, and J. Phillips, "Streaming kernel principal component analysis," in Proceedings Conference Artificial Intelligence and Statistics, 2016, pp. 1365-1374. [55] D. J. Perry and R. T. Whitaker, "Augmented leverage score sampling with bounds," in Proceedings Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer International Publishing, 2016, pp. 543-558. [56] D. J. Perry, R. M. Kirby, A. Narayan, and R. T. Whitaker, "Allocation strategies for high fidelity models in the multifidelity regime," Submitted Journal of Uncertainty Quantification, 2017. [57] F. Samaria and A. Harter, "Parameterisation of a stochastic model for human face identification," Proceedings of 2nd IEEE Workshop on Applications of Computer Vision, December 1994. 16 Figure 1.1. Eigenfaces example: The first six PCA components of the Olivetti face dataset from AT&T Labs Cambridge [57]. While these components are useful computationally for face recognition and related tasks, it can be difficult for a human to interpret what these abstract components represent. CHAPTER 2 VISUALIZATION OF TOPOLOGY OPTIMIZATION DESIGNS WITH REPRESENTATIVE SUBSET SELECTION This chapter is a preprint of, Visualization of topology optimization designs with representative subset selection, Daniel J. Perry, Vahid Keshavarzzadeh, Shireen Y. Elhabian, Robert M. Kirby, Michael Gleicher, and Ross T. Whitaker, submitted to Transactions in Visualization and Computer Graphics in 2017, used with permission of the authors. 18 1 Visualization of topology optimization designs with representative subset selection Daniel J Perry, Vahid Keshavarzzadeh, Shireen Y Elhabian, Robert M Kirby, Member, IEEE, Michael Gleicher, Member, IEEE, Ross T Whitaker, Fellow, IEEE, Abstract-An important new trend in additive manufacturing is the use of optimization to automatically design industrial objects, such as beams, rudders or wings. Topology optimization, as it is often called, computes the best configuration of material over a 3D space, typically represented as a grid, in order to satisfy or optimize physical parameters. Designers using these automated systems often seek to understand the interaction of physical constraints with the final design and its implications for other physical characteristics. Such understanding is challenging because the space of designs is large and small changes in parameters can result in radically different designs. We propose to address these challenges using a visualization approach for exploring the space of design solutions. The core of our novel approach is to summarize the space (ensemble of solutions) by automatically selecting a set of examples and to represent the complete set of solutions as combinations of these examples. The representative examples create a meaningful parameterization of the design space that can be explored using standard visualization techniques for high-dimensional spaces. We present evaluations of our subset selection technique and that the overall approach addresses the needs of expert designers. Index Terms-topology optimization, interpolative decomposition, PCA F 1 I NTRODUCTION D ESIGNERS in various industries (e.g., machinery construction, 3D printing, automotive and aerospace) are often tasked with creating products that meet certain constraints, such as making the most efficient use of material (and reducing weight or manufacturing costs), achieving a specified level of thermal conductivity, or bearing a specified mechanical load. Meanwhile, additive manufacturing (AM) has removed the limitations of traditional machine-toolbased manufacturing and opened the doors for radically new physical designs. In AM, material is typically distributed within a particular design space, for a given set of design parameters and constraints, to optimize product performance criteria. As an emerging component of the engineering design process, structural topology optimization (STO) has offered a principled approach for addressing sets of competing design goals by optimizing material distribution in the given design space subject to problem-specific parameters and constraints [1], [2]. Recent advances in manufacturing technologies have made the STO approach to product design even more promising and important. For example, additive manufacturing can more directly map such optimal material layouts to physical realizations [3], [4]. While material placement is found automatically, other design parameters, such as boundary conditions, external force parameters, etc., cannot be set arbitrarily because the number and range of values of these parameters can significantly impact solution accuracy and convergence. Designers using topology optimization must also understand the implications of different parameters and other physical constrains to properly select designs and incorporate potentially innovative ideas from the optimization solutions into more traditional design pipelines. We refer to this choice of parameters and resulting STO design possibilities as the design space. However, building a more holistic understanding is challenging due to the large design space resulting from large number of possible parameters and the high dimensionality of the space in which optimized topologies live (e.g., the space of all possible designs). Furthermore, for domain experts there is a surprising variation in designs from small changes of the design parameters that are often difficult for a designer to anticipate, thereby complicating the process of understanding optimal designs and incorporating human experience and knowledge into the engineering design pipeline. The nonlinearity and nonconvexity of the topology optimization problem limit the availability of closed-form, analytical relationships between input parameters and optimal designs [5]. Therefore, ensembles of optimized topologies are often generated to aid in exploring the underlying high-dimensional solution space. Despite the computational burden of obtaining each structurally optimal topology, typically a sufficiently space-filling sampling of the design parameters space is used to find an ensemble of representative design solutions. Exploration of such an ensemble can be performed by viewing ensemble elements, i.e., optimized • D. Perry, V. Keshavarzzadeh, S. Elhabian, R. Kirby, R. Whitaker are with the SCI Institute, University of Utah, Salt Lake City, UT, 84112. topologies, each in turn. However, this task can become E-mail: {dperry@cs,vahid@sci,shireen@sci,kirby@cs,whtaker@cs}.utah.edu labor-intensive, time-consuming, and make gleaning general ensemble trends difficult. • M. Gleicher is with the Department of Computer Sciences, University of One typical approach to analyzing ensembles in this Wisconsin - Madison, Madison, WI 53706. E-mail: gleicher@cs.wisc.edu setting, in hopes of understanding the full space, is to Manuscript received August 31, 2017. deploy a dimensionality reduction technique, e.g., principal 19 2 Fig. 1. A structure topology optimization (STO) design dataset shown using a representative subset based visualization (left side), along with supporting linked views (right side). The main (left) view shows a subset chosen for optimal representation with the remaining data points represented by the lines in the parallel coordinate view (colored by average stress of the design). The top right shows the relationship between a single parameter, position, and the compliance of the design. The bottom right view shows a close-up view between two of the example-axis, allowing a more detailed inspection of any relationships. In our expert case study, this view allowed the domain experts to identify an interesting relationship between position, compliance, and the corresponding shapes by viewing all this information together. component analysis (PCA) [5], to reduce the associated degrees of freedom (i.e., topology space dimensions) in representing each ensemble element. This type of analysis is convenient and effective in terms of representation error (being optimal in the `2 reconstruction error), but the resulting basis functions are often difficult to understand in terms of the original designs. Furthermore, the reduced dimensions often do not provide an adequate representation of important features in the data (e.g., they are usually abstractions summarizing major trends rather than enumerating all or most possibilities). However, feature coverage information, which relates information about the individual ensemble elements, is important because they contain information about design details and strategies (often tied to the design parameters and application-specific performance criteria) that are often critical to an engineers understanding of the design space. To summarize, some of the problems encountered by a designer using STO which we plan to address include, • high dimensional design space due to a large number of parameters and variables, • unintuitive connection between design parameters and resulting optimal (non-unique) designs, and • lack of interpretability in traditional dimension reduction, such as PCA. In this paper, we propose a new visualization technique that is meant to address the challenges of the topology optimization visualization problem. The proposed method uses an automatically chosen subset of the ensemble itself as both a summary of the dataset and a basis from which to view the remaining data points. This summary provides a better representative coverage of the features in the dataset than alternatives such as PCA. Using the subset as a basis, the full dataset is represented as a combination of these example points, resulting in a new coordinate system that can be visualized using conventional visualization techniques. To make this clear, an example of our final visualization tool is shown in Figure 1, although we will explain the various components of the tool throughout the paper. The choice of the subset and the method of combination become important in this context, and we provide some methodology for those choices. This results in a novel view of the topology optimization datasets that has the summarizing characteristics favored in PCA but with a basis set that is easy to relate back to the original data ensemble and design space. We evaluate the proposed approach by considering several case studies as part of a visualization system, as well as specific studies validating the subset decisions. Other visualization work has examined the individual design visualization [3], but to our knowledge, no one has previously explored visualization paradigms to understand the design space of an ensemble of topology optimization solutions. While our focus is topology optimization design spaces, we note that the proposed approach has potential implications in other domains with similar properties, such as simulation ensembles, large image datasets, and large text corpora. This work makes the following novel contributions: • characterizing and addressing the problem of visualization of topology optimization design, • proposing a novel, subset-based method for visualizing high-dimensional data to aide in topology optimization, and • novel evaluation methods and evaluation results in comparing subset selection for optimal design visualization. 2 TOPOLOGY OPTIMIZATION FOR INDUSTRIAL DE - SIGN Structural topology optimization (STO) has emerged as a powerful tool in designing various high performance structures, from medical implants and prosthetics to jet engine 20 3 components [6], [7], [8]. STO is different from shape optimization in which the topology is predetermined and only the boundaries are optimized [9]. Therefore STO can be used to generate totally new design strategies with significantly better performance. STO has received significant attention in the past couple of decades due to the popularization of additive manufacturing [10]. Some complicated designs may not be realizable with traditional manufacturing techniques such as casting or machining, which have manufacturing constraints such as tool access or shape limitations associated with a molding process. In those cases, there has to be a compromise between optimal design and manufacturable designs. AM tackles these practical limitations and is capable of producing designs with complicated geometries and/or topologies. In addition, AM can be much faster than traditional techniques. Therefore, the combination of STO as the computational engine and AM as the manufacturing procedure is perceived as an emerging method for producing high performance designs that have not been realizable heretofore [11]. For the purposes of simplifying this discussion we consider a certain class of prototypical STO problems in which the optimization finds the best material layout in the design space in order to maximize the stiffness (or equivalently minimizing the deflection or compliance) of the structure subject to a material volume constraint. The response of the structure for a set of loading and boundary conditions is typical computed/simulated via finite element method (FEM). The topology optimization yields a solution in the form of binary maps that indicate material placement. p ρ θ Fig. 2. STO design space. Left: The loading on the right edge is parameterized by a vertical position, p, and a loading angle, θ, and the filter size, ρ, which controls the length scale in the topology optimization. Right: Some of the more obvious connections between the force angle parameter choice, θ and resulting STO designs. A shallow angle of force encourages the material to be centrally located into what we refer to as a beam or beam-like structure (right-top). As the angle changes to become more inclined, the resulting design to includes smaller and smaller connections, ultimately becoming lattice-like because the collection of smaller beams form a kind of lattice (right-middle,-bottom). Our experiments and evaluations will use solutions from the design of a cantilever beam that is clamped on the left edge and is under a static load on the right edge. The variable loading is parameterized by the vertical loading position, p, and the angle (θ) of loading direction. The STO is further constrained by a filter size parameter, ρ, which determines the granularity of the material placements and effectively controls the size of the structures that are allowed to form in the solution. This design space is summarized on the left side of Figure 2. Some specific design results are shown in the right side of Figure 2. In order to facilitate discussion, we will intro- duce some colloquial terminology we will use throughout the paper to describe various aspects of designs in this space. We will refer to designs which include a primary structure as a beam or beam-like, similar to the example on the top of Figure 2, and we will refer to examples with a collection of smaller connecting structures as a lattice or lattice-like, with one example shown on the bottom of Figure 2. Along those lines the design in the middle of Figure 2 we will describe as having both lattice-like features with a beam structure along the top side of the design. Additional examples can be found in a later part of the paper in Figure 4, as well as in many of the visualizations throughout. 3 3.1 R ELATED WORK Dimension reduction The proposed method falls in the general class of dimensionality reduction (DR) techniques, but is designed to address specific challenges of feature summarization and coverage in the topology optimization context. However, because it shares many characteristics with more general DR we review that related work first to identify specific differences and similarities to existing methods. An important aspect of our approach is to use examples from the dataset to summarize and represent the rest of the data. The idea of using a subset for DR is important and has been explored in other forms [12], [13], [14], [15], [16], [17]. In [17] they use examples to define the extents of an embedding space because the examples provide an undiluted representation of the underlying set. In another set of examples, [18] and [12] both use a subset of the dataset to represent the contents of data clusters or to identify points in a 2D embedded layout, because the subset presents a natural summary of data without any confusing abstraction or obfuscation. The works [13], [14], [15], [16] all use subsets because of their numerical closeness to the remaining data elements, allowing for the creation of powerful embedding methods. Subsets are a powerful representations because they contain details of the dataset, they avoid abstraction, and when selected appropriately can accurately represent neighbors. One important goal of this work, which differentiates it from most DR approaches, is that we require a basis or axis after the DR transformation that easily relates back to the original data and provides explicit information on the existence of human-identifiable features (see Section 4.1 for a more precise treatment of this requirement). For the specific application to STO designs presented here, the axis should be easily related back to examples of STO designs, because these are physically meaningful and optimal, and become important for connecting designs and parameters. From the above discussion subsets are clearly an excellent choice to satisfy this interpretability constraint of STO summarization. One of the most widely used DR techniques is principal component analysis (PCA) [19]. PCA seeks a linear projection to a lower dimension that maximally preserves variance of the dataset. While PCA is optimal in linear representation, the resulting basis is often quite difficult to interpret and frequently has only a vague relationship to the individual 21 4 input data samples. In this paper we will examine PCA with respect to the STO application domain and show examples from an expert case study where using the PCA as the basis made otherwise simple tasks quite difficult. These same problems can also be found in many optimal basis or embedding techniques, such as independent component analysis (ICA) [20] and multidimensional scaling (MDS) [21], [22]. There are many nonlinear DR techniques, including nonlinear extensions of PCA, such as kernel PCA [23], or nonlinear geometric DR methods, such as Isomap [24] and local linear embedding (LLE) [25], that have been used to better model the geometric relationship of the data. While these approaches have arguably better representations of data on nonlinear manifolds, they do not directly address the interpretation problem, and, indeed, the resulting parameterizations often have no physical interpretation. Likewise, probabilistic nonlinear DR methods, such as stochastic neighbor embedding (SNE) [26], its t-distribution variant tSNE [27], and parametric embedding (PE) [28], provide no clear interpretation relative to the original ensemble. Improving DR from an interpretation standpoint has also been addressed in the literature [18], [29], [30]. While these methods aid in the use of standard DR techniques, they do not address the need for the axis to directly encode related human-identifiable features. For example, [29] and [30] both provide important tools for either finding the right point of view of the DR scatterplot or enhancing the choice of DR parameters for existing systems. Unfortunately, both cases, as with PCA, do not address the underlying problem that the DR axis obfuscate many of the humanidentifiable features. [18] more closely addresses our needs by automatically selecting a subset to represent clusters in the DR. We compare to this general idea of clusters being representative of the data by using a k-medoids clustering for subset selection in some of our tests. The proposed approach is different in that it uses the subset, however it is determined, as the axis of the DR, while they perform a DR and then use a subset to represent resulting clusters. Additionally, the subset selection method from [18] has some fundamental problems which we directly address in our subset methodology. Their method first computes an singular value decomposition (SVD), closely related to PCA, and then finds points similar to the singular vectors (PCA modes). While this will work for specific datasets where points happen to lie near modes, this will not always be the case (this is similar to expecting the points to always lie near an average of the points). We instead propose selecting a subset that directly minimizes this representation error. 3.2 Subset-based visualization The proposed work is strongly motivated by previous visualization research that has demonstrated the effectiveness of data subsets to understand the structure of a population. For instance, [14] uses greedy residual-based sampling to select the subset and PE, referred to above, to create twodimensional embeddings. Another set of related techniques is least square projection (LSP) [13] and part-linear multidimensional projection (PLMP) [15], that use a subset of the data to learn a linear low dimensional embedding. This is then applied to the full dataset. LSP is motivated by increasing the DR quality by limiting the learned projection to a set of control points, while PLMP is primarily motivated by computational challenges for extremely high dimensions datasets. LSP and PLMP both use clustering methods (k-medoids and bisecting k-means) to select the controls points. To understand the effectiveness of this, we evaluate k-medoids as a method for selecting a visualization subset and compare it to several other representation-motivated methods in this paper. Both LSP and PLMP result in a lower dimensional space whose axis are weighted combinations of the subset, similar to PCA. Local affine multidimensional projection (LAMP) [16] provides for user selection of the subset and then uses that subset to produce a locally affine projection for the full dataset. Another interactive technique was presented in [17], where the user can interactively select basis extents from specific samples in the dataset, for the purposes of controlling how the scatter plot data is displayed. The proposed method differs from both in that the exemplars themselves are the axis and are selected automatically (rather than by the user), allowing it to scale to large subsets and large ensembles. Additionally, the proposed method automatically selects the subset based on some specific motivating criteria. Another point of difference is that [17] is focused entirely on a 2D scatter plot embedding, while we have used the proposed method almost entirely in higher-dimensional representations using alternative visualization strategies, such as parallel coordinates. Design galleries [12] presents a 2D embedded layout and summarizes visual content of clusters or positions in the embedding using small multiples of a subset of the full dataset. This approach to using subsets for summarizing feature content of data points is quite effective, but is limited to 2D and 3D embeddings, which are often insufficient to adequately represent the high-dimensional data. This same limitation of embeddings also applies to many of the DR approaches described above. In the proposed approach, the dimensionality is only limited by the number of examples that can be shown in a row at the chosen screen and image resolutions. 3.3 Structural topology optimization analysis Because we are not contributing directly to the methods of solving STO problems, we restrict our discussion to relevant work that analyzes or visualizes this type of data (for a more complete review of STO in general see, for example, [2]). In the domain of visualization, [3] introduced a complete system for finding and visualizing individual STO solutions. They present an impressive array of STO design spaces and visualization results, however their work focuses on the visualization of a single, complicated design space. Here, we are focused on understanding an ensemble of designs found by sampling different points in the parameter space or by different results from a stochastic process. The work in [5] also has some overlap with this paper, because they use ensembles of designs to learn data driven models, relating the parameters to design solutions via a neural network. Their underlying motivation is to reduce 22 5 Fig. 3. Diagram of the components of the proposed subset-based visualization using parallel coordinates. Component (A) shows the main parallel coordinate display that plots the computed weights of all data points on the chosen subset, by selecting and filtering this view, one gains an idea of what portion of the population is similar to the coordinate bases. Component (B) shows the subset axis figures. This subset represents the visual features of the dataset, and viewing the weights of the remaining points provides some intuitive idea of what each of the individual shapes consists of by considering the larger weights. Component (C) highlights a single coordinate axis, where the line positions along this axis are the weights of all points related to the highlighted subsample element. appropriate. In contrast, a random set of samples represents an unbiased subset. Because the typical screen size limits the number of basis elements that can be displayed (see Figure 3) it becomes critical to choose a subset that is both relatively small and that accurately represents the entire set of input data. For this reason, we formulate the representation goals of the subset decision precisely and then pursue optimum satisfaction of these goals. We define a subset as representative if linear combinations of that subset minimize the reconstruction error of the rest of the dataset. If we express the input data in matrix form, X ∈ Rd×n , where each column xi ∈ Rd is a data point and each row is a feature or dimension, the subset-based representation can be formulated so that X ≈ R = DB̃, where R ∈ Rd×n is the representation (or reconstruction), D ∈ Rd×m is a subset of the data, and B̃ ∈ Rm×n is the coefficients (or weights) used to combine the subsets into the approximate reconstructions. To find the most representative subset would mean simply minimizing the difference between the representation R and the input data X, or min E(B̃, D), where, the computational burden of finding new design solutions in the large design space. However, rather than learning a data driven relationship, we are instead interested in visualizing the data in such a way to allow the human designer to understand that relationship, and then apply that knowledge in future design problems. Additionally, [5] relies on PCA as the primary DR stage of their pipeline. This makes sense for a purely data-driven approach which does not address interpretability, and therefore has the previously described limitations. 4 S UBSET- BASED VISUALIZATION OF TOPOLOGY OPTIMIZATION DESIGNS We propose a technique for visualizing a large collection of high-dimensional designs by first selecting a representative subset, using that subset as the visual basis, encoding the remaining data as linear combinations of those elements, and then displaying the resulting subset and encodings in standard visualization tools such as parallel coordinates [31] (or star coordinates [32], [33], scatter plot matrices [34], etc.). An example is shown in Figure 3, where we have highlighted the main components of the subset-based visualization. The two main components are the subset display at the bottom (marked (B)), and the parallel coordinate plot of the weights relating the other data points to that subset (marked (A)). Because these are the main components of the visualization, careful selection of the type of subset and the properties of the weight relationship becomes very important. We additionally highlight a single coordinate axis in (C), with line positions along the axis correspond to reconstruction weights of all points using that element. 4.1 Representative subset The subset selection method can be motivated by different goals. For example, if we believe that data naturally stratifies into groups, a cluster-center-based subsample might be (1) B̃,D⊂A E(B̃, D) = X i kxi − Db̃i k22 = kX − DB̃k2F , (2) (3) in which we have used D ⊂ A to denote that D is a column subset of A. In considering our specific STO problem, we found that domain experts are reliant on certain visual elements of the resulting designs which we refer to as human-identifiable features, or just features. For STO designs these include, for example, the number of internal holes, the number of thick beam structures, thin beam structures, how latticed the design is, etc. Consider some of the designs shown in Figure 4 (bottom two rows). These features are particularly important when making connections between parameters and designs. This idea of human-identifiable features is more general than for just STO designs and can change, appropriate to the domain. As an intuitive example, we might consider the set of human faces that have features such as hair length, presence of facial hair, a hat, earings, etc., as seen in the sketched human faces (from [35]) in Figure 4 (top). These facial features are groups of data values (pixels) that have a correlated behavior across multiple images. For instance, longer hair is seen as sets of values/pixels on either side of the head that have dark values. Being able to identify the subset with the most representative features reduces computationally to finding the subset whose features (groups of pixels) maximally correlates with the rest of the dataset, which matches well with our formulation above. In the estimation and machine learning literature, research have observed that non-negative matrix factorization (NMF) has the property of extracting coherent features [36], by insisting that feature-capturing basis elements approximate the data by combining in a constructive manner. In other words, the non-negative constraint on weights tends to discourage the representation of data features through a negative-positive 23 6 Fig. 4. Representative subsets from face-sketch, STO D1, and STO D2. Top: representative subset from a face-sketch dataset [35], Middle: from the STO D1 design dataset with varying parameters and constant initialization, Bottom: from the STO D2 design dataset with constant parameters and varying initializations. The resulting B matrix encodes both the subset selection (corresponding to nonzero rows), and the weights connecting the rest of the dataset to the chosen subset (the values of the matrix). This is a convenient result as both the subset selection and weight computation occur simultaneously. This problem is known as a grouped lasso problem, and many solvers exist for efficiently estimating the matrix B [38], [39], [40]. We have found the method choice does not affect the resulting subset significantly, and have settled on using the fast grouped orthogonal matching pursuit (GOMP) solver [41], which was used in all results presented in this work. 4.3 combination of large numbers of basis elements (as with PCA) and requires features (correlated values) to be represented explicitly in the basis elements. This effect can be seen in well-known negative-positive weighted decompositions such as Eigenfaces [37], which have face-like shapes but are difficult to interpret as clear facial features. Additionally, we will describe in our expert case study in Section 5.2.4, experiences where non-negative weights were easier for the participant to process mentally in terms of interpreting the combination of examples from the subset, reinforcing the importance of a non-negative weighting for relevant tasks. With these details in mind, we further refine our goal in finding a representative subset which captures the most representative features by constraining the reconstruction weights, B̃, to be non-negative. Thus, the non-negative formulation of the subset problem is: min E(B̃, D). (4) B̃≥0,D⊂A 4.2 4.4 Simultaneous subset and coefficient computation We now propose an approach for solving the optimization problem presented in the previous section. We can rewrite the objection function as, E(B̃, D) = kX − DB̃k2F = kX − XBk2F = Ê(B, m), (5) (6) where B ∈ R , bi,: = b̃j,: , and b̃j,: is the j -th row corresponding to the j -th data element in D, the same as the i-th data element in X. The smaller matrix B̃ can be thought of as a submatrix of the larger more sparse B, where m specific (nonzero) rows are preserved. Instead of building B from B̃, we are interested in solving for B directly (and therefore obtain the embedded B̃). The absolute optimum of the objective kX − XBk2F would result in B = I. However, if we constrain or regularize B so that it is row-sparse and non-negative, as described above, we can recover an optimal B with the desired properties. The regularization approach leads to this formulation for subset selection, n×n B∗ = arg min kX − XBk2F + λkBk2,1 B≥0 P (7) where kBk2,1 = i kB:,i k2 is a mixed `2,1 norm that induces row sparsity, where only a subset of the rows has nonzero entries, and B ≥ 0 means element-wise nonnegativity (bi,j ≥ 0, ∀i, j ). Subset size One important parameter to consider is the size of the subset. This parameter presents an important tradeoff. A larger subset will provide the opportunity to represent more unique features in the dataset, but a large enough subset could become cluttered, redundant, or even confusing. However, we have found that the screen space required for the proposed layout in Figure 3 is ultimately the limiting factor in the subset size, and so our recommendation is to use as many as the screen layout will allow, within reason. Another factor, which we have found useful in the decision, is to measure reconstruction error. For some datasets viewing the reconstruction error as a function of subset size can be helpful in determining a good tradeoff. In all of the experiments here, we use a subset of size 8, because we found this was appropriate for the screen displays used in our experiments and found it sufficiently large to capture the most important variations in the dataset. Visual encoding The method above places each design as a coordinate point in a medium-dimensional (e.g., d = 8) space that is optimized to be representative and capture significant features. We have investigated a variety of methods for visually encoding the resulting representative subset coordinates, including scatter plot matrices, parallel coordinates, and star coordinates. Because of the non-negative combinations, the star-coordinate plot was promising. However, the documented ambiguities in the star-coordinate visualizations prevented sound exploration of the data and pattern discovery [34], [42]. The scatter plot matrix was difficult to use because of the large number of zero coefficients in the non-negative case, resulting in individual scatter plots that had points mainly clustered near the origin-the properties of which are unclear, but an area of ongoing consideration/investigation. Consequently, for this work we found the parallel coordinate visualization of the subset coordinates to be more effective, because it makes more visually apparent the coefficient relationships, better disambiguates data points, and, with the addition of interactive selection and filtering, makes it easier to find important relationships among data points. These motivations combined result in the proposed visualization shown in Figure 3. 4.5 Linked views As described previously, the topology optimization designs are useful for STO task only when visualized coincident 24 7 Exemplar scatter plot Parameter scatter plot combined subset and weight computation. The second part, full system evaluation, focuses on the overall visualization with a quantitative analysis of successful task completion for the domain experts and a qualitative assessment of usage from a set of domain expert case studies. In addition to the evaluation, we include a final clarifying example to show how the full systems works on even very challenging datasets. 5.1 Example of the detail view showing a specific design The colormap used throughout the examples Fig. 5. Linked scatter plot and detail view. Top: The exemplar scatter plot allowing detail view of two parallel coordinate views, and the second scatter plot view that can be set to a scatter plot of parameters (shown) or an embedding like t-SNE, Isomap, etc. as needed. Middle: An example of the detail view of a single data point including specific parameter values and weights used in a linear reconstruction. Bottom: The colormap used in all examples, representing low to high values of the user-selected parameter. For this figure points are colored according to average stress of the corresponding design. with parameters that gave rise to each design and corresponding scores related to expected functionality (e.g., maximum stress). Therefore, we augment the subset-based parallel coordinates view with a scatter plot showing combinations of the various parameters and scores. This additional view is then linked to the parallel coordinates view by using the same coloring, point selection, and filtering effects in both views. We include two types of linked scatter plots. The first is a detailed view of the two selected axes from the parallel plots that we call the exemplar scatter plot. The second is a scatter plot that can use several options chosen by the user, from a tSNE or isomap embedding of the dataset, to a scatter plot of specific parameters (angle, position, filter size) or score values (max stress, average stress, compliance). This additional visualization is a traditional scatter plot of parameter data, augmenting the subset display, and therefore the concerns about traditional DR techniques are not directly relevant. For example, some DR methods make cluster identification simple, and connecting that view to the proposed subsetbased view would allow for fast interrogation of cluster attributes by visualization of the subset representations. Additionally, we include a linked detail view that allows the user to view a specific data point and related details such as weight values. Figure 5 shows a few samples of each of these linked views, and Figure 1 shows a sample of them combined as a single visualization tool. 5 E VALUATION We assessed the method in two parts. The first part, subset selection evaluation, focuses on evaluating the proposed subset selection method, using a quantitative analysis of feature coverage for each subset selection approach considered, and a quantitative assessment of representation error in a Subset selection evaluation An important characteristic of the resulting subset is including examples of a broad range of user observable features of the dataset elements. The importance of this feature coverage arises from the need to represent a variety of points outside the chosen subset that will each include a potentially different mixture of those features. Likewise, any features not represented in the subset by definition will not be available in the visualization. We consider our approach to compare feature coverage as a novel contribution, as we are unaware of previous work that has used feature coverage tests in this way to evaluate subset selection. This approach to measuring the level of coverage of the observable features allows a more direct, objective view of how well the subset represents what a user would see had they looked at all the elements of the dataset one by one. The novelty is most easily seen in image-based datasets where observable features are difficult to extract automatically from the data elements. Specifically we select the subset using only the data elements, ignoring any manuallygenerated labels that would not be available in a completely new, unlabeled dataset that we would want to process. After selecting the subset, we use human-labeled visual features to determine quantitatively how well the subset covers the observable features - if the subset has a high coverage percentage, then it represents the dataset well, and if it has a low coverage percentage, then it does not represent the full dataset well. This comparison approach provides a strong indication of the quality of the subset from the point of view of representing the observable features in the full dataset. 5.1.1 Methods We compared the proposed subset selection method, solved using non-negative group orthogonal matching pursuit (GOMP-NN), with the deterministic form of interpretable decomposition subset method (ID) [43], [44], the k-medoids method (KM) [45], and uniformly random subsets (RAND). The formulation of the proposed GOMP-NN approach was described previously in Section 4.2. ID is an appropriate alternative because it also simultaneously computes the weights, but it instead uses a pivoted-QR-type [44] subset selection, computing both positive and negative weights. KM (and related k-means) is a very well-known approach to clustering and/or quantizing a set of data, which also identifies cluster centers. Because KM does not explicitly prescribe a weight computation, we will look at both nonnegative (KM-NN) and positive-negative weight options (KM-PN). RAND is a common method to subsample a dataset in an unbiased way. RAND also does not prescribe a specific type of weighting, so we compare against both 25 8 options (RAND-NN and RAND-PN). One critical difference between GOMP-NN and either KM-NN or RAND-NN is that the non-negativity also drives the subset selection in GOMP-NN. In contrast, for KM-NN or RAND-NN, the nonnegative weights can only be computed after the subset selection. In terms of asymptotic runtime, for a dataset with n elements of d dimension, GOMP is O(mn2 d), ID is O(mnd), KM is O(mndt), and RAND is O(1), where t is the number of iterations to convergence for KM. For datasets where n is the largest dimension, GOMP is the most expensive, followed by KM, and then ID. RAND is the fastest with a constant cost. However, the STO design datasets have a data dimension much larger than the number of points (d > n). Because of the high cost of solving for a single design, this situation will be common for general STO design ensembles. As the dimension of the design space increases, the cost for each solution will also increase, limiting the number of samples that can be solved. This indicates that asymptotic cost in n will become less of an issue as larger systems are solved. From this vantage point, the runtime differences between GOMP, ID, and KM become less pronounced. KM is the only non-greedy method in terms of subset selection. For example, choosing m verses m + 1 subset elements can result in two different subset choices, while for both GOMP and ID the m + 1 subset is the same as the m subset except for the additional subset element found in the latest round. This is important when evaluating different size subsets because as the subset grows ID and GOMP both only require an additional sample to be found, where for KM the entire subset must be recomputed. Overall, for the STO design datasets we have not found a significant difference in runtime cost among methods (excepting RAND), and so limit the runtime discussion to this summary. 5.1.2 Datasets We include an evaluation of subset method results on two datasets. One, CelebA, is a database of celebrity face images used in [46]. This dataset is of interest for the current problem because of the intuitive nature of the dataset type (everyone understands common face features). Also, this specific dataset contains a set of 40 human-derived facial features amenable to the comparison, such as "mustache", "eye glasses", or "pointy nose", etc. Evaluating this dataset also supports the claim that the proposed technique is more general than only STO design datasets. The original dataset consists of 2 × 105 color face images of size 178 × 218. Because this dataset is not the primary focus of the paper, we subsampled it both spatially (down sample by 4) and randomly in the number of images (to 2 × 104 ). As the STO datasets are much smaller than this down-sampled version, we consider this compromise acceptable. The other, STO D1, is an STO design dataset, generated by collaborators/coauthors, with a design space of 40 × 80 pixels and 103 samples, where the structure of interest is attached to the left edge of the domain, and a point force is applied at some position p ∈ [−20, 20] and angle θ ∈ [0, π] on the right edge. The optimization is further constrained by a filter size ρ ∈ [1.1, 2.5] that controls how thin/thick structures are allowed in the solution (see Figure 2). A Monte Carlo sampling among these three parameters was used (10 samples), with the same constant initialization. Because this dataset did not have a predefined set of humandefined features like CelebA, we generated and evaluated them within a small-scale user study. A sample of the dataset is shown in the middle portion of Figure 4. 3 5.1.3 User-based evaluation of STO designs To generate the human-defined features, we recruited one STO expert and eight different nonexpert graduate students. We asked the STO expert to specify a set of physically relevant features to describe the set of designs in a way that a nonexpert could understand, including things such as number of internal holes, number of external holes, thickness of the components, etc. We then asked three of the nonexpert participants to generate a set of visual features from their own assessment of the dataset. The expert and nonexperts each generated the feature set while being shown five random subsets of size 25 (125 total samples) from the dataset. One of the resulting nonexpert lists was not a binary feature set (as per the instructions) that had to be thrown out. After verifying both of the other nonexpert lists were binary feature sets (they could be evaluated in an approximate "yes" or "no" answer for any example), we randomly chose one of the remaining two nonexpert study for the remainder of the study, so that comparison could be made across participants. We then asked five of the nonexperts to assess which of the subsets contained at least one instance of each feature in the expert list, and four do the same assessment but with the nonexpert list. Each nonexpert was completely blind to which subset they were assessing and the subsets were shown in a different order to each person. The assessed subsets were generated by GOMP-NN, ID, KM, and five different from RAND, resulting in eight total subsets being assessed by each participant for each list. To compare two subsets we simply count how many features from each list were present in the subset - a subset which captures more of the features will better represent the remaining elements of the dataset. 5.1.4 Evaluation We evaluate representation of each subset by reconstruction error and human-defined feature set coverage (count). CelebA faces dataset. In terms of reconstruction error using non-negative weights for the subset sizes evaluated, GOMP-NN was always shown to be more representative than RANDOM-NN and KM-NN, while KM-NN did better than random for some subset sizes. For positive-negative weights, ID-PN and KM-PN did better than random for many sizes, but not all. Better here means the scores was better than the mean plus one standard deviation of the corresponding RANDOM. These quantitative representation results are shown in top of Figure 6. However, the our comparison should also have some basis in how humans interpret the data. To establish this connection, we also consider the human-defined feature comparison. The results showed that GOMP-NN did better than random for all but two subset sizes, while among the others only KM really did better than random once, as shown in Figure 6. Note that different algorithms chose very different faces to get a similar number of features. 26 9 900000 GOMP-NN KM-NN RAND-NN reconstruction error 850000 ID-PN KM-PN RAND-PN 800000 TABLE 1 Shape reconstruction error test results for STO D1. The reconstruction error for each subset type of size 8. Those using non-negative (NN) weights are shown in the top part, positive-negative (PN) weighted reconstructions are shown in the lower part. The random mean and standard deviation were computed over 102 samples. The lowest reconstruction error for each group is in bold font. 750000 Method GOMP-NN KM-NN RAND-NN ID KM-PN RAND-PN 700000 650000 600000 feature coverage (max 80) 80 75 2 4 RAND ID 6 8 10 subset size 12 KM GOMP-NN 70 User GOMP-NN KM-NN ID RAND GOMP ? KM ? ID ? Any ? 60 55 50 2 4 6 8 subset size 10 12 Fig. 6. CelebA reconstruction error and coverage tests. Top: The reconstruction error for each subset type of size 2-12. Those using non-negative (NN) weights are shown with a solid plot line; positivenegative (PN) weighted reconstructions are shown with a dashed plot line. Bottom: The feature count found (out of 80) in each subset type. The random mean and standard deviation shown were computed over 10 and 102 samples respectively per measure. Topology optimization designs. The above human-face analysis showed that the proposed approach finds a representative subset in both the reconstruction error and the more qualitative human-defined-feature sense. For the specific case of STO designs, we performed a similar comparison. However, as described in Section 5.1.3, we created the feature list and feature assessment through a user-based evaluation strategy. This limited our evaluation to a single subset size, which we chose to be 8 because that was the maximum number of subsamples we could show on the screen without the subsample resolution becoming too low in our visual layout. We will now summarize three types of results, a projection error comparison, an expert-feature-listbased assessment, and a nonexpert-feature-list-based assessment. In the projection-error-based assessment we found that GOMP-NN was the most representative of the NN methods, while KM-PN was the most representative of the PN methods. In both cases all methods compared were more representative than the average random subset error. These results are summarized in Table 1. We also performed a similar comparison for a qualitative comparison of human-derived features. Rather than tag Better? Y Y N/A Y Y N/A TABLE 2 Shape feature coverage test results using expert list for STO D1. The feature count found (out of 45) from the expert feature list for each subset type of size 8, and whether this is better or not than random. Here, better means more than mean plus one standard deviation. The random mean and standard deviation were computed over five samples given to the same participant. The highest feature count for each participant is in bold font. 65 45 Reconstruction error 628.51 850.44 893.55(35.39) 585.46 537.59 609.33(26.52) 1 30 30 31 27.0(4.1) N N N N 2 23 23 26 23.6(2.3) N N Y Y 3 27 23 26 24.6(1.9) Y N N Y 4 17 17 17 14.4(2.5) Y Y Y Y 5 25 23 24 21.8(1.6) Y N Y Y these features in all the designs, we had the participants assess the number of features present in each subset directly. This was a more reasonable amount of work for the number of participants we had, but it meant we were limited in the number of random subsets we could evaluate. Even with the limited samples, we were able to show that for the expert feature list four of the five participants found that at least one of the nonrandom methods did better than random. Three of the five showed that both GOMPNN and ID were better than RAND, while only one of the five showed KM-NN was better than RAND. Overall we interpret these results, summarized in Table 2, as support for the methods specifically designed to select a representative set. For the nonexpert feature list we showed three out of four participants found an optimal method performed better than RAND. Specifically three out of four for GOMP-NN, two out of four for ID, and one out of four for KM-NN. While these results make little distinction among the type of subset methods, overall we take away that a method specifically designed to find a representative subset does in fact do better than RAND at feature representation. These results are summarized in Table 3. 5.2 Full system evaluation Here we present an analysis of the full system from an STO domain expert point of view. For this analysis we had two domain experts (both coauthors) sit down and critically assess whether the tool successfully aids in building intuition around parameters, shapes, and quantitative design scores. While the domain experts are coauthors, we purposely had both play no role in the visualization development itself, 27 10 TABLE 3 Shape feature coverage test results using nonexpert list for STO D1. The feature count found (out of 17) from the nonexpert feature list for each subset type of size 8, and whether this is better or not than random. The random mean and standard deviation were computed over 5 samples given to the same participant. The highest feature count for each participant is in bold font. User GOMP-NN KM-NN ID RAND GOMP ? KM ? ID ? Any ? 1 11 12 12 11.2(0.8) N N N N PCA as basis 2 14 12 13 12.4(1.5) Y N N Y 3 13 7 11 7.6(0.9) Y N Y Y 4 15 13 15 11.0(1.0) Y Y Y Y ID as basis Fig. 7. Alternatives non-negative weighted subset views. Left: The same visualization approach but using a PCA basis for the axis. Users found this approach very confusing both for the axes representation and attempting to combine them mentally using positive and negative weights. Right: Subset-based visualization using ID with its positivenegative weights. Users also found this combination confusing because it was difficult to manually subtract one image from another. limited mainly to helping the rest of the other authors understand what aspects of the data are important for them to explore and suggesting some visual interactions that could be helpful. 5.2.1 Methods We use the same set of subset-selection and weightcomputation methods explored in Section 5.1. However, because manual visual assessment is more limiting in time than the quantitative assessments done earlier, we limited the comparison among GOMP-NN, KM-NN, ID. Because prior work uses PCA to analyze these type of datasets, we also included a PCA version of the parallel coordinates view with the appropriate positive-negative weighting, referred to as PCA, shown in Figure 7. This provided four different axes and weighting choices, two with non-negative weighting and two with positive-negative weighting. 5.2.2 Datasets The same parameter varying STO design dataset, STO D1 (described in Section 5.1.2), was used for this qualitative study. We also included a second STO design dataset, STO D2, where the three parameters were fixed to p = 0, θ = π/4, and ρ = 1.1 respectively, but the initial configuration was initialized randomly, resulting in 1000 different designs. Because of the nonconvex nature of the STO problem, the resulting ensemble had a large variety of latticed designs, although total design variance was not as great as STO D1. The domain experts also had much less initial intuition about this dataset. A sample from both datasets is shown in Figure 4. 5.2.3 Evaluation Before viewing the final visualization tool, we asked the two domain experts to report any relationships they were already aware of between parameters, designs, and scores. We then translated those into a set of tasks: confirm each relationship known beforehand. The number of these prior known relationships that can be confirmed by each version of the tool is used as a quantitative measure for comparison. During the case studies, the experts were unaware of which type of subset-weight combination was shown (labeled A,B,C), except for PCA because of the obvious nature of the axes figures. In addition to evaluating how many prior relationships were confirmed, we also asked the experts to explore the data with the tool and report any additional relationships or other insights they gained during the use that we report as well. Task completion assessment. Most of the relationships the domain experts were aware of before the visualization focus session involved simple parameter-to-shape relationships, such as position and contact point, angle and structure composition (lattice vs beam), and filter size and beam thickness. There was one relationship between shape and scoring: having many holes should lead to high stress values. Finally, one relationship among scores, specifically that stress and compliance should correlate well. One detail to note is that there was almost no prior intuition about the second dataset with fixed parameters and random initializations. We summarize the relationships identified before using the visualization technique along with specific numbers for easy reference in Table 4 (top portion). While exploring the datasets the experts arrived at several unknown, or at least unstated, relationships between the parameters and scores, such as position and compliance, angle and compliance. They also confirmed some unstated assumptions, such as all designs are variations among purely beam and purely lattice designs. The experts were also able to gain some intuition about the various shapes that arose in the second dataset. Specific relationships found in the study are shown in the lower portion of Table 4 with a number and asterisk to distinguish them from the first group. After allowing for exploration, we tasked the experts to confirm each of the relationships they identified before using the tool. Using the complete tool, the experts were able to confirm all prior relationships quite quickly with any of the basis types. Because of the multiple components in the visualization we next tried to understand at a finer granularity how each piece aided in the confirmation process by removing some components. One particular modification that revealed a more informative comparison was to remove the detail plot entirely, so that the expert had to rely entirely on the parallel coordinate plot to both summarize the dataset designs and to tease out specific details of the designs and how they relate to the parameters and scores. This change resulted in different performance among the methods. Specifically, the experts found it very difficult to confirm all but one of the known relationships using the PCA basis 28 11 Example of beam reconstruction using PCA Example of beam reconstruction using a subset Fig. 8. Beam reconstruction using PCA compared to a subset reconstruction. Top: Shows a straight beam design and how it is represented using PCA modes. Note that none of the basis elements looks like a beam, but by combining them in positive and negative fashion the original beam can be reconstructed quite accurately. Bottom: The same straight beam represented by a subset. Because some of the basis elements (subset elements) do look like a beam it is quite easy to mentally reconstruct what the data element would look like, if you did not have the design itself to view (perhaps due to space). Note that a similar problem would occur with subset visualization if the subset did not contain a beam structure, which is another motivation for using the proposed subset selection technique by definition it is the optimal subset for reconstruction using non-negative weights. or the ID basis. When asked why it was difficult, they were quite critical of the PCA basis set and weights, as it was "impossible to combine the bases, especially with the positivenegative weights". In essence, the experts found it very difficult to mentally combine the PCA basis to understand any of the reconstructions. One particularly illuminating example is that none of the PCA bases showed an example of a beam, because all beam representations were recreated by subtracting lattice-looking bases from each other; an example is shown in Figure 8. The ID subset visualization was also difficult for the experts to use in confirming relationships. Even though the axes were a subset of the designs, the positive-negative weights made the mental combination of examples difficult. This confirms our hypothesis, which is that constructive combinations of features allows for easier interpretation of data represented as mixtures of basis elements. In contrast, the experts were able to confirm most of the prior relationships using KM-NN and GOMP bases. Consider the example comparison in Figure 8 where mentally extracting a beam design from the PCA basis is quite difficult, in contrast to the very obvious beam structure indicated by the subset visualization example. For some of the relationships, experts claimed a partial confirmation because of a desire to view more samples of some of the more detailed aspects of the dataset than those available. That experience outlines one limitation of the approach - that while the subset-based view provides a good overview of the dataset with feature details, it is difficult to extrapolate out to feature details not represented in the subset. Most of this discussion has centered around the first of the two STO datasets, but results were also investigated for the second dataset. Most of the prior assumptions were not applicable to the second dataset because the parameters were fixed, but for the two that did apply we observed similar results. The precise results are summarized in Table 5. 5.2.4 Case studies We now present several case studies from the domain expert session. These case studies detail some of the uses we found for the tool, and identify some of the specific strengths and limitations of the proposed subset-based visualization. Subset-based parallel coordinate view as surrogate to full TABLE 4 Relationships reported by domain experts. The relationships reported before viewing the visualization tool are summarized in the top portion of table with assigned numbers. The relationships reported (discovered) while using the tool are reported in the lower half of the table (numbers with asterisks). relationship 1 2 3 4 5 6 7* 8* 9* 10* description position and right side contact point angle horizontal then structure horizontal low angle and varying positions then shallow angle straight structure filter size controls structure thickness more holes in structure then higher stress compliance and stress are correlated symmetric-quadratic relationship between position and compliance/stress position choice induces an upper limit on compliance (lowest at 0) negative correlation between angle and compliance all of the designs are variations of a beam or lattice design (confirmation) TABLE 5 Prior known relationships assessments by domain experts. Verification of known relationships described in Table 4 using various versions of the visualization tool (by changing axis and visualization components). Detail indicates use of visualization with detail view, otherwise the detail view was excluded. Results are shown for the first (STO D1) and second (STO D2) datasets in upper and lower parts of the table, respectively. Here Yes (Y) means confirmed, No (N) cannot confirm, Partial (P) means confirmed but want more samples to be completely sure. relationship Detail (D1) GOMP-NN (D1) ID (D1) KM-NN (D1) PCA (D1) Detail (D2) GOMP-NN (D2) ID (D2) KM-NN (D2) PCA (D2) 1 Y Y N P N - 2 Y P N P N - 3 Y Y N P N - 4 Y P N N N - 5 Y P N P N Y P N P N 6 Y Y Y Y Y Y Y Y Y Y dataset. As part of the relationship confirmation task above, we forced the domain experts to not use the detail view (individual elements), and the resulting use of the visualization was quite illuminating. When limited to the the subset-based parallel coordinate view, the experts selected points in the parameter scatter plot and then looked at which of the subset elements contained large weights. For the case of non-negative weighted views, they used those values to mentally consider what the element looks like and make a conclusion. This subset-based view usage is in contrast to the individual element view, where the experts selected points from the parameter scatter plot and looked at the corresponding element. To summarize, the subset-based view was being used as a surrogate for the full dataset in completing the confirmation tasks mentioned above. Note that in this same setting when the subset-based view was changed to use positive-negative weights or the PCA basis elements, this summarizing ability of the parallel coordinate view was apparently lost. In considering this aspect of the subset-based parallel coordinate view, the domain experts commented that it "Allows for a whole-view perspective of the data, instead of viewing the data elements individually". Limitations of subset-based view. One of the tasks was to confirm that the filter size influenced the size of the 29 12 structures in the design (smaller filter size results in smaller lattice structures and other features). However because the subset contains only a limited number of the smaller lattice structure examples, the expert was unable to confirm the relationship fully, marking it as "partially" confirmed. Because the experts had previously confirmed the relationship using individual instances, they were expecting to see more examples of the relationship than what was available. The experience highlights one limitation of the method: when any feature is not included in the subset, the surrogate nature of the view is not nearly as true to the underlying dataset. This point emphasizes that the choice of subset becomes crucial to success, and why we advocate the use of the GOMP-NN-based selection. Confirmation of design shape aspect of parameter-score relationships. The domain experts were able to discover new relationships among parameters and scores, for example relationships 7,8, and 9, in Table 4. These relationships have implications for the design shapes themselves, and in each case the experts were able to use the subset-based view to understand that relationship. For example, for relationship 9 they were able to show that the steeper angle resulted in a more latticed structure that had more holes and higher stress values. Exemplar scatter plot as summary device. The exemplar scatter plot, although not used extensively by the experts, did become useful for understanding and confirming general trends in the dataset, as well as for understanding the parallel coordinates plot at times. For example, the domain experts were able to confirm that the STO designs can be largely grouped into two large sets, those that have a large central beam, and those that are mainly lattice structure. This confirmation was accomplished by viewing the exemplar scatter plot with a beam structure on one axis, and a purely lattice structure on the other axis. Parameter scatter plot as driver for discovery. For the full tool, the parameter-based scatter plots (i.e., scatter plot of angle and compliance) that is then colored by a third variable such as maximum stress, proved to be the primary tool used to drive discovery. The basic workflow consists of the experts looking at relationships on that plot, and then using the linked-nature of the views to explore visual aspects of the relationships in the subset-based visualization and the detail view. Limitations of PCA- and ID-based views. The limitations of the PCA view became very evident in the domain expert study. While using the subset view, they were able to confirm and gauge general visual trends in the data, but when using the PCA version of the tool the users became quite frustrated. Some of the phrases used to describe their reaction are: "I can't see any beams in the modes", "convoluted examples... look like photo negatives... or ghosts of the structures", and "the axis are no longer intuitive". The positivenegative weights also played a role in the confusion. This was discovered by using the ID-based subset view, that uses subsets but with positive-negative weights. While the axes were easier to understand for the users, trying to mentally subtract examples from each other was difficult to do. When the bases were PCA modes, the task became even more difficult. 5.2.5 Additional examples To further discuss some of the points made above, and to further demonstrate how the system benefits from the use of a subset-based display of the designs, we will now present some examples of the more challenging STO2 datasets. These example were not included in the case study, but are presented here for further illumination of the points brought up previously. Note that the STO2 dataset mentioned above, used as part of the expert case study, is more challenging to understand, even for an expert, because the only parameter that was changed was the initial material placement. In other words, the angle, position, and filter size parameters are all constant. This makes the dataset difficult to interpret because there is no longer a clear connection between parameter choice and the resulting STO design. However, we will show that the system is still quite useful for understanding the connection between attributes, such as stress, and the visual design. The subset-based summary is shown in Figure 9 where the STO2 dataset has been loaded. The display is using the GOMP-NN basis and weights with a high stress example selected and highlighted in the left figure, and a low stress figure has been highlighted in the right figure. Because the subset-based view captures and shows the a subset of the dataset, which in this case appears to be the differing lattice structure, it becomes quite clear from the visualization that the more lattice structure there is in the design the higher the stress becomes. Now contrast the very clear interpretation of that observation with what we see in the PCA-based view shown in Figure 10. Again we show the same high stress example in the left part of Figure 10 and the same low stress example in the right part. Now, because of the choice of PCA basis, it is quite difficult to understand what, if any, relationship can be seen using this view of the data. This example reinforces what we observed on other datasets, such as STO1, CelebA, and the sketch face database. In terms of interpretability, a subset-based view provides superior summarization to PCA and related basis choices. 6 D ISCUSSION Strengths. The representative subset view appears to be an appropriate technique for summarizing a dataset of visual features, and specifically for STO designs. By linking the subset-based view to the STO parameters through coloring and interactive selection, the visualization allows for a parameter-driven exploration of the dataset while the subset view provides the needed dataset visualization, avoiding the need to display all individual designs for most tasks. The method also appears suitable for more general datasets with visual features, and potentially non-visual features given appropriate linked views. Limitations. Extrapolation of any features not present in subset is difficult to make, which can limit the amount of detailed relationships that can be found using the subset view. Additionally, our proposed method is limited to design spaces where linear combinations work. While we have found this to work for STO designs, some design spaces will require a different approach. In terms of our assessment, 30 13 High stress example selected Low stress example selected Fig. 9. GOMP-NN view of the more challenging STO2 dataset. This example shows how visual features can be directly connected to low and high stress using a subsample-based view (more lattice structure correlates with high stress). High stress example selected Low stress example selected Fig. 10. PCA view of the more challenging STO2 dataset. This example shows how it is difficult to find a connection between low and high stress and the design features using a PCA-based view. we limited ourselves (on purpose, for matters of scope) to 2D design spaces, but in the future it would be interesting to consider higher dimensional design spaces. While we explored several aspects of the method, our assessment was primarily based around the STO design dataset and the CelebA image database. In the future we would like to explore how this approach benefits in other data types, such as text and other nonspatial datasets. Conclusions. The proposed visualization method has significant advantages over traditional dimension-reductionbased techniques in terms of allowing a domain expert to extrapolate and infer details of the dataset from a small, carefully chosen subset. For STO design this specifically allows connecting parameters and score values back to visual design features in many cases. The suggested optimally representative GOMP-NN-based subset selection has been shown to be the most representative in terms of reconstruction error for non-negative weighting and, in a more qualitative sense, more representative of human-determined features than a random dataset. Finally, expert case-studies have identified some of the strengths and limitations of the proposed approach. ACKNOWLEDGMENTS D. Perry and V. Keshavarzzadeh acknowledge that their part of this research was sponsored by ARL under Cooperative Agreement Number W911NF-12-2-0023. S. Elhabian, R.M. Kirby and R. Whitaker acknowledge support from DARPA TRADES HR0011-17-2-0016. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of ARL, DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. R EFERENCES [1] [2] M. P. Bendsøe and N. Kikuchi, "Generating optimal topologies in structural design using a homogenization method," Computer methods in applied mechanics and engineering, vol. 71, no. 2, pp. 197- 224, 1988. M. P. Bendsoe and O. Sigmund, Topology Optimization: Theory, Methods, and Applications. Springer Science & Business Media, 2003. 31 14 [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] J. Wu, C. Dick, and R. Westermann, "A system for high-resolution topology optimization," IEEE transactions on visualization and computer graphics, vol. 22, no. 3, pp. 1195-1208, 2016. U. Schramm and M. Zhou, "Recent developments in the commercial implementation of topology optimization," in IUTAM symposium on topological design optimization of structures, machines and materials. Springer, 2006, pp. 239-248. E. Ulu, R. Zhang, and L. B. Kara, "A data-driven investigation and estimation of optimal topologies under variable loading configurations," Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, vol. 4, no. 2, pp. 61-72, 2016. A. Sutradhar, G. H. Paulino, M. J. Miller, and T. H. Nguyen, "Topological optimization for designing patient-specific large craniofacial segmental bone replacements," Proceedings of the National Academy of Sciences, vol. 107, no. 30, pp. 13 222-13 227, 2010. T. A. Reist, J. Andrysek, and W. L. Cleghorn, "Topology optimization of an injection moldable prosthetic knee joint," ComputerAided Design and Applications, vol. 7, no. 2, pp. 247-256, 2010. J.-H. Zhu, W.-H. Zhang, and L. Xia, "Topology optimization in aircraft and aerospace structures design," Archives of Computational Methods in Engineering, vol. 23, no. 4, pp. 595-622, 2016. W. Chen, M. Fuge, and J. Chazan, "Design manifolds capture the intrinsic complexity and dimension of design spaces," Journal of Mechanical Design, vol. 139, no. 5, p. 051102, 2017. D. Brackett, I. Ashcroft, and R. Hague, "Topology optimization for additive manufacturing," in Proceedings of the solid freeform fabrication symposium, Austin, TX, 2011, pp. 348-362. B. S. Lazarov, F. Wang, and O. Sigmund, "Length scale and manufacturability in density-based topology optimization," Archive of Applied Mechanics, vol. 86, no. 1-2, pp. 189-218, 2016. J. Marks, B. Andalman, P. A. Beardsley, W. Freeman, S. Gibson, J. Hodgins, T. Kang, B. Mirtich, H. Pfister, W. Ruml et al., "Design galleries: A general approach to setting parameters for computer graphics and animation," in Proceedings of the 24th annual conference on Computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., 1997, pp. 389-400. F. V. Paulovich, L. G. Nonato, R. Minghim, and H. Levkowitz, "Least square projection: A fast high-precision multidimensional projection technique and its application to document mapping," IEEE Transactions on Visualization and Computer Graphics, vol. 14, no. 3, pp. 564-575, 2008. Y. Chen, L. Wang, M. Dong, and J. Hua, "Exemplar-based visualization of large document corpus," IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 6, 2009. F. V. Paulovich, C. T. Silva, and L. G. Nonato, "Two-phase mapping for projecting massive data sets," IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 6, pp. 1281-1290, 2010. P. Joia, D. Coimbra, J. A. Cuminato, F. V. Paulovich, and L. G. Nonato, "Local affine multidimensional projection," IEEE Transactions on Visualization and Computer Graphics, vol. 17, no. 12, pp. 2563-2571, 2011. H. Kim, J. Choo, H. Park, and A. Endert, "Interaxis: Steering scatterplot axes via observation-level interaction," IEEE transactions on visualization and computer graphics, vol. 22, no. 1, pp. 131-140, 2016. P. Joia, F. Petronetto, and L. G. Nonato, "Uncovering representative groups in multidimensional projections," in Computer Graphics Forum, vol. 34, no. 3. Wiley Online Library, 2015, pp. 281-290. I. Jolliffe, Principal component analysis. Wiley Online Library, 2002. A. Hyvärinen, J. Karhunen, and E. Oja, Independent component analysis. John Wiley & Sons, 2004, vol. 46. J. B. Kruskal and M. Wish, Multidimensional scaling. Sage, 1978, vol. 11. S. Ingram, T. Munzner, and M. Olano, "Glimmer: Multilevel mds on the gpu," IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 2, pp. 249-261, 2009. B. Schölkopf and A. J. Smola, Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002. J. B. Tenenbaum, V. De Silva, and J. C. Langford, "A global geometric framework for nonlinear dimensionality reduction," science, vol. 290, no. 5500, pp. 2319-2323, 2000. S. T. Roweis and L. K. Saul, "Nonlinear dimensionality reduction by locally linear embedding," Science, vol. 290, no. 5500, pp. 2323- 2326, 2000. G. E. Hinton and S. T. Roweis, "Stochastic neighbor embedding," in Advances in neural information processing systems, 2002, pp. 833- 840. [27] L. Van der Maaten and G. Hinton, "Visualizing data using t-sne," Journal of Machine Learning Research, vol. 9, no. 2579-2605, p. 85, 2008. [28] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T. D. Griffiths, and J. B. Tenenbaum, "Parametric embedding for class visualization," Neural Computation, vol. 19, no. 9, pp. 2536-2556, 2007. [29] D. B. Coimbra, R. M. Martins, T. T. Neves, A. C. Telea, and F. V. Paulovich, "Explaining three-dimensional dimensionality reduction plots," Information Visualization, vol. 15, no. 2, pp. 154-172, 2016. [30] R. M. Martins, D. B. Coimbra, R. Minghim, and A. C. Telea, "Visual analysis of dimensionality reduction quality for parameterized projections," Computers & Graphics, vol. 41, pp. 26-42, 2014. [31] A. Inselberg and B. Dimsdale, "Parallel coordinates: a tool for visualizing multi-dimensional geometry," in Proceedings of the 1st conference on Visualization'90. IEEE Computer Society Press, 1990, pp. 361-378. [32] E. Kandogan, "Star coordinates: A multi-dimensional visualization technique with uniform treatment of dimensions," in Proceedings of the IEEE Information Visualization Symposium, vol. 650. Citeseer, 2000, p. 22. [33] --, "Visualizing multi-dimensional clusters, trends, and outliers using star coordinates," in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001, pp. 107-116. [34] S. Liu, D. Maljovec, B. Wang, P. Bremer, and V. Pascucci, "Visualizing High-Dimensional Data : Advances in the Past Decade," Eurographics Conference on Visualization (EuroVis), 2015. [35] X. Wang and X. Tang, "Face photo-sketch synthesis and recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 11, pp. 1955-1967, 2009. [36] D. D. Lee and H. S. Seung, "Learning the parts of objects by nonnegative matrix factorization," Nature, vol. 401, no. 6755, pp. 788- 791, 1999. [37] M. A. Turk and A. P. Pentland, "Face recognition using eigenfaces," in Computer Vision and Pattern Recognition, 1991. Proceedings CVPR'91., IEEE Computer Society Conference on. IEEE, 1991, pp. 586-591. [38] M. Yuan and Y. Lin, "Model selection and estimation in regression with grouped variables," Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49-67, 2006. [39] S. Boyd, "Alternating Direction Method of Multipliers," Proceedings of the 51st IEEE Conference on Decision and Control, vol. 3, no. 1, pp. 1-44, 2011. [40] W. Deng, W. Yin, and Y. Zhang, "Group Sparse Optimization By Alternating Direction Method," SPIE Optical Engineering + Applications, pp. 88 580R-88 580R, 2013. [Online]. Available: http://circ.ahajournals.org/cgi/content/abstract/95/2/473 [41] J. Kim, R. Monteiro, and H. Park, "Group sparsity in nonnegative matrix factorization," In proc. of the 2012 SIAM International Conference on Data Mining (SDM), pp. 851-862, 2012. [42] D. J. Lehmann and H. Theisel, "Orthographic star coordinates," IEEE Transactions on Visualization and Computer Graphics, vol. 19, no. 12, pp. 2615-2624, 2013. [43] H. Cheng, Z. Gimbutas, P.-G. Martinsson, and V. Rokhlin, "On the compression of low rank matrices," SIAM Journal on Scientific Computing, vol. 26, no. 4, pp. 1389-1404, 2005. [44] G. Golub, "Numerical methods for solving linear least squares problems," Numerische Mathematik, vol. 7, no. 3, pp. 206-216, 1965. [45] H.-S. Park and C.-H. Jun, "A simple and fast algorithm for kmedoids clustering," Expert systems with applications, vol. 36, no. 2, pp. 3336-3341, 2009. [46] Z. Liu, P. Luo, X. Wang, and X. Tang, "Deep learning face attributes in the wild," in Proceedings of International Conference on Computer Vision (ICCV), 2015. CHAPTER 3 NYSTRÖM SKETCHES This chapter is a reprint of, Springer Lecture Notes in Computer Science, Nyström sketches, v. 10534, 2017, pp. 401-416, Daniel J. Perry, Braxton Osting, and Ross T. Whitaker, copyright Springer International Publishing AG 2017, with permission of Springer and originally presented at the 2017 Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD) in September 2017. 33 Nyström sketches Daniel J. Perry ( ) , Braxton Osting, and Ross T. Whitaker University of Utah, USA, {dperry@cs,osting@math,whitaker@cs}.utah.edu Abstract. Despite prolific success, kernel methods become difficult to use in many large scale unsupervised problems because of the evaluation and storage of the full Gram matrix. Here we overcome this difficulty by proposing a novel approach: compute the optimal small, out-of-sample Nyström sketch which allows for fast approximation of the Gram matrix via the Nyström method. We demonstrate and compare several methods for computing the optimal Nyström sketch and show how this approach outperforms previous state-of-the-art Nyström coreset methods of similar size. We further demonstrate how this method can be used in an online setting and explore a simple extension to make the method robust to outliers in the training data. Keywords: Nyström approximation, kernel PCA, kernel preimage 1 Introduction Kernel-based learning methods have gained popularity in the last few decades due to their simplicity and powerful capability. Specifically, for unsupervised problems kernel PCA [18], spectral clustering [12], and many other methods have been used successfully on a variety of problems, see, e.g., [25, 22, 19]. However, like other non-parametric learning models, their application to large data sets can be limited by the need to retain all the training data. One of the most successful approximation techniques, generally referred to as the Nyström approximation, was introduced in [26] and further analyzed in [5] where a small subset of the data, which we will call a Nyström coreset, is used to approximate the full Gram matrix. The primary difficulty with Nyström approximation in an unsupervised setting is to select the right coreset to populate the dictionary-if the coreset is too large unneeded computational cost is incurred and if the coreset doesn't contain a representative set then too much error is incurred. While there has been significant work in the area of coreset selection (see, e.g., [3, 1, 2, 13, 20, 24, 21]), these methods all select a subset of the training data for use in the Nyström approximation. This is counterintuitive because it is well known [14, 7] that the optimal basis on which to project data in an unsupervised setting are the PCA basis vectors, which for clarity we will refer to as a sketch instead of a coreset because the summary basis are a derivation (via SVD) rather than a subset fo the input data. We will show that, for very small element sizes, a Nyström sketch obtained by solving a particular optimization 34 problem (similar to the kPCA optimization problem) are better able to describe the data (have smaller projection error) than a Nyström coreset of similar size. In applications, one frequently needs to project a dataset onto the coreset, which is a computationally intensive task, and a smaller coreset/sketch (with similar projection error) reduces these computational costs. More recently, there has been significant effort to develop techniques that make use of random projections to estimate the lifting function directly [15, 16, 9, 11]. These methods can be thought of as a type of random sketch for kernel approximation, where instead of storing a coreset of the data from the input set, a set of random directions are retained for computing the nonlinear lifting. While theoretically interesting and demonstrably useful, there has also been work exhibiting significant advantages of the Nyström coreset approach [29], leading to the conclusion that, with regards to projection error, sampling from the dataset itself will always yield specific advantages to random projections. Here we demonstrate that with the same size dictionary, the proposed Nyström sketch obtains superior results than a random projection based kernel approximation. In this paper, we propose a novel approach to kernel approximation that uses a Nyström sketch, similar to a PCA basis, instead of a Nyström coreset [5] or a random projection approximation [15, 9]. By formulating this Nyström sketch as an optimization problem, we also incorporate well known optimality ideas from the immensely successful PCA basis selection. 2 Background While Nyström approximation of the Gram matrix is generally useful in both supervised and unsupervised settings, in order to focus the discussion we will primarily be investigating how different methods compare in estimating the kernel PCA (kPCA), and our primary measure of accuracy will be projection error resulting from the different approximations to kPCA. 2.1 Nyström coreset and sketch Because there is some ambiguity around the terms matrix coreset and matrix sketch and to clarify discussion within this paper, we now introduce a more strict definition on coreset and sketch with respect to a given data matrix X ∈ Rd×n , X = [x1 , . . . , xn ] for the purposes of Nyström approximation. Definition 1. A Nyström coreset is a subset matrix R ∈ Rd×m , R = [r1 , . . . , rm ] where each of the columns ri has a corresponding column in the input dataset, ri = xj , so that R = XQ, where Q ∈ Rn×m is a column sampling matrix chosen for use in a Nyström approximation. Definition 2. A Nyström sketch is a matrix S ∈ Rd×c , S = [s1 , . . . , sc ] where each of the columns is derived from the input dataset for purposes of Nyström approximation, but the columns do not necessarily need to come from X-in this sense the columns are "out of sample". 35 A Nyström sketch is a more general matrix, and so while a sketch could be a coreset (if the derived columns happened to correspond to columns in the dataset), a coreset is not necessarily a sketch. 2.2 PCA and kPCA Linear PCA can be computed via a truncated SVD of X = U ΣV T at cost O(n3 ), or the top p components can be computed using the Lanczos method (similar to power iteration) at cost O(npz) where z is the number of matrix multiplies required to compute one eigenvector, and depends on the singular value gap. For computational efficiency the eigendecomposition is typically performed on the outer product or covariance matrix, XX T U = U Σ 2 when d n, or on the inner product or Gram matrix, X T XV = V Σ 2 situations where d n. In either case, a subspace of dimension p is selected based on the eigenvalues, by selecting the subspace spanned by the p eigenvectors corresponding to the largest p eigenvalues. KPCA is a non-linear extension of PCA, where the data elements are lifted in a higher dimensional feature space H using a non-linear lifting function φ : Rd → H prior to PCA, resulting in a data matrix in H, Φ := φ(X) = [φ(x1 ), . . . , φ(xn )], and corresponding decomposition Φ = U ΣV T . The idea is that PCA in H will reveal nonlinear relationships in the lower dimensional input space Rd . Because the dimension of H is potentially infinite, we compute the right singular vectors spanning Im(ΦT ) using the Gram matrix eigendecomposition, ΦT ΦV = V Σ 2 . We recover the left singular vectors of Φ spanning Im(Φ): U = ΦV Σ −1 . A projection onto the kPCA subspace can be computed using the reproducing kernel, yi = φ(xi )T U = φ(xi )T ΦV Σ −1 = k(xi , X)V Σ −1 , where the kernel function k(a, b) = φ(a)T φ(b) corresponds to an inner product in feature space. The lifting function φ(·) and corresponding reproducing kernel k(·, ·) are typically chosen so that computing k(a, b) is more efficient than the explicit expansion and evaluation of the inner product in H. 2.3 kPCA projection error and Nyström approximation p For a collection of data points {xi }ni=1 , the mean projection error, Ekpca , for projection onto a p-dimensional PCA subspace is the mean of the lengths of the orthogonal projections, n p Ekpca = 1 1X k(I − Vp VpT )φ(xi )k2 = k(I − Vp VpT )Φk2F . n i=1 n (1) Here k · kF denotes the Frobenius norm. The error can be computed using a kernel trick, n p Ekpca = 1X k(I − Vp VpT )φ(xi )k2 n i=1 n = 1X k(xi , xi ) − k(X, xi )T k(X, X)−1 p k(X, xi ), n i=1 (2) 36 where k(X, X)p is the best rank-p approximation of the matrix k(X, X). Consider a Nyström coreset specified by the matrix R ∈ Rd×m . Let ΦR = φ(R) have SVD ΦTR = UR ΣR VRT and k(R, R) = ΦTR ΦR ∈ Rm×m be the Gram matrix of the dictionary elements. We define the mean projection error, Ekpca (R), for projection onto ΦR analogously to (1) and simplify this expression using the kernel trick from (2), Ekpca (R) := 1 k(I − VR VRT )Φk2 n (3) n 1X k(xi , xi ) − k(R, xi )T k(R, R)−1 k(R, xi ). = n i=1 This approximation to k(xi , xi ) is known as the Nyström approximation, which we aim to optimize. 3 General formulation of the Nyström sketch learning problem We now consider learning an optimal Nyström sketch with m < n elements that describes the collection of n data points {xi }ni=1 ⊂ Rd . Consider a sketch specified by the matrix R ∈ Rd×m with columns given by m sketch elements d {rj }m j=1 ⊂ R . Definition 3. Nyström sketch learning problem. Compute the Nyström sketch R? by solving the optimization problem, min R∈Rd×m Ekpca (R), (4) where the objective function is defined in (2). We will show below that the solution to this Nyström sketch learning problem for a linear kernel is the PCA basis of dimension m. For any nonlinear space the Nyström sketch learning problem is essentially tracking the preimage of the kPCA basis as it is computed. We discuss later how this approach differs from precomputing the kPCA basis and then finding a preimage afterwards, as well as some empirical evidence showing that the proposed sketch learning method performs much better in practice. 3.1 Optimal Nyström sketch is the PCA basis for the trivial lifting function As an example, we consider the Nyström sketch learning problem (4) for the trivial lifting function, φ(x) = x, with corresponding reproducing kernel k(x, y) = 37 xT y. In this case, the objective function for the Nyström sketch learning problem can be written n Ekpca (R) = 1X T x xi − xTi R(RT R)−1 RT xi . n i=1 i (5a) Assume m < d. We observe that minimizing Ekpca over all R ∈ Rd×m is equivalent to maximizing n X i=1 xTi R(RT R)−1 RT xi = tr X T R(RT R)−1 RT X = hXX T , R(RT R)−1 RT iF . where h·, ·, iF denote the Frobenius inner product. The quantity R(RT R)−1 RT is simply the projection onto the image of R. Since the optimal Nyström sketch clearly has rank m, we can let Ũ ∈ Rd×m have columns which are an orthonormal basis for the image of R. Thus, R(RT R)−1 RT =PŨ Ũ T . It follows from Von m Neumann's trace inequality that hXX T , Ũ Ũ T iF ≤ i=1 λi (XX T ) with equality only when span(Ũ ) is the span of the first m left singular vectors of X. Thus the optimal Nyström sketch is any rank m matrix, R, with image equal to the span of the first m left singular vectors of X, or the rank-m PCA basis. 4 Optimal Nyström sketch using non-linear least-squares methods In this section we show how the Nyström sketch learning problem (4) can be formulated as a nonlinear least squares problem, for which a variety of optimization algorithms, such as Gauss-Newton or Levenberg-Marquardt, can be applied to find the optimal Nyström sketch, R [27]. To apply such methods, we will require the gradient and Hessian of Ekpca (R) with respect to the sketch, R. We assume that the sketch R and kernel k have been chosen such that k(R, R) is an invertible matrix. We first compute ∂ k(R, x)T k(R, R)−1 k(R, x) = ∂Rjk ∂ 2k(R, x)T k(R, R)−1 k(R, x) ∂Rjk ∂ T + k(R, x) k(R, R)−1 k(R, x) ∂Rjk To compute the second term, we use the identity ∂ ∂ k(R, R)−1 = −k(R, R)−1 k(R, R) k(R, R)−1 . ∂Rjk ∂Rjk 38 Thus we have ∂ k(R, x)T k(R, R)−1 k(R, x) = ∂Rjk ∂ 2k(R, x)T k(R, R)−1 k(R, x) ∂Rjk ∂ T −1 − k(R, x) k(R, R) k(R, R) k(R, R)−1 k(R, x). ∂Rjk (6) Now write the objective n Ekpca (R) = 1X k(I − VR VRT )φ(xi )k2 n i=1 n = 1X 1 kri k2 = kRk2 , n i=1 n where the i-th residual vector and total residual vector are defined T ri = φ(xi ) − VR VRT φ(xi ) and R = r1T · · · rnT . Using (6), the gradient of Ekpca (R) is computed n ∂Ekpca 1 X ∂kri k2 = ∂Rjk n i=1 ∂Rjk n ∂ 1X 2k(R, xi )T k(R, R)−1 k(R, xi ) n i=1 ∂Rjk ∂ − k(R, xi )T k(R, R)−1 k(R, R) k(R, R)−1 k(R, xi ). ∂Rjk =− In particular, we find that the gradient of Ekpca (R) can be evaluated in terms of the kernel function. As an example, we compute the gradient for the linear and a general symmetric stationary kernels below. 4.1 Linear kernel For the linear kernel, we have k(x, y) = xT y, k(R, x) = RT x, and k(R, R) = RT R. ∂[RT R]l,m = Rj,m δk,l + Rj,l δk,m ∂Rj,k ∂[RT x]l = xj δk,l ∂Rj,k where δa,b ( 1 = 0 a=b . a 6= b (7) (8) 39 Algorithm 1 A least-squares based method for solving the Nyström sketch learning problem (4). Input Mercer kernel k(·, ·), input data X, and sketch size m Output Nyström sketch R ∈ Rd×m Initialize: Let R be a random subset of X or use, e.g., [13] Solve (4) using the Levenberg-Marquardt method (or alternative non-linear least squares method) to find the optimal parameter R. 4.2 Symmetric stationary kernel For any symmetric stationary kernel of the form k(x, y) = f (kx − yk2 ). ∂[k(R, R)]l,m = f 0 (kRl − Rm k2 )2 [(Rj,k − Rj,m )δl,k ∂Rj,k +(Rj,m − Rj,l )δm,k ] ∂[k(R, x)]l = f 0 (kRl − xk2 )2δl,k [Rj,k − xj ] ∂Rj,k 4.3 Batch algorithms An algorithm for solving the Nyström sketch problem (4) is given in Algorithm 1. In this formulation we explicitly take advantage of the least-squares aspect of the optimal Nyström sketch problem. By using the special structure of a nonlinear least-squares problem we are able to perform better than comparable gradient descent based methods. In Section 5, we compare the performance of Algorithm 1 with current state-of-the-art Nyström coreset selection and using other optimization strategies such as BFGS on both real and simulated data sets. A second algorithm for estimating the optimal Nyström sketch is to first solve for the optimal basis in feature space using kPCA and then compute the preimage of the basis. Because there often is not a one-to-one mapping between input space and feature space, the preimage is an estimation problem, however it seems intuitive that if the optimal basis in feature space is the kPCA basis, then finding the corresponding preimage of those bases should provide a reasonable Nyström sketch. While this approach has been used in other contexts-for example [4] use preimages of their incremental kPCA basis as a sketch-we have not seen it proposed as an explicit solution to the Nyström sketch learning problem. We consequently propose this technique as a "straw hat" approach to solving the sketch learning problem (details Algorithm 2). However, in Section 5, we exhibit several instances where a direct solution to the Nyström sketch learning problem via Algorithm 1 results in a more descriptive Nyström sketch. Furthermore the online and robust extensions to Algorithm 1 introduced in Sections 4.4 and 4.5 are not directly transferable to the preimage approach. 40 Algorithm 2 A preimage based solution to the Nyström sketch learning problem. Input Mercer kernel k(·, ·), input data X, and sketch size m Output Approximate optimal Nyström sketch R ∈ Rd×m Initialize: Run kPCA to compute the bases {V1 , . . . , Vm } of the kPCA space of dimension m. Apply a preimage algorithm, such as [17], to each Vi independently, resulting in m sketch elements, di = preimage(Vi ). Algorithm 3 A stochastic gradient descent based solution to the online Nyström sketch learning problem (4). Input Mercer kernel k(·, ·), input data X, sketch size m, and batch size b Output Optimal Nyström sketch R ∈ Rm×d Initialize stage: Let R be the first m elements observed of X. Solve (4) using stochastic gradient descent over the next b observed data elements, {xi , . . . , xi+b }. 4.4 Online algorithm In addition to the batch formulation in Algorithm 1, we also propose two explicit extensions that are useful in online and noisy contexts. First, by design, a least squares solver will need to revisit each data point repeatedly, which becomes costly as the data set grows. While Algorithm 1 can be used for small- to medium-sized data sets, it does not scale to very large datasets. To address this problem we modify Algorithm 1 to use a stochastic gradient descent (SGD) method, resulting in Algorithm 3. We explore how this algorithm performs on large data sets (large enough that the Gram matrix can't be explicitly formed, making the approach in Algorithm 2 untenable) by comparing to current state-of-the-art kernel approximation techniques such as Random Kitchen Sinks [15, 16] and Fastfood [9]. We are able to demonstrate using the same sized dictionaries that the proposed optimal Nyström sketch obtains a smaller projection error on an unseen test set than these state-of-the-art kernel approximation techniques. 4.5 Robust online algorithm A second extension is the use of an alternative loss to least-squares, possibly biasing the solution for certain desirable properties. Because the batch and online algorithms can use gradient descent methods instead of least squares methods, it is possible extend the underlying model to be something other than an `2 -norm loss. One particular loss of interest in data mining is the least-absolute-deviations or `1 -norm loss, which has some nice properties, such as being more robust to outliers than the `2 -norm loss. Other work has adapted batch kPCA to use `1 -norm loss, for example [28] developed an algorithm to solve for the kPCA basis using an `1 -norm loss which yields a subspace basis which characterizes 41 Algorithm 4 A gradient descent based solution to the robust Nyström sketch learning problem (9). Input Mercer kernel k(·, ·), input data X, sketch size m, batch size b, and regularization Output Optimal Nyström sketch R ∈ Rm×d Initialize stage: Let R be the first m elements observed of X. Apply (stochastic) gradient descent to find the optimal parameter R to the least-absolute-deviations problem in (9) after observing b new data elements {xi , . . . , xi+b }. the primary signal in the data more than the outliers. We propose the following modified projection error, Erkpca (R) = ≈ n 1X k(I − Vp VpT )φ(xi )k1 n i=1 n 1 Xp k(xi , xi ) − k(R, xi )T k(R, R)−1 k(R, xi ) + n i=1 (9) where the approximation is valid as → 0. This approximation is (i) differentiable for > 0 and (ii) allows for evaluation using the kernel trick. The derivative of (9) with respect to the sketch can be calculated as in Section 4. To compute solutions of (9), we propose a gradient descent based method in the batch setting or SGD in the online setting, as detailed in Algorithm 4. 5 Experiments We ran several experiments to explore how well the proposed Nyström sketch learning strategy works compared to current state-of-the-art methods in Nyström coreset selection. To compare methods in batch mode, we generate a simple simulated dataset of a swirl with outliers, very similar to the swirl data set presented in [28]. This synthesized dataset facilitates exploring and visualizing the performance of the various selection techniques, including the effects of the robust loss function. For a view of real-world data, we also compared the batch algorithms on a subsample of the large forest cover data set from th UCI repository [10]. To compare the methods in an online setting, we use the forest cover, cpu, and two gas datatsets, gas-CO and gas-methane from the UCI repository [10, 6]. The datasets are large enough that it becomes difficult to compute kPCA exactly, making Algorithm 1 or Algorithm 2 unusable. Instead we compare to a random projection based kernel approximation methods. We first ran the proposed algorithm on a training subset, and compare the resulting Nyström approximation to the current state-of-the-art method for the Gaussian kernels on large data sets - random Fourier features (RFF) [15]. We note that the proposed method is learning from the training data while the RFF method is not; this is part of 42 the point - by learning a small sketch for Nyström approximation, the proposed method is able to outperform the current state-of-the-art kernel approximation methods for similar sized sketches. 5.1 Methods We present results using the least-squares method or lsq (Algorithm 1) using the preimage method (Algorithm 2), and the stochastic gradient descent method or sgd (Algorithm 3). For comparison we also include two non-stochastic gradient based variations on Algorithm 3, the first uses the BFGS method which we refer to as bfgs, and the second uses a typical first order gradient descent method gd. For details on the Levenberg-Marquadt method used in lsq, gradient descent, and the BFGS method we refer the reader to [27]. To compare to the current stateof-the-art kernel approximations, we use the seminal random Fourier features rff from [15, 16]. The preimage method makes use of the preimage approach in [17], which uses a fixed-point iteration solver to compute the preimage. We initially evaluated the preimage approaches in [8] which uses an interpolation of a neighborhood of points and [23] which combines both methods, but found that [17] performed best overall on the problems considered. To compare against Nyström coreset methods we use the well studied uniform random coreset selection method random [5]. Although simple, this method is a common benchmark for Nyström coreset selection; we include it here for reference to other coreset methods. 5.2 Parameter selection kx −x k2 We specifically present examples using the Gaussian kernel, k(x1 , x2 ) = 12σ22 2 , because of it's universal use and to demonstrate that even on problems where the feature space has high curvature and infinite feature size our method performs well. The parameter σ for the synthetic swirl data set was chosen small enough "spread out" the curve as a line in feature space. For the other datasets σ was selected according to the average inter-point distance of the data set, specific parameters are in Table 1. For each dataset we estimated the number of kPCA components, p, needed to capture 90% of the kPCA spectral energy and set that as the smallest Nyström sketch. We then explore sketches of that size and larger, m ≥ p. 5.3 Optimal Nyström sketch learning The first experiment used a synthetic swirl dataset. This type of dataset is interesting in an unsupervised case, because it provides some view into how well the method is learning the (obvious) structure in the data. We generated a swirl with some clear outlier data in a way similar to [28], specifically we sample from the following region uniformly, {(0.07γ + α)(cos γ, sin γ) : α ∈ [0, 0.1], γ ∈ [0, 6π]}, 43 1.5 data dictionary 1.0 1.5 data dictionary 1.0 1.5 data dictionary 1.0 0.5 0.5 0.5 0.0 0.0 0.0 −0.5 −0.5 −0.5 −1.0 −1.0 −1.0 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 random lsq data dictionary 1.0 1.5 1.0 0.5 0.5 0.0 0.0 bfgs data dictionary random sgd lsq gd bfgs preimage 0.45 0.40 0.35 0.30 0.25 −0.5 −0.5 −1.0 −1.0 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 sgd 0.50 projection error 1.5 0.20 0.15 preimage 24 25 26 27 28 dictionary size 29 30 comparison Fig. 1. Results running the various solution methods in batch mode on a synthesized swirl dataset. Top row and bottom left figures are visualizations comparing the sketch and coreset selection for each method. Bottom right figure is a plot of the projection error of each method. as well as uniformly from a square region off to the side to provide some indication on the effect of obvious outliers on the result. We sample a total of 1000 points with 10% of those sampled from the outlier region. A summary of attributes for all data sets used is given in Table 1. We are specifically interested in how the results change when the restriction of Nyström coreset selection to a more general Nyström sketch affects results. We have shown that the optimal Nyström sketch for a linear kernel are the first p PCA basis vectors. We hypothesize that relaxing the coreset restriction to a sketch learning problem in the non-linear kernel case also provides an advantage. We present a visual and quantitative comparison of the Nyström coreset selection and sketch learning using the random, lsq, and preimage methods on the swirl data set in Figure 1. The plot of projection error shows that the proposed lsq approach performs the best for each size of coreset or sketch, and that both lsq and preimage sketches perform better than the random coreset of the same size. Visually we can see that each method obtains good coverage of the structure of the swirl, but lsq places the sketch elements at more strategic locations, for example the center of the swirl with high curvature area has more samples, while areas with less curvature have less sampling density. Note that the random method maintains coreset, ie "in-sample" points, while the preimage and lsq methods can obtain sketch elements that are "out-ofsample". We emphasize that this flexibility in sketch learning provides advantages coreset selection when the sketch/coreset is restricted in size. 44 data set d n √σ swirl 2 1000 0.10 forest (subsampled) 54 1000 2.59 × 103 cpu-train 21 6554 5.88 × 105 cpu-test 21 819 5.88 × 105 forest-train 54 522910 2.53 × 103 forest-test 54 58102 2.53 × 103 gas-CO-train 16 3708261 6.44 × 103 gas-CO-test 16 500000 6.44 × 103 gas-methane-train 16 3678504 4.68 × 103 gas-methane-test 16 500000 4.68 × 103 Table 1. data sets and parameters 0.45 104 random sgd lsq bfgs preimage 0.35 0.30 0.25 0.20 2 random sgd lsq bfgs preimage 103 runtime (sec) projection error 0.40 102 101 100 10-1 4 6 8 dictionary size 10 12 10-2 2 4 6 8 dictionary size 10 12 Fig. 2. Projection error (left) and runtime (right) for various solution methods in batch mode on a small subsample of the forest cover dataset. We are also interested in understanding how various methods perform at solving the optimal Nyström sketch learning problem. For this we ran a second experiment on a uniformly randomly subsample of the UCI forest cover data set [10]. We solved for the optimal sketch using lsq, preimage, random, as well as two gradient descent methods, bfgs and sgd. The results are plotted in Figure 2. Note that all three solvers for the optimal formulation, lsq, bfgs, and sgd perform the best for all sketch sizes. It's interesting that the preimage method doesn't perform as well as the sketch size increases. We also plot the runtime of the various methods. Note that the optimal solvers also all take the most amount of time, and grow in cost as the sketch size increases. This indicates that computing an optimal sketch will primarily be worth the effort for very small sketches. We note that lsq runs faster than sgd, only because we gave a very loose convergence criteria for lsq (1.0 × 10−3 ), while sgd must run through the entire data set. The bfgs method had the same convergence criteria but takes longer than both lsq and sgd. 45 projection error 0.6 0.45 sgd rff 0.5 0.35 0.4 projection error 0.30 0.3 0.2 0.1 0.0 0 rff/d4 rff/d5 sgd/d4 sgd/d5 sgdl1/d4 sgdl1/d5 0.40 0.25 0.20 0.15 0.10 0.05 5 10 15 dictionary size 20 25 0.00 forest CO methane Fig. 3. Results for various solution methods in online mode on the real world datasets. (left) The projection error for the sgd and rff methods on the cpu dataset. (right) A summary of similar results for the forest, gas-CO, and gas-methane datasets, using sketch of size 4 and 5 (d4 and d5). 1.5 data dictionary 1.0 0.5 1.5 data dictionary 1.0 0.5 1.5 data dictionary 1.0 0.5 0.0 0.0 0.0 −0.5 −0.5 −0.5 −1.0 −1.0 −1.0 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 sgd gd bfgs Fig. 4. Results for the gradient-descent-based methods on the robust sketch learning problem with a synthesized swirl dataset. Compare to Figure 1. 5.4 Online extension As described previously, using a gradient descent method to solve the leastsquares problem, opens the possibility of using a method like stochastic gradient descent to extend our model to be used in an online setting, as detailed in Algorithm 3. The basic idea is that while observing new data we can continually update the sketch so that we alway have a very representative Nyström sketch. To evaluate this idea, we simulated an online setting by splitting the UCI forest and cpu datasets into a training set and a test set each. Then we run the sketch learning algorithm on the (large) training set to learn the modified sketch as we "observe" new data. Since we are interested in using the kPCA in some kind of dimension reduction or other application, we evaluate the learned sketch on unseen data (the test set). Because the datasets in an online setting are potentially very large, methods that need to revisit each data point repeatedly, like lsq and bfgs, are too slow. 46 However the sgd method is specifically useful in an online setting. Note that because the data set is too large for direct decomposition, the preimage method cannot be used on a dataset that large. Consequently we compare the sgd method with the current state-of-the-art in kernel approximation using rff, for the online setting. The results of this experiment for the cpu dataset are shown in Figure 3 (left). Note that while the sgd was trained over a large training data set (cpu-train), the rff method doesn't require any training. Consequently projection error for the rff consists of projecting the data point onto the preselected random directions, taking the inner product, and then computing a difference with exact kernel inner product. Because the sgd method is trained on the same distribution of data and computes an approximate optimal sketch, it performs better for all sketch sizes evaluated. The results on the forest, gas-CO, and gas-methane datasets are summarized in Figure 3 (right). This plot summarizes the projection error for the three datasets using the rff, sgd, and the `1 version of sgd. Each method was evaluated on a sketch of size 4 or 5, as indicated. Again these experiments specifically point out the advantage of using a learned sketch and the Nyström approximation over using a random basis of the same size. While random projection methods excel when the basis size is very large, for very small sketches the Nyström approximation obtains a better projection error, as shown here. 5.5 Robust extension We ran similar experiments using the robust extension described in Algorithm 4. The results for the swirl dataset are shown in Figure 4. Note that compared to the least-squares results in Figure 1 (especially lsq), the robust formulation places fewer sketch elements in the outlier section of the swirl data set. This is an important effect caused by the use of the `1 norm, reducing the effect of outliers on the sketch learning. We also used the robust extension in an online setting on the gas-CO and gas-methane datasets, with results in Figure 3 (right). Note how in the gas-CO case the `1 result improved on the `2 result, while in the gas-methane dataset the results were similar to the `2 results. This real world example illustrates one case where a robust result improved generalization for the gas-CO dataset. 6 Conclusion We introduced a novel out-of-sample dictionary learning method, and shown that it better describes the data than current state-of-the-art methods; in particular for the same sized-dictionaries, the proposed method yields dictionaries with smaller projection error. We proved that for the linear kernel, the proposed method gives a dictionary that is equivalent to the PCA subspace of the given size. We also demonstrated that because of the difficulty in mapping back into 47 the input space, the proposed method for finding the optimal dictionary performs better than the pre-image of the kPCA subspace basis. Acknowledgements B. Osting was partially funded by NSF DMS 16-19755. References 1. Ahmed El Alaoui and Michael W. Mahoney. Fast Randomized Kernel Methods With Statistical Guarantees. arXiv, pages 1-17, 2014. 2. Carlos Alzate and Johan AK Suykens. Kernel component analysis using an epsilon-insensitive robust loss function. Neural Networks, IEEE Transactions on, 19(9):1583-1598, 2008. 3. Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed Adaptive Sampling for Kernel Matrix Approximation. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 54:1421-1429, 2017. 4. Tat-Jun Chin and David Suter. Incremental kernel principal component analysis. Image Processing, IEEE Transactions on, 16(6):1662-1674, 2007. 5. Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. The Journal of Machine Learning Research, 6:2153-2175, 2005. 6. Jordi Fonollosa, Sadique Sheik, Ramón Huerta, and Santiago Marco. Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sensors and Actuators B: Chemical, 215:618-629, 2015. 7. Ian Jolliffe. Principal component analysis. Wiley Online Library, 2002. 8. JT-Y Kwok and Ivor W Tsang. The pre-image problem in kernel methods. Neural Networks, IEEE Transactions on, 15(6):1517-1525, 2004. 9. Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, 2013. 10. M. Lichman. UCI machine learning repository, 2013. 11. David Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Schölkopf. Randomized nonlinear component analysis. arXiv preprint arXiv:1402.0119, 2014. 12. Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 2:849-856, 2002. 13. Marie Ouimet and Yoshua Bengio. Greedy spectral embedding. In Proc. 10th Int. Workshop on Artificial Intelligence and Statistics, pages 253-260. Citeseer, 2005. 14. Karl Pearson. Principal components analysis. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 6(2):559, 1901. 15. Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177-1184, 2007. 16. Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems, pages 1313-1320, 2009. 48 17. Bernhard Schölkopf, Sebastian Mika, Alex Smola, Gunnar Rätsch, and KlausRobert Müller. Kernel pca pattern reconstruction via approximate pre-images. In ICANN 98, pages 147-152. Springer, 1998. 18. Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299- 1319, 1998. 19. Erich Schubert, Arthur Zimek, and Hans-Peter Kriegel. Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, pages 542- 550, 2014. 20. Alex J Smola, Olvi L Mangasarian, and Bernhard Schölkopf. Sparse kernel feature analysis. In Classification, Automation, and New Media, pages 167-178. Springer, 2002. 21. Alex J Smola and Bernhard Schökopf. Sparse greedy matrix approximation for machine learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 911-918. Morgan Kaufmann Publishers Inc., 2000. 22. Patrick Snape and Stefanos Zafeiriou. Kernel-pca analysis of surface normals for shape-from-shading. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1059-1066. IEEE, 2014. 23. Ana R Teixeira, Ana Maria Tomé, Kurt Stadlthanner, and Elmar Wolfgang Lang. Kpca denoising and the pre-image problem revisited. Digital Signal Processing, 18(4):568-580, 2008. 24. Michael E Tipping. Sparse kernel principal component analysis. In Advances in Neural Information Processing Systems, pages 633-639, 2001. 25. Jingdong Wang, Jianguo Lee, and Changshui Zhang. Kernel trick embedded gaussian mixture model. In Algorithmic Learning Theory, pages 159-174. Springer, 2003. 26. Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems, number EPFL-CONF-161322, pages 682-688, 2001. 27. SJ Wright and J Nocedal. Numerical optimization, volume 2. Springer New York, 1999. 28. Yingchao Xiao, Huangang Wang, Wenli Xu, and Junwu Zhou. L1 norm based kpca for novelty detection. Pattern Recognition, 46(1):389-396, 2013. 29. Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. In Advances in neural information processing systems, pages 476-484, 2012. CHAPTER 4 STREAMING KERNEL PRINCIPAL COMPONENT ANALYSIS This chapter is a reprint of, Proceedings of Machine Learning Research, Streaming kernel principal component analysis, v. 51, 2016, pp. 1365-1374, Mina Ghashami, Daniel J. Perry 1 , and Jeff M. Phillips, copyright retained by the authors, with permission of the authors and originally presented at the 2016 conference on Artificial Intelligence and Statistics in May 2016. 1 First and second authors contributed equally 50 Streaming Kernel Principal Component Analysis Mina Ghashami School of Computing University of Utah Salt Lake City, UT 84112 ghashami@cs.utah.edu Daniel J. Perry SCI Institute University of Utah Salt Lake City, UT 84112 dperry@cs.utah.edu Abstract Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture nonlinear structures within large data sets, and is a central tool in data analysis and learning. To allow for nonlinear relations, typically a full n × n kernel matrix is constructed over n data points, but this requires too much space and time for large values of n. Techniques such as the Nyström method and random feature maps can help towards this goal, but they do not explicitly maintain the basis vectors in a stream and take more space than desired. We propose a new approach for streaming KPCA which maintains a small set of basis elements in a stream, requiring space only logarithmic in n, and also improves the dependence on the error parameter. Our technique combines together random feature maps with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to the original kernel matrix, and it compares favorably in practice to state-ofthe-art approaches. 1 Introduction Principal component analysis (PCA) is a well-known technique for dimensionality reduction, and has many applications including visualization, pattern recognition, and data compression [11]. Given a set of centered d-dimensional (training) data points A = [a1 ; . . . ; an ] ∈ Rn×d , PCA diagonalizes the covariance matrix C = n1 AT A by solving the eigenvalue equaAppearing in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain. JMLR: W&CP volume 41. Copyright 2016 by the authors. Jeff M. Phillips School of Computing University of Utah Salt Lake City, UT 84112 jeffp@cs.utah.edu tion Cv = λv. However, when the data points lie on a highly nonlinear space, PCA fails to concisely capture the structure of data. To overcome this, several nonlinear extension of PCA have been proposed, in particular Kernel Principal Component Analysis (KPCA) [24]. The basic idea of KPCA is to implicitly map the data into a nonlinear feature space of high (or often infinite) dimension and perform PCA in that space [24]. The nonlinear map is often denoted as φ : Rd → H where H is a Reproducing Kernel Hilbert Space (RKHS). While direct computation of PCA in RKHS is infeasible, we can invoke the so called kernel trick which exploits the fact that PCA interacts with data through only pair-wise inner products. That is hφ(x), φ(y)iH = K(x, y), for all x, y ∈ Rd for a kernel function K; we represent this as the n × n gram matrix G. However, KPCA suffers from high space and computational complexity in storing the entire kernel (gram) matrix G ∈ Rn×n and in computing the decomposition of it in the training phase. Then in the testing phase it spends O(nd) time to evaluate the kernel function for any arbitrary test vector with respect to all training examples. Although one can use low rank decomposition approaches [4, 23, 18, 7] to reduce the computational cost to some extent, KPCA still needs to compute and store the kernel matrix. There have been two main approaches towards resolving this space issue. First approach is the Nyström [26] which uses a sample of the data points to construct a much smaller gram matrix. Second approach is using feature maps [22] which provide an approximate but explicit embedding of the RKHS into Euclidean space. As we describe later, both approaches can be made to operate in a stream, approximating the KPCA result in less than O(n2 ) time and space. Once these approximations are formed, they reveal a D n dimensional space, and typically a kdimensional subspace found through linear PCA in RD , which captures most of the data (e.g., a low rank-k approximation). There are two main purposes of these 51 Streaming Kernel Principal Component Analysis D- and k-dimensional subspaces; they start with mapping a data point x ∈ Rd into the D-dimensional space, and then often onto the k-dimensional subspace. If x is one of the training data points, then the k-dimensional representation can be used as a concise "loadings" vector. It can be used in various down-stream training and learning tasks wherein this k-dimensional space, can assume linear relations (e.g., linear separators, clustering under Euclidean distance) since the nonlinearity will have already been represented through the mapping to this space. If x is not in the training set, and the training set represents some underlying distribution, then we can assess the "fit" of x to this distribution by considering the residual of its representation in the D-dimensional space when projected to the k-dimensional space. We let Test time refer to this time for mapping a single point x to the D-dimensional and k-dimensional spaces. The value of k needed to get a good fit depends on the choice of kernel and its fit to the data; but D depends on the technique. For instance in (regular) KPCA D = n, in Nyström D = O(1/ε2 ), when using random feature maps with [22] D = O((1/ε2 ) log n), where ε ∈ (0, 1) is the error parameter. We propose a new streaming approach, named as SKPCA, that will only require D = O(1/ε). We prove bounds and show empirically that SKPCA greatly outperforms existing techniques in Test time, and is also comparable or better in other measures of Space (the cost of storing this map, and space needed to construct it), and Train time (the time needed to construct the map to the D-dimensional and kdimensional spaces). Background and Notation. We indicate matrix A is n × d dimensional as A ∈ Rn×d . Matrices A and Z will be indexed by their row vectors A = [a1 ; a2 ; . . . ; an ] while other matrices V, U, W, . . . will be indexed by column vectors V = [v1 , v2 , . . . , vd ]. We use In for the n-dimensional identity matrix and 0n×d as the full zero matrix of dimension n × d. The FrobepP 2 nius norm of a matrix A is kAkF = i=1 kai k kAxk and the spectral norm is kAk2 = supx∈Rd kxk . We denote transpose of a matrix as AT . The singular value decomposition of matrix A ∈ Rn×d is denoted by [U, S, V ] = svd(A). If n ≥ d it guarantees that A = U SV T , U T U = In , V T V = Id , U ∈ Rn×n , V ∈ Rd×d , and S = diag(s1 , s2 , . . . , sd ) ∈ Rn×d is a diagonal matrix with s1 ≥ s2 ≥ . . . ≥ sd ≥ 0. Let Uk and Vk be matrices containing the first k columns of U and V , respectively, and Sk = diag(s1 , s2 , . . . , sk ) ∈ Rk×k . The matrix Ak = Uk Sk VkT is the best rank k approximation of A in the sense that Ak = arg minC:rank(C)≤k kA − Ck2,F . We denote by πB (A) the projection of rows of A on the span of the rows of B. In other words, πB (A) = AB † B where (·)† indicates taking the Moore-Penrose psuedoinverse. Finally, expected value of a matrix is defined as the matrix of expected values. 1.1 Related Work Matrix Sketching. Among many recent advancements in matrix sketching [27, 20], we focus on those that compress a n × d matrix A into an ` × d matrix B. There are several classes of algorithms based on row/column sampling [4, 2] (very related to Nyström approaches [5]), random projection [23] or hashing [3] which require ` ≈ c/ε2 to achieve ε error. The constant c depends on the algorithm, specific type of approximation, and whether it is a "for each" or "for all" approximation. A recent and different approach, Frequent Directions (FD) [18], uses only ` = 2/ε to achieve the error bound kAT A − B T Bk2 ≤ εkAk2F , and runs in time O(nd/ε). We use a modified version of this algorithm in our proposed approach. Incremental Kernel PCA. Techniques of this group update/augment the eigenspace of kernel PCA without storing all training data. [13] adapted incremental PCA [9] to maintain a set of linearly independent training data points and compute top d eigenvectors such that they preserve a θ-fraction (for a threshold θ ∈ (0, 1)) of the total energy of the eigenspace. However this method suffers from two major drawbacks. First, the set of linearly independent data points can grow large and unpredictably, perhaps exceeding the capacity of the memory. Second, under adversarial (or structured sparse) data, intermediate approximations of the eigenspace can compound in error, giving bad performance [6]. Some of these issues can be addressed using online regret analysis assuming incoming data is drawn iid (e.g., [15]). However, in the adversarial settings we consider, FD [18] can be seen as the right way to formally address these issues. Nyström-Based Methods for Kernel PCA. Another group of methods [26, 5, 8, 14, 25], known as Nyström, approximate the kernel (gram) matrix G with a low-rank matrix Ĝ, by sampling columns of G. The original version [26] samples c columns with replacement as C and estimates Ĝ = CW −1 C T , where W is the intersection of the sampled columns and rows; this method takes O(nc2 ) time and is not streaming. Later [5] used sampling with replacement and approximated G as Ĝ = CWk† C T . They proved if sampling Pn probabilities are of form pi = G2ii / i=1 G2ii , then for ε ∈ (0, 1) and δ ∈ (0, 1), P a Frobenius error bound n kG− Ḡk kF ≤ kG−Gk kF +ε i=1 G2ii holds with prob4 ability 1 − δ for c = O((k/ε ) log(1/δ)), and aP spectral n error bound kG − Ḡk k2 ≤ kG − Gk k2 + ε i=1 G2ii holds with probability 1 − δ for c = O((1/ε2 ) log(1/δ)) 52 Mina Ghashami, Daniel J. Perry, Jeff M. Phillips samples. There exist conditional improvements, e.g., ) where µ denotes the [8] shows with c = O(µ k ln(k/δ) ε2 coherence of the top k-dimensional eigenspace of G, n that kG − Ḡk k2 ≤ (1 + (1−ε)c )kG − Gk k2 . Random Fourier Features for Kernel PCA. In this line of work, the kernel matrix is approximated via randomized feature maps. The seminal work of [22] showed one can construct randomized feature maps Z : Rd → Rm such that for any shift-invariant kernel K(x, y) = K(x − y) and all x, y ∈ Rd , E[hZ(x), Z(y)i] = K(x, y) and if m = O((d/ε2 ) log(n/δ)), then with probability at least 1−δ, \|hZ(x), Z(y)i − K(x, y)\| ≤ ε. Using this mapping, instead of implicitly lifting data points to H by the kernel trick, they explicitly embed the data to a lowdimensional Euclidean inner product space. Subsequent works generalized to other kernel functions such as group invariant kernels [17], min/intersection kernels [21], dot-product kernels [12], and polynomial kernels [10, 1]. This essentially converts kernel PCA to linear PCA. In particular, Lopez et al. [19] proposed randomized nonlinear PCA (RNCA), which is an exact linear PCA on the approximate data feature maps matrix Z ∈ Rn×m . They showed the approximation error is bounded as E[kĜ−Gk2 ] ≤ Θ((n log n)/m), where Ĝ = ZZ T is not actually constructed. 1.2 Our Result vs. Previous Streaming In this paper, we present a streaming algorithm for computing kernel PCA where the kernel is any shiftinvariant function K(x, y) = K(x − y). We refer to our algorithm as SKPCA (Streaming Kernel PCA) throughout the paper. Transforming the data to a mdimensional random Fourier feature space Z ∈ Rn×m (for m n) and maintaining an approximate ` dimensional subspace W ∈ Rm×` (` m), we are able to show that for G̃ = ZW W T Z T , the bound kG − G̃k2 ≤ εn holds with high probability. Our algorithm requires O(dm + m`) space for storing m feature functions and the `-dimensional eigenspace W . Moreover SKPCA needs O(dm + ndm + nm`) = O(nm(d + `)) time to compute W , and permits O(dm+m`) test time to first transfer the data point to m-dimensional feature space and then update the eigenspace W . We compare with two streaming algorithms. RNCA[19]: It achieves E[kĜ−Gk2 ] ≤ Θ((n log n)/m), for Ĝ = ZZ T and Z ∈ Rn×m being the data feature map matrix. Using Markov's inequality it is easy to show that with any constant probability kĜ − Gk2 ≤ εn if m = O((log n)/ε). We extend this (in Appendix A.1) to show with m = O((log n)/ε2 ) then kĜ − Gk2 ≤ εn with high probability 1 − 1/n. This algorithm takes O(dm+nmd+nm2 ) = O(nmd+nm2 ) time to construct m feature functions, apply them to n data points and compute the m × m covariance matrix (adding n m × m outer products). Moreover, it takes O(dm + m2 ) space to store the feature functions and covariance matrix. Testing on a new data point x ∈ Rd is done by applying the m feature functions on x, and projecting to the rank-k eigenspace in O(dm + mk) Nyström [5]: It approximates the original Gram matrix with Ḡ = CWk† C T . For shift-invariant Pn kernels, the sampling probabilities are pi = G2ii / i=1 G2ii = 1/n, hence one can construct W ∈ Rc×c in a stream using c independent reservoir samplers. Note setting k = n (hence Gk = G), their spectral error bound translates to kG− Ḡk2 ≤ εn for c = O((1/ε2 ) log(1/δ)). Their algorithm requires O(nc + dc2 log n) time to do the sampling and construct W . It also needs O(cd + c2 ) space for storing the samples and W . The test time step on a point x ∈ Rd evaluates K(x, y) on each data point y sampled, taking O(cd) time, and projects onto the c-dimensional and k-dimensional basis in O(c2 + ck) time; requiring O(cd + c2 ) time. For both RNCA and Nyström we calculate the eigendecomposition once at cost O(m3 ) or O(c3 ), respectively, at Train time. Since SKPCA maintains this decomposition at all steps, and Testing may occur at any step, this favors RNCA and Nyström. Table 1 summarizes train/test time and space usage of above mentioned algorithms. KPCA is included in the table as a benchmark. All bounds are mentioned for high probability δ = 1/n guarantee. As a result c = O((log n)/ε2 ) for Nyström and m = O((log n)/ε2 ) for RNCA and SKPCA. One can use Hadamard fast Fourier transforms (Fastfood)[16] in place of Gaussian matrices to gain improvement on train/test time and space usage of SKPCA and RNCA. These matrices allow us to compute random feature maps in time O(m log d) instead of O(md), and to store feature functions in space O(m) instead of O(md). Since d was relatively small in our examples, we did not observe much empirical benefit of this approach, and we omit it from our further discussions. We see SKPCA wins on Space and Train time by factor (log n)/ε over RNCA, and similarly on Space and Test time by factors (log n)/ε and (log n)/(ε2 k) over Nyström. When d is constant and ε < ((log2 n)/n)1/3 it improves Train time over Nyström. It is the first method to use Space sublinear (logarithmic) in n and sub-quartic in 1/ε, and have Train time sub-quartic in 1/ε, even without counting the eigen-decomposition cost. 2 Algorithm and Analysis In this section, we describe our algorithm Streaming Kernel Principal Component Analysis (SKPCA) for approximating the eigenspace of a data 53 Streaming Kernel Principal Component Analysis Table 1: Asymptotic Train time, Test time and Space for SKPCA, KPCA [24] , RNCA [19], and Nyström [5] to achieve kG0 − Gk2 ≤ εn with high probability (KCPA is exact) and with Gaussian kernels. KPCA Nyström RNCA SKPCA Train time O(n2 d + n3 ) O((n log n)/ε2 + (d log3 n)/ε4 + (log3 n)/ε6 ) O((nd log n)/ε2 + (n log2 n)/ε4 + (log3 n)/ε6 ) O((nd log n)/ε2 + (n log n)/ε3 ) Algorithm 1 SKPCA Input: A ∈ Rn×d as data points, a shift-invariant kernel function K, and `, m ∈ Z+ Output: Feature maps [f1 , · · · , fm ] and their approximate best `-dim subspace W [f1 , · · · , fm ] = FeatureMaps(K, m) B ← 0`×m for i ∈ q [n] do 2 [f1 (ai ), · · · , fm (ai )] zi = m B ← zi #insert zi as a row to B if B has no zero valued rows then [Y, Σ,q W ] ← svd(B) B ← max{0, Σ2 − Σ2`/2,`/2 I` } · W T Return [f1 , · · · , fm ] and W #W ∈ Rm×` set which exists on a nonlinear manifold and is received in streaming fashion one data point at a time. SKPCA consists of two implicit phases. In the first phase, a set of m data oblivious random feature functions (f1 , · · · , fm ) are computed to map data points to a low dimensional Euclidean inner product space. These feature functions are used to map each data point ai ∈ Rd to zi ∈ Rm . In the second phase, each approximate feature vector zi is fed into a modified FrequentDirections [18] which is a small space streaming algorithm for computing an approximate set of singular vectors; the matrix W ∈ Rm×` . However, in the actual algorithm these phases are not separated. The feature mapping functions are precomputed (oblivious to the data), so the approximate feature vectors are immediately fed into the matrix sketching algorithm, so we never need to fully materialize and store the full n × m matrix Z. Also, perhaps unintuitively, we do not sketch the m-dimensional column-space of Z, rather its n-dimensional row-space. Yet, since the resulting `-dimensional row-space of W (with ` m) encodes a lower dimensional subspace within Rm , it serves to represent our kernel principal components. Pseudocode is provided in Algorithm 1. Approximate feature maps. To make the algorithm concrete, we consider the approximate feature maps described in the general framework of Rahimi and Recht [22]; label this instantiation of the FeatureMaps function as Random-Fourier Feau- Test time O(n2 + nd) O((d log n)/ε2 + (log2 n)/ε4 ) O((d log n)/ε2 + (k log n)/ε2 ) O((d log n)/ε2 + (k log n)/ε2 ) Space O(n2 + nd) O((d log n)/ε2 + (log2 n)/ε4 ) O((d log n)/ε2 + (log2 n)/ε4 ) O((d log n)/ε2 + (log n)/ε3 ) reMaps (or RFFMaps). This works for positive definite shift-invariant kernels K(x, y) = K(x − y) (e.g. Gaussian kernel K(x, y) = (1/2π)d/2 exp(−kx − yk2 /2)). It computes a randomized feature map z : Rd → Rm so that E[z(x)T z(y)] = K(x, y) for any x, y ∈ Rd . To construct the mapping z, they define m functions of the form fi (x) = cos(riT x + γi ), where ri ∈ Rd is a sample drawn uniformly at random from the Fourier transform of the kernel function, and γi ∼ Unif(0, 2π], uniformly at random from the interval (0, 2π]. Applying each fi on a datapoint p x, gives the ith coordinate of z(x) in Rm as z(x)i = 2/mfi (x). This implies each coordinate has squared value of (z(x)i )2 ≤ 2/m. We consider m = O((1/ε2 ) log n) and ` = O(1/ε). Space: We store the m functions fi , for i = 1, . . . , m; since for each function uses a d-dimensional vector ri , it takes O(dm) space in total. We compute feature map z(x) and get a m-dimensional row vector z(ai ) for each data point ai ∈ A, which then is used to update the sketch B ∈ R`×m in FrequentDirections. Since we need an additional O(`m) for storing B and W , the total space usage of Algorithm 1 is O(dm + `m) = O((d log n)/ε2 + (log n)/ε3 ). Train Time: Applying the feature map takes O(n · dm) time and computing the modified FrequentDirections sketch takes O(n`m) time, so the training time is O(ndm + n`m) = O(n log n(d/ε2 + 1/ε3 )). Test Time: For a test point xtest , we can lift it to Rm in O(dm) time using {f1 , . . . , fm }, and then use W to project it to R` in O(`m) time. In total it takes O(dm + `m) = O((d + 1/ε)/ε2 · log n) time. Although W approximates the eigenspace of A, it will be useful to analyze the error by also considering all of the data points lifted to Rm as the n × m matrix Z; and then its projection to R` as Z̃ = ZW ∈ Rn×` . Note Z̃ would need an additional O(n`) to store, and another pass over A (similar for Nyström and RFF); we do not compute and store Z̃, only analyze it. 2.1 Spectral Error Analysis Let G = ΦΦT be the exact kernel matrix in RKHS. Let Ĝ = ZZ T be an approximate kernel matrix using Z ∈ Rn×m ; it consists of mapping the n points to Rm using m RFFMaps. Then we consider G̃ = ZW W T Z T , 54 Mina Ghashami, Daniel J. Perry, Jeff M. Phillips as the kernel matrix which could be constructed from output W of Algorithm 1 using RFFMaps. We ultimately will show that kG − G̃k2 ≤ εn. The main technical challenge is that FD bounds are typically for the covariance matrix W T W not the gram matrix W W T , and thus new ideas are required. We show in Lemma A.1 in Appendix A.1 that with m = O((1/ε2 ) log(n/δ)) then kG − Ĝk2 ≤ εn with probability at least 1 − δ. Note this is a "for all" result which (see Lemma 2.3) is equivalent to, for all unit vectors x that \|kΦT xk2 − kZ T xk2 \| ≤ εn. If we loosen this to a "for each" result, where the above inequality holds for any one unit vector x (say we only want to test with a single vector xtest ), then we show in Appendix A.2 that this holds with m = O((1/ε2 ) log(1/δ)). This makes partial progress towards an open question [1] (can m be independent of n?) about creating oblivious subspace embeddings for Gaussian kernel features. Next, we show that applying the modified Frequent Directions step to Z does not asymptotically increase the error. To do so, we first show that spectrum of Z along directions that FrequentDirections fails to capture is small. We prove this for any n × m matrix A that is approximated as B ∈ R`×m by FrequentDirections. Lemma 2.1. Consider an A ∈ Rn×m matrix with m ≤ n, and let B be an ` × m matrix resulting from running Frequent Directions on A with ` rows. For any unit vector y ∈ Rn with ky T AB † Bk = 0, it holds that ky T Ak2 ≤ kA − Ak k2F /(` − k), for all k ≤ `, including k = 0 where A − Ak = A. Proof. Let [U, S, V ] = svd(A) be the svd of A. Consider any unit vector y ∈ Rn that lies in the column space of A and the null space of B, that is ky T AB † Bk = 0 and ky T AA† k = kyk = 1. Since U = [u1 , u2 , . . . , un ] provides an orthonormal basis for Rn , we can write y= n X i=1 αi ui such that αi = hy, ui i, n X αi2 = 1 i=1 Now to see that kBxk = 0, we will assume that kBxk > 0 and prove a contradiction. Since kBxk > 0, then x is not in the null space of B, and kπB (x)k > 0 for any unit vector x. Let Σ = diag(σ1 , . . . , σ` ), assuming σ1 ≥ σ2 ≥ . . . ≥ σ` > 0, are the singular values of B, and W = [w1 , . . . , w` ] ∈ Rmx` are its right singular vectors. Then kBxk = kΣW xk and if kπB (x)k > 0, then setting Σ̄ = diag(1, 1, . . . , 1) and B̄ = Σ̄W = W` to remove the scaling from B, we have kπB̄ (x)k > 0. Similarly, if ky T U SV T B † Bk = ky T πB (A)k = 0, then setting S̄ = diag(1, . . . , 1) and Ā = U S̄V T to remove scale from A, we have ky T πB (Ā)k = 0. Hence 0 < kπB̄ (x)k = kxB † Bk = kxW` W`T k =k m X̀ X̀ X hx, wj ik = k αi hvi , wj ik j=1 j=1 i=1 and 0 = ky T πB (Ā)k = k =k m X i=1 m X hy, ui iviT W` W`T k i=1 αi viT W` k = k m X̀ X j=1 i=1 αi hvi , wj ik. Since last terms of each line match, we have a contradiction, and hence kBxk = 0. Lemma 2.2. Let Z̃ = ZW , and G̃ = Z̃ Z̃ T = ZW W T Z T be the corresponding gram matrix from Z ∈ Rn×m and W ∈ Rm×` constructed via Algorithm 1 with ` = 2/ε. Comparing to Ĝ = ZZ T , then kG̃ − Ĝk2 ≤ εn. Proof. Consider any unit vector y ∈ Rn , and note that y T Z = [y T Z]W + [y T Z]⊥W where [y T Z]W = y T ZW W T lies on the column space spanned by W , and [y T Z]⊥W = y T Z(I −W W T ) is in the null space of W . Then first off ky T Zk2 = k[y T Z]W k2 +k[y T Z]⊥W k2 since two components are perpendicular to each other. Second [y T Z]W W = y T ZW W T W = y T ZW and [y T Z]⊥W W = y T Z(I −W W T )W = y T Z(W −W ) = 0. Knowing these two we can say Pn Since 1 = Pi=1 αi2 = kyk = ky T AA† k = m T ky T Um Um k = i=1 αi2P , therefore αi = 0 P for i > m. m m 2 2 Moreover ky T Ak2 = = P i=1 s2i αi2 . i=1 si hy, ui i m This implies there exists a unit vector x = i=1 αi vi ∈ Rm with αi = hx, vi i = hy, ui i for i = 1, · · · , m such that ky T Ak = kAxk and importantly kBxk = 0, which we will prove shortly. y T (ZZ T − Z̃ Z̃ T )y Then, due to the Frequent Directions bound [7], for any unit vector x̄ ∈ Rm , kAx̄k2 − kB x̄k2 ≤ kA − Ak k2F /(` − k), and for our particular choice of x with kBxk = 0, we obtain ky T Ak = kAxk2 ≤ kA − Ak k2F /(` − k), as desired. = k[y T Z]⊥W k2 . = (y T Z)(y T Z)T − (y T Z)W W T (y T Z)T = ky T Zk2 − [y T Z]W + [y T Z]⊥W W W T ([y T Z]W + [y T Z]⊥W )T = ky T Zk2 − (y T ZW )(y T ZW )T = k[y T Z]W k2 + k[y T Z]⊥W k2 − ky T ZW k2 The last inequality holds because ky T ZW k = ky T ZW W T k = k[y T Z]W k as W is an orthonormal matrix. 55 Streaming Kernel Principal Component Analysis To show k[y T Z]⊥W k ≤ εkZk2F , consider vector v = y T Z(I − W W T )Z † and let y ∗ = v/kvk. Clearly y ∗ satisfies requirement of Lemma 2.1 as it is a unit vector in Rn and ky ∗ ZW W T k = 0 as ∗ T T † T T ky ZW W k = ky Z(I − W W )Z ZW W k/kvk of G−G0 . Recall that one can write each eigenvalue as Λi,i = uTi (G − G0 )ui , and the definition of the Froben P nius norm implies kG − G0 k2F = Λ2i,i Hence i=1 kG − G0 k2F = = ky T Z(I − W W T )W W T k/kvk = 0. 2 T T T T k[y Z]⊥W k = ky Z(I − W W )k † 2 ∗ 2 2 = ky Z(I − W W )Z Zk = ky Zk kvk ≤ kZk2F kvk2 /` ≤ εnkvk2 . It is left to show that kvk ≤ T1. † For that, note πZW πZ (y T ) = πZW (y ZZ ) = y T ZZ † (ZW )(ZW )† = y T ZW (ZW )† = πZW (y T ). The finally we obtain i=1 0 Lemma 2.4. Given that kG−G √ k2 ≤ εn we can bound kG − G0k kF ≤ kG − Gk kF + ε kn. Proof. Let [u1 , . . . , un ] and [v1 , . . . , vn ] be eigenvectors of G and G − G0 , respectively. Then kG − G0k k2F = = ky T ZZ † − y T ZW W T Z † k2 = kπZ (y T ) − πZW (y T )k2 + = kπZ (y T ) − πZW (πZ (y T ))k2 ≤ Combining Lemmas A.1 and 2.2 and using triangle inequality, we get our main result. ≤ T Theorem 2.1. Let G = ΦΦ be the exact kernel matrix over n points. Let G̃ = ZW T W Z T be the result of Z from m = O((1/ε2 ) log(n/δ)) RFFMaps and W from running Algorithm 1 with ` = 4/ε. Then with probability at least 1 − δ, we have kG − G̃k2 ≤ εn. kxk=1 Proof. Recall we can rewrite spectral norm as kG − G0 k2 = max \|xT Gx − xT G0 x\| = max \|xT ΦΦT x − kxk=1 kxk=1 xT Y Y T x\| = max \|kΦT xk2 − kY T xk2 \|. First line folkxk=1 lows by definition of top eigenvalue of a symmetric matrix, and last line is true because kyk2 = y T y for any vector y. Thus if kG − G0 k2 ≤ εn where G0 = Y Y T could be reconstructed by any of the algorithms we consider, then it implies maxkxk=1 \|kΦT xk2 − kY T xk2 \| ≤ εn. We can now generalize the spectral norm bound to Frobenius norm. Let G−G0 = U ΛU T be the eigen decomposition n X k X (viT (G − G0k )vi )2 i=1 (viT (G − G0k )vi )2 i=k+1 ≤ kπZ (y T )k2 ≤ kyk2 = 1. Lemma 2.3. kG − G0 k2 = max \|kΦT xk2 − kY T xk2 \|. i=1 Therefore kG−G0 kF ≤ εn1.5 . We can also show a more interesting bound by considering Gk and G0k , the best rank k approximations of G and G0 respectively. kvk2 = ky T Z(I − W W T )Z † k2 2.2 Frobenius Error Analysis Let the true gram matrix be G = ΦΦT , and consider G0 = Y Y T , for any Y including when Y = ZW . First we write the bound in terms of Φ and Y . i=1 (uTi (G − G0 )ui )2 n n X X = (kΦT ui k2 − kY T ui k2 )2 ≤ (εn)2 ≤ ε2 n3 . Therefore it satisfies ky ∗ Zk ≤ kZ −Zk k2F /(`−k). Since kZk2F ≤ 2n for k = 0 and ` = 2/ε, we obtain T n X = k n X X (viT (G − G0k )vi )2 + (viT Gvi )2 i=1 i=k+1 k X i=1 (viT (ΦΦT − Y Y T )vi )2 + n X (uTi Gui )2 i=k+1 k X (kΦT vi k2 − kY T vi k2 )2 + kG − Gk k2F i=1 ≤ k(εn)2 + kG − Gk k2F . The second transition is true because G0 is positive semidefinite, therefore viT (G − G0k )vi ≤ viT Gvi , and third transition holds because if ui is ith eigenvector of G then, uTi Gui ≥ viT Gvi where vi is ith eigenvector of G − G0 . Taking square root yields √ kG − G0k k2F ≤ kG − G0k kF + εn k. Thus we can get error bounds for the best rank-k approximation of the data in RKHS that depends on "tail" kG − Gk kF which is typically small. We can √ also make the second √ term εn k equal to ε0 n by using a value of ε = ε0 / k in the previously described algorithms. 3 Experiments We measure the Space, Train time, and Test time of our SKPCA algorithms with ` taking values {2, 5, 10, 20, 30, 50}. We use spectral and Forbenious 56 Mina Ghashami, Daniel J. Perry, Jeff M. Phillips 10-4 101 102 Train time (sec) Kernel Spectral Error Kernel Frobenius Error 103 10-1 10-2 101 103 sample size 102 101 100 10-1 103 sample size 102 200 400 600 800 1000 1200 1400 sample size 10-4 101 102 103 104 space 105 106 10-1 Test time (sec) Kernel Spectral Error Kernel Frobenius Error 102 10-2 101 102 103 104 space 105 106 101 100 10-1 10-2 200 400 600 800 1000 1200 1400 sample size RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) Figure 1: RandomNoisy dataset showing Error and Test and Train time versus Sample Size (m or c) and Space. error measures and compare against the Nyström and the RNCA approach using RFFMaps. All methods are implemented in Julia, run on an OpenSUSE 13.1 machine with 80 Intel(R) Xeon(R) 2.40GHz CPU and 750 GB of RAM. Data sets. We run experiments on several real (CPU, Adult, Forest) and synthetic (RandomNoisy) data sets. Each data set is an n × d matrix A (CPU is 7373 × 21, Forest is 523910 × 54, Adult is 33561 × 123, and RandomNoisy is 20000 × 1000) with n data points and d attributes. For each a random subset is removed as the test set of size 1000, except CPU where the test set size is 800. We generate the RandomNoisy synthetic dataset using the approach by [18]. We create A = SDU + F/ζ, where SDU is an s-dimensional signal matrix (for s < d) and F/ζ is (full) d-dimensional noise with ζ controlling the signal to noise ratio. Each entry Fi,j of F is generated i.i.d. from a normal distribution N (0, 1), and we use ζ = 10. For the signal matrix, S ∈ Rn×s again we generate each Si,j ∼ N (0, 1) i.i.d; D is diagonal with entries Di,i = 1 − (i − 1)/d linearly decreasing; and U ∈ Rs×d is just a random rotation. We set s = 50. Error measures. We consider two error measures comparing the true gram matrix G and an approximated gram matrix (constructed in various ways). Kernel Spectral Error = kG − G0 k2 /n represents the worst case error. Kernel Frobenius Error = kG − G0 kF /n2 represents the global error. We normalized the error measures by 1/n and 1/n2 , respectively, so they are comparable across data sets. These measures require another pass on the data to compute, but give a more holistic view of how accurate our approaches are. We measure the Space requirements of each algorithm as follows. SKPCA sketch has space md + m`, Nyström is c2 + cd, and RNCA is m2 + md, where m is the number of RFFMaps, and c is the number of samples in Nyström. In our experiments, we set m and c similarly, calling these parameters Sample Size. Note that Sample Size and Space usage are different: both RNCA and Nyström have Space quadratic in Sample Size, while for SKPCA it is linear. Results. Figures 1, 2, and 3 show log-log plots of results for Random Noisy, CPU, and Adult datasets. See also Figure 4 for Forest in the Appendix. For small Sample Size we observe that Nyström performs quite well under all error measures, corroborating results reported by Lopez et al. [19]. However, all methods have a very small error range, typically less than 0.01. For Kernel Frobenius Error we typically observe a cross-over point where RNCA and often most versions of SKPCA have better error for that size. Under Kernel Spectral Error we often see a crossover point for SKPCA, but not for RNCA. We suspect that this is related to how FD only maintains the most dominate directions while ignoring other (potentially spurious) directions introduced by the RFFMaps. In general, SKPCA has as good or better error than RNCA for the same size, with smaller size being required with smaller ` values. This difference is more pronounced in Space than Sample Size, where our theoretical results expect a polynomial advantage. In timing experiments, especially Train time we see SKPCA has a very large advantage. As a function of Sample Size RNCA is slowest for the Train time, and Nyström is slowest for Test time by several 57 10-4 102 103 104 space 10-2 105 101 102 101 100 103 sample size 10-1 10-2 101 106 102 10-1 103 10-4 101 10-1 200 400 600 800 1000 1200 1400 sample size Test time (sec) 102 sample size Kernel Spectral Error Kernel Frobenius Error 101 Train time (sec) Kernel Spectral Error Kernel Frobenius Error Streaming Kernel Principal Component Analysis 102 103 104 space 105 100 10-1 10-2 106 200 400 600 800 1000 1200 1400 sample size 10-4 102 10-1 10-2 sample size 101 100 101 103 102 103 sample size 200 400 600 800 1000 1200 1400 sample size 10-4 101 102 103 104 space 105 106 Test time (sec) 101 Kernel Spectral Error Kernel Frobenius Error 101 Train time (sec) Kernel Spectral Error Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) Figure 2: CPU dataset showing Error and Test and Train time versus Sample Size (m or c) and Space. Kernel Frobenius Error RNCA 10-1 10-2 101 102 103 104 space 105 106 100 10-1 10-2 200 400 600 800 1000 1200 1400 sample size RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) Figure 3: Adult dataset showing Error and Test and Train time versus Sample Size (m or c) and Space. orders of magnitude. In both cases all versions of SKPCA are among the fastest algorithms. For the Train time results, RNCA's slow time is dominated by summing n outer products, of dimensions m × m. This is avoided in SKPCA by only keeping the top ` dimensions, and only requiring similar computation on the order of ` × m, where typically ` m. Nyström only updates the c × c gram matrix when a new point replaces an old one, expected c log n times. Nyström is comparatively very slow in Test time. It computes a new row and column of the gram matrix, and projects onto this space, taking O(cd + c2 ) time. Both RNCA and SKPCA avoid this by directly computing an m dimensional representation of a test data point in O(dm) time. Recall we precompute the eigen- structure for RNCA and Nyström, whereas SKPCA maintains it at all times, so if this step were counted, SKPCA's advantage here would be even larger. Summary. Our proposed method SKPCA has superior timing and error results to RNCA, by sketching in the kernel feature space. Its error is typically a bit worse than a Nyström approach, but the difference is quite small, and SKPCA is far superior to Nyström in Test time, needed for any data analysis. 58 Mina Ghashami, Daniel J. Perry, Jeff M. Phillips References [1] Haim Avron, Huy L. Nguyen, and David P. Woodruff. Subspace embeddings for the polynomial kernel. In NIPS, 2014. [2] Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of 20th ACM-SIAM Symposium on Discrete Algorithms, 2009. [3] Kenneth L Clarkson and David P Woodruff. Low rank approximation and regression in input sparsity time. In Proceedings of the 45th Annual ACM symposium on Theory of computing, 2013. [4] Petros Drineas, Ravi Kannan, and Michael W Mahoney. Fast monte carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix. SIAM Journal on Computing, 36(1):158- 183, 2006. [5] Petros Drineas and Michael W Mahoney. On the nyström method for approximating a gram matrix for improved kernel-based learning. The Journal of Machine Learning Research, 6:2153-2175, 2005. [6] Mina Ghashami, Amey Desai, and Jeff M. Phillips. Improved practical matrix sketching with guarantees. In Proceedings 22nd Annual European Symposium on Algorithms, 2014. [7] Mina Ghashami and Jeff M. Phillips. Relative errors for deterministic low-rank matrix approximations. In SODA, pages 707-717, 2014. [8] Alex Gittens and Michael W Mahoney. Revisiting the nystrom method for improved large-scale machine learning. arXiv preprint arXiv:1303.1849, 2013. [9] Peter M Hall, A David Marshall, and Ralph R Martin. Incremental eigenanalysis for classification. In BMVC, volume 98, pages 286-295, 1998. [10] Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps. arXiv preprint arXiv:1312.4626, 2013. [11] Ian Jolliffe. Principal component analysis. Wiley Online Library, 2005. [12] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. arXiv preprint arXiv:1201.6530, 2012. [13] Shosuke Kimura, Seiichi Ozawa, and Shigeo Abe. Incremental kernel pca for online learning of feature space. In Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on, volume 1, pages 595-600. IEEE, 2005. [14] Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the nyström method. The Journal of Machine Learning Research, 13(1):981-1006, 2012. [15] Dima Kuzmin and Manfred K Warmuth. Online kernel pca with entropic matrix updates. In Proceedings of the 24th international conference on Machine learning, pages 465-472. ACM, 2007. [16] Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning, 2013. [17] Fuxin Li, Catalin Ionescu, and Cristian Sminchisescu. Random fourier approximations for skewed multiplicative histogram kernels. In Pattern Recognition, pages 262-271. Springer, 2010. [18] Edo Liberty. Simple and deterministic matrix sketching. In KDD, pages 581-588, 2013. [19] David Lopez-Paz, Suvrit Sra, Alex Smola, Zoubin Ghahramani, and Bernhard Schölkopf. Randomized nonlinear component analysis. arXiv preprint arXiv:1402.0119, 2014. [20] Michael W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine Learning, 3, 2011. [21] Subhransu Maji and Alexander C Berg. Maxmargin additive classifiers for detection. In Computer Vision, 2009 IEEE 12th International Conference on, pages 40-47. IEEE, 2009. [22] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177-1184, 2007. [23] Tamás Sarlós. Improved approximation algorithms for large matrices via random projections. In FOCS, pages 143-152, 2006. [24] Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. Kernel principal component analysis. In Artificial Neural Networks- ICANN'97, pages 583-588. Springer, 1997. 59 Streaming Kernel Principal Component Analysis [25] Ameet Talwalkar and Afshin Rostamizadeh. Matrix coherence and the nystrom method. arXiv preprint arXiv:1004.2008, 2010. [26] Christopher Williams and Matthias Seeger. Using the nyström method to speed up kernel machines. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems, number EPFL-CONF-161322, pages 682-688, 2001. [27] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10:1-157, 2014. . 60 Mina Ghashami, Daniel J. Perry, Jeff M. Phillips APPENDIX A σ2 = k i=1 E[Ei2 ] High Probability Bound for RFFMaps Pm =E =E Here we extend the analysis of Lopez et al. [19] to show that the Fourier Random Features of Rahimi and Recht [22] approximate the spectral error with their approximate Gram matrix within εn with high probability. A.1 High Probability Bound "for all" Bound for RFFMaps In our proof we use the Bernstein inequality on sum of zero-mean random matrices. Matrix Bernstein Inequality: Let X1 , · · · , Xd ∈ Rn×n be independent random matrices such that for all 1 ≤ i ≤ d, E[Xi ] = 0 and kXi k2 ≤ R for a fixed constant R. Define P variance parameter as σ 2 = Pd d T max{k i=1 hE[Xi Xi ]k, k i=1iE[Xi XiT ]k}. Then for Pd 2 all t ≥ 0, Pr ≥ t ≤ 2n · exp 3σ2−t i=1 Xi +2Rt . 2 Using this inequality, [19] bounded E[kG − Ĝk2 ]. Here we employ similar ideas to improve this to a bound on kG − Ĝk2 with high probability. Lemma A.1. For n points, let G = ΦΦT ∈ Rn×n be the exact gram matrix, and let Ĝ = ZZ T ∈ Rn×n be the approximate kernel matrix using m = O((1/ε2 ) log(n/δ)) RFFMaps. Then kG − Ĝk ≤ εn with probability at least 1 − δ. Proof. Consider m independent random variables 1 1 Ei = m G − zi ziT . Note that E[Ei ] = m G − E[zi ziT ] = n×n 0 [22]. Next we can rewrite 1 kEi k2 = G − zi ziT m 2 1 = E[ZZ T ] − zi ziT m 2 and thus bound 1 1 kE[ZZ T ]k2 + kzi ziT k2 ≤ E[kZk22 ] + kzi k2 m m 2n 2n 4n ≤ + = m m m kEi k2 ≤ The first inequality is correct because of triangle inequality, and second inequality is achieved using Jensen's inequality on expected values, which states kE[X]k ≤ E[kXk] for any random variable X. Last inequality uses the bound on the norm of zi as kzi k2 ≤ 2n 2 2 m , and therefore kZk2 ≤ kZkF ≤ 2n. To bound σ 2 , due to symmetry of matrices Ei , simply E[Ei2 ]k2 . Expanding " 1 G − zi ziT m 2 # 1 G2 2 T T T + kz k z z − (z z G + Gz z ) i i i i i i i m2 m it follows that 2n G2 1 + E[zi ziT ] − (E[zi ziT ]G + G E[zi ziT ]) m2 m m 1 1 = 2 (G2 + 2nG − 2G2 ) = 2 (2nG − G2 ) m m E[Ei2 ] ≤ The first inequality holds by kzi k2 ≤ 2n/m, and sec1 G. Therefore ond inequality is due to E[zi ziT ] = m σ2 = m X i=1 ≤ E[Ei2 ] 2 ≤ 1 (2n G − G2 ) m 2 2 2n 1 2n 1 3n2 kGk2 + kG2 k2 ≤ + kGk22 ≤ m m m m m the second inequality is by triangle inequality, and the last 2 ≤ Tr(G) = n. Setting M = Pm inequality Pmby kGk 1 T i=1 Ei = i=1 ( m G − z:,i z:,i ) = G − Ĝ and using Bernstein inequality with t = εn we obtain ! h i −(εn)2 Pr kG − Ĝk2 ≥ εn ≤ 2n exp 2 4n 3( 3n m ) + 2( m )εn 2 −ε m ≤δ = 2n exp 9 + 8ε Solving for m we get m ≥ 9+8ε ε2 log(2n/δ), so with probability at least 1 − δ for m = O( ε12 log(n/δ)), then kG − Ĝk2 ≤ εn. A.2 For Each Bound for RFFMaps Here we bound kΦT xk2 − kZ T xk2 , where Φ and Z are mappings of data to RKHS by RFFMaps, respectively and x is a fixed unit vector in Rn . Note that Lemma A.1 essentially already gave a stronger proof, where using m = O((1/ε2 ) log(n/δ)) the bound kG − Ĝk2 ≤ εn holds along all directions (which makes progress towards addressing an open question of constructing oblivious subspace embeddings for Gaussian kernel features spaces, in [1]). The advantage of this proof is that the bound on m will be independent of n. Unfortunately, in this proof, going from the "for each" bound to the stronger "for all" bound would seem to require a net of size 2O(n) and a union bound resulting in a worse "for all" bound with m = O(n/ε2 ). On the other hand, main objective of Test time procedure, which is mapping a single data point to the 61 10-4 101 102 Train time (sec) 104 Kernel Spectral Error Kernel Frobenius Error Streaming Kernel Principal Component Analysis 10-1 10-2 101 103 sample size 102 102 101 100 103 sample size 103 200 400 600 800 1000 1200 1400 sample size 10-4 101 102 103 104 space 105 10-1 Test time (sec) Kernel Spectral Error Kernel Frobenius Error 101 10-2 10-1 10-2 101 106 100 102 103 104 space 105 106 200 400 600 800 1000 1200 1400 sample size RNCA Nystrom SKPCA (2) SKPCA (5) SKPCA (10) SKPCA (20) SKPCA (30) SKPCA (50) Figure 4: Results for Forest dataset. Row 1: Kernel Frobenius Error (left), Kernel Spectral Error (middle) and Train Time (right) vs. Sample size. Row 2: Kernel Frobenius Error (left), Kernel Spectral Error (middle) vs. Space, and Test Time vs. Sample size (right) D-dimensional or k-dimensional kernel space is already interesting for what the error is expected to be for a single vector x. This scenario corresponds to the "for each" setting that we will prove in this section. In our proof, we use a variant of Chernoff-Hoeffding inequality, stated next. Consider a set of r independent random variables {X1 , · · · , Xr } where 0 ≤ Xi ≤ Pr ∆. Let M = i=1 Xi , then for any α ∈ (0, 1/2), Pr [\|M − E[M ]\| > α] ≤ 2 exp 2 −2α r∆2 . For this proof we are more careful with notation about rows and column vectors. Now matrix Z ∈ Rn×m can be written as a set rows [z1,: ; z2,: ; . . . , zn,: ] where each zi,: is a vector of length m or a set of columns [z:,1 , z:,2 , . . . , z:,d ], where each z:,j is a vector of length n. We denote the (i, j)-th entry of this matrix as zi,j . Theorem A.1. For n points in any arbitrary dimension and a shift-invariant kernel, let G = ΦΦT ∈ Rn×n be the exact gram matrix, and Ĝ = ZZ T ∈ Rn×n be the approximate kernel matrix using m = O((1/ε2 ) log(1/δ)) RFFMaps. Then for any fixed unit vector x ∈ Rn , it holds that kΦT xk2 − kZ T xk2 ≤ εn with probability at least 1−δ. i=1 i=1 = m X i=1 = n X = n X j=1 = n X j=1 = n X j=1 = D X i=1 = D X i=1 j=1   n n X n X X 2 E (zji xj ) + 2 zji zki xj xk  j=1 x2j j=1 i=1 Proof. Note Rn is not the dimension of data. Consider any unit vector x ∈ Rn . Define m independent random variables {Xi = hz:,i , xi2 }m i=1 . We can bound each Xi as 0 ≤ Xi ≤ kz:,i k2 ≤ 2n/m therefore ∆ = 2n/m for Pm Xi = kZ T xk2 , we observe   m m n X X X 2 2 E[M ] = E hz:,i , xi = E ( zji xj )  all Xi s. Setting M = E " m X i=1 2 zji # j=1 k>j +2 n n X X x2j E [hzj,: , zj,: i] + 2 x2j hφj,: , φj,: i + 2 x2j D X i=1 φ2ji + 2 xj xk E j=1 k>j n X n X j=1 k>j n n X X j=1 k>j n X n X m X zji zki i=1 # xj xk E [hzj,: , zk,: i] xj xk hφj,: , φk,: i xj xk j=1 k>j " D X φji φki i=1   n n X n X X 2 2  xj φji + 2 xj xk φji φki  j=1 j=1 k>j hφ:,i , xi2 = kΦT xk2 Since x is a fixed unit vector, it is pulled out of all expectations. Using the Chernoff-Hoeffding bound and setting α = εn yields Pr \|kΦT xk2 − kZ T xk2 \| > εn ≤ −2(εn)2 2 exp m(2n/m) = 2 exp −2ε2 m ≤ δ. Then we 2 solve for m = (1/(2ε2 )) ln(2/δ) in the last inequality. CHAPTER 5 DETERMINISTIC COLUMN SUBSET SELECTION This chapter presents a collection of important results in the area of deterministic column subset selection. A large portion of this chapter consists of parts from, Springer Lecture Notes in Computer Science, Augmented Leverage Score Sampling with Bounds, v. 9852, 2016, pp. 543-558, Daniel J. Perry and Ross T. Whitaker, copyright Springer International Publishing AG 2016, with permission of Springer and originally presented at the 2016 Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD) in September 2016. The novel theoretical analysis presented here was derived later in close collaboration with Aditya Bhaskara.1 1 Aditya Bhaskara, bhaskara@cs.utah.edu, School of Computing, University of Utah 63 5.1 Introduction In many situations within machine learning and data analysis, it is desirable to represent a dataset using either a subset of the features, to ease interpretation, or a subset of the data, to identify a representative coreset of the data, which can be thought of as column (or row) subset selection of the data matrix. For example, many kernelized machine learning methods can be more quickly approximated using a Nyström method, which requires the selection of a coreset [1]. There has been substantial interest in selecting an optimal subset of columns for a data matrix [2, 3, 4, 5]. To be precise, we are interested in the column subset selection problem (CSSP) recently examined in [3]: Definition 1. Column subset selection problem. Let A ∈ Rd×n and let m < n be a sampling parameter. Find m columns for A - denoted as B ∈ Rd×m that minimize k A − BB† Akη , (5.1) for η ∈ { F, 2}, and where B† denotes the Moore-Penrose pseudo-inverse. Previous work has investigated the use of leverage scores, defined below, for use in selecting the column samples using both deterministic and random methods [2]. We use the definition provided in [3], Definition 2. Leverage scores. Let Vk ∈ Rn× x contain the top k right singular vectors of a d × n matrix A with rank ρ = rank( A) ≥ k. Then the (rank-k) leverage score of the i-th column of A is defined as (k) li = k[Vk ]i,: k22 , i = 1, 2, . . . , n. (5.2) Here, [Vk ]i,: denotes the i-th row of Vk . Leverage scores have a long tradition of being very useful in both deterministic and random solutions. However, the SVD produces other useful information and one could ask whether taking advantage of the other components in the SVD could improve the subsample selection. Consider this alternative formulation of the CSSP objective, where we make the assumption that the matrix BB T is invertible (only for this specific discussion): 64 k A − BB† Akη (5.3) k A − B( BT ( BBT )−1 ) Akη (5.4) kUΣV T − Wd WdT UΣV T kη (5.5) k(U − Wd WdT U )ΣV T kη (5.6) where A = UΣV T and B = WΨH T are the respective SVDs. In other words, this objective encodes the idea of selecting a B such that the columns space W aligns with the column space of U (corresponding to full A), and the projection of the data points ΣV T onto that basis. In this form, it becomes more apparent why the V T is important in selecting a subset of columns. The standard leverage scores are drawn from the rows of V = A T UΣ† , which are the data projected onto the principal components and rescaled ("whitened") according to the singular values ("variances") so that the covariance is identity. The leverage score is essentially the distance from the origin in the k-subspace spanned by Vk . However, the above objective function dictates the "unscaled" or "unwhitened" data points ΣV T , not V T , are projected, which indicates an "unscaled" leverage score would be more appropriate for CSSP. This intuition informs the core premise of this paper - to propose an augmented leverage score, Definition 3. Augmented leverage scores. Let Yk = Vk Σk ∈ Rn×k contain the top k singular values multiplied with the right singular vectors of a d × n matrix A with rank ρ = rank( A) ≥ k. Then the (rank-k) augmented leverage score of the i-th column of A is defined as (k) lˆi = k[Yk ]i,: k22 , i = 1, 2, . . . , n. (5.7) We show that by using this augmentation, the empirical performance for both deterministic and probabilistic sampling strategies substantially improves. We additionally present a novel error analysis of any deterministic sampling that ignores the tail-error in the column selection algorithm, which generally includes deterministic leverage sampling. 65 5.2 Related work As described in Section 5.1, leverage score sampling has a long tradition and recently there as been substantial interest in analyzing and extending it [6, 7]. Specifically, [2] presents a very thorough examination of CSSP in general, including several proofs to help in understanding and analyzing various approaches to CSSP. [3] shows a deterministic error bound for the traditional leverage score sampling with the following bound: k A − BB† Ak2ζ < 1 · k A − Ak k2ζ 1−e (5.8) for ζ ∈ {2, F }, where Ak is the best rank-k approximation to A. Here, instead, we propose making a very simple modification to leverage scores that substantially improves the quality of the column subsample found, with regard to the CSSP error shown in (5.1). We further empirically compare the proposed augmented leverage score sampling to traditional leverage score sampling in both a deterministic and probabilistic setting. Related work also includes CUR and Nyström approximation techniques that are not specifically targeted at solving CSSP, but are performing related subsampling of datasets with error bound guarantees. CUR [4] is an approximation technique that first computes a column subsample C of A, then a row subsample R of A, and then an interpolative matrix U that minimizes the approximation error k A − CURk2,F . However, as described in [4], the column subsample step uses a leverage sample, and so by comparing to leverage sampling directly, we gain some idea of how the proposed method compares to CUR. The Nyström method [1, 8] is a matrix approximation technique where a column subsample B of A is found such that k A − BUB T k2,F is minimized. [9] obtains a direct relationship between Nyström and CSSP by formulating the Nyström approximation as a CSSP by setting B = A1/2 S. However, all the subsampling approaches discussed use either a naive subsampling (uniformly random) or a leverage score sampling, so a comparison of our method to leverage score and uniform random sampling is sufficient. Our new analysis shows that a general bound on deterministic leverage score sampling is not possible. The bound presented for deterministic leverage sampling in [3] was for a very specific case with leverage scores following a precise power-law distribution. The analysis presented here shows why such assumptions are necessary to obtain such a 66 bound. 5.3 Deterministic column sampling In the following formulation and analysis, we make use of the singular value decomposition (SVD) of A ∈ Rd×n , A = UΣV T , with the left singular vectors U T U = UU T = I, U ∈ Rd×d , the right singular vectors V T V = VV T = I, V ∈ Rn×n , and the singular value matrix Σ ∈ Rd×n a diagonal matrix with entries Σii = σi ( A) being the singular values of A, sorted so that σ1 ( A) ≥ σ2 ( A) ≥ . . .. The rank-k approximation Ak = AVk VkT = Uk Σk VkT ∈ Rd×n , with Uk ∈ Rd×k , Σk ∈ Rk×k , V ∈ Rn×k is the optimal rank-k matrix approximation under the Frobenius norm of the difference, k A − Ak k2F = k∆k k2F = ∑ik=1 σi (∆k ) and under the spectral norm k∆k k22 = σ1 (∆k ). We will use the shorthand, σ̂i2 ( Ak ) = σi2 ( Ak ) , σk2 ( Ak ) which is a scaled (unitless) singular value. Note also that for a symmetric matrix M = A T A = UΣ2 U T , the eigenvalues λi ( M) correspond to the singular values, so that λi ( M) = σi2 ( M). Another shorthand, Π B ( A) = BB† A, is the projection of A onto the column space of B. Finally, ei is the i-axis aligned unit vector. Algorithm 1 AugmentedLeverageScoreSampler( A, k, θ ) Input A ∈ Rd×n , k, θ Compute Yk = Vk Σ̂k ∈ Rn×k (top k singular values normalized so that Σii = σ̂i (Σk ) and right singular vectors of A multiplied together) for i = 1, 2, . . . , n (k) lˆi = k[Yk ]i,: k22 end for (k) Let lˆi 's be sorted: (k) (k) (k) (k) lˆ ≥ . . . ≥ lˆ ≥ lˆ ≥ . . . ≥ lˆn (5.9) 1 i +1 i Find index c ∈ {1, . . . , n} such that: m = arg min m m ∑ i =1 (k) lˆi ! >θ . (5.10) If m < k, set m = k. Output S ∈ Rn×m , s.t. AS has the top m columns of A. The corresponding probabilistic algorithm for augmented leverage scores samples points randomly according to the probability pi = lˆi , ∑ j lˆi instead of sorting. 67 5.4 Experiments 5.4.1 Algorithms For the deterministic subsampling problem, we compare the proposed deterministic augmented leverage score sampling method (aug-det) from Algorithm 1 with the traditional deterministic leverage score sampling method (lev-det) from [3] and the QR-pivotbased sampling method (qr) from [10]. QR-pivot-based sampling is done by computing the QR decomposition with pivoting and then using the resulting pivot matrix for column selection as described in [11]. The leverage score methods are described in more detail in Section 5.2. In the probabilistic case, we compare the proposed probabilistic augmented leverage score sampling method (aug-prob) (similar to Algorithm 1 as described in Section 5.3) with the traditional probabilistic leverage score sampling method (lev-prob) from [3], and the ubiquitous uniformly random sampling (unif). The algorithms were compared on the basis of the CSSP projection space error, kA − BB† Akη for both the Frobenius norm (η = F) and spectral norm (η = 2), as well as the projection error sum over all the data points, which is ∑in=1 kAi − BB† Ai k2 , where B is selected using the corresponding algorithm. 5.4.2 Datasets We compared the performance of the methods using synthetic datasets, with specific spectral properties, and several real datasets. We use two small synthetic datasets to examine how the spectral makeup of the dataset affects the results. The first dataset we sampled from a multidimensional normal distribution with a spectrum, s L (i ), which decays linearly from d = 10, s L (i ) = d − i, which was then rotated to take it off axis. The second synthetic dataset was chosen to see the effect of a power-law decay on the spectrum. As described in [3], this model is of interest because many datasets will have a spectrum that decays quickly. The second was constructed similarly to the first, but a power-law decay was used, s P (i ) = d , i 1+ η with η = 0.5. The sizes of both datasets are summarized in Table 5.1, and the spectrum of each is shown in Figure 5.1, in the top left and bottom left plots. As both leverage score sampling techniques rely on the selection of a rank-k subspace, we selected the SVD subspace of size k that 68 captures 90% of the spectral energy and report the specific k chosen in Table 5.1. We chose a diverse set of real datasets to provide insight into how the algorithms compare on real data. We summarize the size and dimensionality of each dataset used in Table 5.1, and show the corresponding spectrum plots along the left column of Figure 5.2. The first dataset, diabetes, is a small dataset of various health measurements, with the goal of finding some relationship to diabetes used in [12], which has some spectral properties somewhat similar to the power-law decaying synthetic dataset. The second dataset, enron, is an email social graph derived from the popular Enron public email dataset from [13, 14]. This dataset exhibits a mostly linear decay in the spectrum. The forest cover, adult, and census dataset are all taken from the UCI data repository [15], and were chosen because of their use in other related machine learning and data analysis tasks, which helps in comparison to other (future) algorithms. 5.4.3 Synthetic The leverage and augmented leverage scores for the synthetic datasets are shown in the middle column of Figure 5.1, and a visualization of the resulting samples selected using the deterministic approach is shown in the far right column. In the center column plots, the leverage scores have been normalized so that the largest value is 1.0. The scores have also been sorted in descending order to make the visual comparison easier. This view of the scores makes the effect of the augmentation on the scores more obvious - for any point whose "magnitude energy" is in the directions with dominant spectral values, the value is preserved or increased, while for any point with "magnitude energy" in the directions corresponding to smaller spectral values, the score is decreased. This contrast in point selection is illustrated in the 2D principal component analysis (PCA) visualization in the right column of Figure 5.1. While the traditional leverage score selects points that are important in several directions, the augmented leverage score selects points that are primarily important in the dominant PC modes - note how the augmented leverage score points are mainly on the outer edges of this data projection. Now consider how the augmented score performs on these synthetic datasets as shown in Figure 5.3 and Figure 5.4. The first column of both figures shows results measured under the Frobenius norm, the center column under the spectral norm, and the right column 69 under the projection error. The rank-k approximation error is included in each plot for reference. The rank-k approximation is constant because it is independent of the column subsample size. Consider the deterministic results in Figure 5.3, specifically the first row showing the linear decay dataset. In each error measurement, the augmented leverage score performs better at column selection, and the QR-pivot-based selection performs the best. Now consider the power-law decay dataset in the second row. In the power-law decay case, both the augmented leverage score and the QR-pivot have a significant advantage over the traditional leverage score for the first several samples, which indicates that both QR and augmented leverage are taking advantage of the fast drop in spectral error, while traditional leverage sampling does not. This connection provides a nice insight into not only why augmented leverage scores perform better, but also why QR-based sampling performs so well. The connection also confirms that taking into account the spectrum of the dataset can be important to the CSSP, for certain kinds of data. Observe the probabilistic results in Figure 5.4 and note that for this case, we compare the two leverage score approaches to a uniformly random sampling of the dataset. For these synthetic results, each algorithm was run 100 times. The solid plot line shows the mean result, while the vertical error bars indicate the standard deviation at each sample point. The first row shows the probabilistic results for the linear decaying spectrum dataset, and for this specific case, the three algorithms perform very similarly. In contrast, in the bottom row, when considering the power-law spectrum decay data, the augmented leverage score performs significantly better than both the uniformly random and the traditional leverage score sampling. Again, these results reinforce the idea that, especially for datasets with a fast decay, taking into account the spectral magnitudes is important to the probabilistic CSSP. 5.4.4 Real A similar comparison was run on the real datasets. In Figure 5.2, in the right column, the leverage scores are shown - again these are normalized so that the maximum leverage score is 1.0, and sorted in descending order for visual comparison. Note how the varying spectra strongly influence the resulting scores. The enron dataset has one of the most dramatic changes - it shows a very gradual decrease in leverage scores over the samples, 70 while after augmentation, the leverage score values drop off very quickly. The adult dataset also has a dramatic decrease in leverage score value (and increased decay rate), while the forest dataset decreases the decay rate of the leverage scores. Figure 5.5 shows the spatial layout of the deterministic leverage samples in 2D PCA space for the diabetes and enron datasets. The left plot in Figure 5.5 for the diabetes dataset shows results similar to those for the synthetic datasets. The leverage score appears to select points independent of the dominant directions, whereas the augmented and QRbased samples select points that are clearly important in the dominant directions. The right side plot shows the resulting sample for the enron graph dataset. In this case, there is overlap between the methods: they all pick very similar columns. The deterministic method performance comparison is shown in Figure 5.6; note the probabilistic results are in Figure 5.7. For these probabilistic results, the algorithms were each run 100 times (except enron, which was run only 5 times), and the average is plotted with the standard deviation shown by the error bars. Consider the diabetes dataset in the top row, where we noted a moderate change in leverage score distribution, and significant differences in actual leverage sampling in Figure 5.5. These modifications resulted in a significant improvement in the errors reported. In almost every case, the augmented leverage scores performed significantly better than the traditional leverage score. As we noted above, the enron dataset in the second row had dramatic changes in the leverage score distribution, but viewing the actual deterministic samples taken, the samples were similar. This small change is reflected in the error scoring where, for the deterministic approach, the three methods performed similarly. The forest dataset proved interesting in that we noted the augmentation led to a slower decay in leverage score magnitude. This change results in some of the augmented leverage score performing slightly worse in the deterministic algorithm, although the story is different for the probabilistic case where the mean is always below the traditional leverage score mean error. 71 5.5 Analysis of tail oblivious methods 5.5.1 Tail oblivious methods First, we present a formal definition of a tail oblivious column subset selection algorithm, of which deterministic leverage sampling is a specific example: Definition 4. A column selection algorithm is said to be k-tail-oblivious if, given an input A and a parameter k, the algorithm works only with the matrix Ak (i.e., the best rank-k approximation to A) in order to come up with its choice of columns. We will now present some proofs characterizing how many samples a general tail oblivious method requires to satisfy a multiplicative constraint. We note that the constants in the proofs that follow are often not optimal, and can be improved; we have chosen constants that make the presentation easy to follow. 5.5.2 Proofs In this section, we prove that any tail oblivious algorithm (Definition 4) has to choose at √ least n · 10α columns in order to achieve an approximation ratio of 10−α . Formally, we show the following: Theorem 5.5.1. Let k, m, n be given parameters with k ≤ m n. Let Alg be a deterministic k-tail-oblivious algorithm with resulting approximation guarantee on an input matrix A, k A − ΠS ( A)k2F < C k A − Ak k2F , with C > 1 the desired approximation factor, and S the selected subset. Then there exists an input A with n columns such that the required number of selected columns, m, is bounded from below by q nk C. We start with a high-level outline of our proof. 5.5.2.1 Proof outline We will construct the matrix A such that its projection to its top-k-SVD subspace (i.e., Ak ) is essentially an identity matrix of size k. Now, suppose Alg uses Ak (as it is tailoblivious) and picks a set of columns S of size m. The idea is to then make sure that in A, the chosen columns S have a high "tail error," i.e., for j ∈ S, we ensure that k( A − Ak )e j k is 72 δ, for some appropriately chosen parameter, and have a tail error zero for the columns that are not chosen. (This seems like a cyclical argument, as adding an error term to some of the columns could change the SVD! This is indeed a key challenge in our proof.) Once we have such a property, we prove that the error in reconstructing the columns 6∈ S is at least (roughly) δ2 nk , per column. That per-column error forces the overall squared reconstruction error to be ≈ n · δ2 /m, which is significantly larger than the "initial" tail error, roughly mδ2 . √ The tradeoff between these quantities gives rise to the m ≈ nk in the theorem. 5.5.2.2 Construction of matrix A Let us thus proceed with the construction of the matrix A that will be used in our argument. Crucial to our analysis will be the fact that the best rank-k approximation Ak is the identity matrix. We start by constructing a matrix B with orthogonal rows, and where any nonzero entry has value 1, specifically the matrix has the structure, x1,1 0 .. . x1,2 0 ... ... x1,p 0 0 x2,1 0 x2,2 ... ... .. . 0 0 ... 0 0 0 ... 0 x2,p ... ... .. . 0 0 ... ... 0 0 0 ... xk,1 ... xk,p where all xi,j = 1, k > 0, and p > 1. For example if p = 3, k = 4, we would have the matrix, 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 √ Clearly any matrix B with the above form has SVD B = Ik ( pIk )( √1p B). Let the sets Gi ⊂ [n], \| Gi \| = p denote all of columns in the same direction, so that dik=1 Gi = [n]. By running the tail-oblivious selection algorithm Alg, we obtain a subset index set S ⊂ [n] with \|S\| = m. Let Si ⊂ Gi , denote the selection of columns in direction i, and let mi = \|Si \| ≤ p denote the number of columns chosen, ∑i mi = m. Claim 1. m ≥ k and mi ≥ 1∀i ∈ [k ]. Proof. Otherwise there would be a direction without representation, resulting in potentially significant error (Alg unbounded). 73 Claim 2. At most k 2 of the column directions can have mi ≥ Proof. Otherwise ∑i mi = m ≥ n 8, n 4k , a large number of columns. which contradicts the assumption that Alg chooses a small number of columns. (< Let Q ⊂ [k ] be the set of column directions for which the algorithm Alg chose a small n 4k ) number of columns Q = {i : mi < n 4p }. We now add rows to the matrix B to form the matrix A. Algorithm 2 shows how we can add rows, while maintaining the condition that Ak = B. This is key to handling the cyclicity in the proof outline above. Algorithm 2 Augment( B, S, Q) Input B, S, Q defined above. let q = ∑i∈Q ni let A be a (k + q) × n matrix, with the first k rows equivalent to B, and zeros everywhere else. let T be the subset index set left over after selection, T = [n] \ S let j = 1 and R = {} an empty set. for i ∈ S if bi has same direction as one of Q select a column br , r ∈ T that has the same direction as bi set A j,i = δ and A j,r = −δ. set j = j + 1 and R = R ∪ r, T = T \ r end if end for Output A, R, T To be clear, consider running Algorithm 2 on the example B matrix above, with p = 3, k = 4, and assume that Alg selected S = {1, 4, 7, 10}. We would expect the following corresponding A matrix: 1 0 0 0 δ 0 0 0 1 0 0 0 −δ 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 δ 0 0 0 1 0 0 0 −δ 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 δ 0 0 0 1 0 0 0 −δ 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 δ 0 0 0 1 0 0 0 −δ 0 0 0 1 0 0 0 0 74 Lemma 5.5.1. Let B be a k × n matrix constructed as described above. Let A be a matrix formed by running Algorithm 2. Note that the r = \| R\| additional rows v1 , v2 , . . . , vr ∈ Rn satisfy (a) kvi k2 ≤ 2δ2 for all i, and (b) vi are all orthogonal to the row span of B. Then the best rank-k approximation of A is B, i.e., Ak = B. The proof is obtained by observing that the SVDs of B and the matrix whose rows are vi can be concatenated to obtain the SVD of A, and then observing that the singular values of the latter matrix are all ≤ 1. Proof of Lemma 5.5.1. Let B = XΣY be the SVD of B, where the dimensions are X (k × k ), Σ(k × k ), Y (k × n). By assumption, we have Σii > 1 for all i ∈ [k ]. Now, consider the SVD of the (r × n) matrix whose rows are the vectors vi . Let this be X 0 Σ0 Y 0 , where the dimensions are (r × r ), (r × r ) and (r × n), respectively. Define Y 00 to be a (k + r ) × n matrix formed by concatenation of the rows of Y and Y 0 . Also, let X 0 be a (k + r ) × (k + r ) matrix formed by two diagonal blocks X and X 0 , with zeroes in the off-diagonal. Analogously, define Σ00 to be the (k + r ) × (k + r ) formed by concatenating Σ and Σ0 . Now, the condition kvi k ≤ 2δ2 implies that Σii0 ≤ 1 for all i.2 Now, we note that our matrix A is precisely X 00 Σ00 Y 00 . Further, all the columns of X 00 and the rows of Y 00 are orthogonal and have a unit norm. Since the diagonal entries of Σ are all > 1 and those of Σ0 are all ≤ 1, it follows that the largest k singular values correspond to the portion from Σ, and thus we have the desired conclusion. We will next show that the A generated by Algorithm 2 will satisfy the conclusion of Theorem 5.5.1, thus proving the theorem. We will now bound the k-SVD error. Lemma 5.5.2. For the matrix A constructed as above, we have k A − Ak k2F ≤ 2mδ2 . √ Proof. We use the bound δ = 1/ m, together with Lemma 5.5.1 to conclude that Ak = B, which immediately implies the lemma. 2 Which follows from the well-known inequality that the largest singular value of a matrix with r columns √ is at most r times the length of the largest column. Equality holds iff all the columns are identical. 75 5.5.2.3 Analyzing the reconstruction error Next, we analyze the error when we reconstruct A from the chosen columns AS . Before we do this, we state an important property about reconstruction using the columns of A. Lemma 5.5.3. Suppose we want to write e j using the vectors e j + δez1 , e j + δez2 , . . . , e j + δezs for some s, j ∈ [s + 1], and where z = [s + 1] \ j. For any α1 , . . . , αs and δ2 < s ke j − ∑ αi (e j + δezi )k22 ≥ i =1 Proof. s ke j − ∑ αi (e j + δezi )k22 = i =1 If (1 − ∑is=1 αi )2 < 5.5.2.4 δ2 s s 4 we have δ2 . s s s i =1 i =1 ∑ α2i δ2 + (1 − ∑ αi )2 , then ∑is=1 αi ≥ 12 . Thus ∑is=1 α2i ≥ Ω( 1s ) (using Cauchy-Schwartz). Proof of Theorem 5.5.1 We can now place all the pieces together, to prove that reconstructing A using AS incurs a large error. Let S, R, and T be disjoint subsets of [n] as described in Algorithm 2. Consider the p columns Gi , with i ∈ Q. Gi can be further partitioned into the part that was selected, Si ; the part matched with the selected set in Algorithm 2, Ri = Gi \ (Si ∪ Ti ); and the left over part Ti , where \|Si \| ≤ n 4k , \| Ri \| = \|Si \|, and \| Ti \| = p − 2\|Si \| ≥ n 2k . The set Ri can be perfectly reconstructed using Si , but Ti incurs error due to the augmented component. Lemma 5.5.4. Every set Ti incurs error δ2 n mi 2k . Proof. Consider any column z with column index in a specific Ti . z is a vector in Rk+m , and the last m entries are all 0 (otherwise it would be in Si or Ri ). Now, consider a "reconstruction" of z using the columns of AS , i.e., let z0 = ∑i∈S αi Ai be any linear combination of the Ai , i ∈ S. The reconstruction error is, by design, the situation discussed in Lemma 5.5.3. This is where we use the fact that vi has precisely one coordinate of value δ in the construction of A using Algorithm 2. By that construction, we have that each Ai has one nonzero entry in the last m coordinates, and that the position of 76 this entry is different for different Ai , for i ∈ S. Thus, the sum of squares of the error for each column i in direction group Q j is bounded from below by Lemma 5.5.4 results in a lower bound on the total error of series). δ2 n m j 2k nδ2 2k (Lemma 5.5.3). ∑i 1 ni ≥ nδ2 k 8m (harmonic We can now combine the results from this section to prove our main result. Proof of Theorem 5.5.1. Lemmas 5.5.4 and 5.5.2 can now be used to deduce our main result, along with our choice of δ and m. Suppose the algorithm Alg approximates the error to a factor better than C. Then, the two lemmas imply that nδ2 k ≥ C · 2mδ2 . 8m Thus, having a C approximation implies that, m = Ω 5.6 q nk C . Conclusion In conclusion, we have shown motivation for taking into account the scale when computing leverage scores, which we call an augmented leverage score. We have shown empirically the method benefits both the probabilistic and the deterministic leverage score sampling algorithms on a variety of datasets, as well as a few cases where the improvement is less noticeable or even made the result slightly worse. We have also introduced a novel analysis of the general tail-oblivious deterministic columns selection algorithm, of which the deterministic leverage score is a special case. The analysis shows that it is difficult to obtain a multiplicative error bound with a small number of columns in this case. 5.7 References [1] P. Drineas and M. W. Mahoney, "On the Nyström method for approximating a gram matrix for improved kernel-based learning," Proceedings of Learning Theory, vol. 3559, pp. 323-337, 2005. [2] C. Boutsidis, P. Drineas, and M. Magdon-Ismail, "Near-optimal column-based matrix reconstruction," SIAM Journal on Computing, vol. 10598, no. i, pp. 1-27, 2014. [Online]. Available: http://epubs.siam.org/doi/abs/10.1137/12086755X [3] D. Papailiopoulos, A. Kyrillidis, and C. Boutsidis, "Provable deterministic leverage score sampling," arXiv preprint, pp. 997-1006, 2014. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2623330.2623698 77 [4] P. Drineas, M. Mahoney, and S. Muthukrishnan, "Relative-error cur matrix decompositions," SIAM Journal of Matrix Analysis and Applications2, vol. 30, no. 2, pp. 844-881, 2008. [5] S. Paul, M. Magdon-Ismail, and P. Drineas, "Column selection via adaptive sampling," pp. 1-9, 2015. [Online]. Available: http://arxiv.org/abs/1510.04149 [6] M. M. W. Mahoney, "Randomized algorithms for matrices and data," arXiv preprint arXiv:1104.5557, p. 49, 2011. [Online]. Available: http://arxiv.org/abs/1104.5557 [7] M. W. Mahoney, "Algorithmic and statistical perspectives on large-scale data analysis," p. 33, 2010. [Online]. Available: http://arxiv.org/abs/1010.1609 [8] A. Gittens and M. W. Mahoney, "Revisiting the Nystrom method for improved largescale machine learning," International Conference on Machine Learning, vol. 28, pp. 1-45, 2013. [9] A. Gittens, "The spectral norm error of the naive Nystrom extension," arXiv preprint arXiv:1110.5305, pp. 1-9, 2011. [Online]. Available: http://arxiv.org/pdf/1110.5305\nhttp://arxiv.org/abs/1110.5305 [10] G. Golub, "Numerical methods for solving linear least squares problems," Numerische Mathematik, vol. 7, no. 1, pp. 206-216, 1965. [11] T. F. Chan and P. C. Hansen, "Some applications of the rank revealing qr factorization," SIAM Journal on Scientific Computing, vol. 13, no. 3, pp. 727-741, 1992. [12] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, "Least angle regression," Annals of Statistics, vol. 32, no. 2, pp. 407-499, 2004. [13] B. Klimt and Y. Yang, "Introducing the enron corpus," in CEAS, 2004. [14] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, "Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters," Internet Mathematics, vol. 6, no. 1, pp. 29-123, 2009. [15] M. Lichman, "UCI machine learning repository," 2013. [Online]. Available: http://archive.ics.uci.edu/ml 78 Table 5.1. Real and synthetic datasets examined with number of points n, dimension d, and subspace dimension k which captures 90% of the spectral energy. name n d k Synthetic linear decay 1000 10 9 Synthetic power law decay 1000 10 6 Diabetes 442 9 6 Enron 3000 3000 2108 Forest cover 58102 54 5 Adult 16281 123 73 Census 2273 119 8 Gaussian with linear decay 1.0 1.0 0.7 10 0.6 score 0.6 0.5 0.3 −20 0.2 0.2 1 2 3 4 5 6 components 7 8 9 0.0 0 0 −10 0.4 0.4 data lev-det aug-det lu-det gomp-det 20 0.8 0.8 0.1 0 30 lev auglev PC 2 cumulative percent 0.9 −30 200 400 600 samples 800 −40 −40 −30 −20 −10 1000 0 PC 1 10 20 30 40 Gaussian with power-law decay 1.0 1.0 0.9 5 score 0.6 0.4 0.7 1 2 3 4 5 6 components 7 8 9 0.0 0 0 −5 0.2 0.6 data lev-det aug-det lu-det gomp-det 10 0.8 0.8 0.5 0 15 lev auglev PC 2 cumulative percent 1.1 −10 200 400 600 samples 800 1000 −15 −40 −30 −20 −10 0 PC 1 10 20 30 Figure 5.1. Synthetic Gaussian datasets. The first row shows the spectrum (top left), the leverage and augmented leverage scores (top center), and a PCA-based visualization of the leverage score samples for the deterministic algorithm (top right), all for the linear decay dataset. The second row shows the same results but for the power-law decay dataset. 79 Diabetes 1.0 1.0 cumulative percent 0.9 0.8 score 0.6 0.7 0.4 0.6 0.5 0.2 0.4 0 0.0 0 1 2 3 4 5 6 components 7 8 cumulative percent 9 0.6 0.6 0.4 0.4 0.2 0.2 500 1000 1500 2000 components cumulative percent 2500 3000 0.0 0 1500 2000 samples 2500 3000 lev auglev 0.8 0.9 0.6 score 0.7 0.4 0.6 0.2 0.5 10 20 30 40 components 50 60 0.0 0 10000 20000 30000 40000 50000 60000 samples Adult 1.2 cumulative percent 1000 1.0 1.0 0.8 1.0 1.0 lev auglev 0.8 0.8 score 0.6 0.6 0.4 0.4 0.2 0.2 20 40 60 80 components 100 120 140 0.0 0 200040006000800010000 12000 14000 16000 18000 samples Census 1.1 cumulative percent 500 Forest cover 1.1 1.0 1.0 lev auglev 0.8 0.9 0.8 score 0.6 0.7 0.4 0.6 0.5 0.2 0.4 0.3 0 lev auglev score 0.8 0.0 0 samples 1.0 0.8 0.4 0 50 100 150 200 250 300 350 400 450 Enron 1.0 0.0 0 lev auglev 0.8 20 40 60 80 components 100 120 0.0 0 500 1000 1500 samples 2000 2500 Figure 5.2. A plot of the spectrum, and a plot comparing the leverage and augmented leverage scores for each of the real datasets evaluated. 80 350 18000 500 300 16000 Projection error 14000 250 Spectral error Frobenius error Gaussian with linear decay 600 400 12000 200 300 10000 150 200 100 100 50 0 0 2 4 6 8 samples 10 6000 4000 2000 0 0 12 8000 2 4 6 8 samples 10 0 0 12 2 4 6 samples 8 10 12 9000 300 8000 250 250 200 200 100 100 50 50 2 3 4 5 samples 6 7 8 9 0 1 lev-det 7000 6000 5000 4000 150 150 0 1 Projection error 350 300 Spectral error Frobenius error Gaussian with power-law decay 350 3000 2000 1000 2 3 4 5 samples aug-det 6 7 8 qr 9 0 1 2 3 4 5 6 samples 7 8 9 rank-k Figure 5.3. Deterministic results on the synthetic datasets. The first row is a synthetic Gaussian dataset with linear spectral decay. The second row is a synthetic Gaussian dataset with a power-law decay. Results are shown for each dataset measured under a Frobenius norm error (left column), a spectral norm error (center column), and a projection error (right column). 81 300 400 250 300 200 100 0 20000 Projection error 350 500 Spectral error Frobenius error Gaussian with linear decay 600 200 150 100 50 15000 10000 5000 0 0 −100 0 2 4 6 samples 8 10 12 −50 0 2 4 6 8 samples 10 −5000 0 12 2 4 6 8 samples 10 12 250 200 200 150 100 50 7000 150 100 50 0 0 −50 1 8000 Projection error 300 250 Spectral error Frobenius error Gaussian with power-law decay 300 2 3 4 5 samples 6 7 8 9 −50 1 6000 5000 4000 3000 2000 1000 0 2 3 4 lev-prob 5 samples 6 7 8 −1000 1 9 aug-prob 2 3 4 5 6 samples 7 8 9 unif Figure 5.4. Probabilistic results on the synthetic datasets. The first row is a synthetic Gaussian dataset with linear spectral decay. The second row is a synthetic Gaussian dataset with a power-law decay. Results are shown for each dataset measured under a Frobenius norm error, a spectral norm error, and a projection error. For these synthetic results, the algorithm was run 100 times and the line shows the mean result, while the vertical error bar indicates the standard deviation. 60 40 0 −20 data lev-det aug-det lu-det gomp-det 5 PC 2 20 PC 2 10 data lev-det aug-det lu-det gomp-det 0 −5 −10 −40 −60 −200 −150 −100 −50 0 PC 1 Diabetes 50 100 150 −15−5 0 5 PC 1 10 15 20 Enron Figure 5.5. Embedding of the data into the first two dimensions of a principal component analysis (PCA), with the deterministic sampling results of the traditional leverages scores (lev-det), proposed augmented leverage scores (aug-det), and QR-pivot samples (qr). 800 800 700 16000 700 600 14000 300 300 200 200 100 4 5 samples 6 7 8 400 80 350 70 300 60 2 3 4 5 samples 6 7 8 Enron 30 150 20 100 0 1 2 3 4 5 6 7 samples 8 9 20000 40 200 6000 4000 9 50 250 8000 2000 0 1 9 Projection error 3 Spectral error 15000 10000 5000 10 0 0 500 1000 1500 2000 samples 0 0 2500 500 1000 2000 0 0 2500 Forest 450000 500000 1500 samples 500 1000 1500 2000 samples 2500 1.01e8 400000 400000 0.8 Projection error Spectral error 350000 300000 300000 0.6 250000 200000 200000 100000 0.4 150000 100000 0.2 50000 2 3 4 5 samples 6 7 0 1 8 140 350 120 300 100 Spectral error 400 250 200 150 100 2 3 4 5 samples 6 7 8 Adult 0.0 1 10 20 30 40 50 samples 60 70 80 60 40 0 0 80 1.21e7 1.0 1.0 20 30 40 50 samples 60 70 Census 0.2 0.2 0.0 0 0.0 0 2 4 6 samples 8 10 12 lev-det 7 8 70 80 30000 25000 20000 15000 10000 0 0 10 20 30 40 50 samples 60 3.5 0.4 0.4 6 4.0 1e8 0.6 0.6 5 35000 80 0.8 0.8 4 samples 5000 10 Spectral error 1.21e7 3 40000 20 50 2 45000 Projection error 0 1 Projection error Frobenius error 2 50 Frobenius error 10000 400 400 0 1 Frobenius error 12000 500 500 0 0 18000 Projection error 600 100 Frobenius error Diabetes 900 Spectral error Frobenius error 82 3.0 2.5 2.0 1.5 1.0 0.5 2 4 6 samples aug-det 8 10 qr 12 0.0 0 2 4 6 samples 8 10 12 rank-k Figure 5.6. Deterministic results on each of the real datasets. Results are shown for each dataset measured under a Frobenius norm error (left column), a spectral norm error (center column), and a projection error (right column). 83 Diabetes 1200 800 600 400 400 200 200 0 2 3 4 5 6 7 samples 8 1 9 350 3 4 5 samples 6 7 8 10000 5000 0 Enron 150 100 50 0 500 1000 1500 2000 samples 60 40 20 500 400000 400000 350000 350000 300000 Spectral error 450000 300000 1500 2000 samples 150000 100000 Forest 2 3 4 5 6 samples 7 120 250 100 Spectral error 300 200 150 100 10 20 30 40 50 60 samples 70 2 3 4 5 6 samples 7 8000 6000 0 1 Adult 60 6 samples 8 10 12 7 8 15000 10000 5000 10 20 30 40 50 60 samples 70 0 0 80 Census 10 20 30 40 50 samples 60 70 80 4.0 1e8 Projection error Spectral error 4 6 20000 3.5 2 5 30000 3.0 0.6 0.4 0.2 0.0 4 samples 25000 0.8 −0.2 0 3 35000 1.0 0.0 2 40000 40 −0.2 0 2500 45000 1.0 0.2 2000 2 1.2 0.4 1500 samples 3 1.21e7 0.8 1000 5 1.41e7 0.6 500 7 8 80 0 0 80 10000 6 20 50 12000 1 0 1 140 9 4 100000 350 8 8 150000 8 7 9 1e7 Projection error 0 1 6 14000 2000 0 50000 50000 5 samples 16000 2500 200000 200000 0 0 1000 250000 250000 4 4000 0 0 2500 3 18000 Projection error 200 2 20000 Projection error 250 −5000 1 9 80 300 Spectral error Frobenius error 2 100 400 Frobenius error 15000 0 −200 1 Frobenius error Projection error 600 Spectral error Frobenius error 1000 Frobenius error 20000 800 2.5 2.0 1.5 1.0 0.5 2 lev-prob 4 6 samples 8 aug-prob 10 0.0 0 12 2 4 6 samples 8 10 12 unif Figure 5.7. Probabilistic results on each of the real datasets. For these results, each algorithm was run 100 times on each dataset (except enron, which was only run only five times), the mean is plotted and the standard deviation shown with error bars. CHAPTER 6 ALLOCATION STRATEGIES FOR HIGH-FIDELITY MODELS IN THE MULTIFIDELITY REGIME This chapter is a preprint of, Allocation strategies for high fidelity models in the multifidelity regime, Daniel J. Perry, Robert M. Kirby, Akil Narayan, and Ross T. Whitaker, submitted to SIAM Journal on Uncertainty Quantification in 2017, used with permission of the authors. 85 Allocation strategies for high fidelity models in the multifidelity regime∗ Daniel J. Perry† , Robert M. Kirby‡ , Akil Narayan§ , and Ross T. Whitaker¶ Abstract. We propose a novel approach to allocating resources for expensive simulations of high fidelity models when used in a multifidelity framework. Allocation decisions that distribute computational resources across several simulation models become extremely important in situations where only a small number of expensive high fidelity simulations can be run. We identify this allocation decision as a problem in optimal subset selection, and subsequently regularize this problem so that solutions can be computed. Our regularized formulation yields a type of group lasso problem. Our numerical results compare performance of our algorithmic allocation against a variety of other strategies, including those based on classical linear algebraic pivoting routines and those derived from more modern machine learning-based methods. We demonstrate on well known synthetic problems and more difficult real-world simulations that a solution to the relaxed optimal subset selection problem performs better than the alternatives. Key words. multifidelity models, column subset selection, mixed norm regularization AMS subject classifications. 65D05 1. Introduction. A persistent challenge in uncertainty quantification is estimating the effect of uncertain parameters on quantities of interest. A common approach to understanding the effect of a parameter is to evaluate an ensemble of simulations for various parameter values. This approach is reasonable when multiple runs of a simulation model or experiment are easily obtained. Unfortunately, a simulation model or discretization that is more true-to-life, or has higher fidelity, requires additional computational resources due to, for example, increased number of discrete elements or more expensive modeling of complex phenomena. For these reasons a high-fidelity model simulation can incur significant computation cost, and repeating such a simulation for a sufficient number of times to understand parameter effects can quickly become infeasible. Recent work has introduced an effective solution to this problem by using multiple fidelities of simulation models where less-expensive, lower-fidelity versions of the high-fidelity model are used to learn the parametric structure of a simulation. This structure is utilized to choose a small parameter ensemble at which high-fidelity runs are assembled, providing insight into ∗ Submitted to the editors August 23, 2017. Funding: D. Perry and R.M. Kirby acknowledge that their part of this research was sponsored by ARL under Cooperative Agreement Number W911NF-12-2-0023. A. Narayan is partially supported by AFOSR FA9550-15-10467 and DARPA EQUiPS N660011524053. R. Whitaker acknowledges support from DARPA TRADES HR001117-2-0016. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of ARL, DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. † School of Computing and SCI Institute, University of Utah, Salt Lake City, UT (dperry@cs.utah.edu), ‡ School of Computing and SCI Institute, University of Utah, Salt Lake City, UT (kirby@cs.utah.edu), § Department of Mathematics and SCI Institute, University of Utah, Salt Lake City, UT (akil@sci.utah.edu), ¶ School of Computing and SCI Institute, University of Utah, Salt Lake City, UT (whitaker@cs.utah.edu), 1 86 2 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER the finer details and effects of the simulation parameters [19, 33]. One important decision in this allocation process is which parameter values should be used in the less costly lowand medium-fidelity simulations, and which values are worth the more costly high-fidelity simulation. This decision becomes extremely important for situations where it is physically impossible to run the high-fidelity simulation more than a small number of times, say O(10) times. By assuming the simulation realizations are elements in a Hilbert space, previous work uses a Gram matrix associated to a parameter ensemble to learn the parametric structure. Standard linear algebra tools, such as the Cholesky decomposition, are utilized to rank "important" parameter values where high-fidelity simulations are run. The multifidelity situation is special among subset selection problems since we cannot add additional elements to a high-fidelity subset due to the significant marginal cost of additional computation. Thus choosing the best subset becomes quite critical in order to effectively utilize a very small number of high-fidelity simulations. While the classical linear algebraic approaches mentioned previously perform reasonably well, this approach ignores the extensive work done in subset selection in a more general context in both the data analysis and machine learning literature [9, 5, 3, 4, 21, 1]. Our main contributions in addressing this problem are as follows: • Our work includes substantial experimental comparisons with a braod range of subset selection algorithms. These include classical methods such as randomized sampling, single-layer Gaussian processes, and pivoted numerical linear algebra routines, along with more modern machine learning-based strategies such as leverage sampling and neural networks. We conclude that a particular group orthogonal matching pursuit (GOMP) algorithm yields superior results for the problems we consider. • We apply existing GOMP approximation theory to the multifidelity case, providing justification for the competitive performance of GOMP compared to other algorithms. • We empirically show that the GOMP multifidelity procedure is effective when applied to a variety of nontrivial large-scale problems in computational science, such as compressible fluid dynamics and structural topology optimization. In particular, we observe that GOMP-based methods yield superior results in almost every situation we have tried. We note that there are many alternative approaches to tackling the multifidelity problem, e.g., [15, 24, 20, 22]. These alternatives are substantially different in scope, applicability, and goals, and so we leave a direct comparison between our methodology and these alternatives for future work. 2. Previous work. We will now consider the multifidelity proxy model introduced in [19] and further studied in [33]. The authors of [19] proposed using samples from a low-fidelity model to inform the parameter selection of the much more costly high-fidelity samples and to compute the reconstruction weights of the high-fidelity proxy model based on those lowfidelity samples. The weights are computed from the Gramian (also known as the Gram matrix or inner product matrix) of the low-fidelity samples, and so the resulting model can be considered a non-parametric model with weights derived from the low-fidelity Gramian. For simplicity we will consider only a two-level ("bifidelity") situation with only a single low- and 87 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 3 high-fidelity model. However, this procedure can be applied to many levels [19]. 2.1. Gramian non-parametric multifidelity algorithm. We consider the following general multifidelity setup: Let z ∈ D ⊂ Rq , q ≥ 1, be a common parametric input into two simulations models of differing fidelities. We use uL and uH to denote the outputs of the low- and high-fidelity models, respectively, with uL (z) a vector-valued output of the low-fidelity model evaluated at parameter value z. The models produce outputs in vector spaces V L and V H , respectively. To summarize our notation: uL : D → V L , uH : D → V H . The vectors uL and uH are assumed to represent any spatial/temporal effects of interest. The multifidelity regime occurs when uH is more faithful to reality than uL , but requires greater computational effort to simulate. That is, (1) uH (z) − u(z) uL (z) − u(z) , Cost uH (z) Cost uL (z) , where u(z) represents reality1 , and Cost(·) is the requisite computational burden of evaluating the argument. In practice the assumptions (1) frequently imply that the dimensions of the vector spaces satisfy dim V L = dL dH = dim V H , but it is not necessary to assume this and the multifidelity algorithm does not require this assumption. Due to the high computational cost, the number of high fidelity simulations is limited to m, with m = O(10) being a common bound. In contrast, the low-fidelity model is much more computationally affordable with n m low-fidelity simulations available. Let γ = {z1 , . . . , zn } be a set of sample points in D, which define matrices L L aL . . . aL aL AL = aL ∈ Rd ×n , n j = u (zj ) , 1 2 H H aH . . . aH ∈ Rd ×n . aH AH = aH n j = u (zj ) , 1 2 Let S ⊂ [n] denote a generic set of column indices, and for A ∈ Rd×n having columns aj we define S = j1 , . . . , j\|S\| , AS = aj1 aj2 . . . aj\|S\| . The authors in [19] showed that the structure of AL can be used to identify a small number of column indices S ⊂ [n], \|S\| = m, so that AL S can be used to form a rank-m approximation H to AL , and AH S can be used to form a rank-m approximation to A . Precisely, they form the approximations L † L L † L AL ≈ AL (2) A , AH ≈ AH A . S AS S AS 1 We are intentionally being vague in defining u(z) and the norms in (1). Such relations can be established through formal means, such as convergence of discretizations of mathematical models, or through more informal means, such as expert belief or knowledge in superiority of the high-fidelity model. 88 4 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER An important observation in the above approximations is that the representation for AH requires only AH S , i.e., it only requires m = \|S\| evaluations of the high-fidelity model. The construction of S is performed in a greedy fashion, precisely as the first m ordered pivots in T a pivoted Cholesky decomposition of AL AL . While AH only represents uH on a discrete set γ, the procedure above forms the approximation uH (z) ≈ (3) m X ci uH (zji ), i=1 where the coefficients the expansion coefficients of uL (z) in a least-squares approximation L ci are m with the basis u (zji ) i=1 . Thus, the high-fidelity simulation may actually be evaluated at any location z ∈ D if uL (z) is known. This procedure has the following advantages: • Once m high-fidelity simulations have been computed and stored, an approximation (3) to the high-fidelity model is constructed having the computational complexity of only the low-fidelity model, cf. (3). • The subset S is identified via analysis of the inexpensive low-fidelity model, so that a very large set γ may be used to properly capture the parametric variation over D. • It is not necessary for the spatiotemporal features of the low-fidelity to mimic those of the high-fidelity model. Section 6.2 in [19] reveals that uL and uH may actually be entirely disparate models, yet the approximation (3) can be accurate. 2.2. Subset selection. The critical portion of the previous section's algorithm is the identification of the column subset S. In order to guarantee convergence of the high-fidelity approximation, one must also have some correlation between the parametric variation of the high- and low-fidelity models, cf. Theorem 4.2 of [19]. In this paper, we assume such correlation exists and focus exclusively on the problem of identification of S. Abstractly, this is a problem of subset selection among the n columns of AL , and thus this problem focuses entirely on the low-fidelity model. To emphasize this and to simplify notation we dispense with the L and H superscripts, hereafter writing A ← AL , aj ← aL j. The subset selection problem, identification of S, is an allocation problem. That is, given an accurate but expensive high-fidelity simulation, how do we allocate high-fidelity computational resources across the set of possible simulations γ? Noting the approximations (2), one way to accomplish this is to choose S in a way that captures as much variation in the low-fidelity model as possible. Definition 2.1. Column subset selection problem (CSSP). Find the column subset matrix C = AS, where C ∈ Rd×m , S ∈ Rn×m is a column selection matrix, and the m ≤ n columns of C are a subset of the n columns of A, so that (4) is minimized. r = kRk2F = kA − CC† Ak2F 89 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 5 The matrix S above mirrors the subset S discussed earlier, and C = AS . The exact CSSP is an NP-complete problem [27], requiring the evaluation of all subsets of size m. However there have been a large variety of strategies proposed to address CSSP; we summarize a selection of these strategies below: • Leverage sampling [4, 21, 10, 25]. The singular value decomposition (SVD) of A is computed or approximated, and this information is used to choose a column subset of A, either randomly or deterministically. Leverage sampling has attractive theoretical guarantees, e.g., kA − CC† Ak2F ≤ (1 + )kA − Ak k2F where C is chosen according to the leverage score and m is O(1/ ). (The best rank-k matrix Ak is defined by the Eckart-Young-Mirsky theorem.) In our results section we observe that leverage sampling on our datasets is less competitive than other sampling techniques. • The CUR matrix decomposition [17]. This performs a column (C) and row (R) subset selection on A, resulting in an element-based decomposition of a matrix, so that A ≈ CUR, where U = C† AR† . The best known CUR matrix error bounds are obtained by using leverage sampling for the column and row subset selection [17], for that reason we consider our experimental comparison to leverage sampling and random sampling as sufficient to characterize related methods such as the CUR decomposition technique. • Deterministic interpolative decompositions (ID) [9]. These are similar in spirit to CUR; ID methods frequently use a Gram-Schmidt-based column-pivoted QR method for subset selection, resulting in an approximation A ≈ CB, where C is a column subset of A and B is a coefficient matrix that minimizes the approximation error. (E.g., B = C † A is possible choice for B.) In this paper we will use the pivoted QR method as an exemplar of deterministic ID algorithms. • Pivoted QR decompositions [12, 14]. One historical use of these routines was to select a column subset in order to reliably compute the least-squares solution of a linear system. For CSSP we run a pivoted QR routine and then sample the columns in order identified by the pivots [8, 21]. QR based subset selection has a much worse known theoretical guarantee with respect to the rank-k tail bound kA − Ak k2F than leverage sampling, but empirically we have observed it perform much better. Both [19] and [33] propose using a pivoted Cholesky routine in a manner that is algebraically equivalent to a pivoted QR method. Alternative linear algebraic pivoting strategies can be used as approaches to CSSP, such as partial or full pivoted LU [13]. However, besides their mention in [33] as a possible direction, we have not observed their use in existing CSSP work. We note that [1] showed a different type of bound for a general greedy strategy, of which the pivoted QR, LU, and Cholesky methods are specific instances. The theoretical bound shown for the general greedy strategy is kCC† Ak2F ≥ (1− )kDD† Ak2F where D is the optimal subset of size m (defined by Definition 2.1). This bound is also of interest because large values of kCC† Ak2F are related to small values of kA − CC† Ak2F . All the strategies above are described as applied to a set of Euclidean vectors, but each only requires a set of elements in a Hilbert space. Consequently, by using the same Hilbert space assumptions from prior work [19], we are able to make use of a broad class of CSSP algorithms for the allocation portion of the multifidelity procedure under consideration. 90 6 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER 3. Allocation strategies for multifidelity simulation. 3.1. Relaxation of the exact subset problem. We now reformulate a regularized version of the CSSP as a mixed-norm constrained least-squares problem. Each column of the subset matrix C is taken from the full dataset, and therefore cj ∈ {ai \|i = 1, . . . n}. This representation can be formulated so that R = CB̃, where B̃ ∈ Rm×n is a coefficient matrix that essentially combines the coreset elements, or atoms, via weights. Each data point is therefore approximated as, ai ≈ Cb̃i , (5) and the residual is (6) r= X i kai − Cb̃i k22 = kA − CB̃k2F (7) = kA − ABk2F , (8) where B ∈ Rn×n , bi,: = b̃j,: , and b̃j,: is the j-th row corresponding to the j-th data element in C, the same as the i-th data element in A. The smaller matrix B̃ can be thought of as a submatrix of the larger more sparse B, where m specific (nonzero) rows are preserved. Instead of building B from B̃, we are interested in solving for B directly which minimizes r. The optimum of the objective kA − ABk2F would result in B = I, however if we constrain or regularize B so that it is row-sparse as described above, then we can compute a matrix B with the desired properties. This leads to this regularized formulation for column subset selection, B∗ = arg min kA − ABk2F + λkBT k0,0 (9) B P where kMk0,0 = i kM:,i k0 is a mixed `0,0 norm that induces column sparsity. Note the equivalence of the objective in (9) to CSSP objective in (4) above where AB = CC† A, and the columns selected for C correspond to the non-zero rows of B. Because of the mixed `0,0 norm we still have an NP-complete problem. However P in this new form, we can relax the mixed norm penalty to an `2,1 norm, where kMk2,1 = i kM:,i k2 , (10) B∗ = arg min kA − ABk2F + λkBT k2,1 B This relaxed form also induces row sparsity of B but is no longer NP-complete and can be solved exactly. The type of problem shown in (10) is known in the optimization literature as a group lasso problems [31]. Group lasso problems have been well studied and there are a variety of algorithms available for their solution. For the results in this paper we use group orthogonal matching pursuit (GOMP). Although this is a greedy algorithm and not optimal in general, it performs well for selecting subsets of columns [16]. Two other methods that have demonstrated efficiency in solving this type of optimization are the group least angle regression (GLARS) 91 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 7 and alternating direction method of multipliers (ADMM) [31, 6, 11]. We have found GOMP and GLARS to be the most useful overall for the exploration of various coreset sizes, while ADMM is well-suited to extremely large datasets. While GOMP can be computed over a discrete data matrix A, it only really makes use of the inner products of the data vectors, so that the algorithm only assumes a Hilbert space over the data points. This means that the approach can be used in the context of continuous simulation vectors uL (zi ) as long as they exist in a proper Hilbert space. A pseudocode presentation of GOMP for column subset selection with continuous vectors is shown in Algorithm 1. Algorithm 1 Group orthogonal matching pursuit for subset selection of continuous vectors Input u(γ) - input vectors, λ - regularization parameter controlling sparsity, - precision parameter Output B ∈ Rm×p , coefficient matrix, A, the subset indices. procedure GOMP(u(γ), λ, ) Set Qij = hu(zi ), u(zj )i and Z = Q, the inner product matrices. Initialize S = Z , A = ∅ while kBT k2,1 < λ or kA − ABk2F < do S = Z − QB . update residual ci = \|\|Si \|\|2 i∗ = arg maxi cAc . element with max group correlation A = A ∪ {i∗ } BA = Z†A,A ZA,: end while return B, A end procedure 3.2. Group orthogonal matching pursuit. We now present some theoretical justification for why GOMP is competitive for the CSSP. Our two results are Theorem 3.1 and 3.2. In particular, Theorem 3.2 shows that GOMP can identify a small representative column subset under some assumptions on the matrix and the representative set. Let A ∈ Rd×n be a matrix whose n columns consist of snapshot or feature vectors. An implicit assumption is that n d, but this is not necessary. We assume that the columns of A have unit norm, i.e., if aj is the jth column of A, then we assume that (11) kaj k2 = 1. Our goal is to construct a low-rank approximation to every feature (column) in A using linear combinations of a (small) subset of feature vectors in A. As described in the previous section, our strategy for accomplishing this is to solve the following optimization problem for the matrix B ∈ Rn×n : (12) min kA − ABk2F + λ B T 2,1 , 92 8 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER where BT 2,1 = n X j=1 kbj,: k2 , and bj,: the jth row of B. The mixed norm k · k2,1 promotes column sparsity of its argument. In the context of the product AB, column sparsity of B T implies that a small number of columns of A are used to approximate the remaining columns. Because we solve (12) using a group orthogonal matching pursuit (GOMP) algorithm, we proceed to rewrite this problem in the language of GOMP algorithms. Let ai , i ∈ [n], denote the ith column of A. We call a set of (column) indices S ⊂ [n] a basis set for A if span ai i ∈ S = range (A) . Basis sets are not unique in general. A composite set of indices equals [n]\S for a given basis set S. Our first result is a consistency result: given a basis set that satisfies reasonable assumptions, GOMP can identify this set. Theorem 3.1. Let A ∈ Rd×n have columns ai ∈ Rd , and let Sg ⊂ [n] be a basis set for A. For each j ∈ [n], there is a unique expansion of aj in columns of the basis set: X (13a) aj = Ci,j ai i∈Sg If (13b) C̄ := max j∈[n]\Sg X i∈Sg \|Ci,j \| < 1 then GOMP identifies the set Sg after \|Sg \| steps. Proof. For any matrices A and B of appropriate size, the following identity holds: vec B T AT = (A ⊗ I n ) vec B T , where ⊗ is the Kronecker product, and vec (·) is the vectorization (vertical concatenation of columns) of its argument. Defining d := vec B T , f := vec AT , and H = (A ⊗ I n ), we can write our desired relation A ≈ AB in vectorized format: (14) Hd ≈ f . The goal of (12), aiming to promote row sparsity of B, is now translated into a group sparsity of entries of d = vec (B). Each group of elements in d corresponds to an individual row of B. We seek to prove that GOMP applied to this system recovers the basis groups. We will let G denote a generic subset of [n2 ]. We define the groups Gi ⊂ [n2 ], i = 1, . . . , n as sets of n sequential indices, Gi = {n(i − 1) + 1, n(i − 1) + 2, . . . , ni} , 93 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 9 and thus G1 , . . . , Gn partition [n2 ]. We denote by H G the matrix H restricted to the Gindexed columns H. The basis columns Sg can be translated into a set of basis groups Gg : Sg ⊂ [n] =⇒ Gg ⊂ [n2 ], Gg = ∪i∈Sg Gi The complement of Sg in [n], and the associated set Gd , are defined accordingly Sd := [n]\Sg , Gd = ∪i∈Sd Gi . We use AS to denote selection of the S-indexed columns of A, and H G to denote selection of the G-indexed columns of H. Under the assumption (11), we have H TGj H Gj = I n for each group j = 1, . . . , n. Row sparsity of B means that we wish to choose a small (sparse) number of groups Gj . The theorem is proven if we can show that GOMP applied to (14) identifies groups of indices associated with the column indices Sg . Consider the matrix, M = ATSg ASg . Since Sg is a set of basis indices, then M is invertible. This implies + + −1 T H+ = A ⊗ I = A ⊗ I = M A n n S g Sg ⊗ I n Gg Sg Then H+ Gg H Gd = M −1 ATSg ⊗ I n (ASd ⊗ I n ) = M −1 ATSg ASd ⊗ I n . Define C := M −1 ATSg ASd ∈ Rs×(n−s) , and note that it is comprised of the entries Ci,j defined in (13a). We need to introduce a mixed vector norm. Let q ∈ N be arbitrary; for a vector x ∈ Rqn , we define   x1 q  x2  X   x =  .  ∈ Rqn , xj ∈ Rn =⇒ kxk2,1 = kxj k2 .  ..  j=1 xq Given such a vector x, we use notation xj to mean a length-n vector from the elements of x as above. We have k(C ⊗ I n ) vk2,1 = = n−s X j=1 (cj ⊗ v j ) n−s X s X j=1 i=1 ≤ 2,1 n−s X \|Ci,j \| kv j k2 ≤ j=1 kcj ⊗ v j k2,1 max 1≤j≤n−s s X i=1 ! n−s X \|Ci,j \| kv j k2 = kCk1 kvk2,1 j=1 Chaining together these results with the assumption (13b), we have shown H+ Gg H Gd ∗2,1 := sup v∈R(n−s)n \{0} k(C ⊗ I n ) vk2,1 kvk2,1 < 1. Under this condition, Corollary 1 in [16] shows that GOMP applied to Hd ≈ f identifies the basis groups in d; this is equivalent to GOMP identifying the basis columns of A. 94 10 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER The condition (13b) is a nontrivial requirement, and is difficult to verify in practice. In particular, the set Sg defining a basis group is not unique for most matrices of interest, but many of these sets will not satisfy (13b). One significant disadvantage of this analysis is that the condition (13b) cannot be checked before the set Sg has been identified. For A ∈ Rd×n , assuming d ≤ n and that A has full rank, we require \|Sg \| = d in order to satisfy the relation (13a). However, d can be very large and so in practice we will identify a set of features S with size \|S\| < d; this suggests that (13a) can only be satisfied approximately in this case. Following the analysis in [16], we can provide robust version of the above theorem, showing success of GOMP when the relation (13a) is satisfied only approximately. We require a little more notation to proceed: With A and Sg as in Theorem 3.1, let ASg be as defined in the proof of that theorem, i.e., the submatrix of A formed from the columns indexed by Sg . Now define the smallest eigenvalue of ATSg ASg : (15) λ̄ := λ ATSg ASg = min v∈R\|Sg \| \{0} ATSg ASg v kvk2 2 >0 where the inequality is true since the columns of ASg are a basis for range(A). Theorem 3.2. Let A ∈ Rd×n have normalized columns ai ∈ Rd , and let Sg ⊂ [n] be a basis set for A such that the columns of A satisfy X (16a) aj = Ci,j ai + nj , i∈Sg and assume that each nj ∈ Rd for j = [n]\Sg has independent and identically distributed components, each a normal random variable with mean 0 and variance σ 2 . For any η ∈ (0, 1/2), let the stopping tolerance ε in the GOMP Algorithm 1 satisfy p σ 2 · n · d log(2 · n · d/η) (16b) ε> . 1 − C̄ Assume that the Ci,j satisfy (13b), and further that they satisfy v √ uX u n 2 8 ε (16c) min t . Ci,j > i∈Sg λ̄ j=1 Then GOMP identifies the set Sg and computes a solution B satisfying r 2 log(\|Sg \|/η) . max \|Bi,j − Ci,j \| ≤ σ i,j λ̄ Proof. Let C be the matrix formed from the elements Ci,j , which is row-sparse, having non-zero entries only when i ∈ Sg . As before, we seek a computational approximation B to the (unknown) C. The model (16a) results in the linear system AB = AC + N 95 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 11 where N is a concatenation of the columns nj . (We define ni = 0 for i ∈ Sg .) Following the proof of Theorem 3.1, the vectorization of the transpose of the above equation is Hd = f + g where g = vec N T and f = vec C T AT . The above equation holds when d has group sparsity defined by the set Sg . We seek to show that GOMP applied to this system, having knowledge only of H and f + g, can identify groups of d corresponding the column indices Sg for A. The system above is simply a "noisy" version of GOMP, where g is a noise vector. We seek to apply Theorem 3 in [16] to show the conclusions. Assumptions (16) and (13b) are among the stipulations of this theorem. The remaining stipulation is (17) λmin H TGg H Gg ≤ 1, where λmin (·) denotes the minimium eigenvalue of the symmetric matrix input. We can show this via applications of Kronecker product properties. First we have T H TGg H Gg = ASg ⊗ I n ASg ⊗ I n = ATSg ASg ⊗ (I n I n ) . If λ , i = 1, . . . , \|Sg \| are the eigenvalues of ATSg ASg , then they are also the eigenvalues of i ATSg ASg ⊗ (I n I n ), each having multiplicity n. Thus, λmin H TGg H Gg = λmin ATSg ASg = λ̄, with λ̄ defined in (15). The matrix ATSg ASg is positive-definite and has diagonal entries equal to 1 since the columns ai are normalized. Thus, \|Sg \| X \|Sg \| = trace ATSg ASg = λi ≥ \|Sg \|λ̄, i=1 so that λ̄ ≤ 1, showing (17). Having shown this we may apply Theorem 3 in [16], which yields the conclusion of the theorem. Note that (16a) can be satisfied with \|Sg \| < d even if A has full rank. When the columns of A are feature vectors that are outputs of scientific models, then the assumption that columns are perturbed by random noise is not realistic. However, the main purpose of Theorem 3.2 is to show that GOMP is robust to relaxations of (13a). 3.3. Theoretical discussion. We discuss here how the various methods we have investigated compare in terms of error guarantees, number of columns and asymptotic runtime to achieve those guarantees. In Table 1 the relationship between and m for deterministic leverage score sampling 1 assumes a power-law distribution, parameterized as `i = i1+η , on the leverage scores, which we differentiate by using m̂. Cholesky, QR, LU, and GOMP can all be analyzed using results 96 12 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER Table 1 Theoretical comparison among the subset-selection methods discussed here. Algorithm Runtime Theoretical error , m Leverage [21] O(n3 ) kA − CC† Ak2F ≤ (1 + )kA − Ak k2F m̂ = max Cholesky [19] QR [14] LU [13] GOMP O(n2 m) O(n2 m) O(n2 m) O(n2 m2 ) kCC† Ak2F ≥ (1 − )kDD† Ak2F √ kA − CC† Ak2 ≤ 1 + n − k · 2k kA − Ak k2 kCC† Ak2F ≥ (1 − )kDD† Ak2F kCC† Ak2F ≥ (1 − )kDD† Ak2F 2k 1 1 1+η η , 2k ,k η m = σ 16k † min (DD ) m=k m = σ 16k † min (DD ) m = σ 16k (DD† ) min from [1], which analyzes a general greedy column subset algorithm. For QR we report the multiplicative error reported in [14], although the results from [1] apply to as well. Overall we observe that leverage sampling has the most preferable theoretical error bounds, because the bound is multiplicative and depends on the level of error, , which is satisfactory in the application. The rest of the methods have the same error bound from [1], which does not directly related to the best rank-k approximation. In addition QR has a multiplicative relationship to the best rank-k but the coefficient grows exponentially with respect to k. 4. Experimental results. 4.1. Methods compared. We use a collection of methods from the literature both from previous work in multifidelity simulation approximation and in the more general CSSP literature. Previous work in multifidelity simulation approximation has primarily used pivoted Cholesky (chol) for subset selection, although authors have alluded to the possibility of use other pivoted decomposition methods, such as partially-pivoted LU decomposition (lu) and pivoted QR decomposition (qr). All of these decomposition methods have a general approach of selecting the element with the largest residual norm after removing contribution from previous selected elements, with the exception of LU which does this but using the element with the maximum residual element magnitude (rather than full norm). From the CSSP literature we compare against leverage score sample (lev) that makes use of singular value decomposition (SVD) of the Gram matrix to make the selection. Finally we compare against the proposed method, grouped orthogonal matching pursuit (gomp), which was explained in detail in Section 3.1. We also include results using a uniformly random selection (rand) and a best rank-k approximation (rank-k) as contextual benchmarks. All methods used are summarized in Table 2. We report errorPas a squared `2 error normalized by the sum of squared `2 norms of the kX −X̃ k2 i i 2 i original data, E = P , where Xi is the i-th simulation and X̃i is the proxy estimation 2 i kXi k2 for the i-th simulation. 4.2. Burgers equation. We use the viscous Burgers equation to evaluate the method in a well-studied and familiar setting. We introduce uncertain perturbations to the left boundary condition and the viscosity parameter, similar to the setting studied in [33]. 97 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME Abbreviation rand lev qr chol lu gomp rank-k 13 Algorithm Uniform random Deterministic leverage score [21] Pivoted QR [14] Pivoted Cholesky [19] Partially pivoted LU [13] Group orthogonal matching pursuit (Section 3.1) Best rank-k approximation [13] Table 2 Summary of methods used in the empirical comparison for choosing for resource allocation. Specifically we use, (18) (19) ut + uux = νuxx , x ∈ (−1, 1), u(−1) = 1 + δ, u(1) = −1 where ν ∼ U (0.1, 1) and δ ∼ U (0, 0.1) are both uniform random variables in the respective ranges. This same problem with only the boundary perturbation (fixed viscosity) was studied in [33] because it is extremely sensitive to boundary perturbations [30]. We found that adding a second element of uncertainty around the viscosity parameter, ν, makes the problem more interesting from the perspective of attempting a linear approximation. 4.2.1. Setup. We sampled this parameter space using 400 total samples by taking 20 uniform grid samples within the boundary perturbation range, (0, 0.1), and 20 uniform grid samples within the viscosity perturbation range (0.1, 1.0). For each parameter sample pair, we ran the low-fidelity and high-fidelity simulation for a set number of t = 1.5×106 simulation steps, resulting in 400 different simulation results for the low-fidelity model and another 400 different simulation results for the high-fidelity model. 4.2.2. Analysis. The ultimate goal is to understand how each subset selection method performs using the low-fidelity for selection, while the error is computed with respect to the high-fidelity estimation. To test this, we use each method to select a subset and reconstruction weights using the low-fidelity dataset, and then test the reconstruction error in both the lowfidelity samples and the high-fidelity samples. We observed that all methods improved with each additional sample available, but certain methods do better overall and some outperform all others when the number of samples is extremely small. This latter case is important for our setting because we are in the setting where only a few samples can be computed in the high-fidelity model. We found that overall, deterministic leverage-score sampling performed the worst, even after a relatively large number of samples have been found. Even the random sampling method appears to do better on average than these two methods, which is surprising. The LU-based partial pivoting sampling approach did better than random, but worse overall than the remaining methods. In all of our experiments, QR and Cholesky based sampling performed nearly identically. This can be understood as follows: when acting on a symmetric positive-definite matrix like the Gramian, the QR and Cholesky pivoting strategies become nearly equivalent. Finally, the proposed 98 error 14 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01 lev qr chol lu 2 3 4 5 subsamples Low-fidelity model gomp rand rank-k 6 7 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01 2 3 4 subsamples 5 6 7 High-fidelity model Figure 1. Burgers equation: Reconstruction error among subset methods for solutions to Burgers equation. Left: the reconstruction error of the rest of the low-fidelity simulation dataset when using the indicated subset chosen using various methods. Right: the reconstruction error but for the high-fidelity simulations with the subset chosen using the low-fidelity samples. GOMP-based sampling approach performs the best overall, and particularly for very small subset setting. These results are summarized in Figure 1 which shows the reconstruction error in the low-fidelity model and the high-fidelity model against the exact solution. 4.3. Double pendulum. We consider a classic double pendulum problem. A similar problem was also considered in [19], and we use the same setup and assumptions here. The problem is parameterized by the two pendulum angles, θ1 , θ2 , the lengths of the pendulums, `1 , `2 , and the mass of each pendulum, m1 , m2 , as well as the gravity coefficient g. The high fidelity model corresponds to the solution of Equation (6.5) in [19] parameterized by (m2 , `2 ) using a strong stability preserving Runge-Kutta method. The low fidelity model uses the same parameterization but a corresponding linear approximation, see Equation (6.7) in [19] for details. In contrast to [19], we use a Euclidean inner product for the examples shown here. 4.3.1. Setup. For the high-fidelity model, we use a time step ∆t = 10−2 until T = 15, resulting in a high-fidelity time-series vector of size 1501. The low-fidelity space uses ∆t = 0.25 until T = 15 resulting in a 61-dimensional time-series vector. The uncertain parameter m2 is sampled along the interval [0.25, 0.75], and `2 is sampled from the range [0.25, 4]. We sample the low-fidelity model using 20 uniform grid samples along each parameter range, resulting in 400 simulations. 4.3.2. Analysis. For this example, we observed a much larger gap between the lowest and highest performers. This gap indicates that subset selection choice matters more for this problem, probably due to the more strongly non-linear relationship exhibited here than in the Burgers example above. The error observed in the low-fidelity model, shown in Figure 2 on the left, demonstrates that the GOMP-based approach achieves superior performance in almost every subset size, 99 error ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.00 lev qr chol lu 2 4 6 8 subsamples 10 gomp rand rank-k 12 14 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.01 2 Low-fidelity model 3 4 5 subsamples 6 7 15 8 High-fidelity model Figure 2. Double pendulum: Reconstruction error among subset method for solutions to double pendulum problem. Left: the reconstruction error of the rest of the low-fidelity simulation dataset when using the indicated subset chosen using various methods. Right: the reconstruction error but for the high-fidelity simulations with the subset chosen using the low-fidelity samples. until the number of samples because so large the choice in subset selection does not matter. Using the low-fidelity subset choices in computing reconstructions for the high-fidelity model, shown on the right side of Figure 2, achieves similar improvements. In the high-fidelity case, the model errors are much larger making any gains more important. A random subset selection does quite well in comparison to these deterministic approaches. However we note that rarely would making the allocation of a single high-fidelity simulation based purely on a random selection make sense, due to the risk of any specific subset performing quite poorly. While in expectation the random case performs well, any specific random subset may not unless the subset size becomes large enough that the subset selection algorithm is no longer important. However, having arbitrarily large numbers of high-fidelity simulations is typically not possible for real-world situations. 4.4. Compressible flow simulation. Compressible flow simulations are used to study how fluids act while flowing in and around obstacles. Understanding these situations in detail is vital in many domains, especially in designing and engineering of bridges, buildings, and airplane wings. For example, simulating flow of air around a building helps inform designers of the effects on the surrounding environment, including pedestrians, cars, trees, and other buildings. We present here results from a 2D compressible flow simulation around a cylindrical object, which relates directly to the design of pylons on a bridge, or the beam supports of aircraft wings. The simulation finds a solution to the compressible Navier-Stokes equation, which for two dimensions can be written as: (20) ∂q ∂f ∂g + + = 0, ∂t ∂x ∂y 100 16 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER where q is a vector of conserved variables, f = f (q, ∇q) and g = g(q, ∇q) are the vectors of the fluxes. These can be rewritten as: f = fi − fv , g = gi − gv , (21) where fi , gi are the inviscid fluxes and fv , gv are the corresponding viscous fluxes. The viscous fluxes take the form,     0 0     τxx τxy  , gv =  , (22) fv =      τyx τyy uτxx + vτyx + kTx uτxy + vτyy + kTy where τ is the stress tensor, (23) τxx = 2µ(ux − ux + uy ux + uy ), τyy = 2µ(vy − ), τxy = τyx = µ(vx + uy ), 3 3 with µ the dynamic viscosity and k the thermal conductivity. The corresponding solution is well known and we refer the reader to classic texts for appropriate background, such as [18]. We are specifically interested in two uncertain parameters, the Reynolds number, Re, which relates the velocity of the fluid to the viscosity of the fluid, and the Mach number, M a, which relates the speed of the fluid to the local speed of sound. We ran the simulations using the compressible flow module of the Nektar++ suite [7]. In the simulation, we use the following relationships between the Reynolds number and dynamic viscosity, and between the Mach number and the farfield velocity: r ρ∞ u∞ γp∞ (24) µ= , u∞ = M a , Re ρ∞ where p∞ , ρ∞ , u∞ denote the farfield pressure, density, and x-component of the velocity, respectively, and γ is the ratio of specific heats. An example of a single time point in one of the simulations is shown in the left side of Figure 3. 4.4.1. Setup. We evaluated the Reynolds number, Re, and Mach number, M a, parameters at regular intervals with 5 samples between 200 and 450, and 6 samples between 0.2 and 0.45 respectively. For a single parameter pair, we ran the simulation for 1 × 107 steps to achieve 1 unit of non-dimensionalized simulation time, and for each time-step we recorded 4 values at 15 spatial locations in the immediate vicinity of the cylinder. The spatial locations are shown visually in Figure 3. We are primarily interested in the fluid flow simulation after it transitions into a steady shedding regime, so we ignore any of the time steps in the startup regime, which for this problem we observed to be at 30% of the steps, or 4 × 104 steps. This results in a feature vector with 6 × 104 time-steps and 15 × 4 = 60 values for each step. Once the simulation transitions from the transient regime, each of these simulation sample points are periodic signals whose form depends nonlinearly on the parameters chosen. To align the signals we detect the period length, find the cyclic peak, and translate all signals so that the first peak aligns. We then crop all signals to a single period size. The cropping was done 101 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 17 Figure 3. Compressible flow simulation. Left: An example of one of the compressible flow simulations used in this experiment, showing the entire field of mass-density (ρ) values. The simulation models a fluid flowing past a cylinder, we are specifically interested in the mass-density and energy values that emerge in the immediate wake of the cylinder. Right: A zoomed-in diagram showing the location of the 15 sample points taken from each simulation for the analysis. These points were specifically positioned in the immediate wake of the cylinder. The location of this diagram is shown by the rectangle in the image on the left. to ensure a nice decay of singular values in the ensemble. A sample of a single positional point and variable but for various Reynolds and Mach numbers is shown in Figure 4 for reference. The corresponding Hilbert space is induced by the inner produce of these period length signals. We ran both the low- and high-fidelity simulations over the specified parameter values in order to compute the proxy approximation error using the various subset selection methods. 4.4.2. Analysis. The results of the comparison in terms of how each of the subset selection methods perform is shown in Figure 5. Here we observe slightly differing results in the highfidelity vs low-fidelity errors. In the low-fidelity case the methods perform similar to previous experiments. The GOMP, Cholesky, QR and leverage methods all perform the same after a certain number of samples, and all perform better than random sampling. For the high-fidelity results we observed that Cholesky and QR perform similarly to random, while GOMP performs the best. One of the primary differences between GOMP and QR is that GOMP performs selection based on residual correlation in the Hilbert space, while QR uses residual magnitude. These results indicate that the correlation becomes more indicative than magnitude of highfidelity sample importance for this type of simulation data; this is another reason to consider GOMP when choosing the subset selection technique. 4.5. Structure topology optimization. A structure topology optimization (STO) problem involves solving a non-convex optimization problem for the optimal material placement to satisfy a physical property such as stiffness or maximum stress. STO has emerged as a powerful tool in designing various high performance structures, from medical implants and prosthetics to jet engine components [28, 26, 32]. The optimization shown here is a canonical STO problem in which the optimization finds the best material layout in the design space in order to maximize the stiffness (or equivalently minimizing the deflection or compliance) of the structure subject to a material volume constraint. The response of the structure for a set of loading and boundary conditions is typical computed/simulated via finite element method (FEM). The topology optimization yields a solution in the form of binary maps that 102 18 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER Figure 4. Compressible flow simulation point samples over time. Top: Samples of the mass-density field at a single positional point (1.5, 0.0), which is located in the immediate wake of the cylinder in Figure 3. The differing signal shapes are due to changes in the Reynolds and Mach numbers of the simulation, the figure shows 10 randomly samples Reynolds-Mach pairs. Bottom: Same as above but from the Energy field. indicate material placement. Additional details of the specific optimization and example implementations can be found in [2]. The problem considered here is parameterized by three variables, the vertical position of the loading, p, the loading angle, θ, and the filter size, ρ, which controls the scale of contiguous blocks in the material placement. This is summarized in Figure 6. 4.5.1. Setup. The topology optimization takes place on a 2D rectangular grid of size nrow ×ncol . For this experiment, the position variable, p, was restricted to p ∈ [−0.5nrow , 0.5nrow ] corresponding to the extent of the right boundary depicted in Figure 6. The angle parameter was restricted to θ ∈ [0, π] corresponding to the full range between pointing straight down to straight up. The filter parameter was restricted to ρ ∈ [1.1z, 2.5z] corresponding to the scale of features allowed in the final simulation, where z is the appropriate ratio coefficient depending on the resolution sx , sy . We sampled the three parameter spaces in a uniformly random way until 1000 sample triples were found. Those triples were then used to run 1000 separate topology simulations. Low-fidelity corresponds to the problem solved with nrow = 40, ncol = 80, z = 1, and high-fidelity corresponds to the problem solved on a much larger region, nrow = 80, ncol = 160, z = 2. 103 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 1.0 lev qr chol error 0.8 0.8 gomp rand rank-k 0.7 0.6 0.6 0.5 0.4 0.4 0.2 0.00 19 0.3 2 4 6 8 10 subsamples 12 Low-fidelity model 14 16 0.21 2 3 4 subsamples 5 6 High-fidelity model Figure 5. Projection error among subset method for solutions to compressible flow problem. Left: the reconstruction error of the rest of the low-fidelity simulation dataset when using the indicated subset chosen using various methods. Right: the reconstruction error for the high-fidelity simulations with the subset chosen using the low-fidelity samples. p ρ θ Figure 6. Structure topology optimization diagram. The loading on the right edge is parameterized by a vertical position, p, and a loading angle, θ, and the filter size, ρ, which controls the length scale in the topology optimization. 4.5.2. Analysis. Our analysis takes into account both the low-fidelity and high-fidelity spaces. Specifically we analyze how well the subset selected by each method works to reconstruct the remaining low-fidelity samples. We also consider how the method reconstructs the high-fidelity samples using the same subset. We observed that the GOMP approach performed the best for any subset size examined. The QR and Cholesky methods also performed quite well in the low-fidelity reconstruction error. The average of random method performed relatively well initially, but then all other method perform better at larger subset sizes. The high-fidelity case was more different for this dataset than in the previous simulations examined. Specifically we note the GOMP maintains the position as the best performer, however it also gains a considerable edge in the larger subset sizes. Surprisingly, Cholesky and 104 20 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER Low-fidelity High-fidelity Figure 7. Structure topology optimization Samples from both low- and high-fidelity solutions. Each row shows the low- and high-fidelity solutions using the same parameters, note the variety of changes in the solution just by moving from a lower resolution domain to a higher resolution domain. QR perform well at first, but then perform worse as the subset size grows. The average of random subsets does better than the LU and leverage approaches. The performance comparison among the different methods is summarized in Figure 8. The poorer performance of QR for larger subset sizes, we hypothesize, is due to the larger set of differences between the low- and high-fidelity solutions in this dataset. Because of the non-convex nature of the topology optimization problem, we observe some larger differences in the various solutions, an example is shown in Figure 7. We hypothesize that because GOMP bases the decision of using a subset element largely on overall correlation with the remaining data point, rather than simple residual error magnitude, it is able to capture the more important trends of the samples which carry over to the high-fidelity case. Our previous analysis has focused on the subset size. A larger subset size for the same error ultimately translates into a higher runtime cost to obtain the same approximation error. To make this relationship more clear, we also compared the methods directly using simulation runtime and reconstruction error. A popular alternative approach to generating proxy functions for applications in uncertainty analysis of simulations is to use a Gaussian process (GP) regression to estimate unseen simulations results (see e.g. [15, 24, 23]). However, a GP approach relies on a large number of high-fidelity simulations from which to derive the proxy function. This is directly at odds with computationally constrained high-fidelity problems. We illustrate this point by including a GP proxy using the same, small number of samples used in the multifidelity approach. We also consider an alternative neural-network-based proxy-method proposed specifically for topology optimization in [29]. In [29], the authors proposed training a standard feedforward neural network, also known as a multilayered perceptron (MLP), as a proxy for the exact solution given the problem constraints. They first reduce the dimension of the dataset to an 80-dimensional space using PCA, and then train the MLP to regress to the PCA weights 105 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 1.0 lev qr chol lu error 0.8 gomp rand rank-k 0.6 0.4 0.2 0.00 100 200 300 400 500 600 700 subsamples Low-fidelity model 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.600 5 21 10 15 20 25 30 35 40 45 subsamples High-fidelity model Figure 8. Structure topology optimization: Reconstruction error among subset methods. Left: the reconstruction error of the rest of the low-fidelity simulation dataset when using the indicated subset chosen using various methods. Right: the reconstruction error for the high-fidelity simulations with the subset chosen using the low-fidelity samples. given the problem constraints. We implemented their approach and get comparable results for the same number of solution designs used in their training (400). However, this approach also assumes the availability of sufficient data to train the model, which is not always the case in the high-fidelity models in the wild. To compare we used both an MLP based approach with no PCA reduction, and an MLP training on the PCA weights where the space was chosen to capture 90% of the singular value energy. The results are summarized in Figure 9. The left side of Figure 9 compares all the methods on a log-scale. Because of the lack of training data, the MLP model performs quite poorly. The GP model performs much better initially than the MLP model on the fewer number of samples, but after reaching between 10 − −15 samples the two become quite similar in performance. Both the GP and MLP models accrue considerable error because the limited samples do not sufficiently represent the high fidelity space. In contrast the Gramian-weighted non-parametric proxy method introduced in [19] performs well using only a small number of high-fidelity samples because it makes direct use of the low-fidelity structure through the low-fidelity Gramian. In considering the choice in how the subset selection of the non-parametric proxy method is done, we found similar advantages in using a GOMP-based subset to the previous examples considered. Specifically we note that given the same amount of high-fidelity simulation time we can improve the error considerably over previous methods. There are some significant differences between the observed reconstruction error performance among the different subset methods. For the topology optimization problem, these differences can be seen easily by viewing the subsets of solutions chosen and considering the differences. Figure 10 shows the first 15 subsets chosen by each of the methods in order of selection. Note that QR and Cholesky chose the same subsets for this dataset, and for brevity 106 22 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER 10 4 lev qr chol lu 0.55 10 2 error error 10 3 0.60 gomp GP MLP MLP-pca 10 1 0.45 0.40 10 0 10 -10 0.50 5 10 15 20 25 high fidelity simulation time (min) 30 0.350 5 10 15 20 25 high fidelity simulation time (min) 30 Figure 9. Structure topology optimization Average reconstruction error per simulation runtime cost among subset methods over an ensemble of 50 random datasets with 500 samples each, standard deviation is shown in dashed error bars around each point. The reconstruction error is computed for the high-fidelity simulations with the approximation subset indices chosen using the low-fidelity samples. Simulation runtime cost is the average cost of running the specific high-fidelity simulations needed to obtain the corresponding error. The amount of runtime required for each simulation varies because each can require a different number of iterations to converge, and so an average of the several samples is shown. Left: A log-scale comparison including a Gaussian process (GP), multilayered perceptron (MLP), and an MLP training on PCA projections (MLP-pca) similar to [29]. Right: The same comparison but using a linear-scale and only showing the results using the proposed multifidelity technique. we only show QR. All of the methods select a similar solution initially, one of the solid beam structure solutions. This is consistent with our understanding of each of the methods, as the leverage method is selecting the sample with the most statistical leverage, QR, Cholesky, and LU are selecting the sample with the large residual, and GOMP is selecting the sample with the largest correlation with all the samples. Note how QR selects a sample with some slant, this probably results in a larger magnitude image, while GOMP selects a beam with a straight orientation, this is probably higher correlated with more samples. The methods start to differentiate after a few samples: statistical leverage continues to select samples with similar large beam structures because those solutions also have a large statistical leverage score (leverage scores do not take into account prior samples like the other methods). The order of the QR and Cholesky selections are interesting because, as the first sample was slanted one way, the residual error indicated a slant the other way was the next best according to the magnitude criteria. After the slanted beams, other solutions with large beams in them as well as small lattices were chosen, probably because these also resulted in large residual magnitudes. The GOMP approach, in contrast, selected two large beam solutions, and then next selected a solution made entirely of lattices, probably because that distribution of material more highly correlated with the remaining residuals. We hypothesize that because GOMP uses residual correlation instead residual magnitude, it is able to do better in the high-fidelity space even if there is some differences between low- and high-fidelity solutions for the topology optimization problem. 5. Conclusion. We have presented a novel approach to allocation of resources for highfidelity simulations in a multifidelity regime. The proposed method is motivated from a first- 107 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME 23 principles analysis of the problem and results in substantial improvement in the projection error improvement per simulation cost. The proposed method was compared to previous allocation strategies as well as existing column subset selection problem solutions on several simulation datasets. This work also proposed a connection between the allocation problem and the classic subset selection problem in the machine learning and data mining domains. Acknowledgments. We would like to thank Dr. Vahid Keshavarzzadeh for providing the data and assistance for the structure topology optimization example, and to thank Mr. Ashok Jallepalli for assistance with generating the fluid simulation example. REFERENCES [1] J. Altschuler, A. Bhaskara, Gang, Fu, V. Mirrokni, A. Rostamizadeh, and M. Zadimoghaddam, Greedy Column Subset Selection: New Bounds and Distributed Algorithms, (2016), http://arxiv.org/abs/1605.08795, https://arxiv.org/abs/1605.08795. [2] E. Andreassen, A. Clausen, M. Schevenels, B. S. Lazarov, and O. Sigmund, Efficient topology optimization in MATLAB using 88 lines of code, Structural and Multidisciplinary Optimization, 43 (2011), pp. 1-16. [3] F. Bach, Sharp analysis of low-rank kernel matrix approximations, International Conference on Learning Theory, 30 (2013), pp. 185-209, https://arxiv.org/abs/1208.2015. [4] C. Boutsidis, P. Drineas, and M. Magdon-Ismail, Near-optimal column-based matrix reconstruction, SIAM Journal on Computing, 10598 (2014), pp. 1-27, https://doi.org/10.1137/12086755X, http: //epubs.siam.org/doi/abs/10.1137/12086755X, https://arxiv.org/abs/arXiv:1103.0995v1. [5] C. Boutsidis, M. W. Mahoney, and P. Drineas, An Improved Approximation Algorithm for the Column Subset Selection Problem, Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, (2009), pp. 968-977, https://arxiv.org/abs/0812.4293. [6] S. Boyd, Alternating Direction Method of Multipliers, Proceedings of the 51st IEEE Conference on Decision and Control, 3 (2011), pp. 1-44, http://www.nowpublishers.com/product.aspx?product= MAL{&}doi=2200000016. [7] C. D. Cantwell, D. Moxey, A. Comerford, A. Bolis, G. Rocco, G. Mengaldo, D. De Grazia, S. Yakovlev, J.-E. Lombard, D. Ekelschot, et al., Nektar++: An open-source spectral/hp element framework, Computer Physics Communications, 192 (2015), pp. 205-219. [8] T. F. Chan and P. C. Hansen, Some applications of the rank revealing QR factorization, SIAM Journal on Scientific Computing, 13 (1992), pp. 727-741. [9] H. Cheng, Z. Gimbutas, P. G. Martinsson, and V. Rokhlin, On the Compression of Low Rank Matrices, SIAM Journal on Scientific Computing, 26 (2005), pp. 1389-1404, https://doi.org/10.1137/ 030602678. [10] M. B. Cohen and C. Musco, Ridge Leverage Scores for Low-Rank Approximation, (2015), https:// arxiv.org/abs/1511.07263. [11] W. Deng, W. Yin, and Y. Zhang, Group Sparse Optimization By Alternating Direction Method, SPIE Optical Engineering + Applications, (2013), pp. 88580R-88580R, https://doi.org/10.1117/12. 2024410, http://circ.ahajournals.org/cgi/content/abstract/95/2/473. [12] G. Golub, Numerical methods for solving linear least squares problems, Numer Math, 7 (1965), pp. 206- 216. [13] G. H. Golub and C. F. Van Loan, Matrix computations, vol. 3, JHU Press, 2012. [14] M. Gu and S. C. Eisenstat, Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization, SIAM Journal on Scientific Computing, 17 (1996), pp. 848-869, https://doi.org/10.1137/ 0917055, http://epubs.siam.org/doi/abs/10.1137/0917055. [15] L. Le Gratiet, Multi-fidelity Gaussian process regression for computer experiments, PhD thesis, Université Paris-Diderot-Paris VII, 2013. [16] C. Lozano, G. Swirszcz, and N. Abe, Group Orthogonal Matching Pursuit for Variable Selection and Prediction, Nips2009, (2009), pp. 1-9. 108 24 DANIEL J. PERRY, ROBERT M. KIRBY, AKIL NARAYAN, AND ROSS T. WHITAKER [17] M. W. Mahoney and P. Drineas, CUR matrix decompositions for improved data analysis., Proceedings of the National Academy of Sciences of the United States of America, 106 (2009), pp. 697-702, https://doi.org/10.1073/pnas.0803205106. [18] B. R. Munson, D. F. Young, and T. H. Okiishi, Fundamentals of fluid mechanics, New York, 3 (1990). [19] A. Narayan, C. Gittelson, and D. Xiu, A stochastic collocation algorithm with multifidelity models, SIAM Journal of Computing, 51 (2013), pp. 2005-2035, https://doi.org/10.1137/090750688. [20] L. W.-T. Ng and M. Eldred, Multifidelity Uncertainty Quantification Using Non-Intrusive Polynomial Chaos and Stochastic Collocation, American Institute of Aeronautics and Astronautics, Apr. 2012, https://doi.org/10.2514/6.2012-1852, http://arc.aiaa.org/doi/abs/10.2514/6.2012-1852 (accessed 2013-07-16). [21] D. Papailiopoulos, A. Kyrillidis, and C. Boutsidis, Provable deterministic leverage score sampling, in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '14, 2014, pp. 997-1006, https://doi.org/10.1145/2623330.2623698, http://dl.acm.org/ citation.cfm?doid=2623330.2623698, https://arxiv.org/abs/1404.1530. [22] B. Peherstorfer, K. Willcox, and M. Gunzburger, Optimal Model Management for Multifidelity Monte Carlo Estimation, SIAM Journal on Scientific Computing, 38 (2016), pp. A3163-A3194, https: //doi.org/10.1137/15M1046472. [23] P. Perdikaris, M. Raissi, A. Damianou, N. Lawrence, and G. Karniadakis, Nonlinear information fusion algorithms for data-efficient multi-fidelity modelling, in Proc. R. Soc. A, vol. 473, The Royal Society, 2017, p. 20160751. [24] P. Perdikaris, D. Venturi, J. O. Royset, and G. E. Karniadakis, Multi-fidelity modelling via recursive co-kriging and GaussianMarkov random fields, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 471 (2015), p. 20150018, https://doi.org/10.1098/rspa.2015.0018, http://rspa.royalsocietypublishing.org/lookup/doi/10.1098/rspa.2015.0018. [25] D. J. Perry and R. T. Whitaker, Augmented leverage score sampling with bounds, in Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2016, pp. 543-558. [26] T. A. Reist, J. Andrysek, and W. L. Cleghorn, Topology optimization of an injection moldable prosthetic knee joint, Computer-Aided Design and Applications, 7 (2010), pp. 247-256. [27] Y. Shitov, Column subset selection is np-complete, arXiv preprint arXiv:1701.02764, (2017). [28] A. Sutradhar, G. H. Paulino, M. J. Miller, and T. H. Nguyen, Topological optimization for designing patient-specific large craniofacial segmental bone replacements, Proceedings of the National Academy of Sciences, 107 (2010), pp. 13222-13227. [29] E. Ulu, R. Zhang, and L. B. Kara, A data-driven investigation and estimation of optimal topologies under variable loading configurations, Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 4 (2016), pp. 61-72. [30] D. Xiu and G. E. Karniadakis, Supersensitivity due to uncertain boundary conditions, International journal for numerical methods in engineering, 61 (2004), pp. 2114-2138. [31] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68 (2006), pp. 49-67. [32] J.-H. Zhu, W.-H. Zhang, and L. Xia, Topology optimization in aircraft and aerospace structures design, Archives of Computational Methods in Engineering, 23 (2016), pp. 595-622. [33] X. Zhu, A. Narayan, and D. Xiu, Computational Aspects of Stochastic Collocation with Multifidelity Models, SIAM/ASA Journal on Uncertainty Quantification, 2 (2014), pp. 444-463, https://doi.org/ 10.1137/130949154, http://epubs.siam.org/doi/abs/10.1137/130949154. 109 ALLOCATION STRATEGIES FOR HIGH FIDELITY MODELS IN THE MULTIFIDELITY REGIME Leverage QR LU 25 GOMP Figure 10. Structure topology optimization samples: The samples chosen (in order) by each algorithm. Because Cholesky and QR select the same samples, we only show the results for QR in the interest of space. Note that all methods select samples with a large primary beam first, and then become quite different afterwards. The differences are caused by how the methods make the next selection, for example GOMP uses max correlation with residual while QR/Cholesky use max residual magnitude. CHAPTER 7 CONCLUSION 111 In this dissertation, we have presented a set of novel approaches to data analysis and visualization using a set of carefully motivated basis selection techniques. The work has focused on two broad areas: basis selection to improve human interpretation of datasets and basis selection to reduce computational burden. We will now highlight a set of more fundamental ideas within this work and potential future directions. 7.1 Additional motivating examples The majority of subset-based matrix approximation work is motivated by an established set of problems, such as PCA, covariance estimation, Nyström approximation, etc. An interesting contribution of this dissertation is a set of motivating problems that are both different but also directly connected to important applications. 7.1.1 Subset-based visualization The first new motivating example is subset-based visualization. While other work has used subset selection in the process of visualization, the work presented here has made the subset central to the visualization itself. Specifically, the subset selected from a topology optimization, or other, dataset is used directly to summarize the visual aspects of the dataset. One interesting area of additional research in this area is to identify alternative objectives and how they would summarize the dataset differently. For example, we have emphasized reconstruction as the primary goal and have shown that such an objective does a good job of feature coverage. However, there are other objectives that would result in summaries that have alternative strengths, such as convexity, sparsity, proximity to the corresponding exemplar, etc. These types of questions can lead to different types of objectives and different types of corresponding bases and algorithms. 7.1.2 Proxy construction for uncertainty quantification Another new motivating example is subset selection for use in proxy construction. This motivation has some strong similarities to the more traditional matrix approximation goals, but this example has an important difference: the ultimate goal of the selected basis is to reduce error in another set of data. To be precise, the setting is: given a set of low-fidelity simulations in a matrix Alow , can we find an approximation Blow which 112 corresponds to parameters that will best describe the corresponding high-fidelity matrix Ahigh ? The difference is subtle but important: the goal is to select a subset that will be good for the high-fidelity matrix, using the low-fidelity data. It would be interesting to explore alternative strategies of estimating the high-fidelity while using only the low-fidelity data. 7.2 Out-of-sample Nyström approximation Another fundamental idea presented here is that a basis selection for a Nyström approximation does not have to be a subset matrix, and in fact, when constructed in a specific way, can outperform a subset matrix of similar size. This idea is important because it contrasts with some assumptions of previous work about how to best construct a Nyström matrix. Another avenue of research would be to pursue this type of approximation with faster, more heuristic tools. Here we have explicitly solved the objective of minimizing the Nyström approximation error, but perhaps there are approximations that could obtain similar improvements without as much upfront cost. 7.3 Objectives guide heuristics Heuristic approaches will often perform much faster than exact solutions. However, in the column subset selection portion of this work, we found that by moving the heuristic, in this case, leverage sampling, closer to the objective, we observed better empirical results in the sampling. Similarly, by considering the error in the objective, and how ignoring the tail can result in significant estimation error, we were able to obtain better theoretical characterization of the deterministic column subset. Other heuristics could be adjusted in similar ways that would result in improved basis selection algorithms or better characterizations.
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6bg7x99