| Title | Learning latent structures via bayesian nonparametrics: new models and efficient inference |
| Publication Type | dissertation |
| School or College | College of Engineering |
| Department | Computing |
| Author | Rai, Piyush |
| Date | 2013-08 |
| Description | Latent structures play a vital role in many data analysis tasks. By providing compact yet expressive representations, such structures can offer useful insights into the complex and high-dimensional datasets encountered in domains such as computational biology, computer vision, natural language processing, etc. Specifying the right complexity of these latent structures for a given problem is an important modeling decision. Instead of using models with an a priori fixed complexity, it is desirable to have models that can adapt their complexity as the data warrant. Nonparametric Bayesian models are motivated precisely based on this desideratum by offering a flexible modeling paradigm for data without limiting the model-complexity a priori. The flexibility comes from the model's ability to adjust its complexity adaptively with data. This dissertation is about nonparametric Bayesian learning of two specific types of latent structures: (1) low-dimensional latent features underlying high-dimensional observed data where the latent features could exhibit interdependencies, and (2) latent task structures that capture how a set of learning tasks relate with each other, a notion critical in the paradigm of Multitask Learning where the goal is to solve multiple learning tasks jointly in order to borrow information across similar tasks. Another focus of this dissertation is on designing efficient approximate inference algorithms for nonparametric Bayesian models. Specifically, for the nonparametric Bayesian latent feature model where the goal is to infer the binary-valued latent feature assignment matrix for a given set of observations, the dissertation proposes two approximate inference methods. The first one is a search-based algorithm to find the maximum-a-posteriori (MAP) solution for the latent feature assignment matrix. The second one is a sequential Monte-Carlo-based approximate inference algorithm that allows processing the data oneexample- at-a-time while being space-efficient in terms of the storage required to represent the posterior distribution of the latent feature assignment matrix. |
| Type | Text |
| Publisher | University of Utah |
| Subject | Bayesian learning; Bayesian nonparametrics; Machine learning |
| Dissertation Name | Doctor of Philosophy |
| Language | eng |
| Rights Management | Copyright © Piyush Rai 2013 |
| Format | application/pdf |
| Format Medium | application/pdf |
| Format Extent | 924,147 Bytes |
| Identifier | etd3/id/3460 |
| ARK | ark:/87278/s6bs21fj |
| DOI | https://doi.org/doi:10.26053/0H-F53V-VC00 |
| Setname | ir_etd |
| ID | 197014 |
| OCR Text | Show LEARNING LATENT STRUCTURES VIA BAYESIAN NONPARAMETRICS : NEW MODELS AND EFFICIENT INFERENCE by Piyush Rai A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science School of Computing The University of Utah August 2013 Copyright c Piyush Rai 2013 All Rights Reserved Th e Un i v e r s i t y o f Ut ah Gr a d u a t e S c h o o l STATEMENT OF DISSERTATION APPROVAL The dissertation of Piyush Rai has been approved by the following supervisory committee members: Hal Daume III , Chair August 16, 2012 Date Approved Jordan Boyd-Graber , Member January 29, 2013 Date Approved Lawrence Carin , Member January 29, 2013 Date Approved Thomas P. Fletcher , Member August 16, 2012 Date Approved Suresh Venkatasubramanian , Member August 16, 2012 Date Approved and by Alan L. Davis , Chair of the Department of School of Computing and by Donna M. White, Interim Dean of The Graduate School. ABSTRACT Latent structures play a vital role in many data analysis tasks. By providing compact yet expressive representations, such structures can offer useful insights into the complex and high-dimensional datasets encountered in domains such as computational biology, computer vision, natural language processing, etc. Specifying the right complexity of these latent structures for a given problem is an important modeling decision. Instead of using models with an a priori fixed complexity, it is desirable to have models that can adapt their complexity as the data warrant. Nonparametric Bayesian models are motivated precisely based on this desideratum by offering a flexible modeling paradigm for data without limiting the model-complexity a priori. The flexibility comes from the model's ability to adjust its complexity adaptively with data. This dissertation is about nonparametric Bayesian learning of two specific types of la-tent structures: (1) low-dimensional latent features underlying high-dimensional observed data where the latent features could exhibit interdependencies, and (2) latent task structures that capture how a set of learning tasks relate with each other, a notion critical in the paradigm of Multitask Learning where the goal is to solve multiple learning tasks jointly in order to borrow information across similar tasks. Another focus of this dissertation is on designing efficient approximate inference algo-rithms for nonparametric Bayesian models. Specifically, for the nonparametric Bayesian latent feature model where the goal is to infer the binary-valued latent feature assignment matrix for a given set of observations, the dissertation proposes two approximate inference methods. The first one is a search-based algorithm to find the maximum-a-posteriori (MAP) solution for the latent feature assignment matrix. The second one is a sequential Monte-Carlo-based approximate inference algorithm that allows processing the data one-example- at-a-time while being space-efficient in terms of the storage required to represent the posterior distribution of the latent feature assignment matrix. CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi CHAPTERS 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Overview of Methods and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Nonparametric Bayesian Latent Factor Models . . . . . . . . . . . . . . . . . 3 1.1.2 Nonparametric Bayesian Learning of Latent Task Structures . . . . . . . 4 1.1.3 Efficient Inference for the Nonparametric Latent Feature Models . . . . 5 1.1.4 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1 Nonparametric Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Indian Buffet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Kingman's Coalescent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Mixture of Factor Analyzers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.7 Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3. NONPARAMETRIC BAYESIAN SPARSE LATENT FACTOR MODEL . . 15 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Nonparametric Bayesian Factor Regression . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Nonparametric Gene-Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Feature Selection Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.3 Hierarchical Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.4 Full Model and Extension to Factor Regression . . . . . . . . . . . . . . . . . 18 3.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Sampling the IBP Matrix Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Sampling the Sparse IBP Vector T. . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.3 Sampling the Real-valued Matrix V . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.4 Sampling the Factor Matrix F . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.5 Sampling the Idiosyncratic Noise Term . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.6 Sampling IBP Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3.7 Sampling the Factor Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.5.1 Nonparametric Gene-Factor Modeling and Variable Selection . . . . . . 23 3.5.2 Hierarchical Factor Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.5.3 Factor Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.6 Conclusions and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4. NONPARAMETRIC BAYESIAN FACTOR ANALYSIS WITH MULTIPLE MODALITIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 Probabilistic CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 The Infinite Canonical Correlation Analysis Model . . . . . . . . . . . . . . . . . . 38 4.3.1 The Infinite CCA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.3 Sampling B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.4 Sampling V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.5 Sampling Z . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Multitask Learning Using Infinite CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.4.1 Fully Supervised Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.2 A Semisupervised Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.1 Infinite CCA Results on Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . 43 4.5.2 Infinite CCA Applied to Multilabel Prediction . . . . . . . . . . . . . . . . . . 44 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5. MULTITASK LEARNING USING NONPARAMETRIC BAYESIAN PREDICTOR SUBSPACES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.2 Latent Subspace Model for Task Parameters . . . . . . . . . . . . . . . . . . . . . . . 49 5.3 Infinite Latent Subspace Models for Multitask Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.3.1 An Augmented Model for Learning Task Basis . . . . . . . . . . . . . . . . . 53 5.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4.1 Sampling B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.2 Sampling V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.3 Sampling Aθ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4.4 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4.5 Sampling in the Augmented Model . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 v 5.8 A Mixture of Subspaces Model for Multitask Learning . . . . . . . . . . . . . . . 60 5.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6. NONPARAMETRIC MIXTURE OF SUBSPACES FOR MULTITASK LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 Mixture of Factor Analyzers-based Generative Model for MTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.5 Future Work and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7. BEAM SEARCH-BASED MAP INFERENCE FOR THE INDIAN BUFFET PROCESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2 Infinite Latent Feature Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.3 Search-based MAP Estimate for IBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.4 Upper Bounding the Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 7.4.1 Upper Bounding w.r.t. Already Selected Dishes . . . . . . . . . . . . . . . . 81 7.4.2 Upper Bounding w.r.t. the New Dishes . . . . . . . . . . . . . . . . . . . . . . . 82 7.5 Upper Bounding the Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.5.1 A Trivial Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 7.5.2 An Inadmissible Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.5.3 A Clustering-Based Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.6.1 Baselines and Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7.6.2 E-Coli Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.6.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.6.4 Factor Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.6.5 (Approximate) MAP as an Initializer . . . . . . . . . . . . . . . . . . . . . . . . . 89 7.6.6 Comparison with Greedy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 7.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 7.8 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 8. SPACE-EFFICIENT SEQUENTIAL INFERENCE FOR THE INDIAN BUFFET PROCESS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 8.2 Particle Filtering for IBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 8.3 Improved Particle Filtering for IBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.3.1 Computing the Mixture Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.3.2 Sampling from the Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.3.3 Computing the Conditional Probabilities . . . . . . . . . . . . . . . . . . . . . . 100 8.3.4 The Full Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8.4.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 8.4.2 Block-Images Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 vi 8.4.3 Breast-Cancer Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.4.4 Computation vs Storage Trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.4.5 Comparison with Batch Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 8.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.6 Future Work and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 8.7 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 9. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 108 9.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 APPENDIX: NONPARAMETRIC MIXTURE OF SUBSPACES FOR MULTITASK LEARNING: INFERENCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 vii LIST OF FIGURES 1.1 Factor Analysis with relationship among latent factors. xn is a high dimen-sional observation, fn are the low-dimensional latent factors, and A is the factor-loading matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Pictorial illustration of the IBP with N = 4 and eventual K = 4 unique latent features. In the IBP, customers correspond to datapoints and dishes correspond to latent features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Pictorial illustration of an n-coalescent with n = 15 individuals . . . . . . . . . . 12 2.3 A basic Factor Analysis model. xn is a high-dimensional observation, fn are the low-dimensional latent factors, A is the factor-loading matrix. . . . . . . 13 3.1 The graphical model for nonparametric Bayesian Factor Regression. X con-sists of response variables as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Training and test data are combined together and test responses are treated as missing values to be imputed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3 Adding a new node to the tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 True factor loadings for the synthetic data with P=50, K=8 generated using connectivity matrix of e-coli data. White rectangles represent active sites. The data also have added noise with signal-to-noise-ratio of 10. . . . . . . . . . . 24 3.5 Inferred factor loadings (with our approach) for the synthetic data with P=50, K=8 generated using connectivity matrix of e-coli data. . . . . . . . . . . . . . . . . 25 3.6 Inferred factor loadings with the evolutionary-search-based approach. . . . . . 26 3.7 IBP based sparse Factor Analysis without feature selection . . . . . . . . . . . . . . 27 3.8 IBP based sparse Factor Analysis with variable selection . . . . . . . . . . . . . . . 28 3.9 Bayesian factor regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.10 Hierarchical factor modeling results. (a) Factor loadings for e-coli data. (b) Inferred hierarchy for e-coli data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.11 Hierarchical factor modeling results. (a) Factor loadings for breast-cancer data. (b) Inferred hierarchy for breast-cancer data. . . . . . . . . . . . . . . . . . . . . 32 3.12 Convergence plots: (a) MSE on the breast-cancer data for BFRM (horizontal line), our model with Gaussian (top red curved line) and Coalescent (bottom blue curved line) priors. (b) Log-likelihoods for our model with Gaussian (bottom red curved line) and Coalescent (top blue curved line) priors. . . . . . . 33 4.1 The graphical model depicts the fully supervised case when all variables X and Y are observed. The semisupervised case can have X and/or Y consisting of missing values as well. The graphical model structure remains the same . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.1 Predictor subspace model. Top: our basic model. Bottom: the augmented model using both task parameters and input data. X in the augmented model can additionally also consist of unlabeled data. Noise hyperparameters not shown for the sake of brevity. In both the models, the shaded nodes are observed, and the remaining ones (including the matrix consisting of task parameters, and the noise hyperparameters) are latent variables to be learned. 52 5.2 Performance comparison between both our Multitask Learning models, and Bayesian logistic regression trained separately for each task. Top: Accuracy with varying training data size. Bottom: AUC score with varying training data size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1 A graphical depiction of our model. The task parameters θ are sampled from a DP-IBP mixture and used to generate the Y values. . . . . . . . . . . . . . . . . . . 65 6.2 The hierarchical model. The cluster indicator variable z is implicit in the draw from the DP. The Beta-Bernoulli draw for bkt approximates the IBP for large K (actual K will be inferred from the data). . . . . . . . . . . . . . . . . . . 65 6.3 Variational approximation. Top: the distribution being approximated. Bot-tom: Our approximating Q distribution (note: P(Y |θ) is lower-bounded directly) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.4 Synthetic Data. Top: Plot of the correlation matrix of the ground-truth weight vectors of the 50 tasks. Bottom: Inferred correlation matrix . . . . . . . . 71 6.5 Average accuracies w.r.t. varying amount of training data (top: landmine data, bottom: 20ng data). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.1 The generic IBP search algorithm (takes the scoring function as input). . . . . . 79 7.2 Scalability results of various algorithms for the E-Coli dataset . . . . . . . . . . . 87 7.3 Scalability results of various algorithms for the Synthetic dataset . . . . . . . . . 88 7.4 Log-likelihood scores for random vs search-based MAP initialized Gibbs Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 8.1 Inference quality vs number of particles. (Top) Error vs number of particles on synthetic data. (Middle) Error vs number of particles on block-images data. (Bottom) Log-joint-probability vs number of particles on breast-cancer data. Results for each sampler are averaged over 10 runs with random initializations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 ix LIST OF TABLES 3.1 Factor regression results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1 Results on the multilabel classification task. Bold face indicates the best performance. Model-1 and Model-2 scores are averaged over 10 runs with different initializations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1 Comparison of Bayesian logistic regression, pooling approach, kernel stick-breaking approach (yaxue), our basic model (model-1), and our augmented model (model-2), for two multilabel datasets. Bold face implies the best performance. Results are averaged over 10 runs with different initializations. 59 6.1 Mean squared error (MSE) of various methods on multitask regression prob-lems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Multitask classification accuracies of various methods on the Landmine and 20ng datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.1 Results on the E-coli data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 7.2 Latent factor-based classification results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 8.1 Comparison with batch methods (first and second column: block-images data; third and fourth column: breast-cancer data) . . . . . . . . . . . . . . . . . . . . . 106 ACKNOWLEDGMENTS I will forever be thankful to Hal Daume for being all I could ask for in an adviser. Hal provided me with just the right amount of "hand-holding" in my initial days when I was totally new to machine learning, had faith in me, and then gave me independence and flexibility while I started forging my own path. Hal is incredible in so many ways. The way he balances technical discussions with the right amount of technical rigor and intuition is a rare quality. I will always strive to match the high standards he has set both as a researcher and a teacher. I thank Suresh Venkatasubramanian for being a great source of motivation, for his infectious enthusiasm about research, for the constant encouragement, and useful advice. Suresh has the amazing ability of seeing the connections between concepts from seemingly different topics, and discussions with him always brought useful insights. I thank Jordan Boyd-Graber, Larry Carin, and Tom Fletcher for serving on my PhD committee and pro-viding helpful feedback. I thank Scott DuVall at the VA Healthcare System for involving me in various projects in his group and helping me contribute. I thank Sneha Kasera for the encouragement and also involving me in the project on perimeter distinction. I thank Matthias Seeger for having me as a summer intern where I learned a great deal about variational methods and compressive sensing. I also thank Jeff Phillips - although I did not have many interactions with him, I always learned something whenever we did meet. I am thankful to Ellen Riloff for sheltering me in the NLP lab for all these years. Many thanks go to the excellent School of Computing staff, especially Ann Carlstrom and Karen Feinauer for all the help with the paperwork, logistics, and beyond. I must thank all my co-authors - Amit Goyal, Jiarong Jiang, Abhishek Kumar, Amrish Kapoor, Alexandre Passos, Avishek Saha, Anusua Trivedi, and Junxing Zhang - for collab-orating with me on various projects. I thank Abhishek for various brainstorming sessions on phone and skype. Interactions with him always led to useful insights on many problems. I thank Avishek for his constant volley of questions and discussions which resulted in various research ideas. It was great collaborating with Alexandre. Discussions with him were always insightful. I admire Jiarong for her perseverance. I also thank members of the NLP/ML/Data group at Utah - Nathan, Ruihong, Lalindra, Ashequl, Siddharth Patwardhan, David Price, Sean Igo, Jagadeesh, Arvind, John Moeller, and Parasaran - for their company and various discussions. I thank the students of the Machine Learning class I taught in Fall 2011. Interactions with them gave me new perspectives into teaching (thanks to Ross Whitaker for giving me the opportunity to teach this class). My friends in Utah made my stay here truly worthwhile. Words would never be enough to express what I feel for Niladrish and Anusua. I will always cherish their constant care and support, and all the fun and wonderful times spent with them. I was also extremely lucky to have as roommates (at different points in time) Sachin Goyal, Naveen Muralimanohar, Avishek Saha, Soumya Kar, and (possibly my last roommate) Arijit Banerjee. I thank all of them for their caring ways, friendship, and their great company. I thank Swetha for the friendship, and for the exquisite use of her artistic sensibilities in drawing my avatars. Thanks to Protonu Basu for his visits to my house and making me make tea for him. Finally, this thesis would never have come into being without the love and support of my parents and family. I dedicate this thesis to my mother and father whose strength through the struggles and hardships of life kept me motivated and driven. They may not be eloquent with their wishes but I know they must be very happy to see me reach this point. Special thanks to Madhavi for her love and distractions. xii CHAPTER 1 INTRODUCTION The ubiquity of complex and high-dimensional datasets is presenting ever-increasing challenges in modern-day data analysis problems. More and more application domains are nowadays witnessing the phenomena of data-deluge: advances in microarray technology have made it feasible to acquire high-throughput gene-expression measurements; the explo-sion of the world-wide-web has led to the creation of text and other multimedia collections of enormous scales; prevalence of networks of various types (social networks, coauthorship networks, etc.) has generated huge amounts of data about the social-personal preferences of people; and so on. Translating this wealth of information into useful knowledge is not always easy and often requires uncovering and understanding the latent structures that underlie these data. A natural but principled way of accomplishing this is to come up with a statistical model of the data generation process in terms of these underlying latent structures, and explaining the data in terms of these structure. Such an explanation of the data can help in uncovering the complex relationships underlying the data and, by providing a succinct and rich representa-tion, can also help in dealing with problems resulting from the noisy and high-dimensional nature of the data. A key question is how to model these latent structures. Probabilistic modeling (Bishop, 2006), by its virtue of providing a flexible and natural generative model of the data, is an appealing way of modeling the data. In particular, taking a Bayesian approach to probabilistic modeling allows incorporating prior knowledge about these structures and gives a principled and coherent way of performing inference in the model. This is done by specifying a prior distribution on the model parameters and using the Bayes rule to compute the posterior distribution of the model parameters given data. 2 Complexity control is an important issue while specifying any model. Bayesian meth-ods provide an elegant way of accomplishing this by endowing each model parameter with a prior distribution which (implicitly) acts as a regularizer. However, specifying the right level of complexity remains a challenge. Parametric prior distributions assume a fixed model complexity that is independent of the data. This is undesirable since fixing the model complexity a priori before even seeing the data seems unnatural. Ideally, it is desirable to have models that are flexible enough to adjust their complexity as warranted by the data. Nonparametric Bayesian methods (Gershman and Blei, 2012) are designed precisely with this motivation. These methods provide a flexible modeling paradigm for data without restricting the model complexity a priori. This flexibility is desired as it avoids the need for doing model-selection, which is both a time-consuming and error-prone process. Moreover, nonparametric methods allow the model complexity to adapt itself as more and more data are observed. This flexibility is desired as the model can "create parameters" to explain the data as and when the data warrant it. This is more appropriate than having a model with a predefined model with a fixed complexity that does not depend on data. This thesis focuses on developing new nonparametric Bayesian models for learning latent structures, and designing efficient inference methods for these models. Specifically, two types of latent structures are considered in this thesis: (1) low-dimensional latent factors underlying high-dimensional data, with the additional property that the latent factors are not independent of each other but are related via an a priori unknown structure, and (2) latent task structures capturing how a set of multiple learning tasks (e.g., classification or regression) relate to each other, and leveraging this task structure for sharing information across multiple tasks in order to improve learning. This paradigm is commonly known as learning to learn (Heskes, 2000) or Multitask Learning (Caruana, 1997). Efficient inference in nonparametric Bayesian models remains an open problem. To this end, this thesis presents two efficient inference methods for the Indian Buffet Pro-cess (Ghahramani et al., 2007), which is a nonparametric Bayesian latent feature model. In particular, for the nonparametric Bayesian latent feature model, which posits each ob-servation as being generated by a small (and a priori unknown) number of latent fea-tures, the thesis presents two inference methods: (1) a search-based maximum-a-posteriori (MAP) inference method, and (2) a Sequential-Monte-Carlo-based fully Bayesian infer- 3 ence method that allows processing one observation at a time, while maintaining a compact approximation of the posterior distribution. 1.1 Overview of Methods and Contributions Here, we give a brief overview of the methods developed as part of this thesis. In particular, the thesis can be divided into three parts: (1) designing nonparametric Bayesian latent factor models for high-dimensional data, (2) designing nonparametric Bayesian mod-els for capturing and leveraging the latent relatedness structure for jointly solving multiple learning tasks, and (3) designing efficient inference methods for nonparametric Bayesian latent feature models. 1.1.1 Nonparametric Bayesian Latent Factor Models The first contribution of this thesis is a nonparametric Bayesian Factor Analysis model with the following key properties: (1) the number of latent factors need not be known, (2) the latent factors are not assumed to be independent of each other, and (3) not all observed features in the data are considered relevant for the Factor Analysis task. In particular, (2) is of particular interest in many problems. For example, in gene-expression analysis where the factors correspond to biological pathways, the pathways are known to be related with each other. In topic-modeling-based text analysis, factors correspond to topics and the topics tend to be related with each other (by varying degrees); see Figure 1.1 for a pictorial illustration. Having a Factor Analysis model that captures the dependencies among the factors is therefore desirable. Our model also naturally extends for the task of factor regression (West, 2003), which involves simultaneous learning of latent factors and predicting the responses associated with each sample, given a set of training samples with their responses. The nonparametric latent factor model (Ghahramani et al., 2007) has a limitation that it can only learn latent factors underlying a single feature representation of the objects. Often, however, objects are associated with multiple feature representations. For example, a given collection of webpages can be represented using different types of features such as the page-text, the anchor-text on hyperlinks pointed towards them, the images appearing in them, the social tags associated with them, and so on. For such cases, the thesis presents a nonparametric Bayesian Canonical Correlation Analysis (CCA) model that allows learning 4 Figure 1.1. Factor Analysis with relationship among latent factors. xn is a high dimen-sional observation, fn are the low-dimensional latent factors, and A is the factor-loading matrix. latent factors shared across multiple feature representations (or modalities). Another useful application of such a model is for the problem of multilabel prediction using CCA where one modality is the features in each example and the other modality is the responses/labels associated with each example, and the role of CCA is to perform a response/label-guided latent feature extraction. These latent features can then be used with a supervised learner. 1.1.2 Nonparametric Bayesian Learning of Latent Task Structures The second contribution of this thesis is designing efficient inference methods for the nonparametric latent feature model (Ghahramani et al., 2007), a general, nonparametric Bayesian framework for inferring how a set of learning problems relate to each other, and leveraging this knowledge to jointly solve these problems. This problem setting is com-monly known as Multitask Learning (Caruana, 1997). Multitask Learning critically relies on the assumption of how different tasks relate with each other. An incorrect assumption not supported by the dataset can end up hurting the performance. It is therefore desirable to have a Multitask Learning model that, instead of having an a priori fixed notion of task relatedness, can adapt its assumption based on the data. With this motivation in mind, the thesis presents a generative model of the task parameters (e.g., the weight vectors of a linear classification/regression model) assuming that the task parameters of multiple tasks are drawn from a shared Mixture of Factor Analyzers (MFA) model (Ghahramani and Hinton, 1997). By giving a nonparametric Bayesian treatment, the resulting model achieves 5 considerable modeling flexibility and is shown to subsume several previously proposed Multitask Learning models as its special cases, while being more flexible and robust than these models. 1.1.3 Efficient Inference for the Nonparametric Latent Feature Models The third contribution of this thesis is designing efficient inference methods for non-parametric Bayesian methods, in particular, for the Indian Buffet Process (IBP), which is a nonparametric latent feature model (Ghahramani et al., 2007). To this end, the thesis develops two approximate inference methods • The first method is a beam-search-based approximate maximum-a-posteriori (MAP) inference method for the IBP. This method is motivated by the fact that in many prac-tical cases, we do not require the full posterior distribution of the latent feature as-signment matrix but only seek the best, highest probability sample from the posterior distribution. In such cases, a fast method that can provide the MAP estimate may be more desirable than sampling-based methods such as MCMC, or optimization-based methods such as variational inference that explore the full posterior distribution, and are therefore usually slow. • The second method is a Sequential-Monte-Carlo-based inference method which pro-vides samples from the full posterior distribution and has the appealing property that it can process the observations in an online manner (i.e., one observation at a time). This is desirable both for scalability as well as for many practical scenarios where observations arrive one at a time. Moreover, our proposed method is an improvement over the existing SMC-based method for the IBP as it allows incorporating the most recent observation in the inference. The earlier proposed SMC method for the IBP ignores the most recent observation. We show that our proposed method leads to improved inference quality as well as considerably more succinct representation of the posterior distribution as compared to the standard SMC-based inference for the IBP (Wood and Griffiths, 2007). 6 1.1.4 Thesis Statement Nonparametric Bayesian methods, combined with efficient inference strategies, can provide flexible ways to design models that can (a) learn low-dimensional latent features from high-dimensional data, (b) infer relatedness of these latent features, and (c) solve multiple related learning problems by inferring latent shared predictive structures. 1.2 Thesis Organization The rest of the chapters of the thesis are organized as follows: Chapter 2 provides a brief background on the models and concepts on which the sub-sequent chapters are based. In particular, it talks about nonparametric Bayesian methods such as the Dirichlet Process, the Indian Buffet Process, and the Kingman's Coalescent. In addition, the chapter provides a brief background on Factor Analysis and Multitask Learning. Chapter 3 describes the nonparametric Bayesian Factor Analysis model. We discuss how the model learns the correct number of latent factors, allows the factors to be related via an a priori unknown hierarchy, and filters away noisy features in the data for more robust Factor Analysis. Chapter 4 describes the multiview generalization of the nonparametric latent factor model. In particular, we describe how we can use the Indian Buffet Process to design a nonparametric Bayesian version of the Canonical Correlation Analysis model. Chapter 5 describes a nonparametric Bayesian model we propose for Multitask Learn-ing. The model is based on the assumption that the weight vectors of a collection of potentially related tasks live on a low-dimensional subspace. This is equivalent to the weight vectors being generated as a linear combination of a set of basis tasks. We describe how taking a Factor Analysis model on the weight vector, Multitask Learning can be accomplished and how using the Indian Buffet Process allows us to circumvent model selection issues in such a model. Chapter 6 builds on the model described in Chapter 5. We show how replacing the single factor analyzer by a mixture of factor analyzers allows us to capture considerably richer notions of task relatedness and can provide a general framework for modeling task relatedness. 7 Chapter 7 describes our proposed beam-search algorithm for doing approximate MAP inference for the IBP. By experimental comparisons with other state-of-the-art methods, we show that this method can be a viable alternative to methods based on sampling or variational inference. Chapter 8 describes our proposed Sequential-Monte-Carlo-based (SMC) inference method for the IBP, and discusses its differences with the standard SMC-based inference method for the IBP proposed in (Wood and Griffiths, 2007). Chapter 9 presents a discussion and concludes with some possible directions for future work. CHAPTER 2 BACKGROUND This chapter provides a brief background on nonparametric Bayesian methods, in par-ticular the Dirichlet Process, the Indian Buffet Process, and the Kingman's Coalescent, which would be used as building blocks for the models developed in this thesis. The chapter also provides a brief background on Latent Factor Analysis, Mixture of Factor Analyzers, and Multitask Learning for which the proposed models in the thesis have been developed. 2.1 Nonparametric Bayesian Methods In any data analysis task, choosing the appropriate model complexity is a critical issue. For example, in data clustering, one needs to specify the number of clusters; in dimension-ality reduction, one needs to specify the dimensionality of the lower-dimensional space; in regression or classification, one needs to specify the functional form of the input-output relationship, which is typically a parametric model defined by a fixed set of parameters. In all these cases, the number of parameters (number of clusters, dimensionality of the lower-dimensional space, or the number of parameters in the regression/classification model) do not depend on the data and need to be specified a priori. Nonparametric Bayesian methods take an entirely different approach to this problem of model selection. Instead of prespecifying the model complexity a priori, these methods assume the model to have an unbounded complexity to begin with and the eventual com-plexity to be determined by the amount of data. Essentially, these methods can adapt the model complexity by creating parameters as and when dictated by the data. Note that the name nonparametric here is somewhat a misnomer. It does not mean that the model does not have any parameters. It means that the number of parameters is potentially infinite but limited by the data. What is important here is that it does not need to be specified a priori. 9 2.2 Dirichlet Process The Dirichlet Process (DP) is a prior distribution over discrete distributions (Ferguson, 1973). Discreteness implies that if one draws samples from a distribution drawn from the DP, the samples will cluster: new samples take the same value as older samples with some positive probability. A DP is defined by two parameters: a concentration parameter α and a base measure G0. The sampling process defining the DP draws the first sample from the base measure G0. Each subsequent sample would take on a new value drawn from G0 with a probability proportional to α, or reuse a previously drawn value with probability proportional to the number of samples having that value. This property makes it suitable as a prior for effectively infinite mixture models, where the number of mixtures can grow as new samples are observed. Our mixture of factor analyzers-based MTL model uses the DP to model the mixture components so we do not need to specify their number a priori. 2.3 Indian Buffet Process The Indian Buffet Process (IBP) (Ghahramani et al., 2007) is a nonparametric Bayesian prior that defines a distribution over infinite binary matrices. The IBP was originally motivated by the need to model the latent feature structure of a given set of observations. The IBP, due to its flexibility, has been a model of choice in variety of nonparametric Bayesian applications, such as for factorial structure learning, learning causal structures, modeling dyadic data, modeling overlapping clusters, and others (Ghahramani et al., 2007). In the latent feature model, each observation can be thought of as consisting of a set of latent features. Given an N × D matrix X of N observations having D features each, we can consider a decomposition of the form X = ZA+E where Z is an N×K binary feature-assignment matrix describing which features are present in each observation. Zn,k is 1 if feature k is present in observation n, and is otherwise 0. A is a K × D matrix of feature scores, and the matrix E consists of observation-specific noise. A crucial issue in such models is the choosing the number K of latent features. The standard formulation of IBP lets us define a prior over the binary matrix Z such that it can have an unbounded number of columns and thus can be a suitable prior in problems dealing with such structures. The IBP derivation starts by defining a finite model for K many columns of a N × K binary matrix. 10 P(Z) = YK k=1 α K(mk + α K)(N − mk − 1) (N + 1 + α K) Here mk = P i Zik. In the limiting case, as K → ∞, it as was shown in (Ghahramani et al., 2007) that the binary matrix Z generated by IBP is equivalent to one produced by a sequential stochastic process. This process can be best understood by a culinary analogy of customers coming to an Indian restaurant and selecting dishes from an infinite array of dishes. In this analogy, customers represent observations (rows of X) and dishes represent latent features (columns of Z). Customer 1 selects Poisson(α) dishes to begin with. There-after, each incoming customer n selects an existing dish k with a probability mk/N, where mk denotes how many previous customers chose that particular dish. The customer n then goes on further to additionally select Poisson(α/n) new dishes. This process generates a binary matrix Z with rows representing customer and columns representing dishes. Many real-world datasets have a sparseness property, which means that each observation depends only on a subset of all the K latent features. This means that the binary matrix Z is expected to be reasonably sparse for many datasets. This makes IBP a suitable choice for capturing the underlying sparsity in addition to automatically discovering the number of latent features. Figure 2.1 shows a pictorial illustration of the IBP. Figure 2.1. Pictorial illustration of the IBP with N = 4 and eventual K = 4 unique latent features. In the IBP, customers correspond to datapoints and dishes correspond to latent features. 11 2.4 Kingman's Coalescent Our model makes use of a latent hierarchical structure over factors; we use Kingman's Coalescent (Kingman, 1982) as a convenient prior distribution over hierarchies. King-man's Coalescent originated in the study of population genetics for a set of single-parent organisms. The Coalescent is a nonparametric model over a countable set of organisms. It is most easily understood in terms of its finite dimensional marginal distributions over n individuals, in which case it is called an n-coalescent. We then take the limit n → ∞. In our case, the individuals are factors. The n-coalescent considers a population of n organisms at time t = 0. We follow the ancestry of these individuals backward in time, where each organism has exactly one parent at time t < 0. The n-coalescent is a continuous-time, partition-valued Markov process, which starts with n singleton clusters at time t = 0 and evolves backward, coalescing lineages until there is only one left. We denote by ti the time at which the ith coalescent event occurs (note ti ≤ 0), and δi = ti−1 −ti the time between events (note δi > 0). Under the n-coalescent, each pair of lineages merges independently with exponential rate 1; so δi ∼ Exp n−i+1 2 . With probability one, a random draw from the n-coalescent is a binary tree with a single root at t = −∞ and n individuals at time t = 0. We denote the tree structure by π. The marginal distribution over tree topologies is uniform and independent of coalescent times; and the model is infinitely exchangeable. We therefore consider the limit as n → ∞, called the coalescent. See Figure 2.2 for a pictorial illustration. Once the tree structure is obtained, one can define an additional Markov process to evolve over the tree. One common choice is a Brownian diffusion process. In Brownian diffusion in D dimensions, we assume an underlying diffusion covariance of Λ ∈ RD×D p.s.d. The root is a D-dimensional vector drawn z. Each nonroot node in the tree is drawn Gaussian with mean equal to the value of the parent, and variance δiΛ, where δi is the time that has passed. Recently, Teh et al. (Teh et al., 2008) proposed efficient bottom-up agglomerative inference algorithms for the coalescent. These (approximately) maximize the probability of π and δs, marginalizing out internal nodes by Belief Propagation. If we associate with each node in the tree a mean y and variance v message, we update messages as Eq (2.1), where i is the current node and li and ri are its children. 12 Figure 2.2. Pictorial illustration of an n-coalescent with n = 15 individuals vi = (vli + (tli − ti)Λ)−1 + (vri + (tri − ti)Λ)−1 −1 (2.1) yi = yli(vli + (tli − ti)Λ)−1 + yri(vri + (tri − ti)Λ)−1 −1 vi 2.5 Factor Analysis Factor Analysis (Bartholomew and Knott, 1999) is the task of explaining data by means of a small set of latent factors. One of the first applications of Factor Analysis can be found in the psychology community in an attempt to explain intelligence using a small set of latent traits or factors. More formally, given a set of observations {x1, . . . , xN}, Factor Analysis attempts to explain each observation xn ∈ RD using a smaller number of latent factorsfn ∈ RK (K ≪ D) as follows: xn = Afn + εn where A denotes the factor loading matrix of size D × K and εn denotes the observation-specific noise (typically assumed to be Gaussian) not explained by the latent factors. Fig-ure 2.3 shows a pictorial illustration of a standard Factor Analysis model. 2.6 Mixture of Factor Analyzers A mixture of factor analyzers (MFA) model generalizes the standard Factor Analysis model by assuming that for each observation, first we select a factor analyzer from a collection of factor analyzers and then generate the observation using that factor analyzer. 13 Figure 2.3. A basic Factor Analysis model. xn is a high-dimensional observation, fn are the low-dimensional latent factors, A is the factor-loading matrix. Suppose z(n) denotes the index of the chosen factor analyzer for the n-th observation xn. The generative story for this observation under the MFA can be written as: xn = μz(n) + Az(n)fn + εn Note that, unlike the standard Factor Analysis, in an MFA, we also have a mean μ ∈ RD associated with each factor analyzer. Therefore, each factor analyzer is parameterized by a pair {μ,A} of mean and a factor loading matrix. An MFA model can also be seen as a local dimensionality reduction method with different local factor analyzers performing dimensionality reduction in different regions of space. Seen another way, an MFA model performs data clustering, while simultaneously performing dimensionality reduction within each cluster. This can be especially useful in clustering high-dimensional data when the number of datapoints is small. Standard clustering methods such as a mixture of Gaussian would be prone to overfitting in such high-dimensional, small sample-size cases because it fits a mixture of full-rank Gaussians. On the other hand, an MFA can be seen as fitting a mixture of low-rank Gaussians (note that a factor analyzer is akin to a low-rank Gaussian), thereby preventing overfitting. 2.7 Multitask Learning Learning problems do not exist in a vacuum. Often, one is tasked with developing not one, but many classifiers for different tasks. In these cases, there is often not enough data to learn a good model for each task individually-real-world examples are prioritizing 14 email messages across many users' inboxes (Aberdeen et al., 2011) and recommending items to users on web sites (Ning and Karypis, 2010). In these settings it is advantageous to transfer or share information across tasks. Multitask Learning (MTL) (Caruana, 1997) encompasses a range of techniques to share statistical strength across models for various tasks and allows learning even when the amount of labeled data for each individual task is very small. Most MTL methods achieve this improved performance by assuming some notion of similarity across tasks. For example: • Parameters of all the tasks are close to a shared mean parameter. Probabilistically, this is equivalent to the parameters of all the tasks being drawn from a shared Gaus-sian distribution (Chelba and Acero, 2006). • Parameters of all the tasks exhibit a clustering structure (Jacob and Bach, 2008, Xue et al., 2007b). Probabilistically, this is equivalent to the parameters of all the tasks being drawn from a mixture of Gaussian distributions. • Parameters of all the tasks live on a low-dimensional subspace (Rai and Daum´e III, 2010), or all the tasks have a common set of relevant features (Argyriou et al., 2007). • Task relationships can be captured by modeling the task covariance matrix (Bonilla et al., 2007, Zhang and Yeung, 2010). Choosing the model whose task similarity assumptions are consistent for the given Mul-titask Learning problem is critical. Incorrect assumptions, however, can end up degrading the performance. CHAPTER 3 NONPARAMETRIC BAYESIAN SPARSE LATENT FACTOR MODEL In this chapter, we describe our proposed nonparametric Bayesian Factor Analysis model that simultaneously learns the number of factors as well as the relationships among the factors. Moreover, our method also allows simultaneously doing feature selection so that only relevant features in the data participate in Factor Analysis. 3.1 Introduction Factor Analysis is the task of explaining data by means of a set of latent factors. Factor regression couples this analysis with a prediction task, where the predictions are made solely on the basis of the factor representation. The latent factor representation achieves two-fold benefits: (1) discovering the latent process underlying the data; (2) simpler predictive modeling through a compact data representation. In particular, (2) is motivated by the problem of prediction in the "large P small N" paradigm (West, 2003), where the number of features P greatly exceeds the number of examples N, potentially resulting in overfitting. We address three fundamental shortcomings of standard Factor Analysis approaches (Beal et al., 2005, Sabatti and James, 2005, Sanguinetti et al., 2006, West, 2003): (1) we do not assume a known number of factors; (2) we do not assume factors are independent; (3) we do not assume all features are relevant to the Factor Analysis. Our motivation for this work stems from the task of reconstructing regulatory structure from gene-expression data. In this context, factors correspond to regulatory pathways. Our contributions thus parallel the needs of gene pathway modeling. In addition, we couple predictive modeling (for factor regression) within the Factor Analysis framework itself, instead of having to 16 model it separately. Our factor regression model is fundamentally nonparametric. In particular, we treat the gene-to-factor relationship nonparametrically by proposing a sparse variant of the Indian Buffet Process (IBP) (Ghahramani et al., 2007), designed to account for the sparsity of relevant genes (features). We couple this IBP with a hierarchical prior over the factors. This prior explains the fact that pathways are fundamentally related: some are involved in transcription, some in signaling, some in synthesis. The nonparametric nature of our sparse IBP requires that the hierarchical prior also be nonparametric. A natural choice is Kingman's coalescent (Kingman, 1982), a popular distribution over infinite binary trees. Since our motivation is an application in bioinformatics, our notation and terminology will be drawn from that area. In particular, genes are features, samples are examples, and pathways are factors. However, our model is more general. An alternative application might be to a collaborative filtering problem, in which case our genes might correspond to movies, our samples might correspond to users, and our pathways might correspond to genres. In this context, all three contributions of our model still make sense: we do not know how many movie genres there are; some genres are closely related (romance to comedy versus to action); many movies may be spurious. Our model uses a variant of the Indian Buffet Process (Section 2.3) to model the feature-factor (i.e., gene-pathway) relationships. We further use Kingman's Coalescent (Section 2.4) to model latent pathway hierarchies. 3.2 Nonparametric Bayesian Factor Regression Recall the standard Factor Analysis problem: X = AF + E, for standardized data X. X is a P × N matrix consisting of N samples [x1, ..., xN] of P features each. A is the factor loading matrix of size P ×K and F = [f1, ..., fN] is the factor matrix of size K × N. E = [e1, ..., eN] is the matrix of idiosyncratic variations. K, the number of factors, is known. Recall that our goal is to treat the Factor Analysis problem nonparametrically, to model feature relevance, and to model hierarchical factors. For expository purposes, it is simplest to deal with each of these issues in turn. In our context, we begin by modeling the gene-factor relationship nonparametrically (using the IBP). Next, we propose a variant of IBP to model gene relevance. We then present the hierarchical model for inferring factor 17 hierarchies. We conclude with a presentation of the full model and our mechanism for modifying the Factor Analysis problem to factor regression. 3.2.1 Nonparametric Gene-Factor Model We begin by directly using the IBP to infer the number of factors. Although IBP has been applied to nonparametric Factor Analysis in the past (Ghahramani et al., 2007), the standard IBP formulation places IBP prior on the factor matrix (F), associating samples (i.e., a set of features) with factors. Such a model assumes that the sample-factor relation-ship is sparse. However, this assumption is inappropriate in the gene-expression context where it is not the factors themselves but the associations among genes and factors (i.e., the factor loading matrix A) that are sparse. In such a context, each sample depends on all the factors, but each gene within a sample usually depends only on a small number of factors. Thus, it is more appropriate to model the factor loading matrix (A) with the IBP prior. Note that since A and F are related with each other via the number of factors K, modeling A nonparametrically allows our model to also have an unbounded number of factors. For most gene-expression problems (West, 2003), a binary factor loadings matrix (A) is inappropriate. Therefore, we instead use the Hadamard (element-wise) product of a binary matrix Z and a matrix V of reals. Z and V are of the same size as A. The Factor Analysis model, for each sample i, thus becomes: xi = (Z ⊙V )fi +ei. We have Z ∼ IBP(α, β). α and β are IBP hyperparameters and have vague gamma priors on them. Our initial model assumes no factor hierarchies and hence the prior over V would simply be a Gaussian: V ∼ Nor(0, σ2 vI) with an inverse-gamma prior on σv. F has a zero mean, unit variance Gaussian prior, as used in standard Factor Analysis. Finally, ei = Nor(0,Ψ) models the idiosyncratic variations of genes where Ψ is a P × P diagonal matrix (diag( 1, ..., P )). Each entry P has an inverse-gamma prior on it. 3.2.2 Feature Selection Prior Typical gene-expression datasets are of the order of several thousands of genes, most of which are not associated with any pathway (factor). In the above, these are accounted for only by the idiosyncratic noise term. A more realistic model is that certain genes simply do not participate in the Factor Analysis. In the culinary analogy, some of the genes that enter 18 the restaurant leave before selecting any dishes. We will refer to such genes as "spurious". We add an additional prior term to account for such spurious genes; effectively leading to a sparse solution (over the rows of the IBP matrix). It is important to note that this notion of sparsity is fundamentally different from the conventional notion of sparsity in the IBP. The sparsity in IBP is over columns, not rows. To see the difference, recall that the IBP contains a "rich get richer" phenomenon: frequently selected factors are more likely to get reselected. Consider a truly spurious gene and ask whether it is likely to select any factors. If some factor k is already frequently used, then a priori, this gene is more likely to select it. The only downside to selecting it is the data likelihood. By setting the corresponding value in V to zero, there is no penalty. Our sparse-IBP prior is identical to the standard IBP prior with one exception. Each customer (gene) p is associated with Bernoulli random variable Tp that indicates whether it samples any dishes. The T vector is given a parameter ρ, which, in turn, is given a Beta prior with parameters a, b. 3.2.3 Hierarchical Factor Model In our basic model, each column of the matrix Z (and the corresponding column in V ) is associated with a factor. These factors are considered unrelated. To model the fact that factors are, in fact, related, we introduce a factor hierarchy. Kingman's coalescent (Kingman, 1982) is an attractive prior for integration with IBP for several reasons. It is nonparametric and describes exchangeable distributions. This means that it can model a varying number of factors. Moreover, efficient inference algorithms exist (Teh et al., 2008). 3.2.4 Full Model and Extension to Factor Regression Our proposed graphical model is depicted in Figure 3.1. The key aspects of this model are the IBP prior over Z, the sparse binary vector T, and the coalescent prior over V. In standard Bayesian factor regression (West, 2003), Factor Analysis is followed by the regression task. The regression is performed only on the basis of F, rather than the full data X. For example, a simple linear regression problem would involve estimating a K-dimensional parameter vector with regression value ⊤F. Our model, on the other hand, integrates the factor regression component in the nonparametric Factor Analysis framework itself. We do so by prepending the responses yi to the expression vector xi and joining the 19 Figure 3.1. The graphical model for nonparametric Bayesian Factor Regression. X consists of response variables as well. training and test data (see Figure 3.2). The unknown responses in the test data are treated as missing variables to be iteratively imputed in our MCMC inference procedure. It is straightforward to see that it is equivalent to fitting another sparse model relating factors to responses. Our model thus allows the Factor Analysis to take into account the regression task as well. In case of binary responses, we add an extra probit regression step to predict binary outcomes from real-valued responses. 3.3 Inference Exact inference is intractable in our model and, therefore, we use Gibbs sampling with a few Metropolis-Hastings steps to perform approximate inference. 3.3.1 Sampling the IBP Matrix Z Sampling Z consists of sampling existing dishes, proposing new dishes and accepting or rejecting them based on the acceptance ratio in the associated M-H step. For sampling existing dishes, an entry in Z is set as 1 according to p(Zik = 1|X,Z−ik, V, F,Ψ) ∝ m−i,k (P+β−1)p(X|Z, V, F,Ψ) whereas it is set as 0 according to p(Zik = 0|X,Z−ik, V, F,Ψ) ∝ P+β−1−m−i,k (P+β−1) p(X|Z, V, F,Ψ). m−i,k = P j6=i Zjk is how many other customers chose dish k. For sampling new dishes, we use an M-H step where we simultaneously propose = 20 Figure 3.2. Training and test data are combined together and test responses are treated as missing values to be imputed. (Knew, V new, Fnew) where Knew ∼ Poisson(αβ/(β + P − 1)). We accept the proposal with an acceptance probability (following (Meeds et al., 2007)) given by a = min{1, p(rest| ) p(rest|) }. Here, p(rest|η) is the likelihood of the data given parameters η. We propose V new from its prior (either Gaussian or Coalescent) but, for faster mixing, we propose Fnew from its posterior. Sampling V new from the coalescent is slightly involved. As shown pictorially in Figure 3.3, proposing a new column of V corresponds to adding a new leaf node to the existing coalescent tree. In particular, we need to find a sibling (s) to the new node y′ and need to find an insertion point on the branch joining the sibling s to its parent p (the grandparent of y′). Since the marginal distribution over trees under the coalescent is uniform, the sibling s is chosen uniformly over nodes in the tree. We then use importance sampling to select an insertion time for the new node y′ between ts and tp, according to the exponential distribution given by the coalescent prior (our proposal distribution is uniform). This gives an insertion point in the tree, which corresponds to the new parent of y′. We denote this new parent by p′ and the time of insertion as t. The predictive density of the newly inserted node y′ can be obtained by marginalizing the parent p′. This yields Nor(y0, v0), given by: v0 = [(vs + (ts − t)Λ)−1 + (vp + (t − tp)Λ)−1]−1 y0 = [ys/(vs + (ts − t)Λ) + yp/(vp + (tp − t)Λ)]v0 21 Figure 3.3. Adding a new node to the tree Here, ys and vs are the messages passed up through the tree, while yp and vp are the messages passed down through the tree (compare to Eq (2.1)). 3.3.2 Sampling the Sparse IBP Vector T In the sparse IBP prior, recall that we have an additional P-many variables Tp, indi-cating whether gene p "eats" any dishes. Tp is drawn from Bernoulli with parameter ρ, which, in turn, is given a Bet(a, b) prior. For inference, we collapse ρ and Ψ and get Gibbs posterior over Tp of the form p(Tp = 1|.) ∝ (a + P q6=p Tp)Stu(xp|(Zp ⊙ Vp)F, g/h, g)) and p(Tp = 0|.) ∝ (b + P − P q6=p Tq)Stu(xp|0, g/h, g), where Stu is the nonstandard Student's t-distribution. g, h are hyperparameters of the inverse-gamma prior on the entries of Ψ. 3.3.3 Sampling the Real-valued Matrix V For the case when V has a Gaussian prior on it, we sample V from its posterior p(Vg,j |X, Z, F,Ψ) ∝ Nor(Vg,j |μg,j , g,j) where g,j = ( PN i=1 F2 j,i g + 1 σ2 v )−1 and μg,j = g,j( PN i=1 Fj,iX∗ g,j) −1 g . We define X∗ g,j = Xg,i − PK l=1,l6=j(Ag,lVg,l)Fl,i, and A = Z⊙V. The hyperparameter σv on V has an inverse-gamma prior and posterior also has the same form. For the case with coalescent prior on V, 22 we have g,j = ( PN i=1 F2 j,i g + 1 v0j )−1 and μg,j = g,j( PN i=1 Fj,iX∗ g,j)( g + y0g,j v0j )−1, where y0 and v0 are the Gaussian posteriors of the leaf node added in the coalescent tree (see Eq (2.1)), which corresponds to the column of V being sampled. 3.3.4 Sampling the Factor Matrix F We sample for F from a normal distribution with mean ¯ = AT(AAT + Ψ)−1X and covariance Σ = I − AT(AAT + Ψ)−1A, where A = Z ⊙ V 3.3.5 Sampling the Idiosyncratic Noise Term We place an inverse-gamma prior on the diagonal entries of Ψ and the posterior too is inverse-gamma: p( p|.) ∝ IG(g + N 2 , h 1+h 2 tr(ET E) ), where E = X − (Z ⊙ V)F. 3.3.6 Sampling IBP Hyperparameters We sample the IBP hyperparameter α from its posterior: p(α|.) ∼ Gam(K++a, b 1+bHP (β) ), whereK+ is the number of active features at any moment andHP (β) = PP i=1 1/(β+i−1). β is sampled from a prior proposal using an M-H step. 3.3.7 Sampling the Factor Tree We use the Greedy-Rate1 algorithm (Teh et al., 2008). 3.4 Related Work A number of probabilistic approaches have been proposed in the past for the problem of gene-regulatory network reconstruction (Beal et al., 2005, Sabatti and James, 2005, Sanguinetti et al., 2006, West, 2003). Some take into account the information on the prior network topology (Sabatti and James, 2005), which is not always available. Most assume the number of factors is known. To get around this, one can perform model selection via Reversible Jump MCMC (Green, 1995) or evolutionary stochastic model search (Carvalho et al., 2008). Unfortunately, these methods are often difficult to design and may take quite long to converge. Moreover, they are difficult to integrate with other forms of prior knowledge (e.g., factor hierarchies). A somewhat similar approach to ours is the infinite independent component analysis (iICA) model of (Knowles and Ghahramani, 2007), which treats Factor Analysis as a special case of ICA. However, their model is limited to Factor Analysis and does not take into account feature selection, factor hierarchy, and factor 23 regression. As a generalization to the standard ICA model, (Bach and Jordan, 2003) proposed a model in which the components can be related via a tree-structured graphical model. It, however, assumes a fixed number of components. Structurally, our model with Gaussian-V (i.e., no hierarchy over factors) is most similar to the Bayesian Factor Regression Model (BFRM) of (West, 2003). BFRM assumes a sparsity inducing mixture prior on the factor loading matrix A. Specifically, Apk ∼ (1 − πpk)δ0(Apk) + πpkNor(Apk|0, τk) where δ0() is a point mass centered at zero. To complete the model specification, they define πpk ∼ (1 − ρk)δ0(πpk) + ρkBet(πpk|sr, s(1 − r)) and ρk ∼ Bet(ρk|av, a(1 − v)). Now, integrating out πpk gives: Apk ∼ (1 − vρk)δ0(Apk) + vρkNor(Apk|0, τk). It is interesting to note that the nonparametric prior of our model (factor loading matrix defined as A = Z ⊙ V) is actually equivalent to the (parametric) sparse mixture prior of the BFRM as K→∞. To see this, note that our prior on the factor loading matrix A (composed of Z having an IBP prior, and V having a Gaussian prior), can be written as Apk ∼ (1 − ρk)δ0(Apk) + ρkNor(Apk|0, σ2 v), if we define ρk ∼ Bet(1, αβ/K). It is easy to see that, for BFRM where ρk ∼ Bet(av, a(1 − v)), setting a = 1 + αβ/K and v = 1 − αβ/(aK) recovers our model in the limiting case when K→∞. 3.5 Experiments In this section, we report our results on synthetic and real datasets. We compare our nonparametric approach with the evolutionary-search-based approach proposed in (Car-valho et al., 2008), which is the nonparametric extension to BFRM. We used the gene-factor connectivity matrix of e-coli network (described in (Pournara and Wernisch, 2007)) to generate a synthetic dataset having 100 samples of 50 genes and 8 underlying factors. Since we knew the ground truth for factor loadings in this case, this dataset was ideal to test for efficacy in recovering the factor loadings (binding sites and number of factors). We also experimented with a real gene-expression, breast-cancer dataset having 251 samples of 226 genes and 5 prominent underlying factors (we know this from domain knowledge). 3.5.1 Nonparametric Gene-Factor Modeling and Variable Selection For the synthetic dataset generated by the e-coli network, the results are shown com-paring the actual network used to generate the data (Figure 3.4), the inferred factor loading 24 Factors Genes True Factor Loadings 1 2 3 4 5 6 7 8 5 10 15 20 25 30 35 40 45 50 Figure 3.4. True factor loadings for the synthetic data with P=50, K=8 generated using connectivity matrix of e-coli data. White rectangles represent active sites. The data also have added noise with signal-to-noise-ratio of 10. matrix by our method (Figure 3.5), and by BFRM (Figure 3.6). As shown in Figure 3.5, our method recovered exactly the same number (8) of factors, and almost exactly the same factor loadings (binding sites and number of factors) as the ground truth. In comparison, the BFRM based on evolutionary search overestimated the number of factors and the inferred loadings clearly seem to be off from the actual loadings (even modulo column permutations). Our results on real data are shown in Figure 3.7, Figure 3.8, and Figure 3.9. To see the effect of variable selection for these data, we also introduced spurious genes by adding 50 random features in each sample. We observe the following: (1)Without variable selection being on, spurious genes result in an overestimated number of factors and falsely discovered factor loadings for spurious genes (see Figure 3.7), (2) Variable selection, when 25 Factors Genes Inferred Factor Loadings 1 2 3 4 5 6 7 8 5 10 15 20 25 30 35 40 45 50 Figure 3.5. Inferred factor loadings (with our approach) for the synthetic data with P=50, K=8 generated using connectivity matrix of e-coli data. 26 Factors Genes Factor Loadings Inferred by BFRM 1 2 3 4 5 6 7 8 9 10 5 10 15 20 25 30 35 40 45 Figure 3.6. Inferred factor loadings with the evolutionary-search-based approach. 27 Factors Noise Genes 2 4 6 8 50 100 150 200 250 10 20 30 40 50 60 Figure 3.7. IBP based sparse Factor Analysis without feature selection 28 Factors Noise Genes 1 2 3 4 5 50 100 150 200 250 10 20 30 40 50 60 Figure 3.8. IBP based sparse Factor Analysis with variable selection 29 Factors Noise Genes 1 2 3 4 5 6 7 8 50 100 150 200 250 10 20 30 40 50 60 Figure 3.9. Bayesian factor regression model 30 on, effectively filters out spurious genes, without overestimating the number of factors (see Figure 3.8). We also investigated the effect of noise on the evolutionary-search-based approach and it resulted in an overestimated number of factor, plus false discovered factor loadings for spurious genes (see Figure 3.9). To conserve space, we do not show here the cases when there are no spurious genes in the data, but it turns out that variable selection does not filter out any of 226 relevant genes in such a case. 3.5.2 Hierarchical Factor Modeling Our results with hierarchical factor modeling are shown in Figure 3.10 and Figure 3.11 for synthetic and real data. As shown, the model correctly infers the gene-factor associations, the number of factors, and the factor hierarchy. There are several ways to interpret the hierarchy. From the factor hierarchy for e-coli data (Figure 3.10 (b)), we see that column-2 (corresponding to factor-2) of the V matrix is the most prominent one (it regulates the highest number of genes), and is closest to the tree-root, followed by column-2, to which it looks most similar. Columns corresponding to lesser prominent factors are located further down in the hierarchy (with appropriate relatedness). Figure 3.11 (b) can be interpreted in a similar manner for breast-cancer data. The hierarchy can be used to find factors in order of their prominence. The higher we chop off the tree along the hierarchy, the more prominent the factors, we discover, are. For instance, if we are only interested in the top 2 factors in e-coli data, we can chop off the tree above the sixth coalescent point. This is akin to the agglomerative clustering sense, which is usually done post-hoc. In contrast, our model discovers the factor hierarchies as part of the inference procedure itself. At the same time, there is no degradation of data reconstruction (in the mean-squared-error sense) and the log-likelihood, when compared to the case with Gaussian prior on V (see Figure 3.12 - they actually improve). We also show in Section 3.5.3 that hierarchical modeling results in better predictive performance for the factor regression task. Empirical evidences also suggest that the factor hierarchy leads to faster convergence since most of the unlikely configurations will never be visited as they are constrained by the hierarchy. 31 1 2 3 4 5 6 7 8 5 10 15 20 25 30 35 40 45 50 (a) 3 5 8 7 4 6 1 2 0.04 0.045 0.05 0.055 0.06 0.065 0.07 0.075 0.08 0.085 (b) Figure 3.10. Hierarchical factor modeling results. (a) Factor loadings for e-coli data. (b) Inferred hierarchy for e-coli data. 32 Factors Genes 1 2 3 4 5 20 40 60 80 100 120 140 160 180 200 220 10 20 30 40 50 60 (a) 1 2 3 5 4 0.07 0.08 0.09 0.1 0.11 0.12 (b) Figure 3.11. Hierarchical factor modeling results. (a) Factor loadings for breast-cancer data. (b) Inferred hierarchy for breast-cancer data. 33 0 100 200 300 400 500 600 700 800 900 1000 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Iterations MSE Coalescent V Gaussian V Post Convergence MSE of BFRM 0 100 200 300 400 500 600 700 800 900 1000 −8.5 −8 −7.5 −7 −6.5 −6 −5.5 −5 −4.5 x 104 log likelihood Iterations Gaussian V Coalescent V Figure 3.12. Convergence plots: (a) MSE on the breast-cancer data for BFRM (horizontal line), our model with Gaussian (top red curved line) and Coalescent (bottom blue curved line) priors. (b) Log-likelihoods for our model with Gaussian (bottom red curved line) and Coalescent (top blue curved line) priors. 34 3.5.3 Factor Regression We report factor regression results for binary and real-valued responses and compare both variants of our model (Gaussian V and Coalescent V) against 3 different approaches: logistic regression, BFRM, and fitting a separate predictive model on the discovered factors (see Table 3.1 and Figure 3.12). The breast-cancer dataset had two binary response vari-ables (phenotypes) associated with each sample. For this binary prediction task, we split the data into a training-set of 151 samples and test-set of 100 samples. This is essentially a transduction setting, as described in Section 3.2.4 and shown in Figure 3.2. For real-valued prediction task, we treated a 30x20 block of the data matrix as our held-out data and predicted it based on the rest of the entries in the matrix. This method of evaluation is akin to the task of image reconstruction (Verbeek et al., 2004). The results are averaged over 20 random initializations and the low error variances suggest that our method is fairly robust w.r.t. initializations. 3.6 Conclusions and Discussion We have presented a fully nonparametric Bayesian approach to sparse factor regression, modeling the gene-factor relationship using a sparse variant of the IBP. However, the true power of nonparametric priors is evidenced by the ease of integration of task-specific models into the framework. Both gene selection and hierarchical factor modeling are straightforward extensions in our model that do not significantly complicate the inference procedure, but lead to improved model performance and more understandable outputs. We applied Kingman's coalescent as a hierarhical model on V, the matrix modulating the expression levels of genes in factors. Table 3.1. Factor regression results Model Binary Real (%error,std dev) (MSE) LogReg 17.5 (1.6) - BFRM 19.8 (1.4) 0.48 Nor-V 15.8 (0.56) 0.45 Coal-V 14.6 (0.48) 0.43 PredModel 18.1 (2.1) - CHAPTER 4 NONPARAMETRIC BAYESIAN FACTOR ANALYSIS WITH MULTIPLE MODALITIES In this chapter, we present a generalization of the nonparametric Bayesian latent factor model and show how we can extract latent factors shared between two or more modalities. In this chapter, we consider a special case of supervised dimensionality reduction for the multilabel prediction setting using Canonical Correlation Analysis (CCA) where the first modality is the features in the data and the second modality is the label matrix. However, our model is more general and can be applied for latent factor learning from multimodal data such as a collection of webpages that can be represented using different types of features such as the page-text, the anchor-text on hyperlinks pointed towards them, the images appearing in them, the social tags associated with them, and so on. 4.1 Introduction Learning with examples having multiple labels is an important problem in machine learning and data mining. Such problems are encountered in a variety of application domains. For example, in text classification, a document (e.g., a newswire story) can be associated with multiple categories. Likewise, in bio-informatics, a gene or protein usually performs several functions. All these settings suggest a common underlying prob-lem: predicting multivariate responses. When the responses come from a discrete set, the problem is termed as multilabel classification. The aforementioned setting is a special case of Multitask Learning (Caruana, 1997) when predicting each label is a task and all the tasks share a common source of input. An important characteristics of these problems is that the labels are not independent of each other but actually often have significant correlations 36 with each other. A na¨ıve approach to learn in such settings is to train a separate classifier for each label. However, such an approach ignores the label correlations and leads to suboptimal performance (Ueda and Saito, 2003). In this chapter, we show how Canonical Correlation Analysis (CCA) (Hotelling, 1936) can be used to exploit label relatedness, learning multiple prediction problems simultane-ously. CCA is a useful technique for modeling dependencies between two (or more) sets of variables. One important application of CCA is in supervised dimensionality reduction, albeit in the more general setting where each example has several labels. In this setting, CCA on input-output pair (X, Y) can be used to project inputs X to a low-dimensional space directed by label information Y. This makes CCA an ideal candidate for extracting useful predictive features from data in the context of multilabel prediction problems. The classical CCA formulation, however, has certain inherent limitations. It is non-probabilistic, which means that it cannot deal with missing data, and precludes a Bayesian treatment, which can be important if the dataset size is small. An even more crucial issue is choosing the number of correlation components, which is traditionally dealt with by using cross-validation, or model-selection (Wang, 2007). Another issue is the potential sparsity (Sriperumbudur et al., 2009) of the underlying projections that is ignored by the standard CCA formulation. Building upon the recently suggested probabilistic interpretation of CCA (Bach and Jordan, 2005), we propose a nonparametric, fully Bayesian framework that can deal with each of these issues. In particular, the proposed model can automatically select the number of correlation components, and effectively capture the sparsity underlying the projections. Our framework is based on the Indian Buffet Process (Ghahramani et al., 2007), a nonpara-metric Bayesian model to discover latent feature representation of a set of observations. In addition, our probabilistic model allows dealing with missing data and, in the supervised dimensionality reduction case, can incorporate additional unlabeled data one may have access to, making our CCA algorithm work in a semisupervised setting. Thus, apart from being a general, nonparametric, fully Bayesian solution to the CCA problem, our framework can be readily applied for learning useful predictive features from labeled (or partially labeled) data in the context of learning a set of related tasks. This chapter is organized as follows. Section 4.2 introduces the CCA problem and 37 its recently proposed probabilistic interpretation. In Section 4.3, we describe our general framework for infinite CCA. Section 4.4 gives a concrete example of an application (mul-tilabel learning) where the proposed approach can be applied. In particular, we describe a fully supervised setting (when the test data are not available at the time of training), and a semisupervised setting with partial labels (when we have access to test data at the time of training). We describe our experiments in Section 4.5, and discuss related work in Section 8.5, drawing connections of the proposed method with previously proposed ones for this problem. 4.2 Canonical Correlation Analysis Canonical Correlation Analysis (CCA) is a useful technique for modeling the relation-ships among a set of variables. CCA computes a low-dimensional shared embedding of a set of variables such that the correlation among the variables is maximized in the embedded space. More formally, given a pair of variables x ∈ RD1 and y ∈ RD2 , CCA seeks to find linear projections ux and uy such that the variables are maximally correlated in the projected space. The correlation coefficient between the two variables in the embedded space is given by ρ = uTx xyT q uy (uTx xxTux)(uTy yyTuy) Since the correlation is not affected by rescaling of the projections ux and uy, CCA is posed as a constrained optimization problem. max ux,uy uTx xyTuy, subject to : uTx xxTux = 1, uTy yyTuy = 1 It can be shown that the above formulation is equivalent to solving the following general-ized eigen-value problem: 0 Σxy Σyx 0 ux uy = ρ Σxx 0 0 Σyy ux uy where Σ denotes the covariance matrix of size D × D (where D = D1 + D2) obtained from the data samples X = [x1, . . . , xn] and Y = [y1, . . . , yn]. 38 4.2.1 Probabilistic CCA Bach and Jordan (Bach and Jordan, 2005) gave a probabilistic interpretation of CCA by posing it as a latent variable model. To see this, let x and y be two random vectors of size D1 and D2. Let us now consider the following latent variable model z ∼ Nor(0, IK), min{D1,D2} ≥ K x ∼ Nor(¯x + Wxz,Ψx), Wx ∈ RD1×K,Ψx 0 y ∼ Nor(¯y + Wyz,Ψy), Wy ∈ RD2×K,Ψy 0 Equivalently, we can also write the above as [x; y] ∼ Nor(μ + Wz,Ψ) where μ = [μx; μy], W = [Wx; Wy], and Ψ is a block-diagonal matrix consisting of Ψx and Ψy on its diagonals. [.; .] denotes row-wise concatenation. The latent variable z is shared between x and y. Bach and Jordan (Bach and Jordan, 2005) showed that, given the maximum likelihood solution for the model parameters, the expectations E(z|x) and E(z|y) of the latent variable z lie in the same subspace that classical CCA finds, thereby establishing the equivalence between the above probabilistic model and CCA. The probabilistic interpretation opens doors to several extension of the basic setup proposed in (Bach and Jordan, 2005) which suggested a maximum likelihood approach for parameter estimation. However, it still assumes an a priori fixed number of canonical correlation components. In addition, another important issue is the sparsity of the underly-ing projection matrix, which is usually ignored. 4.3 The Infinite Canonical Correlation Analysis Model Recall that the CCA problem can be defined as [x; y] ∼ Nor(Wz,Ψ) (assuming centered data). A crucial issue in the CCA model is choosing the number of canonical correlation components, which is set to a fixed value in classical CCA (and even in the probabilistic extensions of CCA). In the Bayesian formulation of CCA, one can use the Automatic Relevance Determination (ARD) prior (Bishop, 1999) on the projection matrix W that gives a way to select this number. However, it would be more appropriate to have a principled way to automatically figure out this number based on the data. 39 We propose a nonparametric Bayesian model that selects the number of canonical correlation components automatically. More specifically, we use the Indian Buffet Process (Ghahramani et al., 2007) (Section 2.3) as a nonparametric prior on the projection matrix W. The IBP prior allows W to have an unbounded number of columns which gives a way to automatically determine the dimensionality K of the latent space associated with Z. 4.3.1 The Infinite CCA Model In our proposed framework, the matrix W consisting of canonical correlation vectors is modeled using an IBP prior. However, since W can be real-valued and the IBP prior is defined only for binary matrices, we represent the (D1 + D2) × K matrix W as (B ⊙ V), where B = [Bx; By] is a (D1 + D2) × K binary matrix, V = [Vx; Vy] is a (D1 + D2) × K real-valued matrix, and ⊙ denotes their element-wise (Hadamard) product. We place an IBP prior on B that automatically determines K, and a Gaussian prior on V. Note that B and V have the same number of columns. Under this model, two random vectors x and y can be modeled as x = (Bx ⊙ Vx)z + Ex and y = (By ⊙ Vy)z + Ey. Here, z is shared between x and y, and Ex and Ey are observation-specific noise. In the full model, X = [x1, . . . , xN] is a D1 × N matrix consisting of N samples of D1 dimensions each, and Y = [y1, . . . , yN] is another matrix consisting of N samples of D2 dimensions each. Here is the generative story for our basic model (see Figure 4.1): B ∼ IBP(α) V ∼ Nor(0, σ2 v I), σv ∼ IG(a, b) Z ∼ Nor(0, I) [X;Y] ∼ Nor(B ⊙ V)Z,Ψ), where Ψ is a block-diagonal matrix of size D × D where D = (D1 + D2), with Ψx and Ψy on its diagonal. Both Ψx and Ψy have an inverse-Wishart prior on them. Since our model is probabilistic, it can also deal with the problem when X or Y have missing entries. This is particularly important in the case of supervised dimensionality reduction (i.e., X consisting of inputs and Y associated responses) when the labels for some of the inputs are unknown, making it a model for semisupervised dimensionality reduction with partially labeled data. In addition, placing the IBP prior on the projection matrix W 40 Figure 4.1. The graphical model depicts the fully supervised case when all variables X and Y are observed. The semisupervised case can have X and/or Y consisting of missing values as well. The graphical model structure remains the same (via the binary matrix B) also helps in capturing the sparsity in W (see Results section for evidence). 4.3.2 Inference We take a fully Bayesian approach by treating everything at latent variables and com-puting the posterior distributions over them. We use Gibbs sampling with a few Metropolis- Hastings steps to do inference in this model. In what follows, D denotes the data [X; Y], B = [Bx; By], and V = [Vx; Vy] 4.3.3 Sampling B Sampling the binary IBP matrix B consists of sampling existing dishes, proposing new dishes and accepting or rejecting them based on the acceptance ratio in the associated M-H step. For sampling existing dishes, an entry in B is set as 1 according to p(Bik = 1|D,B−ik, V, Z,Ψ) ∝ m−i,k D p(D|B, V, F,Ψ) whereas it is set as 0 according to p(Bik = 0|D,B−ik, V, Z,Ψ) ∝ D−m−i,k D p(D|B, V, Z,Ψ). m−i,k = P j6=i Bjk is how many other customers chose dish k. For sampling new dishes, we use an M-H step where we simultaneously propose = (Knew, V new,Znew) where Knew ∼ Poisson(α/D). We accept the proposal with an 41 acceptance probability given by a = min{1, p(rest| ) p(rest|) }. Here, p(rest|η) is the probability of the data given parameters η. We propose V new from its prior (Gaussian) but, for faster mixing, we propose Znew from its posterior. 4.3.4 Sampling V We sample the real-valued matrix V from its posterior, which is a normal distribution with covariance i,k = ( PN n=1 Z2 k,n i + 1 σ2 v )−1 and mean μi,k = i,k( PN n=1 Ak,nD∗ i,k) −1 i . We define D∗ i,k = Di,n − PK l=1,l6=k(Bi,lVi,l)Zl,n. The hyperparameter σv on V has an inverse-gamma prior and the posterior also has the same form. Note that the number of columns in V is the same as the number of columns in the IBP matrix B. 4.3.5 Sampling Z We sample for Z from its posterior, which is a normal distribution with mean ¯ = WT(WWT+Ψ)−1D and covariance Σ = I−WT(WWT+Ψ)−1W, where W = B⊙V. Note that, in our sampling scheme, we considered the matrices Bx and By as simply parts of the big IBP matrix B, and sampled them together using a single IBP draw. However, one could also sample them separately as two separate IBP matrices for Bx and By. This would require different IBP draws for sampling Bx and By with some modification of the existing Gibbs sampler. Different IBP draws could result in a different number of nonzero columns in Bx and By. To deal with this issue, one could sample Bx (say havingKx nonzero columns) and By (say having Ky nonzero columns) first, introduce extra dummy columns (|Kx−Ky| in number) in the matrix having a smaller number of nonzero columns, and then set all such columns to zero. The effective K for each iteration of the Gibbs sampler would be max{Kx,Ky}. A similar scheme could also be followed for the corresponding real-valued matrices Vx and Vy, sampling them in conjunction with Bx and By, respectively. 4.4 Multitask Learning Using Infinite CCA Having set up the framework for infinite CCA, we now describe its applicability for the problem of Multitask Learning. In particular, we consider the setting when each example is associated with multiple labels. Here, predicting each individual label becomes a task to be learned. Although one can individually learn a separate model for each task, doing this would ignore the label correlations. This makes borrowing the information across 42 tasks crucial, making it imperative to share the statistical strength across all the task. With this motivation, we apply our infinite CCA model to capture the label correlations and to learn better predictive features from the data by projecting it to a subspace directed by label information. It has been empirically and theoretically (Yu et al., 2006) shown that incorporating label information in dimensionality reduction indeed leads to better projections if the final goal is prediction. More concretely, let X = [x1, . . . , xN] be an D × N matrix of predictor variables, and Y = [y1, . . . , yN] be an M × N matrix of the responses variables (i.e., the labels) with each yi being an M × 1 vector of responses for input xi. The labels can take real (for regression) or categorical (for classification) values. The infinite CCA model is applied on the pair X and Y, which is akin to doing supervised dimensionality reduction for the inputs X. Note that the generalized eigenvalue problem posed in such a supervised setting of CCA consists of cross-covariance matrix XY and label covariance matrix Y Y . Therefore, the projection takes into account both the input-output correlations and the label correlations. Such a subspace therefore is expected to consist of much better predictive features than one obtained by a na¨ıve feature extraction approach such as simple PCA that completely ignores the label information, or approaches like Linear Discriminant Analysis (LDA) that do take into account label information but ignore label correlations. Multitask learning using the infinite CCA model can be done in two settings: supervised and semisupervised, depending on whether or not the inputs of test data are involved in learning the shared subspace Z. 4.4.1 Fully Supervised Setting In the supervised setting, CCA is done on labeled data (X, Y) to give a single shared subspace Z ∈ RK×N that is good across all tasks. A model is then learned in the Z subspace to learn M task parameters {θm} ∈ RK×1 where m ∈ {1, . . . ,M}. Each of the parameters θm is then used to predict the labels for the test data of task m. However, since the test data are stillD dimensional, we need to either separately project it down onto theK dimensional subspace and do predictions in this subspace, or "inflate" each task parameter back to D dimensions by applying the projection matrix Wx and do predictions in the original D dimensional space. The first option requires using the fact that P(Z|Xte) ∝ P(Xte|Z)P(Z), which is a Gaussian Nor(μZ|X, Z|X) with μZ|X = (WTx ΨxWx+I)−1WTx Xte and Z|X = 43 (WTx ΨxWx + I)−1. With the second option, we can inflate each learned task parameter back to D dimensions by applying the projection matrix Wx. We choose the second option for the experiments. We call this fully supervised setting as model-1. 4.4.2 A Semisupervised Setting In the semisupervised setting, we combine training data and test data (with unknown labels) as X = [Xtr, Xte] and Y = [Ytr, Yte] where the labels Yte are unknown. The infinite CCA model is then applied on the pair (X, Y) and the parts of Y consisting of Yte are treated as latent variables to be imputed. With this model, we get the embeddings also for the test data and thus training and testing both take place in the K dimensional subspace, unlike model-1 in which training is done in K dimensional subspace and predictions are made in the original D dimensional subspace. We call this semisupervised setting as model-2. 4.5 Experiments Here, we report our experimental results on several synthetic and real-world datasets. We first show our results with the infinite CCA as a stand-alone algorithm for CCA by using it on a synthetic dataset, demonstrating its effectiveness in capturing the canonical correlations. We then also report our experiments on applying the infinite CCA model to the problem of Multitask Learning on two real-world datasets. 4.5.1 Infinite CCA Results on Synthetic Data In the first experiment, we demonstrate the effectiveness of our proposed infinite CCA model in discovering the correct number of canonical correlation components, and in capturing the sparsity pattern underlying the projection matrix. For this, we generated two datasets of dimensions 25 and 10, respectively, with each having 100 samples. For this synthetic dataset, we knew the ground truth (i.e., the number of components, and the underlying sparsity of projection matrix). In particular, the dataset had 4 correlation components with a 63% sparsity in the true projection matrix. We then ran both the classical CCA and the infinite CCA algorithm on this dataset. Looking at all the correlations discovered by classical CCA, we found that it discovered 8 components having significant correlations, whereas our model correctly discovered exactly 4 components in the first place (we extract the MAP samples for W and Z output by our Gibbs sampler). Thus, on this 44 small dataset, standard CCA indeed seems to be finding spurious correlations, indicating a case of overfitting (the overfitting problem of classical CCA was also observed in (Klami and Kaski, 2007) when comparing Bayesian versus classical CCA). Furthermore, as ex-pected, the projection matrix inferred by the classical CCA had no exact zero entries and even after thresholding significantly small absolute values to zero, the uncovered sparsity was only about 25%. On the other hand, the projection matrix inferred by the infinite CCA model had 57% exact zero entries and 62% zero entries after thresholding very small values, thereby demonstrating its effectiveness in also capturing the sparsity patterns. 4.5.2 Infinite CCA Applied to Multilabel Prediction In the second experiment, we use the infinite CCA model to learn a set of related task in the context of multilabel prediction. For our experiments, we use two real-world multilabel datasets (Yeast and Scene) from the UCI repository. The Yeast dataset consists of 1500 training and 917 test examples, each having 103 features. The number of labels (or tasks) per example is 14. The Scene dataset consists of 1211 training and 1196 test examples, each having 294 features. The number of labels per example for this dataset is 6. We compare the following models for our experiments. • Full: Train separate classifiers (SVM) on the full feature set for each task. • PCA: Apply PCA on training and test data and then train separate classifiers for each task in the low-dimensional subspace. This baseline ignores the label information while learning the low-dimensional subspace. • CCA: Apply classical CCA on training data to extract the shared subspace, learn separate model (i.e., task parameters) for each task in this subspace, project the task parameters back to the original D dimensional feature space by applying the projection Wx, and do predictions on the test data in this feature pace. • Model-1: Use our supervised infinite CCA model to learn the shared subspace using only the training data (see Section 4.4.1). • Model-2: Use our semisupervised infinite CCA model to simultaneously learn the shared subspace for both training and test data (see Section 4.4.2). 45 The performance metrics used are overall accuracy, F1-Macro, F1-Micro, and AUC (Area Under ROC Curve). For PCA and CCA, we choseK that gives the best performance, whereas this parameter was learned automatically for both of our proposed models. The results are shown in Table-4.1. As we can see, both the proposed models do better than the other baselines. Of the two proposed model, we see that model-2 does better in most cases, suggesting that it is useful to incorporate the test data while learning the projections. This is possible in our probabilistic model since we could treat the unknown Ys of the test data as latent variables to be imputed while doing the Gibbs sampling. We note here that our results are with cases where we only had access to a small number of related task (Yeast has 14, Scene has 6). We expect the performance improvements to be even more significant when the number of (related) tasks is high. 4.6 Related Work A number of approaches have been proposed in the recent past for the problem of super-vised dimensionality reduction of multilabel data. The few approaches that exist include Partial Least Squares (Arenas-Garc´ıa et al., 2006), multilabel informed latent semantic indexing (Yu et al., 2005), and multilabel dimensionality reduction using dependence max-imization (MDDM) (Zhou, 2008). None of these, however, deal with the case when the data are only partially labeled. Somewhat similar in spirit to our approach is the work on supervised probabilistic PCA (Yu et al., 2006) that extends probabilistic PCA to the setting when we also have access to labels. However, it assumes a fixed number of components and does not take into account sparsity of the projections. The CCA-based approach to supervised dimensionality reduction is more closely re-lated to the notion of dimension reduction for regression (DRR), which is formally defined as finding a low-dimensional representation z ∈ RK of inputs x ∈ RD (K ≪ D) for predicting multivariate outputs y ∈ RM. An important notion in DRR is that of sufficient dimensionality reduction (SDR) (Fukumizu et al., 2004, Globerson and Tishby, 2003), which states that given z, x and y are conditionally independent, i.e., x ⊥⊥ y|z. As we can see in the graphical model shown in Figure 4.1, the probabilistic interpretation of CCA yields the same condition with X and Y being conditionally independent given Z. Among the DRR-based approaches to dimensionality reduction for real-valued multi- 46 Table 4.1. Results on the multilabel classification task. Bold face indicates the best performance. Model-1 and Model-2 scores are averaged over 10 runs with different initializations. Model Yeast Scene Acc F1-macro F1-micro AUC Acc F1-macro F1-micro AUC Full 0.5583 0.3132 0.3929 0.5054 0.7565 0.3445 0.3527 0.6339 PCA 0.5612 0.3144 0.4648 0.5026 0.7233 0.2857 0.2734 0.6103 CCA 0.5441 0.2888 0.3923 0.5135 0.7496 0.3342 0.3406 0.6346 Model-1 0.5842 0.3327 0.4402 0.5232 0.7533 0.3630 0.3732 0.6517 Model-2 0.6156 0.3463 0.4954 0.5386 0.7664 0.3742 0.3825 0.6686 label data, Covariance Operator Inverse Regression (COIR) exploits the covariance struc-tures of both the inputs and outputs (Kim and Pavlovic, 2009). Please see (Kim and Pavlovic, 2009) for more details on the connection between COIR and CCA. Besides the DRR-based approaches, the problem of extracting useful features from data, partic-ularly with the goal of making predictions, has also been considered in other settings. The information bottleneck (IB) method (Tishby, Pereira, and Bialek, Tishby et al.) is one such example. Given input-output pairs (X, Y), the information bottleneck method aims to obtain a compressed representation T of X that can account for Y. IB achieves this using a single tradeoff parameter to represent the tradeoff between the complexity of the representation of X, measured by I(X; T), and the accuracy of this representation, measured by I(T; Y), where I(.; .) denotes the mutual information between two variables. In another recent work (Ji and Ye, 2009), a joint learning framework is proposed, which performs dimensionality reduction and multilabel classification simultaneously. In the context of CCA as a stand-alone problem, sparsity is another important issue. In particular, sparsity improves model interpretation and has been gaining lots of attention recently. Existing works on sparsity in CCA include the double barrelled lasso, which is based on a convex least squares approach (Shawe-Taylor, 2008), and CCA as a sparse solution to the generalized eigenvalue problem (Sriperumbudur et al., 2009), which is based on constraining the cardinality of the solution to the generalized eigenvalue problem to obtain a sparse solution. Another recent solution is based on a direct greedy approach, which bounds the correlation at each stage (Wiesel et al., 2008). The probabilistic approaches to CCA include the works of (Klami and Kaski, 2007) and (Archambeau and Bach, 2008), both of which use an automatic relevance determination 47 (ARD) prior (Bishop, 1999) to determine the number of relevant components, which is a rather ad-hoc way of doing this. In contrast, a nonparametric Bayesian alternative proposed here is a more principled method to determine the number of components. We note that the sparse Factor Analysis model proposed in (Rai and Daum´e III, 2008) actually falls out as a special case of our proposed infinite CCA model if one of the datasets (X or Y) is absent and the noise covariance matrix Ψ is diagonal. Besides, the sparse Factor Analysis model is limited to Factor Analysis whereas the proposed model can be seen as an infinite generalization of both an unsupervised problem (sparse CCA), and (semi)supervised problem (dimensionality reduction using CCA with full or partial label information), with the latter being especially relevant for Multitask Learning in the presence of multiple labels. Finally, Multitask Learning has been tackled using a variety of different approaches, primarily depending on what notion of task relatedness is assumed. Some of the examples include tasks generated from an IID space (Baxter, 2000), and learning multiple tasks using a hierarchical prior over the task space (Daum´e III, 2009, Xue et al., 2007b), among others. In this work, we consider multilabel prediction in particular, based on the premise that a set of such related tasks share an underlying low-dimensional feature space (Ji et al., 2008) that captures the task relatedness. 4.7 Conclusion We have presented a nonparametric Bayesian model for the Canonical Correlation Analysis problem to discover the dependencies between a set of variables. In particular, our model does not assume a fixed number of correlation components and this number is determined automatically based only on the data. In addition, our model enjoys sparsity, making the model more interpretable. The probabilistic nature of our model also allows dealing with missing data. Finally, we also demonstrate the model's applicability to the problem of multilabel learning where our model, directed by label information, can be used to automatically extract useful predictive features from the data. CHAPTER 5 MULTITASK LEARNING USING NONPARAMETRIC BAYESIAN PREDICTOR SUBSPACES In this chapter, we present a nonparametric Bayesian model for the problem of Mul-titask Learning. Our model is based on the assumption that the task parameters (e.g., the weight vectors of regression or classification tasks) of the multiple tasks live on a low-dimensional linear subspace. This model will form the building block of a more general model for Multitask Learning that will be presented in the next chapter. 5.1 Introduction Many learning settings consist of multiple prediction problems that are related with each other in some way. A common instance is multivariate regression or multilabel classification where each example is associated with several response variables (real-valued for regression, and discrete-valued for classification). For example, given a document, one may be interested in predicting its topic category as well as its author. Clearly, such tasks are expected to be related. A simple way to learn such multiple prediction problems would be to simply treat them as separate problems and learn separate models for each of them, essentially ignoring any correlation that might exist among them. Such an approach, however, fails to exploit any correlations there may be among these tasks, and it is desirable to share information across tasks if they are related. Motivated by this idea, a number of techniques have been proposed to exploit task relatedness in order to better learn a set of related tasks, rather than learning them indi-vidually. This is commonly known as Multitask Learning (Caruana, 1997), "learning to learn" (Heskes, 2000), inductive bias (Baxter, 2000), or predicting multivariate responses 49 (Breiman and Friedman, 1997), where multiple tasks are pooled together with the goal of improving the generalization performance of all the tasks. The idea is to use some aspect that can be shared across all the tasks in order to share their individual statistical strengths, compensating for the paucity of labeled examples. In this chapter, we consider one such aspect, namely a shared predictor subspace. The assumption here is that all the task parameters share an underlying basis space, which accounts for the task relatedness. Each individual task can then be represented as a linear combination of the set of basis tasks. Our predictor subspace model is similar in spirit to (Zhang et al., 2006, 2008). In this work, we propose a nonparametric, fully Bayesian framework that can learn this subspace without making any parametric assumptions (e.g., the framework does not assume the intrinsic dimensionality of the subspace to be known a priori). We present two models to learn such a subspace, with a special emphasis on cases when the number of tasks and/or the number of examples per task is small. In this chapter, we concentrate on Bayesian linear regression (for regression) and Bayesian logistic regression (for classification). The framework, however, is general enough and can accommodate a variety of different probabilistic discriminative models. In addition, being a hierarchical Bayesian model, the model can easily be extended to a mixture of subspaces setting (described in the next chapter) which allows the task parameters to share a nonlinear manifold. In Section 5.2, we describe the problem setup and our basic framework to model task relatedness. Section 5.3 describes both our models. Section 5.4 talks about inference in our model, Section 8.4 reports experimental results, and Section 8.5 discusses related work. We finally discuss the mixture extension of our work and conclude with Section 5.9. 5.2 Latent Subspace Model for Task Parameters To model task relatedness, we assume that the tasks have an underlying basis space and each actual task is a linear combination of the basis vectors (which act as "source" tasks). More specifically, suppose we have M tasks (regression/classification) represented by task parameters θ1, . . . , θM where θm ∈ RD is the task parameter for the m-th task. We assume the following generative model for each task parameter: θm = ZAm + ǫm 50 Here, Z ∈ RD×K is a matrix in which each column is a D dimensional basis vector, Am ∈ RK×1 is the the set of coefficients for the mth task parameter, and ǫm is task-specific noise. The matrix Z under this model defines the latent space underlying the set of pre-dictors, and is shared across all tasks, justifying the task relatedness. The same generative model, with all task parameters grouped together in a matrix = [θ1 . . . θM] ∈ RD×M, can be written in a matrix form as = ZAθ + E, where Aθ = [A1 . . . AM]. Together, the matrix Z of basis tasks, and the coefficients [A1 . . . AM] give the task parameters a parsimonious representation where each D × 1 task parameter vector is represented by a vector of size K × 1, with K ≪ D. Finally, each row em of the D × M matrix E explains the task-specific idiosyncrasies and is assumed to be drawn from a multivariate Gaussian with a diagonal covariance matrix = diag(ψ11, . . . , ψDD). At first blush, such a setup may seem like Factor Analysis (Bartholomew and Knott, 1999, Rai and Daum´e III, 2008). However, unlike Factor Analysis, e.g., X = ZA + E type of set-up where the data X is observed, in this case, the matrix of task parameters is not observed. So the "data" itself is a latent variable in this model (others being Z,A,E, and the associated hyperparameters). The goal is to learn along with all the other latent variables, harnessing the data available from all the tasks. Also note that it is a supervised setting unlike standard Factor Analysis. A crucial issue in the model is determining the intrinsic dimensionality and sparsity of the underlying predictor subspace defined by Z. We propose a nonparametric Bayesian model based on the recently proposed Indian Buffet Process (Section 2.3) (Ghahramani et al., 2007) to deal with both these issues. The dimensionality K of the latent space and the degree of sparsity of the basis space defined by Z is automatically determined by the IBP prior. Note that the sparsity of Z is akin to imposing an ℓ1-type regularization on Z as in the Lasso framework, or assuming a Laplace prior on the columns of Z: Zk ∼ QD d=1 LAPLACE(0, η). 5.3 Infinite Latent Subspace Models for Multitask Learning Our goal is to simultaneously learn several prediction tasks. In the rest of the exposition and our experiments, we consider the special case of multilabel prediction where each 51 input x is associated with multiple labels. Therefore, predicting each label is a task. Our framework is, however, more general and can also be applied for cases where each prediction problem has its own source of input. In the multilabel setting, learning the prediction task for the mth label amounts to learn-ing the task parameter θm. Formally, given training data D = {(x1, ym 1 ), . . . , (xN, ym N )} for taskmwhere xi ∈ RD and ym i is a real (for regression) or a binary valued (for classification) response, a learning task parameterized by θm, can be defined as: Regression: ym i ∼ Nor(θTm xi, ρ2) Classification: ym i ∼ Bin(1/(1 + e−θTm xi)) To follow a more compact notation, we shall denote the inputs [x, . . . , xN] by an N ×D matrix X, the responses for all the M tasks by an N × M matrix Y, and the M task parameters as a D×M matrix = [θ1 . . . θM] ∈ RD×M. With this notation, we can define the prediction setting as a probabilistic model Y| , X ∼ Nor(Y|XT , ρ2I) for regression (Bayesian linear regression), and Y| , X ∼ Bin(1/(1+e−XT )) for classification (Bayesian logistic regression). Recall our original setup = ZAθ + E. We wish to model the matrix Z using the Indian Buffet Process (IBP), thereby automatically choosing the intrinsic dimensionality of the task basis space defined by Z. However, since IBP defines a distribution over binary matrices and Z needs to be a real-valued matrix, we model Z as B ⊙ V, the element-wise product of a binary matrix B and a real-valued matrix V, both of size D × K. We place an IBP prior over the binary matrix B and a Gaussian prior over the real-valued matrix V. Our complete hierarchical model is the following (the corresponding graphical model shown in Figure 5.1: Top; error term not shown for the sake of brevity): Y ∼ Nor(XT , ρ2I)(regression) Y ∼ Bin(1/(1 + e−XT )(classification) = (B ⊙ V)Aθ + E B ∼ IBP(α) V ∼ Nor(0, σ2 v I), σv ∼ IG(a, b) Aθ ∼ Nor(0, σ2 θ I), σθ ∼ IG(c, d) 52 Figure 5.1. Predictor subspace model. Top: our basic model. Bottom: the augmented model using both task parameters and input data. X in the augmented model can addition-ally also consist of unlabeled data. Noise hyperparameters not shown for the sake of brevity. In both the models, the shaded nodes are observed, and the remaining ones (including the matrix consisting of task parameters, and the noise hyperparameters) are latent variables to be learned. 53 E ∼ Nor(0, ), D ∼ IG(e, f) Here , which is itself a latent variable, acts as the "data" in the model and depends on other latent variables in the model, and the data from actual tasks (B,V,Aθ,E,X,Y). Our proposed model learns (along with learning the latent subspace underlying ) by sharing information across all the tasks. 5.3.1 An Augmented Model for Learning Task Basis Learning the task subspace Z (= B ⊙ V) reliably would require a reasonable amount of data. In the basic model, the only available "data" for learning Z is (which, under our probabilistic model, is actually itself a latent variable to be learned). Given related but only a small number of tasks M, the D ×M matrix may not be enough to reliably learn the basis Z. This motivates our second model that allows also using the inputs X from each task to improve the learning of Z. Under this model (shown in Figure 5.1: Bottom), it is assumed that the task parameters and the inputs X both share the same basis space Z, with different mixing matrices Aθ and Ax, respectively. This model can be thought of as simultaneously discovering both the task parameter basis, as well as the latent features underlying the data X. Furthermore, under this model, the data matrix X need not only consist of examples for which labels are known. So, the matrix X shown in Figure 5.1 (bottom) can additionally also consist of unlabeled examples, which are relatively easier to obtain. The reason for having the input share the same subspace as the task parameters can be explained using a Representer theorem (Sch¨olkopf et al., 2001) argument: write the solution of a regularized loss function as: θ = P i αixi (assume a linear kernel). Now, if we write each input vector xi as a combination of basis vectors (Zai + ǫi), then (after rearranging the coefficients) one can also write the task parameter θ as a combination of the same basis vectors defined by Z. Therefore, it makes sense to have both X and share the same subspace. 5.4 Inference We take a fully Bayesian approach for inference in this model. Inference is akin to the Gibbs sampler for the IBP (Ghahramani et al., 2007), except for the following differences: 54 • The matrix Z is no longer a binary matrix but is expressed as an element-wise product of the binary matrix B and the real-valued matrix V. So both B and V need to be sampled in conjunction in our model. • The latent variable acts as the "data" and therefore needs to be sampled from its posterior P( |D, B, V, Aθ) whereD = {(x1, ym 1 ), . . . , (xN, ym N )}, (m = [1, . . . ,M]) denotes the actual data the model has access to. Inference in our model is done using Gibbs sampling with a few Metropolis-Hastings steps. The sampler draws posterior samples of B, V, Aθ, , and the remaining hyperpa-rameters of the model. Here, we describe the sampling equations for all latent variables in our basic model. Sampling the hyperparameters (α, σv, etc.) is straightforward and we skip it due to the space limitation. 5.4.1 Sampling B Sampling the binary IBP matrix B consists of sampling existing dishes, proposing new dishes and accepting or rejecting them based on the acceptance ratio in the associated M-H step. For sampling existing dishes, an entry in B is set as 1 according to p(Bik = 1| ,B−ik, V, Aθ,Ψ) ∝ m−i,k D p( |B, V, Aθ,Ψ) whereas it is set as 0 according to p(Bik = 0| ,B−ik, V, Aθ,Ψ) ∝ D−m−i,k D p( |B, V, Aθ,Ψ). m−i,k = P j6=i Bjk is how many other customers chose dish k. For sampling new dishes, we use an M-H step where we simultaneously propose = (Knew, Vnew, Anew θ ) where Knew ∼ Poisson(α/D). We accept the proposal with an acceptance probability given by a = min{1, p(rest| ) p(rest|) }. Here, p(rest|η) is the probability of the data given parameters η. We propose Vnew from its prior (Gaussian) but, for faster mixing, we propose Anew θ from its posterior (a Gaussian). 5.4.2 Sampling V We sample the real-valued matrix V from its posterior: p(Vi,k| , B, Aθ,Ψ) ∼ Nor(Vi,k|μi,k, i,k) where i,k = ( PN n=1 A 2 k,n i + 1 σ2 v )−1 and μi,k = i,k( PN n=1 Aθk,n ∗ i,k) −1 g . We define ∗ i,k = i,n − PK l=1,l6=k(Bi,lVi,l)Aθl,n. The hyperparameter σv on V has an inverse-gamma prior and the posterior also has the same form. 55 5.4.3 Sampling Aθ We sample for Aθ from its posterior p(Aθ| , B, V,Ψ) ∼ Nor(Aθ|¯,Σ) where ¯ = ZT(ZZT + Ψ)−1Θ and Σ = I − ZT(ZZT + Ψ)−1Z, where Z = B ⊙ V 5.4.4 Sampling The posterior for can be written as P( |D, B, V, Aθ) ∝ P(Y|XT )P( ). The prior on is a Gaussian Nor((B ⊙ V)Aθ, ). For the likelihood term, there are 2 cases. For regression, the likelihood P(Y |XT ) is Gaussian, so the posterior is available in closed form and is easy to sample from. Specifically, the posterior P( |D, B, V, Aθ) is a Gaussian Nor(μθ, θ) where μθ = θ( −1(B ⊙ V)Aθ + βXT Y) −1 θ = −1 + βXT X where β is the precision (inverse variance) of the Gaussian likelihood term P(Y |XT ). For classification however, the likelihood is no longer Gaussian, so we lose conjugacy. There are several ways to deal with this. One way is to use Laplace approximation to the posterior (Bishop, 2006). Another possibility is to use the variational method proposed in (Jaakkola and Jordan, 1996) to approximate a non-Gaussian likelihood by a Gaussian one. We instead use another approach based on the auxiliary-variable-based Gibbs sampler for logistic regression (Holmes and Held, 2006), which is more appropriate in the Gibbs sampling scheme we employ. The auxiliary variable sampler (Holmes and Held, 2006) for logistic regression as-sociates with each response yi ∈ {0, 1} an auxiliary variable ˜yi = xTi θ + ǫi with ǫ ∼ Nor(0, λi), such that yi = 1 if ˜yi > 0, and 0 otherwise. λi is assigned a Kolmogorov- Smirnov distribution. With a normal prior Nor(b, v) on θ, the posterior on θ is still a Gaussian: θ|˜y, λ ∼ Nor(μθ, θ) μθ = θ(v−1b + βXT W˜y) −1 θ = (v−1 + βXT WX)−1 W = diag(λ1, . . . , λN), ˜y = [˜y1, . . . , ˜yN]′ 56 where the posterior over the auxiliary variables ˜yi is a truncated normal, which can be sampled from using standard techniques. ˜yi|θ, xi, yi, λi ∼ Nor(xTi θ, λi)I(˜yi > 0) if yi = 0 Nor(xTi θ, λi)I(˜yi ≤ 0) if yi 6= 0 and in our case, the mean and covariance on the normal prior over are given by b = (B ⊙ V)Aθ and v = Ψ, respectively. 5.4.5 Sampling in the Augmented Model The sampling steps in our augmented model are essentially the same as in the basic model, except that we replace the D×M matrix by the D×(M +N) matrix [ X]. As in the basic model, still needs to be sampled as above, whereas X stays fixed and does not have to be sampled. We note here that although a fully Bayesian solution can be slow with data having a large number of features (since each feature corresponds to a customer in the IBP model), one may address this by using a number of recently proposed alternatives to Gibbs sampling (Doshi and Ghahramani, 2009a) for IBP that can be as much as an order of magnitude faster. 5.5 Prediction Having learned the task parameters , we use these to make predictions on the test data. For the test data x of the mth task, the prediction can be written as p(y|x) = Z p(y|x, θm)p(θm|μm, m)dθm which is essentially averaging over the predictions made by each of the posterior samples of θm, where μm and m are the mean and covariance parameters of the mth task. Since the posterior averaging can be computationally expensive, it can also be replaced by ˆθm, the MAP estimate of θm. Prediction for x then simply requires plugging in the MAP estimate: p(y|x) = p(y|x, ˆθm). 5.6 Experiments We present our experimental results on two real-world multilabel classification datasets (Yeast and Scene) from the UCI repository, comparing our models against independently trained Bayesian logistic regression, the pooling-based approach, and another state-of-the- 57 art Multitask Learning baseline. The Yeast dataset consists of 1500 training and 917 test examples, each having 103 features. The number of labels (or tasks) per example is 14. The Scene dataset consists of 1211 training and 1196 test examples, each having 294 features. The number of labels per example for this dataset is 6. We use the following baselines: • LR: Independent Bayesian logistic regression • pool: Pooling all data and learning a single model • yaxue: The matrix stick-breaking-process-based Multitask Learning model proposed in (Xue et al., 2007a) In the experimental results (Figure 5.2 and Table 5.1) , we refer to our basic model as model-1, and the augmented model with input data as model-2. Note that all the multitask approaches compared here use Logistic Regression as the base classifier. We use overall accuracy, F1-Macro and F1-Micro (Yang, 1997), and AUC (Area Under ROC Curve) as the performance metrics. The Gibbs samplers used in Baye |
| Reference URL | https://collections.lib.utah.edu/ark:/87278/s6bs21fj |



