Algorithms and coresets for large-scale kernel smoothing

Algorithms and coresets for large-scale kernel smoothing

Title	Algorithms and coresets for large-scale kernel smoothing
Publication Type	dissertation
School or College	College of Engineering
Department	Computing
Author	Zheng, Yan
Date	2017
Description	Kernel smoothing provides a simple way of finding structures in data sets without the imposition of a parametric model, for example, nonparametric regression and density estimates. However, in many data-intensive applications, the data set could be large. Thus, evaluating a kernel density estimate or kernel regression over the data set directly can be prohibitively expensive in big data. This dissertation is working on how to efficiently find a smaller data set that can approximate the original data set with a theoretical guarantee in the kernel smoothing setting and how to extend it to more general smooth range spaces. For kernel density estimates, we propose randomized and deterministic algorithms with quality guarantees that are orders of magnitude more efficient than previous algorithms, which do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. Our algorithms are applicable to any large-scale data processing framework. We then further investigate how to measure the error between two kernel density estimates, which is usually measured either in L1 or L2 error. In this dissertation, we investigate the challenges in using a stronger error, L ∞ (or worst case) error. We present efficient solutions for how to estimate the L∞ error and how to choose the bandwidth parameter for a kernel density estimate built on a subsample of a large data set. We next extend smoothed versions of geometric range spaces from kernel range spaces to more general types of ranges, so that an element of the ground set can be contained in a range with a non-binary value in [0,1]. We investigate the approximation of these range spaces through ϵ-nets and ϵ-samples. Finally, we study coresets algorithms for kernel regression. The size of the coresets are independent of the size of the data set, rather they only depend on the error guarantee, and in some cases the size of domain and amount of smoothing. We evaluate our methods on very large time series and spatial data, demonstrate that they can be constructed extremely efficiently, and allow for great computational gains.
Type	Text
Publisher	University of Utah
Subject	Computer science
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	© Yan Zheng
Format	application/pdf
Format Medium	application/pdf
ARK	ark:/87278/s6n05pch
Setname	ir_etd
ID	1426684
OCR Text	Show ALGORITHMS AND CORESETS FOR LARGE-SCALE KERNEL SMOOTHING by Yan Zheng A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computing School of Computing The University of Utah August 2017 Copyright c Yan Zheng 2017 All Rights Reserved The University of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Yan Zheng has been approved by the following supervisory committee members: Jeffrey Phillips , Chair(s) Feifei Li , Member Suresh Venkatasubramanian , Member April 20 2017 Date Approved April 20 2017 Date Approved April 20 2017 Date Approved Vivek Srikumar , Member April 20 2017 Date Approved Clayton Scott , Member May 1 2017 Date Approved by Ross Whitaker , Chair/Dean of the Department/College/School of Computing and by David B. Kieda , Dean of The Graduate School. ABSTRACT Kernel smoothing provides a simple way of finding structures in data sets without the imposition of a parametric model, for example, nonparametric regression and density estimates. However, in many data-intensive applications, the data set could be large. Thus, evaluating a kernel density estimate or kernel regression over the data set directly can be prohibitively expensive in big data. This dissertation is working on how to efficiently find a smaller data set that can approximate the original data set with a theoretical guarantee in the kernel smoothing setting and how to extend it to more general smooth range spaces. For kernel density estimates, we propose randomized and deterministic algorithms with quality guarantees that are orders of magnitude more efficient than previous algorithms, which do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. Our algorithms are applicable to any large-scale data processing framework. We then further investigate how to measure the error between two kernel density estimates, which is usually measured either in L1 or L2 error. In this dissertation, we investigate the challenges in using a stronger error, L• (or worst case) error. We present efficient solutions for how to estimate the L• error and how to choose the bandwidth parameter for a kernel density estimate built on a subsample of a large data set. We next extend smoothed versions of geometric range spaces from kernel range spaces to more general types of ranges, so that an element of the ground set can be contained in a range with a non-binary value in [0, 1]. We investigate the approximation of these range spaces through #-nets and #-samples. Finally, we study coresets algorithms for kernel regression. The size of the coresets are independent of the size of the data set, rather they only depend on the error guarantee, and in some cases the size of domain and amount of smoothing. We evaluate our methods on very large time series and spatial data, demonstrate that they can be constructed extremely efficiently, and allow for great computational gains. CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix CHAPTERS 1. 2. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4 1.5 1 5 6 7 9 Kernel Density Estimates and Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kernel Range Spaces and KDE Coreset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . KDE in Geometric Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EFFICIENT CORESETS FOR KERNEL DENSITY ESTIMATES . . . . . . . . . . . . . 11 2.1 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Warm-up: One Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Efficient Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Two Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Randomized MergeReduce Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2.1 The MergeReduce framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2.2 More Efficient Reduce Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2.3 Streaming and Distributed MergeReduce . . . . . . . . . . . . . . . . . . . . . 2.3.2.4 Other Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Deterministic Z-Order Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Efficient Query Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Parallel and Distributed Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.6 Higher Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Two Dimensions: Centralized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Two Dimensions: Parallel and Distributed . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Two Dimensions: Is RS Good Enough? . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 One Dimension: Parallel and Distributed . . . . . . . . . . . . . . . . . . . . . . . . . 3. 11 13 15 16 16 17 17 19 22 23 23 25 25 26 26 28 31 33 34 L• ERROR AND BANDWIDTH SELECTION FOR KDES . . . . . . . . . . . . . . . . . 36 3.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Unit Kernels or Normalized Kernels? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Why s Is Given? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Why L• Error? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 38 38 39 3.1.4 Computing Coreset Q of P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Related Work on Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Computing err ( P, Q) Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Approximation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Generation of Evaluation Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Refining the Bandwidth for Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Lipschitz Properties of h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Random Golden Section Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Evaluating Point Generation for err X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Choosing New Bandwidth Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Compare with the Traditional Bandwidth Selection Method . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4. GENERALIZED KERNEL RANGE SPACE AND (#, t )-NET . . . . . . . . . . . . . . . . 60 4.1 Definitions and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 New Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Smoothed Range Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 #-Sample in a Smoothed Range Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 (#, t )-Net in a Smoothed Range Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Linking and Properties of (#, t )-Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Relations between Smoothed Range Spaces and Linked Binary Range Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Min-Cost Matchings within Cubes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Constructing #-Samples for Smoothed Range Spaces . . . . . . . . . . . . . . . . . . . . 4.5.1 Discrepancy for Smoothed Halfspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 From a Single Smoothed Halfspace to a Smoothed Range Space . . . . . . . 4.5.3 #-Samples for Smoothed Halfspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Generalization to d Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5. 40 41 42 42 46 48 48 49 50 51 51 53 55 58 59 62 62 63 64 65 66 67 67 69 73 74 75 75 76 CORESETS FOR KERNEL REGRESSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Coresets for Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Our Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Subset Selection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Progressive Grid-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Accuracy of Random Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Accuracy of Grid-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Effectiveness of Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v 77 77 78 79 80 81 82 83 86 89 90 90 91 5.4.3 Efficiency of Coresets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Consistency with Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Progressive Grid-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 High Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6. 93 94 95 96 96 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 APPENDIX: PROOFS FOR CORESET ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 vi LIST OF FIGURES 1.1 Kernel density estimate (KDE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Kernel regression of synthetic data with bandwidth 50 (left) and 200 (right). . . 5 1.3 Range Spaces and an example of #-sample. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Example with 10,000 points in [0, 1]2 generated on a circle or line with N (0, 0.005) noise; 25% of points are uniform background noise. The generating function is reconstructed with KDE with s = 0.05 (upper left), and its persistence diagram based on the superlevel set filtration is shown (upper middle). A coreset [149] of the same data set with only 1,384 points (lower left) and persistence diagram (lower middle) are shown, again using KDE. This associated confidence interval contains the dimension 1 homology features (red triangles) suggesting they are noise; this is because it models data as iid - but the coreset data is not iid, it subsamples more intelligently. We also show persistence diagrams of the original data based on the sublevel set filtration of the standard distance function (upper right, with no useful features due to noise) and the kernel distance (lower right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Intersection between a matching and a disk, solid red lines are included in EM \ B. Left shows Grid matching and right shows Z-order with every other edge in a matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Effect of guaranteed error # on the centralized G-MR, G-MR+RS, Z, Z+RS. . . 29 2.3 Centralized G-MR, B-MR, Greedy-MR, KH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Effect of measured `• error err and N on centralized IFGT, RS, G-MR+RS, Z+RS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.5 Effect of measured `• error err on distributed and parallel G-MR+RS, Z+RS, RS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.6 Effect of n on G-MR+RS, Z+RS, RS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7 Effect of # and `• error err on G-MR+RS, Z+RS, RS for high precision case. . . . 34 2.8 Effect of varying measured err on RS, SS on one-dimensional data. . . . . . . . . . . 35 3.1 KDEs with different bandwidths showing daily, weekly, and yearly trends. Left shows a full year of data, and right shows one week of data. . . . . . . . . . . . 39 3.2 Illustration of the Minkowski sum of a ball of radius s and convex hull of P [ Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.3 3.4 Example with need for larger w for coreset Q. . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Two curves, one of which is Lipschitz. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5 Results on one-dimensional synthetic data to choose best evaluating point generation techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6 Visualization of KDE P and KDE Q for two-dimensional synthetic data using different bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.7 Visualization of KDE P and KDE Q for two-dimensional real data using different bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.8 Results on one-dimensional real data to choose the best evaluating point generation techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.9 Choosing the best evaluation set X for on two-dimensional synthetic (left) and real (right) data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.10 w ⇤ = arg minw err X ( P, s, Q, w ) in R1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.11 w ⇤ = arg minw expX ( P, s, Q, w ) in R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.12 w ⇤ = arg minw expX ( P, s, Q, w ) in R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.1 Illustration of the smoothed halfspace, and smoothed polynomial surface, with function value of three points { p1 , p2 , p3 } defined using a triangle kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.2 (T3) edges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.1 Example improvement of Aggregate-Neighbor (right) over G-Aggregate (left). Input P = {(1 100), (2 40), (3 0), (15 50), (16 50), (17 50)}. With G-Aggregate with g = 2, the coreset Q = {(1.5 70), (3 0), (15.5 50), (17 50)}. The largest errors occur at x < 0 and around x = 9. If we add the extra points at the empty grid cells with nonempty neighbor grids, i.e., Aggregate-Neighbor, then L• is significantly reduced. We add three points ( 1 98.3124), (5 3.2559), and (13 50). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 The maximum L• error found based on the number of evaluation points on real (left) and synthetic data (middle) when Px ⇢ R1 , and real data (right) when Px ⇢ R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.3 5.4 L• error for coresets when also testing sparse regions on real data(left) and synthetic data(middle) when Px ⇢ R, and real data(right) when Px ⇢ R2 . . . . . 92 Comparison of construction time and query time for real data set with Px ⇢ R. 94 5.5 Comparison of construction time and query time for real data set with Px ⇢ R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.6 The relation between L• error and bandwidth for real data (left), synthetic data (middle) for Px ⇢ R and real data (right) for Px ⇢ R2 . . . . . . . . . . . . . . . . . 95 5.7 5.8 The relation between window size and L• error for progressive G-Aggregate method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Left: L• error for coresets of high dimensional data sets. Right: Running time to generate the coresets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 viii ACKNOWLEDGEMENTS I could never have enjoyed my last five years so much without the guidance, support, and efforts of so many people with whom I have interacted. First, I am extremely grateful to have been under the guidance of my great advisor, Jeff Phillips, who has demonstrated to me how to be an excellent researcher, advisor, teacher, and friend. I greatly appreciate his patience while I learned to follow his clear and rigorous approach to geometry and research. I have learned from him not just research and thinking skills, but also principles that I want to follow in my future life. I also want to thank the rest of my committee, Feifei Li, Suresh Venkatasubramanian, Vivek Srikumar, and Clayton Scott, for providing meaningful guidance on the process towards my dissertation. Especially I want to thank Feifei for showing me how to design rigorous experiments and how to apply kernel smoothing to the database area. I thank Suresh for welcoming me to his group meeting during my early stages of research. He can always simplify complicated problems and his contagious enthusiasm in research encourages me. I thank Vivek and Clay for helping me to connect my research to machine learning and giving me different aspects to understand my research. I have been so lucky to work in the wonderful Data Group at the University of Utah, surrounded by many great colleagues. With a ton of fresh and encouraging discussions on research and other fun, I have enjoyed working here every single day. To list a few: Mina Ghashami, Jeffrey Jestes, Robert Christensen, Mingwang Tang, Min Du, Michael Matheny, Pingfan Tang, Wangchao Le, John Moeller, Chi Zhang, Chenxu Ding, Samira Daruki, Dustin Webb, Waiming Tai, Guineng Zheng, and Sunipa Dev. Thank you all for providing an interesting and collaborative research environment. I also want to thank people in the machine learning and NLP group, Haibo Ding, Xingyuan Pan, Jie Cao, and Ruihong Huang, for your useful discussions for inspirations. I would also like to thank my dear friends I met in the School of Computing, Xia Dong, Shuying Liang, Xing Lin, Liang Zhou, Shuqin Zeng, Peihong Zhu, Wei Liu, Rui Dai, Limou Wang, and Yi Ou, for either helping me in my research and study or frequently teaching me the real wisdom of life, and with whom I shared so many vivid memories in Utah. Many thanks to the staff at the School of Computing in the University of Utah. Special thanks go to Karen Feinauer, Ann Carlstrom, and Robert Barber for your immense support, so that I could entirely concentrate on my research. I also want to thank people in the Atmospheric Sciences department, my second home in the University of Utah. I thank John Horel for supporting me during my master study and gap between master and Ph.D. study, and offering me such interesting projects to work on, which helped me develop my research interests in data mining and machine learning. I thank Chris Galli for guiding me towards the frontier of technology. I also need to thank those at Nankai University for helping shape my interests in computer science and research, especially Qingren Wang, Guangshun Shi, Jiyi Li, and Yuanfang Liu. Guangshun not only introduced me to research but also cultivated my interest and skills; he taught me the importance of organization. I also acknowledge the funding that made this dissertation possible. I thank the National Science Foundation and School of Computing for directly funding my research. I am also thankful for the funding I received through grants to my advisor Jeff Phillips from NSF CCF-1350888, IIS-1251019, and ACI-1443046. Finally, I want to thank my parents, Fuxiang Zheng and Chunmei Cai, and parents-inlaw, Jianquan Gao and Heping Shang, for their constant support, confidence, and pride. Their encouragement throughout this process led to resolved and constant progress. Special thanks to my husband, Yang Gao, for his love, care, and support, not only in life, but also in study and research. Last but not least, my two adorable girls, Sunny Gao and Annie Gao, your love teaches me how to be a better mom, your smiles and sweetness are good seasonings to my tense Ph.D. study. x CHAPTER 1 INTRODUCTION 1.1 Kernel Density Estimates and Kernel Regression A kernel density estimate (KDE) is a statistically-sound method to estimate a continuous distribution from a finite set of points. This is an increasingly common task in many areas, such as outlier detection [20, 21], human motion tracking [17], financial data modeling [13], geometric inference [106] and anomaly detection [120]. In many scientific computing and data-intensive applications, the input data set P is a finite number of observations or measurements made for some real-world phenomena that can be best described by some random variable V with an unknown probability distribution function (pdf) f . Given a data set P of size n consisting of values from a domain M, a kernel density estimate is a function f P that for any input in M (not necessarily in P) describes the density at that location. It is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. That said, we view P as a finite, independent and identically distributed (iid) data sample drawn from a random variable V that is governed by an unknown distribution f . We are interested in estimating the shape of this function f . The kernel density estimate f P approximates the density of f at any possible input point x 2 M [100, 116]. Figure 1.1 visualizes the kernel density estimate (KDE) in both 1 and 2 dimensions, using real data sets (a web trace in 1D and a spatial data set from openstreetmap in 2D). Black dots represent a subset of points from P, and the blue curves or regions represent the KDE constructed from P. We restrict the kernels Ks : R d ⇥ R d ! R + that satisfy the following properties: (K1) symmetric and shift invariance: There exists a function k : R + ! R + such that for any p, x 2 R d , we have K ( p, x ) = K ( x, p) = k (k p `2 distance between p and x. x k), where k p x k denotes the 2 (a) KDE in 1D. (b) KDE in 2D. Figure 1.1. Kernel density estimate (KDE). (K2) monotonicity: For z < z0 then k (z) > k (z0 ). In addition to the above properties of the kernels, it is convenient to enforce one of two other properties. A normalized kernel satisfies Z x 2R d Ks ( p, x )dx = 1, (1.1) so that the kernel and the kernel density estimate are probability distributions. A unit kernel satisfies Ks ( x, x ) = 1 so that 0  Ks ( p, x )  1, (1.2) which ensures that KDE P ( x )  1. Unlike normalized kernel, the changing of bandwidth does not affect the coefficient of kernel function, so Ks ( p, x ) = k (k p x k/s). In this dissertation, we will enforce the kernel to be either normalized kernel or unit kernel in different chapters. Examples of kernels include (described here for R d ): • Gaussian Kernel: Ks ( p, x ) = 1 sd (2p )d/2 • Laplacian Kernel: Ks ( p, x ) = 1 sd cd d! • Triangular Kernel: Ks ( p, x ) = d sd cd • Epanechnikov Kernel: Ks ( p, x ) = • Ball Kernel: Ks ( p, x ) = { sd c1 d 1 exp( k x 1 max{0, 1 d +2 2sd cd if k p pk2 /s2 ), exp( k x pk/s), kx max{0, 1 pk/s }, kx x k  s; o.w. 0}, pk2 /s2 }, or 3 where cd = d rd p 2 G( d2 +1) is the volume of the unit d-dimensional sphere. These are shown as normalized kernels; to make them unit kernels, the coefficient is simply set to 1. We use the Gaussian kernel by default throughout the dissertation (the most widely used kernel in the literature), although some scenarios favor the Epanechnikov kernel [122, 126]. All kernel definitions have a s term to controls the amount of data smoothing. Choosing the appropriate s is an important problem and there is a large amount of literature on doing so [77, 113, 132]. In Chapter 3, we will investigate how to choose the bandwidth parameter for a kernel density estimate built on a subsample of a large data set. • Kernel density estimate: Given such a kernel Ks and a point set P in d-dimensions, a kernel density estimate is formally defined as a function KDE P that for any query x 2 R d evaluates as KDE P ( x ) = 1 \| P\| Â Ks ( p, x). (1.3) p2 P • Kernel distance: The kernel distance [52, 70, 78, 105] is a metric [95, 127] between two point sets P, Q (as long as the kernel used is characteristic [127], a slight restriction of being positive definite [6, 140], this includes the Gaussian and Laplace kernels). Define a similarity between the two point sets as k ( P, Q) = 1 1 \| P\| \| Q\| Â Â Ks ( p, q). (1.4) p2 P q2 Q Then the kernel distance between two point sets is defined as DK ( P, Q) = q k ( P, P) + k ( Q, Q) 2k ( P, Q). (1.5) When we let point set Q be a single point x, then k ( P, x ) = KDE P ( x ). If Ks is positive definite, it is said to have the reproducing property [6,140]. This implies that Ks ( p, x ) is an inner product in some reproducing kernel Hilbert space (RKHS) HK . Specifically, there is a lifting map f : R d ! HK so that Ks ( p, x ) = hf( p), f( x )iHK , and moreover, the entire set P can be represented as F( P) = Â p2 P f( p), which is a single p element of HK and has a norm kF( P)kHK = k ( P, P). A single point x 2 R d also has a p norm kf( x )kHK = K ( x, x ) in this space. • Kernel regression: Kernel regression [96, 142] is a powerful non-parametric technique for understanding scalar-valued 1-dimensional (and higher dimensional) data sets. It 4 has distinct advantages over linear or polynomial regression techniques in that it does not impose a possibly restrictive or over-fitting model on the data. Rather it uses a kernel similarity function to describe a smooth weighted average over the points. This allows the predicted function to locally adapt to the values of the data. These advantages have led to wide use of the kernel regression to predict, model, and visualize data from stocks [144] to weather monitoring [124] to quantified self [131]. We now consider an input data set P ⇢ R d+1 . We decompose this into the first d explanatory coordinates denoted Px ⇢ R d and the last dependent one Py ⇢ R. Most examples in this dissertation we discuss have d = 1 where it is common to think of these data items Px as times, but many approaches generalize for larger values of d. Then each data item p 2 P is also associated with a scalar data value py (the set of these comprises Py ). In this setting, Px is the same as P in the previous KDE definition. We will only consider P ⇢ R d+1 in the circumstance of kernel regression, mainly in Chapter 5, and keep P ⇢ R d in all the other chapters. With the new setting, for any query point q 2 R d , KDE P ( q ) = 1 \| P\| Â K ( p x , q ). (1.6) p2 P We say a weighted kernel density estimate (WKDE) replaces this with a weighted sum WKDE P ( q ) = 1 \| P\| Â K ( p x , q) py . (1.7) p2 P Finally, the (Nadaraya-Watson) kernel regression function is defined for a query point q 2 R d as KR P ( q ) = Â p2 P K ( p x , q ) py WKDE P ( q ) . = KDE P ( q ) Â p2 P K ( p x , q ) (1.8) This maps each domain point in R d to an estimate in the space R of scalar values; it takes a weighted average (defined by kernel similarity) of the scalar values nearby. Figure 1.2 visualizes the kernel regression and its original data of a synthetic data set with bandwidth 50 and 200. There are various other forms of kernel regression [112], but in this dissertation we focus on the Nadaraya-Watson variety [96,142] as it has an important, long history and has been widely used in areas such as image processing [129] and economics [12]. Moreover, 5 Figure 1.2. Kernel regression of synthetic data with bandwidth 50 (left) and 200 (right). since it does not try to pass through every data point, it is the most robust to outliers that are pervasive in large data, which often by necessity cannot be carefully filtered. 1.2 Coresets A coreset is a reduced data set that can be used as proxy for the full data set; the same algorithm can be run on the coreset as the full data set, and the result on the coreset approximates that on the full data set. It is often required or desired that the coreset is a subset of the original data set, but in some cases, this is relaxed. A weighted coreset is one where each point is assigned a weight, perhaps different than it had in the original set. A strong coreset provides error guarantees for all queries. #-net and #-sample are two examples of coresets defined in range space. • Range space: A range space ( P, A) consists of a ground set P and a family of ranges A of subsets from P. In this dissertation, we consider ranges that are defined geometrically, for instance when P is a point set and A are all subsets defined by a ball, that is any subset of P which coincides with P \ B for any ball B. A can also be all subsets defined by an axis-aligned rectangle or a half space (Figure 1.3(a)). • #-approximation (or #-sample): Given a range space ( P, A), it is a subset Q such that \| A\ P\| \| P\| \| A\ Q\| \| Q\|  # for any A 2 A. • #-net: Given a range space ( P, A), it is a subset Q ⇢ P, so for any A 2 A such that \| A \ P\| #\| P\|, then A \ Q 6= ∆. For instance, the data set P in Figure 1.3(b) may be a point set describing the location of 6 A (a) Different types of ranges. (b) An example of #-sample, where the black points are original data points and the set of red points is an #-sample. Figure 1.3. Range Spaces and an example of #-sample. twitter users in Utah, and the family of subsets A may be all subsets of twitter users within a fixed radius from a query point. Such a subset is shaded in green in Figure 1.3(a). An #-sample is a subset Q ⇢ P of the ground set such that for any range A 2 A , the fraction of points from Q in A is different from the fraction of points from P in A by at most #. Thus if the set of red points in Figure 1.3(b) is an #-sample of twitter users in Utah, and we ask what fraction of the twitter users in Utah are within 20 miles of Salt Lake City, we can give an answer within # error using the set of red points. Furthermore, the same number of samples would work for the entire state of Utah with the same guarantees and for any query disk (e.g., the fraction of Utah twitter users within 40 miles of Provo). For the case of #-net, P is still the point set describing the location of twitter users in Utah. Now we care about large enough events with many tweets, the area A such that \| A \ P\| #\| P\|. Thus Q is a subset of twitter users that witness every large enough event. 1.3 Kernel Range Spaces and KDE Coreset By smoothing out the boundary of the binary ranges, we introduce the smoother family of range spaces called kernel range spaces to deal with noisy data. Thus we cannot simply say a point is in a range or not; instead, we assign a value between [0, 1] to a point in a range. 7 • Kernel range space: A kernel range space [78] is an extension of the combinational concept of a range space. ( P, K) defined by a point set P ⇢ R d and a set of kernels K ( x, ·) represented by a fixed kernel K and an arbitrary center point x 2 R d . In the binary range space, for a range A centered at x, the fraction of the points in A is represented by \| A\ P\| . \| P\| In the kernel range space, it turns to be 1 \| P\| Â p2 P Ks ( p, x ), where Ks ( x, ·) is the corresponding range centered at x with bandwidth s, which is exactly the definition of KDE P ( x ). Thus the definition of #-sample of kernel density estimates (KDE coreset) is the same as #-sample in kernel range spaces. • #-sample in kernel range spaces (#-sample of kernel density estimates) [103]: Given a kernel range space ( P, K), it is a subset Q ⇢ P, such that max \| KDE P ( x ) x 2R d KDE Q ( x )\| = k KDE P KDE Q k•  #, (1.9) #-sample in kernel range spaces can be very useful in many data-intensive applications, since evaluating a kernel density estimate over P directly takes O(n), which can be prohibitively expensive in big data. So #-sample in kernel range spaces gives us another point set Q, such that KDE Q approximates KDE P well. The error is guaranteed within user-defined parameter #. 1.4 KDE in Geometric Inference As we discussed before, kernel density estimates can be used in many areas. Here we give an example to show its power in geometric inference and topological data analysis. Geometry and topology have become essential tools in modern data analysis: geometry to handle spatial noise and topology to identify the core structure. Topological data analysis (TDA) has found applications spanning protein structure analysis [46, 86] to heart modeling [51] to leaf science [110], and is the central tool of identifying quantities like connectedness, cyclic structure and intersections at various scales. Yet it can suffer from spatial noise in data, particularly outliers. Given an unknown compact set S ⇢ R d and a finite point cloud P ⇢ R d that comes from S under some process, geometric inference aims to recover topological and geometric properties of S from P. The offset-based (and more generally, the distance function-based) approach for geometric inference reconstructs a geometric and topological approximation of S by offsets from P (e.g., [23-27]). 8 Kernel density estimates were brought to geometric inference and TDA by paper [106]; as a coauthor of this paper, we first analyzed the stability of kernel density estimates and the kernel distance in the context of geometric inference. We accomplished this by showing that a similar set of properties hold for the kernel distance with respect to a measure µ, (in CCM place of distance to a measure dµ,m [25]), defined as 0 dµK ( x ) = DK (µ, x ) = q k (µ, µ) + k ( x, x ) 2k (µ, x ). (1.10) It satisfies all the distance-like properties and there are further advantages of the kernel distance. (i) Its sublevel sets conveniently map to the superlevel sets of a kernel density estimate. (ii) It is Lipschitz with respect to the smoothing parameter s when the input x is fixed. (iii) As s tends to • for any two probability measures µ, n, the kernel distance is bounded by the Wasserstein distance: lims!• DK (µ, n)  W2 (µ, n). Here W2 is the ⇣R ⌘1/2 Wasserstein distance [137]: W2 (µ, n) = infp 2P(µ,n) Rd ⇥Rd \|\| x y\|\|2 dp ( x, y) between two measures, where dp ( x, y) measures the amount of mass transferred from location x to location y and p 2 P(µ, n) is a transference plan [137]. Most importantly, it has a small coreset representation, which allows for sparse repre- sentation and efficient, scalable computation. In particular, an #-sample in kernel range spaces for a measure µ: Q of µ is a finite point set whose size only depends on # > 0, such that maxx2Rd \| KDE µ ( x ) KDE Q ( x )\| = k KDE µ KDE Q k•  #. These coresets preserve inference results, such as shapes, Betti numbers and persistence diagrams. Persistence diagrams summarize the persistence of all homological features, such as connected components, tunnels, voids, etc. of a topological space in a single diagram. The birth and death times are the y- and x-coordinates of the persistence diagram, thus significant features with high persistence are far from the diagonal. To calculate the persistent homology of a space, the space should first be represented as a simplicial complex, which is very time consuming. While there exist simplicial sparsification approaches, which reduce the size akin to coresets, they still involve the complicated process of building a simplicial complex of the full data set. Using KDE coresets is the first simplification result that preserves topological features without constructing the simplicial complex on the full data set, and thus can speed up TDA for large data significantly. Figure 1.4 give us such an example with persistence diagrams generated by TDA package for R [49]. 9 Distance Function Diagram 0.02 Death 0.01 2 0 0.00 1 Birth 3 0.03 KDE Diagram 1 2 3 0.00 0.01 0.02 0.03 Death Birth KDE Diagram Kernel Distance Diagram Death 0.98 0.97 2 0 0.96 1 Birth 0.99 3 1.00 1.01 4 0 0 1 2 3 Death 4 0.96 0.97 0.98 0.99 1.00 1.01 Birth Figure 1.4. Example with 10,000 points in [0, 1]2 generated on a circle or line with N (0, 0.005) noise; 25% of points are uniform background noise. The generating function is reconstructed with KDE with s = 0.05 (upper left), and its persistence diagram based on the superlevel set filtration is shown (upper middle). A coreset [149] of the same data set with only 1,384 points (lower left) and persistence diagram (lower middle) are shown, again using KDE. This associated confidence interval contains the dimension 1 homology features (red triangles) suggesting they are noise; this is because it models data as iid - but the coreset data is not iid, it subsamples more intelligently. We also show persistence diagrams of the original data based on the sublevel set filtration of the standard distance function (upper right, with no useful features due to noise) and the kernel distance (lower right). 1.5 Outline This dissertation improves on the construction of coresets in kernel density estimates and kernel regression in several ways relevant to improve the efficiency and effectiveness. It builds several algorithmic techniques that can be used for coreset construction problem in kernel smoothing setting. It also illustrates these techniques on several large data sets. Chapter 2 describes randomized and deterministic algorithms for computing #-samples for kernel density estimates with quality guarantees that are orders of magnitude more efficient than previous algorithms. These algorithms do not require knowledge of the kernel or its bandwidth parameter and are easily parallelizable. We demonstrate how to 10 implement our ideas in a centralized setting and in MapReduce, although our algorithms are applicable to any large-scale data processing framework. Extensive experiments on large real data sets demonstrate the quality, efficiency and scalability of our techniques. This chapter is mainly based on [149] with Jeffery Jestes, Jeff M. Phillips and Feifei Li. Chapter 3 is an extension of the problem described in Chapter 1, which investigates the challenges in using L• (or worst case) error, a stronger measure than L1 or L2 . This chapter presents efficient solutions to two linked challenges: how to evaluate the L• error between two kernel density estimates and how to choose the bandwidth parameter for a kernel density estimate built on a subsample of a large data set. This chapter is based on [150] with Jeff M. Phillips. Chapter 4 extends smoothed versions of geometric range spaces to more general types of ranges and then considers approximation of these range spaces through #-nets and #samples. We characterize when size bounds for #-samples on kernels can be extended to these more general smoothed range spaces. We also describe new generalizations for #-nets to these range spaces and show when results from binary range spaces can carry over to these smoothed ones. This chapter is based on [107] with Jeff M. Phillips. Chapter 5 describes coresets for kernel regression. Kernel regression is an essential and ubiquitous tool for non-parametric data analysis, particularly popular among time series and spatial data. The size of the coresets for kernel regression is also independent of the raw number of data points, rather they only depend on the error guarantee, and in some cases the size of domain and amount of smoothing. We evaluate our methods on very large time series and spatial data, and demonstrate that they incur negligible error, can be constructed extremely efficiently and allow for great computational gains. This chapter is based on [151] with Jeff M. Phillips. Chapter 6 concludes the dissertation by providing some applications and future directions. CHAPTER 2 EFFICIENT CORESETS FOR KERNEL DENSITY ESTIMATES 2.1 Background and Related Work Around the 1980s, kernel density estimates became the defacto way in statistics to represent a continuous distribution from a discrete point set [126], with the study initiated much earlier [116]. However, this work often implied brute force (O(n) time) solutions to most queries. The problem of evaluating kernel density estimates is a central problem in statistical physics and numerical analysis. These problems are often posed as n-body simulations where the force-interactions between all pairs of points need to be calculated [4], and the pairwise force is up to a constant described by the Gaussian kernel. This has resulted in many indexing type techniques that up to constant precisions can evaluate the kernel density estimate at all n points in roughly O(n) time. These techniques are sometimes called fast multipole methods [19], and in practice these are typically implemented as quad trees that calculate the distance to roots of subtrees instead of all pairs when the distance becomes far enough. Numerical approximation techniques called the (Improved) Fast Gauss Transform (IFGT) [56,114,145,146] can further improve these approaches. However, the IFGT approach (in general fast multipole methods) is based on heuristics and does not offer formal theoretical guarantees on the approximation-time trade-off. In order to have a formal theoretical guarantee to derive an (#, d)-approximation, random sampling is a baseline method, but it requires O( #12 log 1d ) samples to be included in Q, which could lead to expensive query evaluations, especially for small # and/or d values. A recent technique using discrepancy theory [103] creates a small representation of a kernel density estimate (smaller than the random sampling approach) while still bounding the `• error. It works by creating a min-cost matching of points in P; that is, P is decomposed into \| P\|/2 pairs so that the sum over all distances between paired points is 12 minimized. Then it randomly removes one point from each pair, reducing the size of P by half. This process is repeated until either the desired size subset or the tolerable error level is reached. However, computing the min-cost matching [48] is expensive, so this approach is only of theoretical interest and not directly feasible for large data. Yet this will serve as the basis for a family of our proposed algorithms. A powerful type of kernel is a reproducing kernel [6, 101] (an example is the Gaussian kernel) that has the property that K ( p, q) = h p, qiHK ; that is, the similarity between objects p and q defines an inner-product in a reproducing kernel Hilbert space (RKHS) HK . This inner-product structure (the so-called "kernel trick") has led to many powerful techniques in machine learning; see [119, 125] and references therein. Most of these techniques are not specifically interested in the kernel density estimate; however, the RKHS offers the property that a single element of this space essentially represents the entire kernel density estimate. Motivated by the task of constructing samples from Markov random fields, Chen et al. [30] introduced a technique called kernel herding suitable for characteristic kernels (including Gaussian and Laplace kernels). Bach et al. [7] showed that this algorithm can be interpreted under the Frank-Wolfe framework [33]. Harvey and Samadi [68] further 0 revisited kernel herding in the context of a general mean approximation problem in R d . 0 That is, consider a set P0 of n points in R d , find a subset Q0 ⇢ P0 so that k P0 where P0 and Q0 are the Euclidean averages of P0 and Q0 , respectively. Q0 k  #, Kernel density estimates have been used in the database and data mining community for density and selectivity estimations, e.g., [61, 147]. However, the focus in these works is how to use kernel density estimates for approximating range queries and performing selectivity estimation, rather than computing approximate kernel density estimates for fast evaluations. When the end-objective is to use a kernel density estimate to do density or selectivity estimation, one may also use histograms [60, 75, 81, 109] or range queries [54, 55, 71, 139] to achieve similar goals, but these do not have the same smoothness and statistical properties of kernel density estimates [126]. Nevertheless, the focus of this work is on computing approximate kernel density estimates that enable fast query evaluations, rather than exploring how to use kernel density estimates in different application scenarios (which is a well-explored topic in the literature). 13 2.2 Warm-up: One Dimension Efficient construction of approximate kernel density estimates in one dimension is fairly straightforward. However, it is still worth investigating these procedures in more detail since to our knowledge, this has not been done at truly large scale before, and the techniques developed will be useful in understanding the higher dimensional version. • Baseline method: random sampling (RS). A baseline method for constructing an approximate kernel density estimate in one dimension is random sampling. It is well known that [30, 103] if we let Q be a random sample from P of size O((1/#2 ) log(1/d)), then with probability at least 1 d, the random sample Q ensures that k KDE P KDE Q k•  #. That said, the first technique (RS) follows from this observation directly and just randomly samples O((1/#2 ) log(1/d)) points from P to construct a set Q. In the centralized setting, we can employ the one pass reservoir sampling technique [138] to implement RS efficiently. For large data that is stored in distributed nodes, RS can still be implemented efficiently using the recent results on generating random samples from distributed streams [35]. The construction cost is O(n). The approximate KDE has a size O((1/#2 ) log(1/d)), and its query cost (to evaluate KDE Q ( x ) for any input x) is O((1/#2 ) log(1/d)). RS can be used as a preprocessing step for any other technique, e.g., for any technique that constructs a KDE over P, we run that technique over a random sample from P instead. This may be especially efficient at extremely large scale (where n 1/#2 ) and where sampling can be done in an efficient manner. This may require that we initially sample a larger set Q than the final output to meet the approximation quality required by other techniques. • Grouping selection (GS). A limitation in RS is that it requires a large sample size (sometimes the entire set) in order to guarantee a desired level of accuracy. As a result, its size and query cost becomes expensive for small # and d. Hence, we introduce another method, called the grouping selection (GS) method. It leverages the following lemma, known as the g-perturbation. Lemma 1. Consider n arbitrary values {g1 , g2 , . . . , gn } such that kgi k  g for each i 2 [n]. Then 14 let Q = {q1 , q2 , . . . , qn } such that qi = pi + gi for all pi 2 P. Then k KDE P KDE Q k•  g/s. Proof. This follows directly from the (1/s)-Lipschitz condition on kernel K (which states that the maximum gradient of K is (1/s)), hence perturbing all points by at most g affects the average by at most g/s. Using Lemma 1, we can select one point q in every segment ` of length #s from P and assign a weight to q that is proportional to the number of points from P in `, to construct an #-sample KDE of P. Specifically, GS is implemented as follows. After sorting P if it is not already sorted, we sweep points from smallest to largest. When we encounter pi , we scan until we reach the first p j such that pi + #s < p j . Then we put pi (or the centroid of pi through p j 1) in Q with weight w( pi ) = ( j i )/n. Since Q constructed by GS is weighted, the evaluation of KDE Q ( x ) should follow the weighted query evaluation as specified in equation 1.7. Theorem 1. The method GS gives an #-sample of kernel density estimate of P. Proof. The weighted output of GS Q corresponds to a point set Q0 that has w(q) unweighted points at the same location of each q 2 Q; then KDE Q = KDE Q0 . We claim that Q0 is an #s-perturbation of P, which implies the theorem. To see this claim, we consider any set { pi , pi+1 , . . . , p j 1} of points that are grouped to a single point q 2 Q. Since all of these points are within an interval of length at most #s, each pi+` is perturbed to a distinct point qi0+` 2 Q (at location q) that is at distance gi+`  #s. GS's construction cost is O(n) if P is sorted, or O(n log n) otherwise. Its query cost is O(\| Q\|), which in the worst case could be \| Q\| = \| P\|. And Q may be large depending on how densely points in P are co-located and the values of # and s. However, GS can be used as a postprocessing step on top of any other method, such as using GS over the output of RS. This takes little overhead if the points are already sorted, such as in the output of SS (see below). • Sort-selection (SS). Our best method (SS) offers an #-sample of kernel density estimate using only O( 1# ) samples in one dimension, a significant improvement over random sampling. Consider a one-dimensional sorted point set P = { p1 , p2 , . . . , pn } where pi  15 pi+1 for all i. Let Pj = { pi 2 P \| ( j 1)#n < i  j#n} for integer j 2 [1, d1/#e] such that P = [ Pj . Then the coreset Q = {q1 , q2 , . . . , qd1/#e } such that each q j = pd( j With this coreset, it has the following result: 1 2 ) #n e from P. Lemma 2. The method SS gives an #-sample of kernel density estimate of P, such that k KDE P KDE Q k•  #. We now can construct the SS method. It simply selects pd( j 1 2 ) #n e from P into Q for each j 2 [1, d1/#e]. This requires that P is sorted, and this can be done efficiently at very large scale using external merge sort. However, we can do better. Note that #approximate quantiles for d1/#e quantile values are sufficient to construct Q, and we can use an efficient streaming or distributed algorithm for computing these #-approximate quantiles [34, 57, 74, 87, 88]. In particular, we only need to find the (n #n 2 )th 3#n 5#n #n 2 th, 2 th, 2 th, ..., quantile values from P. And it is easy to verify that #-approximations of these quantiles are sufficient to establish the result in Lemma 2. Using the #-approximate quantiles, SS has a construction cost of O(n log 1# ); otherwise its construction cost is O(n log n). In either case, its size is only O( 1# ) and its query cost is also just O( 1# ). 2.2.1 Efficient Evaluation Once we have obtained a set Q above so KDE Q approximates KDE P , we need to efficiently answer queries of KDE Q ( x ) for any x 2 R. The first obvious choice is a brute force computation (BF) where we directly calculate 1 \| Q\| Âq2Q K (q, x ). This has little overhead and its cost is obviously O(\| Q\|). It is most efficient if \| Q\| is particular small. A second approach (MP) is to use the one-dimensional variant of multipole methods. We build a B-tree (or binary tree if Q fits in memory) on Q. Each node v will store the weight (or count) wv of all nodes in the subtree and the smallest vs and largest vl coordinates of the subtree. We traverse the tree as follows, starting at the root. If x 2 [ws , wl ], visit each child and return their sum to the parent. If \|K (vs , x ) wv K ((vl K (vl , x )\|  #, then return vs )/2, x ) to the parent. Else, visit each child and return their sum to the parent. This approach may improve the query evaluation time in practice, especially when \| Q\| is large. 16 2.3 Two Dimensions In R2 , we first describe baseline methods from the literature or based on simple extensions to existing methods. We then introduce our new methods. The first uses a randomized technique based on matchings, and the second is deterministic and based on space-filling curves. 2.3.1 Baseline Methods • Random Sampling (RS). The first baseline (labeled RS) is to simply random sample a set Q from P. The same bound of O((1/#2 ) log(1/d)) on the sample size from one dimension still holds in two-dimensions, although the constant in the big-Oh is likely larger by a factor of about 2. • Improved Fast Gauss Transform (IFGT). A class of methods is based on fast multipole methods [19]. In practice in two dimensions these are implemented as quad trees which calculate the distance to roots of subtrees instead of all pairs when the distance becomes far enough. The Improved Fast Gauss Transform (IFGT) [56,114,145,146], is the state-ofthe-art for fast construction and evaluation of approximate kernel density estimates (although only with Gaussian kernels). It improves multipole approaches by first building a k-center clustering over the data set P, and then just retaining a Hermite expansion of the kernel density estimates for the points associated with each k-centers. However, the IFGT method is based on heuristics and does not offer any formal theoretical guarantees on the approximation-size trade-off. As a result, it involves a number of parameters that cannot be easily and intuitively set in order to derive a desired level of efficiency and accuracy tradeoffs. • Kernel herding (KH). Yet another possible approach is to explore the reproducing kernel Hilbert space (RKHS). As discussed in Section 2.1, the technique "kernel herding" [7, 30, 68] showed that iteratively and greedily choosing the point p 2 P, which when added to Q most decreases the quantity k KDE P KDE Q k in RKHS. However, this still takes O(\| Q\|n) time to construct \| Q\| since at each step each point in P \ Q needs to be evaluated to determine how much it will decrease the error. 17 2.3.2 Randomized MergeReduce Algorithm An interesting theoretical result built on discrepancy [28, 92] theory was recently proposed in [103] for constructing a small set Q so KDE Q approximates KDE P . It extends a classic method for creating #-approximations of Chazelle and Matousek [29] (see also [28, 92]), to #-sample of kernel density estimates. These results are mostly of theoretical interest, the straightforward adaption is highly inefficient. Next we explain and overcome these inefficiencies. 2.3.2.1 The MergeReduce framework Our algorithm leverages the framework of Chazelle and Matousek [29] and its generalizations. We first describe our overall framework, and then elaborate the most critical reduce step in further details. Roughly speaking, our algorithm repeatedly runs a mergethen-reduce step. Hence, we denote this framework as the MergeReduce algorithm. Suppose the desired size of the compressed set Q is \| Q\| = k. The framework proceeds in three phases: an initialization phase, the combination phase, and the optional clean-up phase. The first phase is implicit, and the last phase is of theoretical interest only. In the initialization phase, we arbitrarily decompose P into disjoint sets of size k; call these P1 , P2 , . . . , Pn/k . Since this is arbitrary, we can group data that is stored together into the same partition. This works well for distributed data or streaming data where consecutively encountered data are put in the same partition. The combination phase proceeds in log(n/k ) rounds. In each round, of the remaining sets of size k, it arbitrarily pairs them together, which we dub the merge step. For each pair, say Pi and Pj , it runs a reduce step on the union of 2k points to create a single set of size k. At a high level, the reduce step has two parts. The first part is what we call a matching operation. It produces k pairs of points in a certain fashion over the 2k input points. The second part is trivial: it randomly selects one point from each pair to produce the final output of size k. The merging operation is the most critical part in a reduce step, and we will elaborate after presenting the overall MergeReduce framework. That said, after i rounds of merge and reduce, there are (n/k )/2i sets of size k. This continues until there is one set remaining. Note that again, the fact that we can pair sets (Pi and Pj ) arbitrarily in a merge step is extremely convenient in a distributed or streaming 18 setting. The clean-up phase is not needed if the combination phase is run as above; see Section 2.3.2.4 for remaining details. Importantly, we also note that this entire MergeReduce framework can be preceded by randomly sampling O(1/#2 ) points from P which are then treated as the input. Then only log(1/#) merge-reduce rounds will be needed. • The min-cost matching. The key part of this framework is to construct a matching in a set of points P. Suppose \| P\| = n, a matching consists of n 2 pairs of points, and every point in P belongs to exactly one pair in a matching. The recent result from [103] implies that one can use the min-cost matching to derive an #-sample of kernel density estimate. Note that a min-cost matching is a matching so that the sum of distances between matched points is minimized. Unfortunately, by using a min-cost matching, the algorithm in [103] is impractical. Here is why. It is well known that a min-cost matching over n points can be done in O(n3 ) time using Edmonds' Blossom algorithm [48]. This quite complicated algorithm involves nonregular recursion, and it is clearly not scalable for large data sets. We use the state-of-theart implementation [80] as a baseline for the matching operation and label it as BlossomMR (MergeReduce with Blossom min-cost matching). This implementation of the Blossom algorithm requires first calculating K ( p, p0 ) for all O(n2 ) pairs p, p0 2 P as input, which is part of the overall cost (which is O(n3 )). There have been theoretical improvements [136] to Edmonds' algorithm for points in R2 . These algorithms are considerably more complicated than that of Edmonds and no known efficient implementation exists; most likely the improvements are theoretical in nature only. We have the following results concerning the Blossom-MR algorithm, and the BlossomMR+RS algorithm that first randomly samples O(1/#2 ) points. Theorem 2. For a point set P ⇢ R2 with n points, we can construct Q giving an #-sample of KDE (with constant probability) in • O( #n2 2 log n log • O(n + 1 #4 1 #) time and \| Q\| = O( 1# log n q log 1# ) using Blossom-MR, and log3 1# ) time and \| Q\| = O( 1# log1.5 1# ) using Blossom-MR+RS. 19 Lastly, the following greedy algorithm provides a two-approximation to the cost of the min-cost matching [42]. It finds the closest pair of points in P, matches them, removes them from P, and repeats. This algorithm can be implemented in O(n2 log n) time as follows. Calculate all O(n2 ) pairwise distances and place them in a priority queue. Repeatedly remove the smallest pair from the queue (in O(log n) time). If both points are still in P, match them and mark them as no longer in P. We refer to the merge-reduce framework with this matching algorithm as the Greedy-MR method. Greedy-MR does improve the running time over Blossom-MR, however, it is still quite expensive and not scalable for large data. Furthermore, the result produced by Greedy-MR is not known to provide any approximation guarantees on the kernel density estimate. 2.3.2.2 More Efficient Reduce Step The Blossom-MR algorithm and its heuristic variant Greedy-MR are too expensive to be useful for large data. Thus, we design a much more efficient matching operation, while still ensuring an #-sample of kernel density estimate. For any matching M, we produce an edge map EM of that matching M as EM = {e( p, q) \| ( p, q) 2 M} where e( p, q) is an undirected edge connecting p and q. Given a disk B, for e( p, q) 2 EM define e( p, q) \ B as follows. • If both p and q are not covered by B, e( p, q) \ B = ∆. • If both p and q are covered by B, e( p, q) \ B = e( p, q). • If either p or q is covered by B but not both. Suppose p is covered by B, and e( p, q) intersects the boundary of B at a point s, e( p, q) \ B = e( p, s). Then, we define EM \ B as: EM \ B = {e( p, q) \ B \| e( p, q) 2 Em }. (2.1) An example of EM \ B is shown in Figure 2.1. Essentially, we only want it to consider edges with at least one endpoint within B, and only the subset of the edge that is within B. We observe that in order to produce an #-sample of kernel density estimate, the property the matching requires is in regards to any unit disk B. Specifically, we want CM,B = to be small. Let CM = maxB CM,B . Â e( p,q)2 E M \ B kp q k2 (2.2) 20 Figure 2.1. Intersection between a matching and a disk, solid red lines are included in EM \ B. Left shows Grid matching and right shows Z-order with every other edge in a matching. The result in [103] says if M is the min-cost matching (minimizes Â( p,q)2 M k p qk), then CM = O(1). But calculating the min-cost matching, both exactly and approximately, is expensive as we have shown in Section 2.3.2.1. • The grid matching. Here we present a novel solution which does have a bound on CM and is efficient, including at very large scales. We progress in rounds until all points are matched. Starting with i = 0, in round i, we consider a grid Gi on R2 where each grid cell p has edge length l#,i = 2s#2i 2 . Define cell gr,c 2 Gi as [rl#,i , (r + 1)l#,i ] ⇥ [cl#,i , (c + 1)l#,i ] for integers r, c. Inside of each cell, match points arbitrarily. Only the unmatched points (one from any cell with an odd number of points) survive to the next round. Each cell in Gi+1 is the union of 4 cells from Gi , thus in rounds i > 0, it can have at most 4 points. We refer to this matching algorithm as Grid; see an example in Figure 2.1. For simpler analysis, we assume s > #c for some constant c 1; typically s # and # < 1, so this assumption is very mild. Alternatively, if we do not know s ahead of time, we can recursively divide points that fall in the same grid cell into smaller and smaller levels until at most two are left in each cell (like a quad tree), and then move back up the recursion stack only with unmatched points; a careful implementation can be done in O(n log n) time [65]. Let P0 be unmatched points after round 0, let P0 = P \ P0 , and Q 21 the final output. Lemma 3. Let M0 ( P0 ) be the matching on P0 in Grid. For each edge ( p, q) 2 M0 ( P0 ), let P̂ be where (w.l.o.g.) q is moved to location p. Then P̂ is a #s/2-perturbation of P0 . p Proof. Since all points matched in round 0 are in a grid cell of size at most #s 2/4, the p p point q in any edge is moved at most 2 · #s 2/4 = #s/2. Lemma 4. Let M0 ( P0 ) be the matching by Grid on P0 . Then CM0 = O(log(1/#)) and Grid runs in O(n log 1# ) time. Proof. In each round, there are at most 2 matched pairs per grid cell. Each such pair has p edge length at most 2l#,i = s#2i 1 , and there are at most (1/s#2i 5/2 )2 grid cells that intersect any unit ball. Thus the total squared-length of all matchings in round i is at p most ((1/s#2i 5/2 )2 · (s#2i 1 )2 = 2 2. After log(1/s#) + 1 rounds, the total length of all p matchings is at most 2 2(log(1/s#) + 1), and in any ball, there are at most 4 remaining unmatched points. The last 4 points can each account for at most a squared-length of 4 within a unit ball B, so the total weight of edges in any unit ball B is at most CM  p 2 2 log(1/s#) + 19 = O(log(1/#)). Each round takes O(n) time, and we can match points arbitrarily in O(n) time after the log(1/s#) + 1 rounds. We observe in most common scenarios CM is close to 1. With the Grid matching algorithm, we can instantiate the MergeReduce framework to get a Grid-MR algorithm, or if we first sample a Grid-MR+RS algorithm. Theorem 3. For a point set P ⇢ R2 with n points, we can construct Q giving an #-sample of KDE (with constant probability) in • O(n log 1# ) time and \| Q\| = O( 1# log n log1.5 1# ) using Grid-MR, and • O(n + 1 #2 log 1# ) time and \| Q\| = O( 1# log2.5 1# ) using Grid-MR+RS. Since Grid takes only O(n log 1# ) time, the benefit of the initialization phase to split the data set into n/k pieces does not out-weigh its overhead in a centralized setting. In particular, we just run Grid once on all ni points remaining in round i. This does not affect the runtime or error bounds. 22 Compared to the Blossom-MR and Greedy-MR algorithms, Grid-MR produces an #sample of kernel density estimate with about the same size, but with much cheaper construction cost. Grid-MR's running time only linearly depends on n, making it an ideal candidate to scale out to massive data. 2.3.2.3 Streaming and Distributed MergeReduce Since the reduce step (the key computational component of the MergeReduce framework) is only run on select subsets, this allows the full framework to generalize to distributed and streaming settings. • Streaming variant. The streaming algorithm follows techniques for approximate range counting [8, 128]. Consecutive points are assigned to the same partitions Pi , and we pair and reduce partitions whenever there are two that represent the same number of points (one that has been merged i times represents k2i points). This means we only need to keep log nk partitions, and thus only O(k log nk ) points in memory at any given time. The dependence on n can be completely removed by first randomly sampling. The final structure of the streaming algorithm has weighted points, where if a point is in a partition that has been reduced i times, its weight is 2i . These weights can be removed to P\| create just 5\| Q\| points instead of \| Q\| log \|\|Q by running the in memory matching algorithm \| on weighted points [90]. In particular, we can modify a matching algorithm to work with weighted points, specifically consider the Grid algorithm. In the 0th phase of Grid, a point with weight 2i represents 2i points that all fall in the same cell, and can be matched with themselves (this can be done by ignoring this point until the ith phase when its weight is 1). • Distributed variant. This framework is efficient on distributed data. Use the streaming algorithm on each distributed chunk of data, and then pass the entire output of the streaming algorithm to a central node, which can merge and reduce the union. The error caused by reducing on the central node is already accounted for in the analysis. Again, the dependence on \| P\| can be removed by first sampling. 23 2.3.2.4 Other Extension An alternative version of the combination phase is possible for the MergeReduce algorithm. Specifically, it considers some reduce step that takes time O(n b ) on a set of size n, and instead sets k = 4( b + 2)\| Q\| (where \| Q\| is the desired size of the final output), and on every ( b + 3)th round, pairs sets but does not reduce them. Then the clean-up phase is used to reduce the single remaining set repeatedly until it is of size \| Q\|. When b is a constant, this saves a O(log n) from the size of the output Q (or O(log 1# ) if we sampled first) [29]. More specifically, the output of this MergeReduce variant is then a set of size \| Q\| = O(CM 1# log0.5 1# ) for a reduce step that uses a matching algorithm which produces an output with cost CM . If we use Grid, then Grid-MR produces an output of size \| Q\| = O( 1# log1.5 1# ). And Blossom-MR outputs Q of size \| Q\| = O( 1# log0.5 1# ) in O( #n2 log 1# ) time. But in practice, this variant is more complicated to implement and usually the overhead out-weighs its benefit. Hence we do not use this variant in this paper. 2.3.3 Deterministic Z-Order Selection Inspired by one-dimensional sort-section (SS) and randomized two-dimensional GridMR algorithm, we propose a new deterministic two-dimensional technique based on space filling curves. A space filling curve puts a single order on two- (or higher-) dimensional points that preserves spatial locality. They have great uses in databases for approximate high-dimensional nearest-neighbor queries and range queries. The single order can be used for a (one-dimensional) B+ tree, which provides extremely efficient queries even on massive data sets that do not fit in memory. In particular, the Z-order curve is a specific type of space filling curve that can be interpreted as implicitly ordering points based on the traversal order of a quad tree. That is, if all of the points are in [0, 1]2 , then the top level of the quad tree has four children over the domains c1 = [0, 12 ] ⇥ [0, 12 ], c2 = [ 12 , 1] ⇥ [0, 12 ], c3 = [0, 12 ] ⇥ [ 12 , 1], and c4 = [ 12 , 1] ⇥ [ 12 , 1]. And each child's four children itself divide symmetrically as such. Then the Z-order curve visits all points in the child c1 , then all points in c2 , then all points in c3 , and all points in c4 (in the shape of a backwards ‘Z'); and all points within each child are also visited in such a Z-shaped order. See an example in Figure 2.1. Thus, this defines a complete order on 24 them, and the order preserves spatial locality as well as a quad tree does. The levels of the Z-order curve (and associated quad tree) are reminiscent of the grids used in the matching technique Grid. This insight leads to the following algorithm. Compute the Z-order of all points, and of every two points of rank 2i and 2i + 1, discard one at random; repeat this discarding of half the points until the remaining set is sufficiently small. This approach tends to match points in the same grid cell, as with Grid, but is also algorithmically wasteful since the Z-order does not change between rounds. Thus, we can improve the algorithm by using insights from SS. In particular, we just run SS on the Z-order of points. So to collect k = \| Q\| points, let the ith point retained qi 2 Q be the point in rank order d(i points in the Z-order. 1 \| P\| 2 ) k e. This selects one point from each set of \| P\| k In fact, by setting k = O( 1# log n log1.5 1# ), if we randomly select one point from each Z-order rank range [(i 1) \| Pk \| , i \| Pk \| ] (call this algorithm Zrandom), then the resulting set Q has about the same guarantees as the Grid-MR algorithm, including Zrandom+RS, which preprocesses by random sampling. Theorem 4. For a point set P ⇢ R2 with n points, we can construct Q giving an #-sample of KDE (with constant probability) in • O(n log n) time and \| Q\| = O( 1# log n log1.5 1# ) using Zrandom, and • O(n + 1 #2 log 1# ) time and \| Q\| = O( 1# log2.5 1# ) using Zrandom+RS. Proof. We prove this result by imagining that Zrandom does something more complicated than it actually does in order to relate it to Grid-MR. That is, we pretend that instead of just selecting a single point at random from each range [(i 1)\| P\|/\| Q\|, i \| P\|/\| Q\|] in the Z-order rank, we proceed in a series of log(\| P\|/\| Q\|) rounds, and in each round, the set P is reduced in size by half. Specifically, in each round, out of every two consecutive points in the Z-order (rank 2i and 2i + 1), we retain one at random. Since the Z-order is consistent across rounds, this results in a random point in the rank interval [(i as desired. 1)\| P\|/\| Q\|, i \| P\|/\| Q\|] Now if we consider the levels of the implicit quad tree defining the Z-order, this is equivalent to the grid used in Grid-MR. In each round, there are at most 3 edges within grid cell at level j, but that are not entirely in levels smaller than j. Since we still only care 25 about O(log(1/#)) levels of the grid, the squared distance of these edges in that cell level accounts for at most O(1) inside a unit square. Thus, CM is still at most O(log(1/#)). The remainder of the proof is the same as in Theorem 3. However, we find the implementation of the following deterministic algorithm to be more effective; but as the randomness is necessary for the proof, we do not provide bounds. The construction of the Z-order can be done efficiently using its interpretation as bitinterleaving. Specifically, the z-value of a point is calculated by interleaving bits from the most significant bit to the least significant bit in the binary representation of a point's coordinates. For example, a two-dimensional point (3, 5) expressed in its binary representation is (011, 101). Its z-value is then (011011)2 = 27. Then we do not need to completely sort all points by z-value in order to select the proper k points, we can just approximately do this so that we have one point selected from each set of \| P\| k points in the sorted order. This can be done using an #-approximate quantiles algorithm that is accurate up to # = 1/k. This guarantees the existence in the quantile structure and its identification of at least one point within every consecutive set of \| P\| k points. We can just return this point for each range [(i 1) \| Pk \| , i \| Pk \| ] of ranks. We refer to this method as Zorder. Note that, following the discussion for SS in Section 2.2, we can use any of the existing efficient, large-scale #-approximate quantiles algorithms in either the centralized or the distributed setting. 2.3.4 Efficient Query Evaluation We can do efficient evaluations in R2 similar to BF and MP in R1 , as discussed in Section 2.2.1. In fact, BF is exactly the same. MP uses a quad tree instead of a binary tree. It stores a bounding box at each node instead of an interval, but uses the same test to see if the difference between K ( x, ·) is within # on the furthest and closest point in the bounding box to decide if it needs to recur. 2.3.5 Parallel and Distributed Implementation As with one-dimensional methods, our two-dimensional methods can be efficiently implemented in distributed and streaming settings. Any algorithm using the MergeReduce framework can run in distributed and parallel fashion. As a result, Grid-MR, Zrandom, and Zorder are very efficient in any distributed and parallel environments, and they ex- 26 tend especially well for the popular MapReduce framework where data in each split is processed in a streaming fashion. Among the baseline methods, RS can easily be implemented in any distributed and parallel environments. It takes some effort to make IFGT run in distributed and parallel fashion, but it is possible; we omit the details. Lastly, the KH can be approximated (without any bounds) by running its greedy step in each local piece of data independently, and then merging the results from local nodes. 2.3.6 Higher Dimensions All two-dimensional algorithms described can be extended to higher dimensions. In particular, KH, RS, Blossom-MR, Greedy-MR extend with no change, while Grid-MR and Zorder-MR extend in the obvious way of using a d-dimension grid or space-filling curve. In R d , the theoretical size bounds for RS increases to O( #12 (d + log 1d )) [78]; Grid-MR, and Zrandom-MR increases to O(dd/2 /#2 4 d +2 d log1+ d+2 1 # log n) (and a log 1# factor less for Blossom-MR). The increase in the second set is due to only being able to bound d CM = max B Â e( p,q)2 E M \ B kp qkd instead of equation (2.2) since the number of grid cells intersected by a unit ball now grows exponentially in d, and thus we need to balance that growth with the dth power of the edge lengths. The stated bounds, then results from [103] with an extra log n factor (which can be turned into a log 1# by first random sampling) because we do not use the impractical process described in Section 2.3.2.4. In no case does the MergeReduce framework need to be altered, and so the construction times only increase by a factor d. 2.4 Experiments We test all methods in both the single-thread, centralized environment and the distributed and parallel setting. In the centralized case, we implemented all methods in C++ and obtained the latest implementation of the IFGT method from the authors in [114, 145, 146]. We then used MapReduce as the distributed and parallel programming framework. In particular, we implemented and tested all methods in a Hadoop cluster with 17 machines. The centralized experiments were executed over a Linux machine running a single Intel i7 cpu at 3.20GHz. It has 6GB main memory and an 1TB hard 27 disk. The distributed and parallel experiments were executed over a cluster of 17 machines running Hadoop 1.0.3. One of the 17 machines has an Intel Xeon(R) E5649 cpu at 2.53 GHz, 100 GB of main memory, and a 2TB hard disk. It is configured as both the master node and the name node of the cluster. The other 16 machines in the Hadoop cluster (the slave nodes) share the same configuration as the machine we used for the centralized experiments. One TaskTracker and DataNode daemon run on each slave. A single NameNode and JobTracker run on the master. The default HDFS (Hadoop distributed file system) chunk size is 64MB. • Data sets. We executed our experiments over two large real data sets. In two dimensions, we used the OpenStreet data from the OpenStreetMap project. Each data set represents the points of interest on the road network for a US state. The entire data set has the road networks for 50 states, containing more than 160 million records in 6.6GB. For our experiments, we focus on only the 48 contiguous states, excluding Hawaii and Alaska. Each record is a two-dimensional coordinate, represented as 2 4-byte floating points. We denote this data as the US data set. In one dimension, we employed the second real data set, Meme, which was obtained from the Memetracker project. It tracks popular quotes and phrases which appear from various sources on the Internet. The goal is to analyze how different quotes and phrases compete for coverage every day and how some quickly fade out of use while others persist for long periods of time. A record has 4 attributes, the URL of the website containing the memes, the time Memetracker observed the memes, a list of the observed memes, and links accessible from the website. We preprocess the Meme data set, and convert each record to have an 8-byte double to represent the time of a single observed meme and the URL of the website which published the meme, e.g. if an original record contained a list of 4 memes published at a given time at a website, 4 records would be produced in the new data set. We view these records as a set of timestamps in 1d, reflecting the distribution of the Meme's publication time. After this preprocessing step, the Meme data set contains more than 300 million records in 10.3GB. In both 1d and 2d, whenever needed, we randomly sample records from the full US or the full Meme data set to obtain a data set of smaller size. Figure 1.1 visualizes the kernel density estimates built by our MergeReduce algorithms in 1d and 2d, over the full Meme 28 and US data sets, respectively (but only very few data points were plotted, to ensure that the figures are legible). • General setup. In all experiments, unless otherwise specified, we vary the values of one parameter of interest, while keeping the other important parameters at their default values. Also by default, we randomly generate 5,000 test points to evaluate the accuracy of an approximate kernel density estimate. Among these 5,000 points, 4,000 were randomly selected from the data set P, and the other 1,000 were randomly selected from the domain space M of P (but not from P itself). We use err to denote the observed `• error from an approximate kernel density estimate Q, which is computed from the evaluations of these 5,000 test points in KDE Q and KDE P , respectively. We try to compare different methods by setting a proper # value (the desired error in theory) for each of them so that they achieve the same observed error. All experiments report the average of 10 random trials. 2.4.1 Two Dimensions: Centralized • Default setup. Our default data set is a US data set with 10 million records. The default failure probability d for the random sampling method (RS) is set to 0.001. To save space in the legend in all figures, we used G-MR and Z to denote the Grid-MR and Zorder methods, respectively, and method+RS to denote the version of running method over a random sample of P (of size O( #12 log 1d )), instead of running method over P directly. The default s value in any kernel density estimate is 200, on a domain of roughly 50,000 ⇥ 50,000. • Our method+RS. We first study the effect of running our methods over a random sample of P, when compared against the results from running them directly over P. Figure 2.2 shows the results when we vary the value of the common input parameter for all of them, #. Not surprisingly, as shown in Figure 2.2(a), the observed errors in all methods are smaller than the desired error #. All methods produce smaller observed errors when smaller # values were used. Furthermore, under the same # value, G-MR+RS and Z+RS produce higher errors than their respective counterpart does, namely, G-MR and Z. However, the difference is fairly small. In fact, when err is about 10 no difference between G-MR+RS (Z+RS) and G-MR (Z). 2, there are almost 29 −2 10 12 Size (bytes) −3 10 6 G-MR G-MR+RS Z Z+RS 10 8 G-MR G-MR+RS Z Z+RS 5 10 4 10 6 −4 10 0.005 0.01 0.05 0.1 ε (a) err vs. #. 4 0.005 3 0.01 0.05 0.1 ε (b) Construction time. 10 0.005 7 10 Communication (bytes) 14 G-MR G-MR+RS Z Z+RS Time (seconds) err 10 G-MR G-MR+RS Z Z+RS 6 10 5 10 4 10 3 0.01 0.05 ε (c) Size. 0.1 10 0.005 0.01 0.05 0.1 ε (d) Communication. Figure 2.2. Effect of guaranteed error # on the centralized G-MR, G-MR+RS, Z, Z+RS. More importantly, however, running a method over a random sample of P saves valuable construction time as shown in Figure 2.2(b). Figure 2.2(c) indicates that the sizes of the final approximate kernel density estimates constructed by different methods are almost the same. This is because whether running a method over P or a random sample of P, the final size of the kernel density estimate Q is largely determined by # only. This also means that the query time of these methods is almost the same, since the query time depends on only \| Q\|. Hence, we have omitted this figure. Finally, we also investigate the impact to the communication cost if we run these meth- ods in a cluster. Figure 2.2(d) shows that this impact is not obvious. Since G-MR and Z are very effective in saving communication cost already, collecting a random sample first does not lead to communication savings. In contrast, doing so might even hurt their communication cost (if the sample size is larger than what they have to communicate in our MergeReduce framework). Nevertheless, all methods are very efficient in terms of the total bytes sent: a few megabytes in the worst case when # = 0.005 over a cluster for 10 million records in P. In conclusion, these results show that in practice, one can use our method+RS to achieve the best balance in construction time and accuracy for typically required observed errors. • Our method vs. baseline methods. We first investigate the construction cost of different alternatives in instantiating our MergeReduce framework, with a different algorithm for the matching operation. We also include Kernel Herding (KH) as introduced in Section 2.3.1. In order to let these baseline methods complete in a reasonable amount of time, we used smaller data sets here. From the analysis in Section 2.3.2, Blossom-MR (denoted as B-MR, representing the theoretical algorithm [103]) with a O(nk2 ) cost for output size k and Greedy-MR with a O(nk log n) cost are much more expensive than G-MR. Similarly, 30 KH with a O(nk) cost is also much more expensive than G-MR which runs in roughly O(n log k ) time. This is even more pronounced in Figures 2.3(a) and 2.3(b), showing the construction time of different methods when varying the observed error err and the size of the data sets, respectively. G-MR is several orders of magnitude faster and more scalable than the other methods. So the only competing baseline methods we need to consider further are the RS and IFGT methods. We compare these methods against our methods, G-MR+RS and Z+RS in Figure 2.4, using the default 10 million US data set. Figure 2.4(a) indicates that, to achieve the same observed error, RS is the most efficient method in terms of the construction time. However, our methods G-MR+RS and Z+RS are almost as efficient. In contrast, IFGT is almost one order of magnitude slower. In terms of the query time, Figure 2.4(b) shows that all methods achieve a similar query time given the same observed error, though IFGT does slightly better for very small err values. Note that we used the multipole (MP) query evaluation method for the kernel density estimates built from RS, G-MR+RS, and Z+RS. On the other hand, Figure 2.4(c) shows that the kernel density estimates built from both IFGT and RS have much larger size than that produced in our methods G-MR+RS and Z+RS, by 2 to 3, and 1 to 2 orders of magnitude, respectively. Finally, Figure 2.4(d) indicates that all methods have very good scalability in their construction time when the data set size increases from 1 million to 20 million, but G-MR+RS, Z+RS, and RS are clearly much more efficient than IFGT. In conclusion, our methods are almost as efficient as RS in terms of building a kernel 3 10 4 G-MR B-MR KH Greedy-MR Time (seconds) Time (seconds) 10 2 10 0 10 1 10 −1 10 −3 −2 10 −4 10 G-MR B-MR KH Greedy-MR 10 −3 −2 10 10 −1 10 5 15 err (a) Build time vs. err. Figure 2.3. Centralized G-MR, B-MR, Greedy-MR, KH. 50 n (x102) (b) Build time vs. n. 100 31 3 10 0.8 Time (seconds) Time (seconds) Z+RS 1 10 0 10 −4 10 −3 10 err −2 10 (a) Construction time vs err. G-MR+RS IFGT RS Z+RS 7 10 0.6 0.4 G-MR+RS IFGT RS G-MR+RS IFGT RS Z+RS 2 Time (seconds) G-MR+RS IFGT RS Size (bytes) 2 10 5 10 Z+RS 10 1 10 0 10 0.2 0 −4 10 −1 10 3 −3 10 err (b) Query Time. −2 10 10 −4 10 −3 10 err (c) Size. −2 10 1 5 10 6 n (x10 ) 20 (d) Construction time vs n. Figure 2.4. Effect of measured `• error err and N on centralized IFGT, RS, G-MR+RS, Z+RS. density estimate, and they are much more efficient than IFGT. Our methods also share similar query time as IFGT and RS, while building much smaller kernel density estimates (in size) than both IFGT and RS. 2.4.2 Two Dimensions: Parallel and Distributed • Default setup. In this case, we change the default data set to a US data set with 50 million records, while keeping the other settings the same as those in Section 2.4.1. Furthermore, since IFGT is much slower in building a kernel density estimate (even more so for larger data sets), and it is also a heuristic without theoretical guarantees (in contrast to the other 3 methods), in the distributed and parallel setting, we focus on comparing our methods against the RS method. Moreover, IFGT works only for the Gaussian kernel and needs to be provided the bandwidth s ahead of time, whereas our methods and RS do not. For space, we omit experiments showing varying s has mild effects on our algorithms and RS, but can lead to strange effects in IFGT. • Our methods vs. RS. We compare the performance of our methods, G-MR+RS and Z+RS, against RS on the aforementioned Hadoop cluster. Figure 2.5 reports the results when we vary the observed error err for different methods (by adjusting their # values properly so they output roughly the same err). Figure 2.5(a) shows that RS is the most efficient method to construct an approximate kernel density estimate for small err values, but G-MR+RS and Z+RS become almost as efficient as RS once err is no smaller than 10 4 which is sufficient for many practical applications. In those cases, all three methods take less than 40 seconds to construct an approximate KDE for 50 million records. All methods are highly communication-efficient, as shown in Figure 2.5(b). There are almost 32 9 RS Time (seconds) 100 80 60 40 20 −5 10 10 G-MR+RS Z+RS RS 7 10 6 10 −4 −3 10 −2 10 10 10 −5 10 7 10 6 10 4 −4 −3 10 10 −2 10 10 −5 10 −4 −3 10 err (b) Communication. 1 RS 10 err (a) Construction time. Z+RS G-MR+RS 5 5 4 10 8 10 8 10 Time (seconds) Z+RS Size (bytes) G-MR+RS Communication (bytes) 120 10 −2 10 G-MR+RS RS 0.6 0.4 0.2 −5 10 −4 −3 10 10 −2 10 err err (c) Size. Z+RS 0.8 (d) Query time. Figure 2.5. Effect of measured `• error err on distributed and parallel G-MR+RS, Z+RS, RS. no difference among the 3 methods: they communicate only a few MBs over the cluster to achieve an err of 10 4, and tens or hundreds of KBs for err between 10 2 and 10 3. In terms of the size of the approximate KDE Q , not surprisingly, RS is always the largest. By our analysis in Sections 2.3.2 and Sections 2.3.3, \| Q\| is O( #12 log 1d ) for RS, and only O( 1# log2.5 1# ) for both G-MR+RS and Z+RS. This is clearly reflected in Figure 2.5(c), where \| Q\| is 1-2 orders of magnitude larger from RS than from G-MR+RS and Z+RS. Finally, we investigate the query time using Q in Figure 2.5(d). In general, the query time should be linear to \| Q\|. But in practice, since we have used the fast query evaluation technique, the multipole (MP) method as shown in Section 2.3.4, all three methods have similar query time. We thus conclude that our methods, G-MR+RS and Z+RS, perform better than RS. More importantly, they also have much better theoretical guarantees (in order to achieve to same desired error # in theory), which is critical in practice since users typically use a method by setting an # value, and for the same # value, our methods will outperform RS by orders of magnitude (O( 1# log2.5 1# ) vs. O( #12 log 1d )). Nevertheless, to be as fair as possible, we experimentally compared methods by first running a few trials to set a proper # value so each method has roughly the same observed error err. We also study the scalability of these methods in Figure 2.6 by varying the size of the data from 30 million to 100 million records. Not surprisingly, all three methods are extremely scalable in both their construction time (almost linear to n) and communication cost (almost constant to n). Communication only increases by 0.8 ⇥ 105 bytes when the total communication is about 9 10 ⇥ 105 bytes; such increase is due to the overhead in Hadoop to handle more splits (as n increases), rather than the behavior of our algorithms, which is independent from n in communication cost. 33 5 9.6 Communication (bytes) Time (seconds) 40 35 30 25 G-MR+RS 30 50 70 n (x106) Z+RS RS 100 (a) Time. x 10 9.4 9.2 9 8.8 G-MR+RS 30 50 Z+RS 70 n (x106) RS 100 (b) Communication. Figure 2.6. Effect of n on G-MR+RS, Z+RS, RS. 2.4.3 Two Dimensions: Is RS Good Enough? To show our advantages over random sampling, we provide more centralized experiments here for higher precision. The trends in the parallel and distributed setting will be similar. Getting higher precision results from a kernel density estimate can be very desirable. The error of kernel density estimates over samples is with respect to KDE P ( x ), but for large spatial data sets, often only a small fraction of points have non-negligible effect on a query x, so dividing by \| P\| can make KDE P ( x ) quite small. Here it can be of critical importance to have very small error. Another use case of kernel density estimates is in n-body simulations in statistical physics [4], where high precision is required to determine the force vector at each step. Furthermore, note that a user typically uses these methods with a desirable error as the input, which is set as the input error parameter, the # value; even though the observed error err on a data set may be smaller than #. In that case, all of our methods have (roughly) a O( 1# / log 1# ) factor improvement in the KDE size, which is a critical improvement especially when # needs to be small (for high precision applications). We observe experiments in Figure 2.7 which compares G-MR+RS and Z+RS with RS in terms of construction time and size of the samples. (Note that figures for query time were omitted for the interest of space; but not surprisingly, they are roughly linear to the KDE size). We plot the figures based on input error parameter # and observed error err. For the plot with respect to # (Figure 2.7(a), 2.7(c)), when # becomes smaller than 5 ⇤ 10 4, the construction time and size for RS remain constant as the sample size needed for RS becomes larger than the size of the original data set. Since we then don't need to sample, 34 Z+RS 30 RS Z+RS 10 RS 25 20 10 G-MR+RS Z+RS RS G-MR+RS 20 15 Z+RS RS 8 10 8 Size (bytes) Time (seconds) Time (seconds) 30 9 G-MR+RS 10 Size (bytes) G-MR+RS 7 10 6 7 10 10 10 6 10 0 0.05 0.1 0.5 1 ε (x10−3) (a) Construction Time. 5 −6 10 −5 10 err −4 10 (b) Construction Time. 0.05 0.1 0.5 1 ε (x10−3) −6 −5 10 10 err (c) Size. −4 10 (d) Size. Figure 2.7. Effect of # and `• error err on G-MR+RS, Z+RS, RS for high precision case. the construction time and observed error for RS are 0. For small # values, G-MR+RS and Z+RS are clever enough to test if random sampling is beneficial, and if not, the random sampling step is bypassed. For the higher precision observed error, the results (Figures 2.7(b), 2.7(d)) clearly demonstrate the superiority of our proposed methods over RS, reducing both construction time and saving orders of magnitude in terms of both query time and size. We cannot achieve higher precision for RS when the observed error is smaller than 10 5, since in those cases, # is small enough that the random sample size is as big as the size of the original data set (i.e., KDE from a random sample consists of the entire original data set, leading to no error). 2.4.4 One Dimension: Parallel and Distributed • Default setup. We now shift our attention in 1d. The default data set is Meme with 50 million records, d = 0.001, and s at 1 day, over 273 days of data. Since GS (grouping selection) is a complementary technique that works with any other method, we focus on RS and SS (sort selection). We only report the results from the parallel and distributed setting. The trends in the centralized setting are similar. • SS vs.RS. By varying the observed error err, Figure 2.8 compares the two methods across different metrics. To achieve smaller observed errors in constructing KDE Q , SS is more efficient as shown in Figure 2.8(a), but RS becomes faster for larger observed errors. A similar trend is observed for the communication cost in Figure 2.8(b). In terms of reducing the size of Q, SS does a much better job than RS as shown in Figure 2.8(c), \| Q\| in SS is 1-4 orders of magnitude smaller than \| Q\| in RS, the gap is particularly large for smaller observed errors. This translates to the query time in Figure 2.8(d), where evaluating KDE Q ( x ) using Q produced by SS is much faster than doing so over Q from RS for most observed errors. When err becomes large, around 10 2, the RS method 35 9 50 40 30 −5 10 SS −3 −1 10 (a) Construction time. SS 10 RS SS RS 1 10 7 10 7 10 5 10 5 10 3 10 3 10 err 2 10 RS Time (seconds) 60 9 10 RS Size (bytes) SS Communication (bytes) Time (seconds) 70 10 −5 10 −3 10 err −1 10 (b) Communication. −1 10 −2 10 1 10 0 10 −3 −5 10 −3 10 err (c) Size. −1 10 10 −5 10 −3 10 err −1 10 (d) Query time. Figure 2.8. Effect of varying measured err on RS, SS on one-dimensional data. catches up, corresponding to the sizes of Q becoming closer. In conclusion, SS in general performs better than or comparable to RS on observed error. But it has much better theoretical guarantees in size and query time as shown in Section 2.2 ( 1# vs. O( #12 log 1d )), which is critical in practice since users generally use a method by setting an # value. CHAPTER 3 L• ERROR AND BANDWIDTH SELECTION FOR KDES In the second chapter, in order to speed up evaluating kernel density estimate, we produce a coreset representation Q of the data which can be used as proxy for the true data P while guaranteeing approximation error on size and runtime. The size of Q depends only on the required error, not on any properties of P; these go beyond just randomly sampling Q from P. Written concretely, given P, and some error parameter # > 0, the goal is to construct a point set Q to ensure L• ( P, Q) = err ( P, Q) = max \| KDE P ( x ) x 2R d KDE Q ( x )\|  #, (3.1) or err ( P, s, Q, w ) if the bandwidths s and w for KDE P and KDE Q are under consideration. This line of work shows that an L• error measure, compared with L1 or L2 error, is a more natural way to assess various properties about kernel density estimates. They assume s is given and w is always the same as s. However, this is not necessarily true. In this chapter, we will investigate how to choose a bandwidth w for KDE Q ( x ) under L• error given P, s, Q. Thus, we empirically study two concrete problems: 1. Given two point sets P, Q ⇢ R d and a kernel K, find the points x = arg minx err ( P, Q). 2. Given two point sets P, Q ⇢ R d , a kernel K, and a bandwidth s, estimate w = arg minw err ( P, s, Q, w ). It should be apparent the first problem is a key subproblem for the second, but it is also quite interesting in its own right. We will observe L• is a strictly stronger measure than L1 or L2 , yet can still be assessed. To the best of our knowledge, we provide the first rigorous empirical study of how to measure this L• error in practice in an efficiently way, following theoretical investigations demonstrating it should be possible. 37 Bandwidth parameter is hugely important in the resulting KDE, and hence, there have been a plethora of proposed approaches [15, 38, 40, 41, 45, 50, 62-64, 73, 76, 82, 89, 117, 118, 121-123, 126, 130, 141, 148] to somehow automatically choose the "correct" value. These typically attempt to minimize the L2 [122, 126] or L1 error [40, 41] (or less commonly, other error measures [89]) between KDE P and some unknown distribution µ that it is assumed P is randomly drawn from. Perhaps unsurprisingly, for such an abstract problem, many different methods can produce wildly different results. In practice, many practitioners choose a bandwidth value in an ad-hoc manner through visual inspection and domain knowledge. In this chapter, we argue that the choice of bandwidth should not be completely uniquely selected. Rather, this value provides a choice of scale at which the data is inspected, and for some data sets, there can be more than one correct choice depending on the goal. We demonstrate this on real and synthetic data in one and two dimensions. As an intuitive one-dimensional example, given temperature data collected from a weather station, there are very obvious modal trends at the scale of 1 day and at the scale of 1 year, and depending on which phenomenon one wishes to study, the bandwidth parameter should be chosen along the corresponding scale, so it is totally reasonable if we assume s for KDE P is given. Via examinations of problem (2), we observe that in some cases (but not all), given P, Q, and s, we can choose a new bandwidth w (with w > s) so that err ( P, s, Q, w ) is significantly smaller than the default err ( P, s, Q, s). This corresponds with fine-grained phenomenon disappearing with less data (as \| Q\| < \| P\|), and has been prognosticated by theory about L2 [126] or L1 [40] error where the optimal bandwidth for KDE Q is a strictly shrinking function of \| Q\|. Yet we urge more caution than this existing bandwidth theory indicates since we only observe this phenomenon in specific data sets with features present at different scales (like the daily/yearly temperature data example in Section 3.4). • Organization. Section 3.1 formalizes and further motivates the problem. Section 3.2 addresses problem (1), and Section 3.3 problem (2). Then Section 3.4 describes detailed experimental validations of our proposed approaches. Finally, Section 3.5 provides some concluding thoughts. 38 3.1 3.1.1 Background and Motivation Unit Kernels or Normalized Kernels? Unit kernels are more natural to estimate the L• errors of kernel density estimates [103, 146] since the range of values are in [0, 1]. For normalized kernels as s varies, the only bound in the range is [0, •). Moreover, unit kernels, under a special case, correspond to the total variation distance of probability measures. In probability theory, the total variation distance for two probability measures P and Q on a sigma-algebra F of subsets of sample space W is defined as: d( P, Q) = sup \| P( A) A2F Q( A)\|. (3.2) Terms P( A), resp. Q( A), refer to the probability restricted to subset A. If we use F as the set of all balls of radius s, so A is one such ball, then P( A) is the fraction of points of P falling in A. Hence P( A) can be viewed as the KDE P,s ( x ) under the ball kernel, where x is the center of ball A. When Q is the coreset of P, then Q( A) is the fraction of points of Q falling in A, so it can be viewed as the KDE Q,s ( x ) under the ball kernel. In this sense, the total variance distance is the L• , specifically err ( P, Q) where K is the ball kernel. The total variation distance can also be mapped to other unit kernels if F can admit weighted subsets, not just subsets. However, normalized kernels are more useful in bandwidth selection. In this case, there is a finite value for s 2 (0, •) which minimizes the L1 or L2 error between KDE P,s and KDE Q,s , whereas for unit kernels, this is minimized for s ! 0. But recall unit and normalized kernels are only different in the scaling coefficient, so given one setting, it is simple to convert to the other without changing the bandwidth. Hence we use both types of kernels in different scenarios: unit kernels for choosing the coresets, and normalized kernel for problem (1) and problem (2). 3.1.2 Why s Is Given? Recall that problem (2) takes as given two point sets P and Q as well as a bandwidth s associated with P, and then tries to find the best bandwidth w for Q so that KDE P,s is close to KDE Q,w . This is different from how the "bandwidth selection problem" is typically posed [40, 126] : a single point set Q is given with no bandwidth, and it is assumed that Q 39 is drawn randomly from an unknown distribution. We break from this formulation for two reasons. First, we often consider the point set Q chosen as a coreset from P, and this may not be randomly from P; in fact, more intricate techniques [103, 149] can obtain the same error with much smaller size sets Q. These non-random samples break most modeling assumptions essential to the existing techniques. Second, the choice of bandwidth may vary largely within the same data set, and these varied choices may each highlight a different aspect of the data. As an extended example, consider temperature data (here we treat a reading of 50 degrees as 50 data points at that time) from a MesoWest weather station KSLC read every hour in all of 2012. This results in 8760 total readings, illustrated in Figure 3.1. For three bandwidth values of 3, 72, and 1440, KDEs are shown to represent daily, weekly, and yearly trends. All are useful representations of the data; there is no "one right bandwidth.". Section 3.4 shows a two-dimensional example of population densities where similarly there are several distinct reasonable choices of bandwidths. 3.1.3 Why L• Error? As mentioned, the most common error measures for comparing KDEs are the L1 or L2 error defined for p = {1, 2} as L p ( P, Q) = k KDE P KDE Q k p = ✓Z x 2R d \| KDE P ( x ) KDE Q ( x )\| p ◆1/p . (3.3) Figure 3.1. KDEs with different bandwidths showing daily, weekly, and yearly trends. Left shows a full year of data, and right shows one week of data. 40 Although this integral can be reasoned about, it is difficult to estimate precisely. Rather many techniques only evaluate at the points P and simply calculate ⇣ 1 \| KDE P ( p) \| P\| qÂ 2P KDE Q ( p )\| p ⌘1/p These average over the domain or P; hence if \| KDE P ( x ) . (3.4) KDE Q ( x )\|  # for all x, then L p ( P, Q) is also at most #. That "for all" bound is precisely what is guaranteed by L• ( P, Q), hence it is a stronger bound. Another reason to study L• error is that it preserves the worst case error. This is particularly important when KDE P ( x ) values above a threshold trigger an alarm. For instance in tracking densities of tweets, too much activity in one location may indicate some event worth investigating. L1 or L2 errors from a baseline may be small, but still have high error in one location either triggering a false alarm, or missing a real event. 3.1.4 Computing Coreset Q of P In this chapter, we define the coreset as a point set Q the same as 1.9, such that for any query point x, evaluating KDE Q ( x ) is much faster than evaluating KDE P ( x ) and KDE Q ( x ) can approximate KDE P ( x ) very well based on L2 [31] and L• [149] [103]. In Chapter 2, we summarized the approaches to finding such a coreset in one and two dimensions. For a point set P with n points, in one dimension, random sampling method can give us a size O((1/#2 ) log(1/d) coreset with construction cost O(n); sortselection method can construct a coreset \| Q\| = O(#) with the cost is O(n log 1# ) if using #-approximation quantiles, or O(n log n) otherwise. In two dimension, random sampling gives the same bound on the sample size and the cost as in one dimension; kernel herding takes O(\| Q\|n) time to construct Q and claims they need O(1/#) steps; min-cost matching with merge reduce framework and random sampling as preprocessing step can be constructed in O(n + 1 #4 log3 1# ) time and \| Q\| = O( 1# log1.5 1# ); both Grid matching and Z-order selection methods with merge reduce framework give us a coreset with \| Q\| = O( 1# log2.5 1# ) in O(n + 1 #2 log 1# ) time by using random sampling as the preprocessing step. In high dimensions, Phillips [103] states that one can construct Q in O(n/#2 ) time of size O((1/#)2d/(d+2) logd/(d+2) (1/#)) with bounded L• error. 41 3.1.5 Related Work on Bandwidth Selection There is a vast literature on bandwidth selection under the L1 [40, 41] or L2 [122, 126] metric. In these settings, Q is drawn, often at random, from a unknown continuous distribution µ (but µ can be evaluated at any single point x). Then the goal is to choose w to minimize kµ and k KDE µ,w KDE Q,w k{1,2} . KDE Q,w k. This can be conceptualized in two steps as kµ KDE µ,w k This first step is minimized as w ! 0 and the second step is minimized as w ! •. Together, there is a value w{1,2} 2 (0, •) that minimizes the overall objective. The most common error measure for w under L2 are Integrated Squared Errors (ISE) R ISE(w ) = x2Rd ( KDE Q,w µ)2 dx and its expected value, the Mean Integrated Squared R Error (MISE) MISE(w ) = EQ⇠µ [ x2Rd ( KDE Q,w µ)2 dx ]. As MISE is not mathematically tractable, often approximations such as the Asymptotic Mean Squared Error (AMISE) or others [126, 130] are used. Cross-validation techniques [15, 45, 62, 117, 118, 123] are used to evaluate various parameters in these approximations. Alternatively, plug-in methods [76, 121,141] recursively build approximations to µ using KDE Q,wi , and then refine the estimate of wi+1 using KDE Q,wi . A series of Bayesian approaches [16,38,50,73,82,148] build on these models and select w using MCMC approaches. An alternative to these L2 approaches is using an L1 measure, like integrated absolute R error (IAE) of KDE Q,w is I AE(w ) = x2Rd \| KDE Q,w µ\|dx, which has simple interpretation of being the area between the two functions. Devroye and Györfi [40] describe several ro- bustness advantages (better tail behavior, transformation invariance) to these approaches. Several of the approximation approaches from L2 can be extended to L1 [64]. Perplexingly, however, the bandwidths generated by these methods can vary quite drastically! In this chapter, we assume that some bandwidth is given to indicate the intended scale, and then we choose a bandwidth for a sparser point set. Hence the methods surveyed above are not directly comparable to our proposed approaches. But we still include the experiment results from some of the above methods to show that different approaches give quite different "optimal" bandwidth, which in another way shows us there are more than one correct bandwidth for some data set. 42 Computing err ( P, Q) Error 3.2 The goal of this section is to calculate err ( P, Q) = max \| KDE P ( x ) x 2R d KDE Q ( x )\|. (3.5) For notational convenience G ( x ) = \| KDE P ( x ) KDE Q ( x )\|. (3.6) We focus on the case where K is a unit Gaussian. Since even calculating maxx2Rd KDE P ( x ) (which is a special case of err ( P, Q) where Q is empty) appears hard, and only constant factor approximations are known [1, 106], we will not calculate err ( P, Q) exactly. Unfortunately, these approximation techniques [1, 106] for maxx2Rd KDE P ( x ) do not easily extend to estimating err ( P, Q). They can focus on dense areas of P, since the maximum must occur there, but in err ( P, Q), these dense areas may perfectly cancel out. These approaches are also much more involved than the strategies we will explore. 3.2.1 Approximation Strategy Towards estimating err ( P, Q), which is optimized over all of R d , our strategy is to generate a finite set X ⇢ R d , and then return err X ( P, Q) = maxx2X G ( x ). Our goal in the generation of X is so that in practice, our returned estimate err X ( P, Q) is close to err ( P, Q), but also so that under this process as \| X \| ! • then formally err X ( P, Q) ! err ( P, Q). We say such a process converges. We formalize this in two steps. First we show that G ( x ) is Lipschitz-continuous, hence a point x̂ 2 R d close to the point x ⇤ = arg maxx2Rd G ( x ) will also have error value close to x ⇤ . Then given this fact, we will need to show that our strategy will, for any radius r, as \| X \| ! • generate a point x̂ 2 X so that k x ⇤ x̂ k  r. This will be aided by the following structural theorem on the location of x ⇤ . (M is illustrated in Figure 3.2.) Theorem 5. For Ks a unit Gaussian kernel, and two point sets P, Q 2 R d , M is the Minkowski sum of a ball of radius s and the convex hull of P [ Q, then x ⇤ = arg maxx2Rd G ( x ) must be in M. We want to prove Theorem 5 in one and two dimensions. For simplicity, we assume Q ⇢ P so P = P [ Q. 43 p1 Figure 3.2. Illustration of the Minkowski sum of a ball of radius s and convex hull of P [ Q. First, we work on the weighted one-dimensional data, and extend to two dimensions using that the cross section of a two-dimensional Gaussian is still a one-dimensional Gaussian. We focus on when P and Q use the same bandwidth s, and a unit kernel Ks . We start to examine two points in one dimension, and without loss of generality, we assume p1 = d and p2 = d for d 0, and that the coreset of P is Q = { p2 }. We assign the weight for p1 as w1 and the weight for p2 as w2 . Plug in P, Q, and the weight for each point, G ( x ) = \| KDE P ( x ) KDE Q ( x )\| G(x) = We assume w1 discuss when x is expanded as following: 1 w1 exp 2 (x d )2 1 w2 exp 2 2s2 ( x + d )2 . 2s2 w2 , the largest error point must be closer to p1 . So we only need to 0, then 12 w1 exp G(x) = 1 w1 exp 2 ( x d )2 2s2 (x ( x + d )2 2s2 1 2 w2 exp d )2 1 w2 exp 2 2s2 , thus ( x + d )2 . 2s2 Lemma 5. For Ks a unit Gaussian kernel, P = { p1 , p2 } and Q = { p2 } where p1 = d and p2 = d, when x 0, function G ( x ) has only one local maximum, which is between d and d + s and G ( x ) is decreasing when x > d. Proof. By taking the derivative of G ( x ), we can get dG ( x ) 1 = w2 exp dx 2 1 w1 exp 2 When 0  x < d, both 12 w2 exp dG ( x ) dx ( x + d )2 2s2 > 0, so G ( x ) is always increasing. ( x + d )2 x + d 2s2 s2 ( x d )2 x d . 2s2 s2 x +d s2 and 1 2 w1 exp ( x d )2 2s2 x d s2 > 0, thus 44 When x = d, 2d2 2d s2 s2 dG ( x ) 1 = w2 exp dx 2 0. To understand x > d we examine the ratio function r(x) = Since both exp 2xd s2 ( x + d )2 2s2 ( x d )2 2s2 1 2 w2 exp 1 2 w1 exp and x +d x d x +d s2 x d s2 = 2xd x + d . s2 x d w2 exp w1 are decreasing and positive, r ( x ) and thus dG ( x ) dx is decreas- ing when x > d. When x = d + s, the ratio function is r (d + s) = w2 2sd + 2d2 s + 2d exp( ) . w1 s2 s We can view the above equation as a function of variable d. r (d) = w2 2sd + 2d2 s + 2d exp( ) , w1 s2 s and take the derivative of r (d): dr (d) = dd With d w2 w1  dG ( x ) dx 0 then dr (d) dd 4d(d + s) w2 2sd + 2d2 exp ( )  0. s3 w1 s2  0, and thus r (d) is a decreasing function which attains maximum 1 when d = 0; thus r (d)  1. So when x = d + s, 0 when x = d and d and d + s making dG ( x ) dx dG ( x ) dx dG ( x ) dx  0. Due to the fact that is decreasing when x > d, there is only one point between = 0. When 0  x < d, then maximum point of G ( x ) between d and d + s when x dG ( x ) dx > 0. There is only one 0. From Lemma 5, we show that the evaluation point having largest error is between d and d + s. Due to the symmetry of p1 and p2 , when w1  w2 , G ( x ) gets its largest error between d and d s. With the results on both sides, we now show the maximum value point of G ( x ) can't be outside s distance of Conv( P). Now we discuss the case for n points in one dimension. Lemma 6. For Ks a unit Gaussian kernel, P has n points and \| Q\| = \| P\|/2, arg maxx2R1 G ( x ) for one-dimensional data is within s distance of Conv( P). 45 Proof. Suppose n = 2k, P = { p1 , p2 , p3 , p4 , ..., p2k 1, p2k }, choose any k points in Q. Then pair any point in Q with any point in P not in Q, so each point in P is in exactly one pair. For simplicity we set Q = { p1 , p3 , ..., p2k 1} and the pairs are { p1 , p2 }, { p3 , p4 }, ..., { p2k 1, p2k }. Suppose e1 = arg maxx2R1 G ( x ) is not within s distance of Conv( P) and p1 is the point closest to e1 . Based on Lemma 5, for P has only two points, function G ( x ) is decreasing as a point outside s moves away from p1 . So if we choose another point e2 infinitesimally closer to p1 , and we set P1 = { p1 , p2 }, Q1 = { p1 } , GP1 ,Q1 (e2 ) has larger value than GP1 ,Q1 (e1 ). Since p1 is the closest point in P, for any other set P2 = { p3 , p4 }, Q2 = { p3 }, e2 is closer to P2 than e1 is to P2 , hence GP2 ,Q2 (e2 ) is also larger than GP2 ,Q2 (e1 ). The same result holds for all pairs { p2i 1, p2i }, where i is from 1 to k. So G (e2 ) > G (e1 ), which contradicts the assumption that e1 = arg maxx2R1 G ( x ). So the largest error evaluation point should be within s distance of Conv( P). In two dimensions we show a similar result. We illustrate the Minkowski sum M of a set of points P with a ball of radius s in Figure 3.2. Theorem 6. For Ks a unit Gaussian kernel, and two point sets P, Q 2 R2 , \| Q\| = \| P\|/2, arg maxx2R2 G ( x ) should be within the Minkowski sum M of a ball of radius s and Conv( P). Proof. Suppose the largest error position e1 = arg maxx2R2 G ( x ) 62 M, then for some direction v, no point in the convex hull of P is closer than s to e1 after both are projected onto v. Then since any cross section of a Gaussian is a one-dimensional Gaussian (with reduced weight), we can now invoke the one-dimensional result in Lemma 6 to show that e1 is not the largest error position along the direction v, thus e1 6= arg maxx2R2 G ( x ). So arg maxx2R2 G ( x ) should be within the Minkowski sum M of a ball of radius s and Conv( P). We will not focus on proving theoretical bounds on the rate of convergence of these processes since they are quite data dependent, but will thoroughly empirically explore this rate in Section 3.4. As \| X \| grows, the max error value in X will consistently approach some error value (the same value for several provably converging approaches), and we can then have some confidence that as these processes plateau, they have successfully estimated err ( P, Q). Our best process WCen6 converges quickly (e.g., \| X \| = 100); it is likely that the maximum error is approximately achieved in many locations. 46 Now, as a basis for formalizing these results, we first show G ( x ) is Lipschitz continuous. Recall a function f : R d ! R is Lipschitz continuous if there exists some constant b such that for any two points x, y 2 R d that \| f ( x ) f (y)\|/k x yk  b. This result follows from the Gaussian kernel (as well as all other kernels except the Ball kernel) also being Lipschitz continuous. Then, since the function f ( x ) = KDE P ( x ) KDE Q ( x ) is a finite weighted sum of Gaussian kernels, each of which is Lipschitz continuous, so is f ( x ). Since taking absolute value does affect Lipschitz continuity, the claim holds. 3.2.2 Generation of Evaluation Points We now consider strategies to generate a set of points X so that err X ( P, Q) is close to err ( P, Q). Recall that M, the Minkowski sum of a ball of radius s with the convex hull of P [ Q, must contain the point x ⇤ which results in err ( P, Q). In practice, it is typically easier to use B, the smallest axis-aligned bounding box that contains M. And for discussion, we assume Q ⇢ P so P = P [ Q. • Rand: Choose each point at random from B. Since x ⇤ 2 M ⇢ B, eventually some point x 2 X will be close enough to x ⇤ , and this process converges. • Orgp: Choose points randomly from P. This process does not converge since the maximum error point may not be in P. However, section 3.4 shows that this process converges to its limit very quickly. So many of the following proposed approaches will attempt to adapt this approach while still converging. • Orgp+N: Choose points randomly from the original point set P then add Gaussian noise with bandwidth s, where s is the bandwidth of K. Since the Gaussian has infinite support, points in X can be anywhere in R d , and will eventually become close enough to x ⇤ . So this process converges. • Grid: Place a uniform grid on B (we assume each grid cell is a square) and choose one point in each grid. For example, in two dimension, if four evaluation points are needed, the grid would be 2 ⇥ 2 and if nine points are needed, it would be 3 ⇥ 3. So with this method, the number of evaluation points is a non-prime integer. Since x ⇤ 2 B, and eventually the grid cell radius is arbitrarily small, then some point x 2 X is close enough to x ⇤ . Thus, this process converges. 47 • Cen{E[m]}: Randomly select one point p1 from the original point set P and randomly choose m neighbor points of p1 within the distance of 3s. m is chosen through a Exponential process with rate 1/E[m]. Then we use the centroid of the selected neighbor points as the evaluation point. This method is inspired by [47], which demonstrates interesting maximums of KDEs at the centroids of the data points. Since P is fixed, the centroid of any combination of points in P is also finite, and the set of these centroids may not include x ⇤ . So, this process does not converge. We next modify it in a way so it does converge. • WCen{E[m]}: Randomly select one point p1 from the original point set P and select the neighbor point pn 2 P as candidate neighbor proportional to exp( \|\| pn p1 \|\|2 ), 2s2 where s is the bandwidth for K. The smaller the distance between pn and p1 , the higher probability it will be the chosen. Repeat to choose m total points including p1 , where again, m is from an Exponential process with rate 1/E[m]. Now refine the m neighbor points so with probability 0.9, it remains the original point pn 2 P, with the remaining probability it is chosen randomly from a ball of radius s centered at pn . Next, we assign each point a random weight in [0, 1] so that all weights add to 1. Then finally, the evaluation point is the weighted centroid of these points. This method retains much of the effectiveness of Cen, but does converge. Without the 0.1 probability rule of being in a ball of radius s around each point, this method can generate any points within the convex hull of P. That 0.1 probability allows it to expand to M, the Minkowski sum of the convex hull of P with a ball of radius s. And since x ⇤ 2 M, by Theorem 5 then this process converges. • Comb: Rand + Orgp: The combination of method Rand and Orgp, of which 20% points generated from B and 80% points generated from original points. The 20% of points from Rand guarantees convergence, but retain most empirical properties of Orgp. This was used before [149] with little discussion. Section 3.4 describes extensive experiments on both synthetic and real data to evaluate these methods. The weighted centroid method WCen{E[m]} with large parameter (e.g., E[m] = 6) works very well for one and two dimensions, and also converges. 48 3.3 Bandwidth Selection In this section, we consider being given two point sets P, Q ⇢ R2 , a kernel K, and a bandwidth s associated with P. We consider K as a normalized Gaussian kernel, and the case where Q is a coreset of P. The goal is to find another bandwidth w to associate with Q so that err ( P, s, Q, w ) is small. 3.3.1 Refining the Bandwidth for Coresets These coresets are constructed assuming that KDE Q uses the same bandwidth s as KDE P . But can we improve this relationship by using a different bandwidth w for Q? The theory for L1 and L2 error (assuming Q is a random sample from P) dictates that as \| Q\| decreases, the bandwidth w should increase. This intuition holds under any error measure since with fewer data points, the KDE should have less resolution. It also matches the L• theoretical error bounds described above. We first reinforce this with a simple two-dimensional example. Consider point set P = [{ P1 , P2 , P3 , P4 } in Figure 3.3(a), the radius of the circle represents the bandwidth s for P. Figure 3.3(b) gives the coreset Q of P: Q = [{ Q1 , Q2 , Q3 , Q4 }, each Qi contains only one black point. Now suppose our evaluation point is point e. If we use the original bandwidth s, KDE Q,s (e) = 0 with ball kernel, but if we use w, which is the radius of a larger circle centered at e, then KDE Q,w (e) > 0, so the error is decreased. But, we do not want w too large, as it would reach the points in other Qi , which is not the case for s in P, so the error would be increased again. Thus, there seems to be a good choice for w > s. But the situation of finding the wopt that minimizes h(w ) = err ( P, s, Q, w ) is more e P1 P3 P2 P4 (a) Original data set. e Q1 Q2 Q3 Q4 (b) Coreset. Figure 3.3. Example with need for larger w for coreset Q. 49 complicated. For each w it is itself a maximization over x 2 R d . There may in fact be more than one local minimum for w in h(w ). However, equipped with the WCen6 procedure to evaluate err ( P, Q), we propose a relatively simple optimization algorithm. We can perform a golden section search over w, using WCen6 to obtain a set X and evaluate err X ( P, s, Q, w ). Such a search procedure requires a convex function for any sort of guarantees, and this property may not hold. However, we show next that h(w ) has some restricted Lipschitz property, so that with random restarts it should be able to find a reasonable local minimum. This is illustrated in Figure 3.4, where the curve that is Lipschitz either has a large, relatively convex region around the global minimum, or has shallow local minimums. The other curve without a Lipschitz property has a very small convex region around the global maximum, and any search procedure will have a hard time finding it. 3.3.2 Lipschitz Properties of h In general, however, h(w ) is not Lipschitz in w. But, we can show it is Lipschitz over a restricted domain, specifically when w > s and when 1/s  A for some absolute constant A. Define y(w, a) = 1 2pw 2 Lemma 7. For any w \| a2 1/p \| A3 . exp( a2 /(2w 2 )). s and 1/s  A, y(w, a) is b-Lipschitz with respect to w, with b = Proof. By taking the first derivative of y(w ), we have dy(w, a) = ( a2 dw 1 )w p 3 exp( a2 /(2w 2 )). And thus C1 C2 opt Figure 3.4. Two curves, one of which is Lipschitz. 50 dy(w, a) = \| a2 dw 1/p \|w 3 exp( a2 /(2w 2 ))  \| a2 1/p \|s 3  \| a2 1/p \| A3 . So, the absolute value of largest slope of function y(w, a) is b = \| a2 is b-Lipschitz continuous on w. s and 1/s  A, h(w ) is b-Lipschitz with respect to w, for b = Theorem 7. For any w 1 \| Q\| Âq2Q \|( x ⇤ q )2 Proof. If KDE P,s ( x ⇤ ) 1/p \| A3 , thus y(w, a) 1/p \| A3 where x ⇤ = arg maxx2R2 \| KDE P,s ( x ) KDE Q,w ( x ⇤ ) KDE Q,w ( x )\|. then h(w ) = \| KDE P,s ( x ⇤ ) = KDE P,s ( x ⇤ ) = KDE P,s ( x ⇤ ) KDE Q,w ( x ⇤ )\| ✓ ◆ 1 1 ( x ⇤ q )2 exp 2 \| Q\| qÂ 2w 2 2 Q 2pw 1 y(w, ( x ⇤ \| Q\| qÂ 2Q q))). Since h(w ) is a linear combination of \| Q\| functions of y(w, a) plus a constant and y(w, a) is Lipschitz continuous, based on the Lemma 7 h(w ) is Lipschitz continuous on w. We can get the same result if KDE P,s ( x ⇤ )  KDE Q,w ( x ⇤ ). In both directions, the first derivative of the function is bounded, so h(w ) is bounded. 3.3.3 Random Golden Section Search From the above properties, we design a search procedure that will be effective in finding the bandwidth w minimizing err ( P, s, Q, w ). The random golden section search is based on the golden section search [79], a technique to find extremum in a strictly unimodal function. To find a minimum, it successively narrows a range [`, r ] with known function values h(`), h(m1 ), h(m2 ), and h(r ) with ` < m1 < m2 < r and with both h(m1 ), h(m2 ) less than h(`) and h(r ). If h(m1 ) < h(m2 ) the new search range is [`, m2 ], otherwise, it is [m1 , r ]. In either case, a new fourth point is chosen according to the golden ratio in such a way that the interval shrinks by a constant factor on each step. However, h(w ) in our case can be a multimodal function, thus golden section search is not guaranteed to work. We apply random restarts as follows. Starting with a range [` = s, r = 10s], we choose one middle point at m = ls for l ⇠ Unif(1, 10). If h(m) > h(r ), we increase r by a factor 10 until it is (e.g., r = 100s). Then the second middle point 51 is chosen using the golden ratio, and the search is run deterministically. We repeat with several random values l. 3.4 Experiments Here we run an extensive set of experiments to validate our techniques. We compare KDE P with kernel density estimate under smaller coreset KDE Q for both synthetic and real data in one and two dimensions. To show our methods work well in large data sets, we use the large synthetic data set (0.5 million) and real data set (1 million) in two dimension. 3.4.1 Data Sets We consider data sets that have different features at various scales, so that as more data are present using a small bandwidth more fine-grain features are brought out, and a larger bandwidth only shows the coarse features. Our real data set in one dimension is the temperature data in Figure 3.1, with default s = 72 (3 days). We use parameter # = 0.02 to generate a coreset Q with the Sort-selection technique [149]. We can simulate such multiscale features more precisely. On a domain [0, 1] we generate P recursively, starting with p1 = 0 and p2 = 1. Next we consider the interval between [ p1 , p2 ] and insert two points at p3 = 2/5 and p4 = 3/5. There are now 3 intervals [ p1 , p3 ], [ p3 , p4 ], and [ p4 , p2 ]. For each interval [ pi , p j ] we recursively insert 2 new points at pi + (2/5) · ( p j pi ) and at pi + (3/5) · ( p j pi ), until \| P\| = 19684. The KDE of this data set with s = 0.01 is shown in Figure 3.5(d), along with that of a coreset Q of size \| Q\| = 100. We construct the two-dimensional synthetic data set in a similar way. The data are in [0, 1]2 starting with four points p1 = (0, 0), p2 = (0, 1), p3 = (1, 0), p4 = (1, 1). We recurse on this rectangle by adding 4 new points in the middle m: the x-coordinates are either at the 0.5-quantile or 0.8-quantile of the x-coordinates, and it is the same for new y-coordinates. These 4 new points create 9 smaller empty rectangles. We further recurse on each of these rectangles until \| P\| = 532900. The KDE P with s = 0.01 is shown in Figure 3.6(a). We use Grid matching [149] to generate a coreset Q with # = 0.1 and size \| Q\| = 1040. Under the original bandwidth s, the KDE Q is shown in Figure 3.6(b); due to a small bandwidth this KDE has many more modes than the original, which motivates the larger bandwidth KDE shown in Figure 3.6(c). For real data with multiple scales in two dimension we consider OpenStreetMap data 52 (a) Centroid method. (b) Weighted centroid method. (c) Compare with all the other methods. (d) KDE P,s and bandwidth. KDE Q,w with original and best Figure 3.5. Results on one-dimensional synthetic data to choose best evaluating point generation techniques. (a) KDE P,s=0.01 . (b) KDE Q,w =0.01 . (c) KDE Q,w =0.013 . Figure 3.6. Visualization of KDE P and KDE Q for two-dimensional synthetic data using different bandwidth. 53 from the state of Iowa. Specifically, we use the longitude and latitude of all highway data points, then rescale so it lies in [0, 1] ⇥ [0, 1]. It was recognized in the early 1900s [133] that agricultural populations, such as Iowa, exhibited population densities at several scales. For ease of experiments, we use the original data of size \| P\| = 1155102 with s = 0.01, and Q as a smaller coreset with # = 0.1 and \| Q\| = 1128. These are illustrated in Figure 3.7. 3.4.2 Evaluating Point Generation for err X To find the best evaluation point generation techniques, we compare the various ways to generate a set X to evaluate err X ( P, Q). The larger numbers are better, so we want to find point sets X so that err X ( P, Q) is maximized with \| X \| small. As most of our methods are random, five evaluation point sets are generated for each method and the average err X ( P, Q) is considered. We start in one dimension, and investigate which parameter of the Cen and WCen methods work best. We will then compare the best in class against the remaining approaches. Recall the parameter E[m] determines the expected number of points (under a Exponential process) chosen to take the centroid or weighted centroid of, respectively. We only show the test result with E[m] from 2 to 7, since the results are similar when E[m] is larger than 7, and the larger the parameter the slower (and less desirable) the process. The results are plotted in Figure 3.5 on the one-dimensional synthetic data. Specifically, Figure 3.5(a) shows the Cen method and Figure 3.5(b) the WCen method. Both methods plateau, for some parameter setting, after around \| X \| = 100, with WCen more robust to parameter choice. In particular, both WCen converges slightly faster but with not much pattern across the choice of parameter. We use Cen6 and WCen6 as representatives. We next compare these approaches directly against each other as well as Rand, Orgp, Orgp+N, Grid, and Comb in Figure 3.5(c). WCen6 appears the best in this experiment, but it has been selected as best WCen technique from random trials. The Rand and Grid techniques which also converge perform well, and are simpler to implement. Similar results are seen on the real one-dimensional data in Figure 3.8. We can take best in class from Cen and WCen parameter choices, shown as Cen6 and WCen6 in Figure 3.8(a) and Figure 3.8(b). These perform well and similar to the simpler Rand, Grid, and Orgp in Figure 3.8(c). Since Rand and Grid also converge, in one dimension we would 54 (a) KDE P,s=0.01 . (b) KDE Q,w =0.01 . (c) KDE Q,w =0.032 . Figure 3.7. Visualization of KDE P and KDE Q for two-dimensional real data using different bandwidth. (a) Centroid method. (c) Compare with all the other methods. (b) Weighted centroid method. (d) KDE P,s and width. KDE Q,w with original and best band- Figure 3.8. Results on one-dimensional real data to choose the best evaluating point generation techniques. 55 recommend one of these simple methods. For two-dimensional data, the techniques perform a bit differently. We again start with the Cen and WCen methods as shown in Figure 3.9 on real and synthetic data. The convergence results are not as good as in one dimension, as expected, and it takes roughly \| X \| = 1000 points to converge. All methods perform roughly the same for various parameter settings, so we use Cen6 and WCen6 as representatives. Comparing against all techniques in Figure 3.9, most techniques perform roughly the same relative to each other, and again WCen6 appears to be a good choice to use. The notable exceptions is that Grid and Rand perform worse in two dimension than in one dimension; likely indicating that the data dependent approaches are more important in this setting. 3.4.3 Choosing New Bandwidth Evaluation We now apply a random golden section search to find new bandwidth values for coresets on one-dimensional and two-dimensional synthetic and real data. In all random trials, we always find the same local minimum, and report this value. We will see that a value w > s can often give better error results, both visually and empirically, by smoothing out the noise from the smaller coresets. Figures 3.10(a) and 3.10(b) show evaluation of err X ( P, s, Q, w ) for various w values chosen while running the random golden section search on synthetic and real one-dimensional data. In both cases, setting w = s (as w = 0.01 and w = 72, respectively) gives roughly twice as much error as using an omega roughly twice as large (w = 0.017 and w = 142, respectively). We can see even more dramatic results in the two-dimensional data in Figure 3.11. We observe in Figure 3.11(a) on synthetic data that with the original w = s = 0.01 that the error is roughly 3.6, but by choosing w = 0.013 we can reduce the error to roughly 2.7. This is also shown visually in Figure 3.6 where a small coreset Q is chosen, and in Figure 3.6(b) the large-scale pattern in KDE P,s is replaced by many isolated points; KDE Q,w =0.013 increases the bandwidth and the desired visual pattern re-emerges. On real data, a similar pattern is seen in Figure 3.11(b). The original w = s = 0.01 has error roughly 3.0, and an w = 0.032 (more than 3 times larger) gives error about 1.1. This extra smoothing is illustrated in Figure 3.7. 56 (a) Centroid methods. (b) Centroid method. (c) Weighted centroid method. (d) Weighted centroid method. (e) All methods. (f) All methods. Figure 3.9. Choosing the best evaluation set X for on two-dimensional synthetic (left) and real (right) data. 57 (a) Synthetic data (b) Real data Figure 3.10. w ⇤ = arg minw err X ( P, s, Q, w ) in R1 . (a) Synthetic data (b) Real data Figure 3.11. w ⇤ = arg minw expX ( P, s, Q, w ) in R2 . Thus, we see that it is indeed useful to increase the bandwidth of kernel density estimates on a coreset, even though theoretical bounds already hold for using the same bandwidth. We show that doing so can decrease the error by a factor of 2 or more. Since we consider w = s, and only decrease the error in the process, we can claim the same theoretical bounds for the new w value. It is an open question of whether one can prove tighter coreset bounds by adapting the bandwidth value. 58 3.4.4 Compare with the Traditional Bandwidth Selection Method In this section, we include some bandwidth selection results for the two-dimensional synthetic and real data using the traditional bandwidth selection method surveyed in Section 3.1.5, including biased cross-validation (BCV) bandwidth selection method, leastsquares cross-validation (LSCV) bandwidth selection method, plug-in (PI) bandwidth selection method and smoothed cross-validation (SCV) bandwidth selection method. We use the kernel smoothing R package(ks), which was originally introduced by Duong (2007) [44] and improved in 2014. In the experiment, our data set is normalized and we assume data in each dimension are independent and share the same bandwidth, so we use larger value from the main diagonal of bandwidth matrix computed from the R package. For the two-dimensional synthetic data set, we use the same coreset with \| Q\| = 1040, the bandwidth using the above four methods are wBCV = 0.0085, w LSCV = 0.024, w PI = 0.0036, wSCV = 0.0043. For the two-dimensional real data set, with the coreset Q = 1128, the bandwidth chosen from the four methods are w BCV = 0.0078, w LSCV = 0.0003, w PI = 0.0029, wSCV = 0.004. The corresponding error trends compared to our method for these two data sets are shown in Figure 3.12, where wOPT denotes the optimal bandwidth from our method. Both of these figures show our method can achieve the smallest error comparing to the baseline methods and all the baseline methods tend to give smaller bandwidth and not stable for different data sets. (a) Synthetic data Figure 3.12. w ⇤ = arg minw expX ( P, s, Q, w ) in R2 . (b) Real data 59 3.5 Conclusion This chapter considers evaluating kernel density estimates under L• error, and how to use these criteria to select the bandwidth of a coreset. Since L• error is stronger than the more traditional L1 or L2 errors, it provides approximation guarantees for all points in the domain, and aligns with recent theoretical results [103] of kernel range space, it is worth rigorously investigating. We propose several methods to efficiently evaluate the L• error between two kernel density estimates and provide a convergence guarantee. The method Grid works well, and is very simple to implement in R1 . In R2 , methods that adapt more to the data perform much better, and our technique WCen is shown to be accurate and efficient on real and synthetic data. We then use these technique to select a new bandwidth value for coresets that can improve the error by a factor of 2 to 3. We demonstrate this both visually and empirically on real and synthetic data sets. CHAPTER 4 GENERALIZED KERNEL RANGE SPACE AND (#, t )-NET This chapter considers traditional sample complexity problems but adapts to when the range space (or function space) smoothes out its boundary. This is important in various scenarios where either the data points or the measuring function are noisy. Similar problems have been considered in specific contexts of functions classes with a [0, 1] range or kernel density estimates. We extend and generalize these results, motivated by scenarios such as the following. (S1) Consider maintaining a random sample of noisy spatial data points (say twitter users with geo-coordinates), and we want this sample to include a witness to every large enough event. However, because the data coordinates are noisy, we use a kernel density estimate to represent the density. And moreover, we do not want to consider regions with a single or constant number of data points which only occur due to random variations. In this scenario, how many samples do we need to maintain? (S2) Next, consider a large approximate (say high-dimensional image feature [2]) data set, where we want to build a linear classifier. Because the features are approximate (say due to feature hashing techniques), we model the classifier boundary to be randomly shifted using Gaussian noise. How many samples from this data set do we need to obtain a desired generalization bound? (S3) Finally, consider one of these scenarios in which we are trying to create an informative subset of the enormous full data set, but have the opportunity to do so in ways more intelligent than randomly sampling. On such a reduced data set one may want to train several types of classifiers, or to estimate the density of various subsets. Can we generate a smaller data set compared to what would be required by random sampling? 61 The traditional way to study related sample complexity problems is through range spaces (a ground set X, and family of subsets A) and their associated dimension (e.g., VCdimension [135]). We focus on a smooth extension of range spaces defined on a geometric ground set. Specifically, consider the ground set P to be a subset of points in R d , and let A describe subsets defined by some geometric objects, for instance, a halfspace or a ball. Points p 2 R d that are inside the object (e.g., halfspace or ball) are typically assigned a value 1, and those outside a value 0. In our smoothed setting, points near the boundary are given a value between 0 and 1, instead of discretely switching from 0 to 1. In learning theory, these smooth range spaces can be characterized by more general notions called P-dimension [108] (or Pseudo-dimension) or V-dimension [134] (or "fat" versions of these [3]) and can be used to learn real-valued functions for regression or density estimation, respectively. In geometry and data structures, these smoothed range spaces are of interest in studying noisy data. Our work extends some recent work [78,103] which examines a special case of our setting that maps to kernel density estimates, and matches or improves on related bounds for non-smoothed versions. We next summarize the main contributions in this chapter. • We define a general class of smoothed range spaces (Section 4.2.1), with application to density estimation and noisy agnostic learning, and we show that these can inherit sample complexity results from linked non-smooth range spaces (Corollary 4.3.1). • We define an (#, t )-net for a smoothed range space (Section 4.2.3). We show how this can inherit sampling complexity bounds from linked non-smooth range spaces (Theorem 9), and we relate this to non-agnostic density estimation and hitting set problems. • We provide discrepancy-based bounds and constructions for #-samples on smooth range spaces requiring significantly fewer points than uniform sampling approaches (Theorems 11 and 12), and also smaller than discrepancy-based bounds on the linked binary range spaces. 62 4.1 Definitions and Background Recall that we will focus on geometric range spaces ( P, A) where the ground set P ⇢ R d and the family of ranges A are defined by geometric objects. It is common to approximate a range space in one of two ways, as an #-sample (aka #-approximation) or an #-net as we defined in Chapter 1. Given a range space ( P, A) where \| P\| = m, then pA (m) describes the maximum number of possible distinct subsets of P defined by some A 2 A. If we can bound, pA (m)  Cmn for absolute constant C, then ( P, A) is said to have shatter dimension n. For instance, the shatter dimension of H halfspaces in R d is d, and for B balls in R d is d + 1. For a range space with shatter dimension n, a random sample of size O((1/#2 )(n + log(1/d))) is an #-sample with probability at least 1 d [85, 135], and a random sample of size O((n/#) log(1/#d)) is an #-net with probability at least 1 d [69, 98]. An #-sample Q is sufficient for agnostic learning with generalization error #, where the best classifier might misclassify some points. An #-net Q is sufficient for non-agnostic learning with generalization error #, where the best classifier is assumed to have no error on P. The size bounds can be made deterministic and slightly improved for certain cases. An #-sample Q can be made of size O(1/#2n/(n+1) ) [91] and this bound can be no smaller [92] in the general case. For balls B in R d which have shatter-dimension n = d + 1, this can be improved to O(1/#2d/(d+1) logd/(d+1) (1/#)) [10, 92], and the best-known lower bound is O(1/#2d/(d+1) ). For axis-aligned rectangles R in R d which have shatter-dimension n = 2d, this can be improved to O((1/#) logd+1/2 (1/#)) [84]. For #-nets, the general bound of O((n/#) log(1/#)) can also be made deterministic [91], and for halfspaces in R4 the size must be at least W((1/#) log(1/#)) [99]. But for halfspaces in R3 the size can be O(1/#) [67, 93], which is tight. By a simple lifting, this also applies for balls in R2 . For other range spaces, such as axis-aligned rectangles in R2 , the size bound is Q((1/#) log log(1/#)) [5, 99]. 4.1.1 Kernels A kernel is a bivariate similarity function K : R d ⇥ R d ! R + , which can be normalized so K ( x, x ) = 1 (which we assume through this chapter). 63 A kernel range space [78, 103] ( P, K) is an extension of the combinatorial concept of a range space ( P, A) (or to distinguish it we refer to the classic notion as a binary range space). It is defined by a point set P ⇢ R d and a kernel K. An element Kx of K is a kernel K ( x, ·) applied at point x 2 R d ; it assigns a value in [0, 1] to each point p 2 P as K ( x, p). If we use a ball kernel, then each value is exactly {0, 1} and we recover exactly the notion of a binary range space for geometric ranges defined by balls. A binary range space ( P, A) is linked to a kernel range space ( P, K) if the set { p 2 P \| K ( x, p) t } is equal to P \ A for some A 2 A, for any threshold value t. [78] showed that an #-sample of a linked range space ( P, A) is also an #-kernel sample of a corresponding kernel range space ( P, K). Since all range spaces defined by symmetric, shift-invariant kernels are linked to range spaces defined by balls, they inherit all #-sample bounds, including that random samples of size O((1/#2 )(d + log(1/d)) provide an #-kernel sample with probability at least 1 d. Then [103] showed that these bounds can be imp proved through discrepancy-based methods to O(((1/#) log(1/#d))2d/(d+2) ), which is p O((1/#) log(1/#d)) in R2 . A more general concept has been studied in learning theory on real-valued functions, where a function f as a member of a function class F describes a mapping from R d to [0, 1] (or more generally R). A kernel range space where the linked binary range space has bounded shatter-dimension n is said to have bounded V-dimension [134] (see [3]) of n. Given a ground set X, then for ( X, F ) this describes the largest subset Y of X which can be shattered in the following sense. Choose any value s 2 [0, 1] for all points y 2 Y, and then for each subset of Z ⇢ Y, there exists a function f 2 F so f (y) > s if y 2 Z and f (y) < s if y 2 / Z. The best sample complexity bounds for ensuring Q is an #-sample of P based on V-dimension are derived from a more general sort of dimension (called a P-dimension [108] where in the shattering definition, each y may have a distinct s(y) value) requires \| Q\| = O((1/#2 )(n + log(1/d))) [85]. As we will see, these V-dimension-based results are also general enough to apply to the to-be-defined smooth range spaces. 4.2 New Definitions In this chapter, we extend the notion of kernel range spaces to other smoothed range spaces that are "linked" with common range spaces, e.g., halfspaces. These inherit the 64 construction bounds through the linking result of [78], and we show cases where these bounds can also be improved. We also extend the notion of #-nets to kernels and smoothed range spaces, and show linking results for these as well. 4.2.1 Smoothed Range Spaces Here, we will define the primary smoothed combinatorial object that we will examine, starting with halfspaces, and then generalizing. Let Hw denote the family of smoothed halfspaces with width parameter w, and let ( P, Hw ) be the associated smoothed range space where P ⇢ R d . Given a point p 2 P, then smoothed halfspace h 2 Hw maps p to a value vh ( p) 2 [0, 1] (rather than the traditional {0, 1} in a binary range space). We first describe a specific mapping to the function value vh ( p) that will be sufficient for the development of most of our techniques. Let F be the (d 1)-flat defining the boundary of halfspace h. Given a point p 2 R d , let p F = arg minq2 F k p closest to p. Now we define vh,w ( p) = 8 > >1 > > <1 + 2 1 > > >2 > :0 1 k p pF k 2 w 1 k p pF k 2 w qk describe the point on F p 2 h and k p pF k p2 / h and k p p2 / h and k p pF k < w p F k w. p 2 h and k p w pF k < w These points within a slab of width 2w surrounding F can take on a value between 0 and 1, where points outside of this slab revert back to the binary values of either 0 or 1. We can make this more general using a shift-invariant kernel k (k p where k w (k p follows. x k) = k (k p x k) = K ( p, x ), x k/w) allows us to parameterize by w. Define vh,w ( p) as vh,w ( p) = ( 1 1 2 k w (k p 1 2 k w (k p p F k) p F k) p2h p2 / h. For brevity, we will omit the w and just use vh ( p) when clear. These definitions are equivalent when using the triangle kernel. But, for instance, we could also use a Epanechnikov kernel or Gaussian kernel. Although the Gaussian kernel does not satisfy the restriction that only points in the width 2w slab take non {0, 1} values, we can use techniques from [103] to extend to this case as well. This is illustrated in Figure 4.1. Another property held by this definition that we will exploit is the slope V of these kernels is bounded by V = O(1/w) = c/w, for some constant c; the constant c = 1/2 for triangle and Gaussian, and c = 1 for Epanechnikov. 65 G F w w p1F 0 p3 p3F w w p1G p1 1 0 p3 p3G p1 2w 1 p2G p2 p2F p2 k p pF k 3w/2 3w/4 w/2 p1 p2 p3 ?( p 2 h) TRUE TRUE FALSE vh ( p) 1 7/8 1/4 2w Figure 4.1. Illustration of the smoothed halfspace, and smoothed polynomial surface, with function value of three points { p1 , p2 , p3 } defined using a triangle kernel. Finally, we can further generalize this by replacing the flat F at the boundary of h with a polynomial surface G. The point pG = arg minq2G k p qk replaces p F in the above definitions. Then the slab of width 2w is replaced with a curved volume in R d ; see Figure 4.1. For instance, if G defines a circle in R d , then vh defines a disc of value 1, then an annulus of width 2w where the function value decreases to 0. Alternatively, if G is a single point, then we essentially recover the kernel range space, except that the maximum height is 1/2 instead of 1. We will prove the key structural results for polynomial curves in Section 4.4, but otherwise focus on halfspaces to keep the discussion cleaner. The most challenging elements of our results are all contained in the case with F as a (d 4.2.2 1)-flat. #-Sample in a Smoothed Range Space It will be convenient to extend the notion of a kernel density estimate to these smoothed range space. A smoothed density estimate SDE P is defined for any h 2 Hw as SDE P ( h ) = 1 \| P\| Â v h ( p ). (4.1) p2 P An #-sample Q of a smoothed range space ( P, Hw ) is a subset Q ⇢ P such that max \| SDE P (h) h 2 Hw SDE Q ( h )\| (4.2)  #. Given such an #-sample Q, we can then consider a subset H̄w of Hw with bounded integral (perhaps restricted to some domain like a unit cube that contains all of the data P). If we can learn the smooth range ĥ = arg maxh2H̄w SDE Q (h), then we know SDE P (h⇤ ) SDE Q ( ĥ )  #, where h⇤ = arg maxh2H̄w SDE P (h), since SDE Q (ĥ) SDE Q ( h⇤ ) SDE P ( h⇤ ) #. Thus, such a set Q allows us to learn these more general density estimates with generalization error #. 66 We can also learn smoothed classifiers, like scenario (S2) in the introduction, with generalization error #, by giving points in the negative class a weight of 1; this requires separate (#/2)-samples for the negative and positive classes. 4.2.3 (#, t )-Net in a Smoothed Range Space We now generalize the definition of an #-net. Recall that it is a subset Q ⇢ P such that Q "hits" all large enough ranges (\| P \ A\|/\| P\| #). However, the notion of "hitting" is now less well-defined since a point q 2 Q may be in a range but with value very close to 0; if a smoothed range space is defined with a Gaussian or other kernel with infinite support, any point q will have a nonzero value for all ranges! Hence, we need to introduce another parameter t 2 (0, #), to make the notion of hitting more interesting in this case. A subset Q ⇢ P is an (#, t )-net of smoothed range space ( P, Hw ) if for any smoothed range, h 2 Hw such that SDE P (h) #, then there exists a point q 2 Q such that vh (q) t. The notion of #-net is closely related to that of hitting sets. A hitting set of a binary range space ( P, A) is a subset Q ⇢ P so every A 2 A (not just the large enough ones) contains some q 2 Q. To extend these notions to the smoothed setting, we again need an extra parameter t 2 (0, #), and also need to only consider large enough smoothed ranges, since there are now an infinite number even if P is finite. A subset Q ⇢ P is an (#, t )-hitting set of smoothed range space ( P, Hw ) if for any h 2 Hw such that SDE P (h) #, then SDE Q (h) t. In the binary range space setting, an #-net Q of a range space ( P, A) is sufficient to learn the best classifier on P with generalization error # in the non-agnostic learning setting, that is assuming a perfect classifier exists on P from A. In the density estimation setting, there is not a notion of a perfect classifier, but if we assume some other properties of the data, the (#, t )-net will be sufficient to recover them. For instance, consider (like scenario (S1) in the introduction) that P is a discrete distribution so for some "event" points p 2 P, there is at least an #-fraction of the probability distribution describing P at p (e.g., there are more than #\| P\| points very close to p). In this setting, we can recover the location of these points since they will have probability at least t in the (#, t )-net Q. 67 4.3 Linking and Properties of (#, t )-Nets First we establish some basic connections between #-sample, (#, t )-net, and (#, t )-hitting set in smoothed range spaces. In binary range spaces, an #-sample Q is also an #-net, and a hitting set is also an #-net; we show a similar result here up to the covering constant t. Lemma 8. For a smoothed range space ( P, Hw ) and 0 < t < # < 1, an (#, t )-hitting set Q is also an (#, t )-net of ( P, Hw ). Proof. The (#, t )-hitting set property establishes for all h 2 Hw with SDE P (h) t. Since SDE Q (h) = SDE Q ( h ) 1 \| Q\| #, then also Âq2Q vh (q) is the average over all points q 2 Q, then it implies that at least one point also satisfies vh (q) t. Thus Q is also an (#, t )-net. In the other direction, an (#, t )-net is not necessarily an (#, t )-hitting set since the (#, t )net Q may satisfy a smoothed range h 2 Hw with a single point q 2 Q such that vh (q) t, but all others q0 2 Q \ {q} having vh (q0 ) ⌧ t, and thus SDE Q (h) < t. Theorem 8. For 0 < t < # < 1, an (# t )-sample Q in smoothed range space ( P, Hw ) is an (#, t )-hitting set in ( P, Hw ), and thus also an (#, t )-net of ( P, Hw ). Proof. Since Q is the (# t )-sample in the smoothed range space, for any smoothed range h 2 Hw we have \| SDE P (h) SDE Q ( h )\| separately. If SDE P (h) #, when SDE P (h) SDE Q ( h ) SDE Q ( h ), SDE P ( h ) And more simply, when SDE Q (h) t. We consider the upper and lower bound # (# we have t) SDE P ( h ) # (# and SDE P (h) t ) = t. # t, then SDE Q (h) t. Thus, in both situations, Q is an (#, t )-hitting set of ( P, Hw ). And then by Lemma 8, Q is also an (#, t )-net of ( P, Hw ). 4.3.1 Relations between Smoothed Range Spaces and Linked Binary Range Spaces Consider a smoothed range space ( P, Hw ), and for one smoothed range h 2 Hw , examine the range boundary F (e.g. a (d 1)-flat, or polynomial surface) along with a symmetric, shift invariant kernel K that describes vh . The superlevel set (vh )t is all points x 2 R d such that vh ( x ) t. Then recall a smoothed range space ( P, Hw ) is linked to a 68 binary range space ( P, A) if every set { p 2 P \| vh ( p) t } for any h 2 Hw and any t > 0, is exactly the same as some range A \ P for A 2 A. For smoothed range spaces defined by halfspaces, then the linked binary range space is also defined by halfspaces. For smoothed range spaces defined by points, mapping to kernel range spaces, then the linked binary range spaces are defined by balls. Joshi et.al. [78] established that given a kernel range space ( P, K), a linked binary range space ( P, A), and an #-sample Q of ( P, A), then Q is also an #-kernel sample of ( P, K). An inspection of the proof reveals the same property holds directly for smoothed range spaces, as the only structural property needed is that all points p 2 P, as well as all points q 2 Q, can be sorted in decreasing function value K ( p, x ), where x is the center of the kernel. For smoothed range space, this can be replaced with sorting by vh ( p). Corollary 4.3.1 ( [78]). Consider a smoothed range space ( P, Hw ), a linked binary range space ( P, A), and an #-sample Q of ( P, A) with # 2 (0, 1). Then Q is an #-sample of ( P, Hw ). We now establish a similar relationship to (#, t )-nets of smoothed range spaces from (# t )-nets of linked binary range spaces. Theorem 9. Consider a smoothed range space ( P, Hw ), a linked binary range space ( P, A), and an (# t )-net Q of ( P, A) for 0 < t < # < 1. Then Q is an (#, t )-net of ( P, Hw ). Proof. Let \| P\| = n. Then, since Q is an (# \| P \ A\| (# t )n, then Q \ A 6= ∆. Suppose h 2 Hw has SDE P (h) A 2 A be the range such that (# t )-net of ( P, A), for any range A 2 A, if # and we want to establish that SDE Q (h) t )n points with largest vh ( pi ) values are exactly the points in A. We now partition P into three parts (1) let P1 be the (# largest vh values, (2) let y be the point in P with (# be the remaining n n(# t. Let t )n 1 points with t )nth largest vh value, and (3) let P2 t ) points. Thus, for every p1 2 P1 and every p2 2 P2 we have vh ( p2 )  vh (y)  vh ( p1 )  1. Now using our assumption n · SDE P (h) n · SDE P (h) = Â p1 2 P1 n# we can decompose the sum v h ( p1 ) + v h ( y ) + Â p2 2 P2 v h ( p2 ) and hence using upper bounds vh ( p1 )  1 and vh ( p2 )  vh (y), n#, 69 vh (y) n# n# Â v h ( p1 ) (n(# t) p1 2 P1 Â p2 2 P2 1) · 1 v h ( p2 ) (n n(# t ))vh (y). t) nt = t. n Solving for vh (y) we obtain vh (y) n nt + 1 n(# t ) + 1 n nt n(# Since ( P, A) is linked to ( P, Hw ), there exists a range A 2 A that includes precisely P1 [ y (or more points with the same vh (y) value as y). Because Q is an (# t )-net of ( P, A), Q contains at least one of these points, lets call it q. Since all of these points have function value vh ( p) vh (y) t, then vh (q) t. Hence, Q is also an (#, t )-net of ( P, Hw ), as desired. This implies that if t  c# for any constant c < 1, then creating an (#, t )-net of a smoothed range space, with a known linked binary range space, reduces to computing an #-net for the linked binary range space. For instance, any linked binary range space with shatter-dimension n has an #-net of size O( n# log 1# ), including halfspaces in R d with n = d and balls in R d with n = d + 1; hence there exists (#, #/2)-nets of the same size. For halfspaces in R2 or R3 (linked to smoothed halfspaces) and balls in R2 (linked to kernels), the size can be reduced to O(1/#) [67, 93, 111]. 4.4 Min-Cost Matchings within Cubes Before we proceed with our construction for smaller #-samples for smoothed range spaces, we need to prepare some structural results about min-cost matchings. Following some basic ideas from [103], these matchings will be used for discrepancy bounds on smoothed range spaces in Section 4.5. In particular, we analyze some properties of the interaction of a min-cost matching M and some basic shapes ( [103] considered only balls). Let P ⇢ R d be a set of 2n points. A matching M ( P) is a decomposition of P into n pairs { pi , qi } where pi , qi 2 P and each pi (and qi ) is in exactly one pair. A min-cost matching is the matching M that minimizes cost1 ( M, P) = Âin=1 k pi qi k. The min-cost matching can be computed in O(n3 ) time by [48] (using an extension of the Hungarian algorithm from the bipartite case). In R2 , it can be calculated in O(n3/2 log5 n) time [136]. 70 Following [103], again we will base our analysis on a result of [11] which says that if P ⇢ [0, 1]d (a unit cube) then for d a constant, costd ( M, P) = Âin=1 k pi qi kd = O(1), where M is the min-cost matching. We make no attempt to optimize constants, and assume d is constant. One simple consequence, is that if P is contained in a d-dimensional cube of side length `, then costd ( M, P) = Âin=1 k pi qi kd = O(`d ). We are now interested in interactions with a matching M for P in a d-dimensional cube of side length ` C`,d (call this shape an (`, d)-cube), and more general objects; in particular Cw a (w, d)-cube and, Sw a slab of width 2w, both restricted to be within C`,d . Now for such an object Ow (which will either be Cw or Sw ) and an edge { p, q} where line segment pq intersects Ow define point p B (resp. q B ) as the point on segment pq inside Ow closest to p (resp. q). Note if p (resp. q) is inside O then p B = p (resp. q B = q), otherwise, it is on the boundary of Ow . For instance, see C20w in Figure 4.2. Define the length of a matching M restricted to an object Ow ⇢ R d as Â r(Ow , M ) = (q,p)2 M n min (2w)d , k p B o q B kd . Note this differs from a similar definition by [103] since that case did not need to consider when both p and q were both outside of Ow , and did not need the min{(2w)d , . . .} term because all objects had diameter 2. q q qB qB w 20w pB pB p Figure 4.2. (T3) edges. 20w p 71 Lemma 9. Let P ⇢ C`,d , where d is constant, and M be its min-cost matching. For any (w, d)-cube Cw ⇢ C`,d we have r(Cw , M) = O(wd ). Proof. We cannot simply apply the result of [11] since we do not restrict that P ⇢ Cw . We need to consider cases where either p or q or both are outside of Cw . As such, we have three types of edges we consider, based on a cube C20w of side length 20w, and with center the same as Cw . (T1) Both endpoints are within C20w of edge length at most p d20w. (T2) One endpoint is in Cw , the other is outside C20w . (T3) Both endpoints are outside C20w . For all (T1) edges, the result of Bern and Eppstein can directly bound their contribution to r(Cw , M) as O(wd ) (scale to a unit cube, and rescale). For all (T2) edges, we can also bound their contribution to r(Cw , M) as O(wd ), by extending an analysis of [103] when both Cw and C20w are similarly proportioned balls. This analysis shows there are O(1) such edges. We now consider the case of (T3) edges, restricting to those that also intersect Cw . We argue there can be at most O(1) of them. In particular, consider two such edges { p, q} and { p0 , q0 }, and their mappings to the boundary of C20w as p B , q B , p0B , q0B ; see Figure 4.2. If k pB p0B k  10w and kq B min-cost matching since k p q0B k  10w, then we argue next that this cannot be part of a p0 k + kq q0 k < k p qk + k p0 q0 k, and it would be better to swap the pairing. Then it follows from the straight-forward net argument below that there can be at most O(1) such pairs. We first observe that k p B k p0B k pB p0B k + kq B q0B k  10w + 10w < 20w + 20w  k p B q0B k. Now we can obtain our desired inequality using that k p q B k + kq B p0B k + k p0B qk (and similar for k p0 q0 k) and that k p p0 k by triangle inequality (and similar for kq q0 k). qk = k p p0 k  k p qB k + pB k + pB k + k pB Next, we describe the net argument that there can be at most O(d2 · 22d ) = O(1) such pairs with k p B p0B k > 10w and kq B q0B k > 10w. First place a 5w-net N f on each (d 1)- dimensional face f of C20w so that any point x 2 f is within 5w of some point h 2 N f . We can construct N f of size O(2d ) with a simple grid. Then let N = S f N f as the union 72 of the nets on each face; its size is O(d · 2d ). Now for any point p 2 / C20w let h ( p) = h k be the closest point in N to p B . If two points p and p0 have h ( p) = arg minh 2N k p B p0 k  10w. Hence, there can be at most O((d · 2d )2 ) edges with { p, q} h ( p0 ) then k p mapping to unique h ( p) and h (q) if no other edge { p0 , q0 } has k p B kq B q0B k  10w. p0B k  10w and Concluding, there can be at most O(d2 · 22d ) = O(1) edges in M of type (T3), and the sum of their contribution to r(Cw , M) is at most O(wd ), completing the proof. Lemma 10. Let P ⇢ C`,d , where d is constant, and let M be its min-cost matching. For any width 2w slab Sw restricted to C`,d we have r(Sw , M ) = O(`d Proof. We can cover the slab Sw with O((`/w)d 1) 1 w ). (w, d)-cubes. To make this concrete, we cover C`,d with d`/wed cubes on a regular grid. Then in at least one basis direction (the one closest to orthogonal to the normal of F), any column of cubes can intersect Sw in at most 4 cubes. Since there are d`/wed 1 such columns, the bound holds. Let Cw be the set of these cubes covering Sw . Restricted to any one such cube Cw , the contribution of those edges to r(Sw , M) is at most O(wd ) by Lemma 9. Now we need to argue that we can just sum the effect of all covering cubes. The concern is that an edge goes through many cubes, only contributing a small amount to each r(Cw , M) term, but when the total length is taken to the dth power it is much more. However, since each edge's contribution is capped at (2w)2 , we can say that if any edge goes through more than O(1) cubes, its length must be at least w, and its contribution in one such cube is already W(w), so we can simply inflate the effect of each cube towards r(Sw , M) by a constant. In particular, consider any edge pq that has p 2 Cw . Each cube has 3d 1 neighboring cubes, including through vertex incidence. Thus, if edge pq passes through more than 3d cubes, q must be in a cube that is not one of Cw0 's neighbors. Thus, it must have length at least w; and hence its length in at least one cube Cw0 must be at least w/3d , with its 2 contribution to r(Cw0 , M) > wd /(3d ). Thus, we can multiply the effect of each edge in 2 r(Cw , M ) by 3d 2d = O(1) and be sure it is at least as large as the effect of that edge in r(Sw , M). Hence 73 2 r ( Sw , M )  3d 2d  O (1) Â r(Cw , M ) Â O(wd ) Cw 2Cw Cw 2Cw = O((`/w)d = O(`d 1 1 ) · O(wd ) w ). We can apply the same decomposition as used to prove Lemma 10 to also prove a result for a w-expanded volume Gw around a degree g polynomial surface G. A degree g polynomial surface can intersect a line at most g times, so for some C`,d the expanded surface Gw \ C`,d can be intersected by O( g(`/w)d 1) (w, d)-cubes. Hence, we can achieve the following bound. Corollary 4.4.1. Let P ⇢ C`,d , where d is constant, and let M be its min-cost matching. For any volume Gw defined by a polynomial surface of degree g expanded by a width w, restricted to C`,d we have r( Gw , M ) = O( g`d 4.5 1 w ). Constructing #-Samples for Smoothed Range Spaces In this section, we build on the ideas from [103] and the new min-cost matching results in Section 4.4 to produce new discrepancy-based #-sample bounds for smoothed range spaces. The basic construction is as follows. We create a min-cost matching M on P, then for each pair ( p, q) 2 M, we retain one of the two points at random, halving the point set. We repeat this until we reach our desired size. This should not be unfamiliar to readers familiar with discrepancy-based techniques for creating #-samples of binary range spaces [28, 92]. In that literature similar methods exist for creating matchings "with low-crossing number". Each such matching formulation is specific to the particular combinatorial range space one is concerned with. However, in the case of smoothed range spaces, we show that the min-cost matching approach is a universal algorithm. It means that an #-sample Q for one smoothed range space ( P, Hw ) is also an #-sample for any other 0 ), perhaps up to some constant factors. We also show how smoothed range space ( P, Hw these bounds can sometimes improve upon #-sample bounds derived from linked range spaces; herein the parameter w will play a critical role. 74 4.5.1 Discrepancy for Smoothed Halfspaces To simplify arguments, we first consider P ⇢ R2 extending to R d in Section 4.5.4. Let c : P ! { 1, +1} be a coloring of P, and define the discrepancy of ( P, Hw ) with coloring c as discc ( P, Hw ) = maxh2Hw \| Â p2 P c( p)vh ( p)\|. Restricted to one smoothed range h 2 Hw this is discc ( P, h) = \| Â p2 P c( p)vh ( p)\|. We construct a coloring c using the min-cost matching M of P; for each { pi , qi } 2 M we randomly select one of pi or qi to have c( pi ) = +1, and the other c(qi ) = 1. We next establish bounds on the discrepancy of this coloring for a V-bounded smoothed range space ( P, Hw ), i.e., where the gradient of vh is bounded by V  c1 /w for a constant c1 (see Section 4.2.1). For any smoothed range h 2 Hw , for each pair { p j , q j } in the matching M, we can now define a random variable X j = c( p j )vh ( p j ) + c(q j )vh (q j ). This allows us to rewrite discc ( P, h) = \| Â j X j \|. We can also define a variable D j = 2\|vh ( p j ) vh (q j )\| such that X j 2 { D j /2, D j /2}. Now, following the key insight from [103], we can bound Â j D2j using results from Section 4.4, which shows up in the following Chernoff bound from [43]: Let { X1 , X2 , . . .} be independent random variables with E[ X j ] = 0 and X j = { D j /2, D j /2} then h Pr discc ( P, h) i a = Pr h Â Xj j i a  2 exp 2a2 Â j D2j ! . (4.3) Lemma 11. Assume P ⇢ R2 is contained in some cube C`,2 , and with min-cost matching M defining c, and consider a V-bounded smoothed halfspace h 2 Hw associated with slab Sw . Let  r(Sw , M)  c2 (`w) for constant c2 (see definition of r in Section 4.4). Then Pr discc ( P, h) > q p C w` log(2/d)  d for any d > 0 and constant C = c1 2c2 . Proof. Using the gradient of vh is at most V = c1 /w and \|vh ( p j ) q j k} we have Â D2j = Â 4(vh ( p j ) j j vh (q j )\|  V max{2w, k p j vh (q j ))2  4V2 r(Sw , M)  4c21 /w2 · c2 `w = 4c21 c2 `/w, where the second inequality follows by Lemma 10 which shows that r(Sw , M) = Â j max {(2w)2 , k p j q j k2 }  c2 (`w). We now study the random variable discc ( P, h) = \| Âi Xi \| for a single h 2 Hw . Invoking p (4.3) we can bound Pr[discc ( P, h) > a]  2 exp( a2 /(2c21 c2 `/w)). Setting C = c1 2c2 and 75 a=C q ` w 4.5.2  log(2/d) reveals Pr discc ( P, h) > C q ` w log(2/d)  d. From a Single Smoothed Halfspace to a Smoothed Range Space The above theorems imply small discrepancy for a single smoothed halfspace h 2 Hw , but this does not yet imply small discrepancy discc ( P, Hw ), for all choices of smoothed halfspaces simultaneously. And in a smoothed range space, the family Hw is not finite, since even if the same set of points have vh ( p) = 1, vh ( p) = 0, or are in the slab Sw , infinitesimal changes of h will change SDE P (h). So in order to bound discc ( P, Hw ), we will show that there are polynomial in n number of smoothed halfspaces that need to be considered, and then apply a union bound across this set. The proof is deferred to the full version. Theorem 10. For P ⇢ R2 of size n, for Hw , and value Y(n, d) = O can choose a coloring c such that Pr[discc ( P, Hw ) > Y(n, d)]  d. 4.5.3 q ` w log nd ) for d > 0, we #-Samples for Smoothed Halfspaces To transform this discrepancy algorithm to #-samples, let f (n) = discc ( P, Hw )/n be the value of # in the #-samples generated by a single coloring of a set of size n. Solving q ` for n in terms of #, the sample size is s(#) = O( 1# w` log w#d ). We can then apply the MergeReduce framework [29]; iteratively apply this random coloring in O(log n) rounds on disjoint subsets of size O(s(#)). Using a generalized analysis (c.f., Theorem 3.1 in [102]), we have the same #-sample size bound. Theorem 11. For P ⇢ C`,2 ⇢ R2 , with probability at least 1 q 1 ` ( P, Hw ) of size O( # w` log w#d ). d, we can construct an #-sample of To see that these bounds make rough sense, consider a random point set P in a unit square. Then setting w = 1/n will yield roughly O(1) points in the slab (and should p p roughly revert to the non-smoothed setting); this leads to discc ( P, Hw ) = O( n log(n/d)) p and an #-sample of size O((1/#2 ) log(1/#d)), basically the random sampling bound. But setting w = # so about #n points are in the slab (the same amount of error we allow in p p an #-sample) yields discc ( P, Hw ) = O((1/ #n) · log(n/d)) and the size of the #-sample 76 to be O( 1# p log(1/#d)), which is a large improvement over O(1/#4/3 ), and the best bound known for non-smoothed range spaces [92]. 4.5.4 Generalization to d Dimensions We now extend from R2 to R d for d > 2. Using results from Section 4.4 we implicitly get a bound on Â j Ddj , but the Chernoff bound we use requires a bound on Â j D2j . As in [103], we can attain a weaker bound using Jensen's inequality over at most n terms !d/2 !2/d 1 2 1 ⇣ 2 ⌘d/2 Â Dj so Â D2j  n1 2/d Â Ddj . Â n Dj n j j j j Replacing this bound and using r(Sw , M )  O(`d 1 w) (4.4) in Lemma 11 and considering V = c1 /w for some constant c1 results in the next lemma. Its proof is deferred to the full version. Lemma 12. Assume P ⇢ R d is contained in some cube C`,d and with min-cost matching M, and consider a V-bounded smoothed halfspace h 2 Hw associated with slab Sw . Let r(Sw , M)  p ⇥ ⇤ c2 (`d 1 w) for constant c2 . Then Pr discc ( P, h) > Cn1/2 1/d (`/w)1 1/d log(2/d)  d for p any d > 0 and C = 2c1 (c2 )1/d . For all choices of smoothed halfspaces, applying the union bound, the discrepancy is p increased by a log n factor, with the following probabilistic guarantee, Pr[discc ( P, Hw ) > Cn 1/2 1/d (`/w) 1 1/d q log(n/d)]  d. Ultimately, we can extend Theorem 11 to the following. Theorem 12. For P ⇢ C`,d ⇢ R d , where d is constant, with probability at least 1 d, we can ⇣ ⇣ q ⌘2d/(d+2) ⌘ ` 2 ( d 1 ) / ( d + 2 ) construct an #-sample of ( P, Hw ) of size O (`/w) · 1# log w#d . Note this result addresses scenario (S3) from the introduction where we want to find a small set (the #-sample) so that it could be much smaller than the d/#2 random sampling bound, and allows generalization error O(#) for agnostic learning as described in Section 4.2.2. When `/w is constant, the exponents on 1/# are also better than those for binary ranges spaces (see Section 4.1). CHAPTER 5 CORESETS FOR KERNEL REGRESSION 5.1 5.1.1 Basic Definitions Coresets for Kernel Regression The brute force solution of kernel regression is time consuming as each computation calculates KDE and WKDE, which takes O(\| P\|) time. In this chapter, we show how to scalably apply kernel regression to massive scalar-valued data sets. The main idea is to approximate P with a coreset S where Sx ⇢ Px , and in some cases, this can be relaxed, but each s 2 S can potentially be given a scalar value sy different from the associated original point. In particular, the coreset S should act as a proxy for P, so that for any q the value KR S ( q ) should approximate KR P (q). The coreset S should be substantially smaller than P, while also preserving the strong approximation guarantees. Any query to KR S takes time at most proportional to \|S\| instead of \| P\|, so the size of S directly impacts the efficiency of interacting with KR S . Moreover, if the construction of S is efficient (and ours is roughly as fast as reading the data, or sorting if needed), then the time to compute m values of the kernel regression (common for say visualization) is also reduced by \| P\|/\|S\|, after factoring the build time. Here are a list of scenarios where such coresets are essential. • The data are too big to store. For example, Square Kilometer Array, the world's largest radio telescope, receives several terabytes of data per second. Most of the data are in scalar values, such as baseline-corrected power flux density, sensitivity, and receiver temperature, so kernel regression is a good way to track those scalar values over time. However storing all of these data is a challenging problem, let alone analyzing them. Instead of storing all of it, a coreset for kernel regression would keep relevant data that provably behaves like the original data, but needs much less space. 78 • The data are large and the older parts requires less accuracy. For instance, in analyzing trends in system log data, we want more accurate recent data, but allow more imprecision in historical data. For example, in the CloudLab [115] central power database, power data serve as a way to monitor the cloud performance. They have scalar values and change gradually overtime but may have noisy fluctuations. Kernel regression is a good way to track this, and older data can be kept with less precision. • These data are for interactive analysis. To interact with very large data stored on disk one can first analyze a small coreset, and then refine to a larger coresets with more accuracy as more precision is needed; this is much more efficient than bringing all relevant data to disk for each query. Instead, we can maintain several layers of different sized coresets. For instance, in spatial data systems, such as Mesowest [72], temperature is connected with each geo-coordinate, to show temperature across the United States, a coarse level coreset is sufficient. But to zoom the map to see the temperature at the state or city level, then a more detailed coreset is required. 5.1.2 Our Approach To formalize the meaning of KR S (q) is "close" to KR P (q), we focus on worst-case error guarantees (L• as opposed to L2 or L1 more common to KDEs); this ensures we do not have any spurious regression values. This is essential for data analysis, since we want to be able to find important trends and detect outlier points, and also not be fooled into thinking we observe a nonexistent trend or a nonexistent outlier. Beyond that, the error function should not be affected by either a shift or a scaling of a scalar value, since this is equivalent to changing the units (e.g., Celsius to Fahrenheit). As such a natural bound will be absolute error difference with the bound depending on some quantity that depends on the scaling. We will use M = max p,p0 2 P \| py p0y \|, the maximum difference between scalar values, so as the scale of the units on py changes, M does at the same rate. In particular, we are interested in a coreset S of a data set P such that for some domain U ⇢ R d that max \| KR P (q) q 2U KR S ( q )\|  #M. 79 Our coresets S have size depending linearly on 1/# and sometimes D = max p,p0 2 P k p x p0x k/s. It is worth noting that in setting U = R d , such a result may not be possible. The kernel regression definition KR P (q) has in its denominator KDE P (q), so when KDE P (q) is very close to 0, then KR P (q) is very unstable. So, we consider a domain U which is defined by a mild condition on KDE P (q); in particular that KDE P (q) is above some very small value r. To further put this error bound in perspective, consider the relative error maxq2U KR S ( q ) KR P ( q ) instead. This is unstable whenever KR P (q) is close to 0. And, furthermore, the q where KR P ( q ) is small, depends entirely on the units chosen for the py values. For instance, we could have py = 32 Fahrenheit (not be close to 0) or py = 0 Celsius, which is exactly 0 and makes any relative error requirement imply no error at all. Since the change of units is meaningless, this error measure is not feasible. 5.1.3 Our Results Our results focus mainly on Px ⇢ R1 and Px ⇢ R2 (so the x-coordinate(s) naturally represent time or spatial coordinates), although many aspects extend naturally to high dimensions. We first bound the accuracy of a kernel regression coreset formed by random sampling; these are the first known bounds for the sample complexity of kernel regression. It is of particular interest since in many cases the data set provided on input is itself a random sample from some much larger data set or distribution we do not have access to (e.g., a 1% stream from Twitter). So if the input data are indeed sampled, our bounds measure the error present before any analysis is applied. However, random sampling performs poorly compared to most other methods we consider, so it makes sense to further compress them. We analyze (theoretically and empirically) several straight-forward aggregation techniques to construct coresets. These are of particular interest since they mimic common online aggregation techniques [18]. We also propose some modifications which demonstrate sizable empirically improvements. Interestingly, effective coresets for KDEs [149], do not perform the best for kernel regression. In particular, our recommended method G-Aggregate for Px ⇢ R1 , carefully aggre- gates data over a fixed size nonempty grid cells; it takes O(\| P\|) after sorting the data. For 80 Algorithm 1: Z-order (Z) 1: Sort data Px in Z-order; set h = \| P \| / \| S \| 2: Choose a random number in r = [0, h 1] 3: for i 1 to \|S\| do 4: Put Pr+h·(i 1) into S 5: return S Algorithm 2: Z-Aggregate (ZA) 1: Sort data Px in Z-order; set h = \| P \| / \| S \| 2: for i 1 to \|S\| do 3: Pi = [ Ph·(i 1) , · · · , Ph·i ] 4: Put average of all the points in Pi into S 5: return S Px ⇢ R2 , we recommend Aggregate-Neighbor, which carefully adds a few points to the coreset from G-Aggregate. For a data sets Px ⇢ R d , these both produce a coreset S of size O(D/#r)d , where D = max p,p0 2 P k p x KDE Px ( q ) > r that \| KR P (q) KR S ( q )\| p0x k/s, and guarantees for any q 2 R d with  #M, where M = max p,p0 2 P \| py p0y \|. Moreover, these methods are simple to implement and work extremely well on real and synthetic data sets. 5.1.4 Related Work This is the first work to address sample complexity and coreset size for NadarayaWatson kernel regression. There is an enormous body of work on other types of coresets, see the recent survey on coresets [104], including many for parametric regression variants like least-square regression [14] and l p regression [37]. The only nonparametric regression coreset we are aware of is a form of kernel regression [143] related to the smallest enclosing ball. It predicts the value at a point q 2 R d as f (q) = b + Â p2 P a p K ( p x , q) with loss function Â p2 P max{0, \| f ( p x ) py \| #̄}, for a parameter #̄. Then it finds a set of O(1/#) nonzero a p parameters (corresponding with points in the coreset) so many points satisfy \| f ( p x ) were attempted. py \|  #̄(1 + #). No implementations Rather, we believe the most related work involves coresets for kernel density estimates [9, 78, 103, 149] as mentioned above. We extend some of these results and show 81 others do not work well when translated to the regression variant of this problem. 5.2 Subset Selection Methods We next describe several natural approaches to compress scalar-valued spatial data. Some of these are likely in use in existing data aggregation frameworks (e.g., RFF [18]), but as far as we know have not been analyzed in how they preserve kernel regression values. • Random sampling (RS): This method simply draws a uniform random sample S from the data set P. This is probably the most common data aggregation method anywhere. In other cases, it is often assumed that even before aggregating data, the data is only a random sample of some unseen larger "true" data set. This is known to approximate kernel density estimates [9, 58, 78], and we will show extends to kernel regression. Algorithm 3: G-Aggregate (GA) 1: Map Px into grid Gg 2: for g 2 Gg ( P ) do 3: Put average of all the points in Pg into S 4: return S Algorithm 4: Aggregate-Neighbor (AN) 1: Map Px into grid Gg 2: for g 2 Gg ( P ) do 3: Put average of all the points in Pg into S 4: for g 2 Gg ( P̄ ) adjacent to Gg ( P ) do 5: For center c of g, put (c, KR P (c)) in S 6: return S • k-Center (kCen): This method creates a k-center clustering of Px using the greedy Gonzalez algorithms [53]; that finds a set of k center points which (with a factor 2) minimizes the distance to the furthest data point. This is inspired by both a recent way to approximate the kernel mean (equivalent to the KDE) [36] and also the initial step in (improved) fast Gauss transforms [146]. It takes O(kn) time to find the center set, and then data points can be aggregated to the closest center in as much time. 82 • Sorting-based approaches: For Px ⇢ R1 , these methods just sort the points, and choose S as evenly spaced points in the sorted order. Inspired by the KDE coreset algorithm in Chapter 1, we can extend to higher dimensions using the Z-order space-filling curve to implicitly define a single ordering over the data points which attempts to preserve spatial locality. Hence, we refer to it as Z-order (Z). Also, inspired by this approach we take a random point from each block in the sorted order instead of the first or last of each block deterministically. As an extension, we propose Z-Aggregate (ZA) which is more careful on how it represents each interval. It again sorts the x-coordinate(s) of P by Z-order, and then for a set of consecutive points Pi of size h, [ h(i sy = 1 \| Pi \| 1), hi ) for i = 1, 2, . . . , k, choose s x = Â p2 Pi py as the ith point in S. 1 \| Pi \| Â p2 Pi p x and • Grid-based approaches: Define a grid Gg into square grid cells (intervals for Px ⇢ R) of side length g. It will be convenient to designate Gg ( P) as the nonempty grid cells, and Gg ( P̄) as the empty grid cells. For a grid g, define Pg ⇢ P = { p 2 P \| p x 2 g}, the points which fall in g. In the basic method Grid (G), for each g 2 Gg ( P), randomly place one point from Pg into S, and give it a weight \| Pg \|. In an extension G-Aggregate (GA), for each g 2 Gg ( P) we create a new point to place in S as (s x , sy ) defined s x = 1 \| Pg \| Â p2 Pg p x and sy = 1 \| Pg \| Â p2 Pg py . The above algorithms can be subtly further improved by adding extra points in the empty grids with nonempty grids as neighbors, we call this Aggregate-Neighbor (AN). Specifically, these empty but adjacent cells generate a point at the cell center c with value equal to the kernel regression value KR P (c). This takes a bit longer than just performing an aggregate, but these empty but adjacent cells are few so the time burden is negligible. This is inspired by the work [22] and the illustrative toy example in Figure 5.1. We will see the improvement is especially significant for Px ⇢ R2 . 5.2.1 Progressive Grid-Based Approaches In many scenarios, Px ⇢ R and this coordinate represents time. Let the current time t NOW : x = 0, and so all other values are negative (say 5 hours ago is x = 5). In these settings, we might only examine windows of the data over x 2 [ T, 0], that is including now, and up to T time units into the past. Further, we can assume over any view we 83 Figure 5.1. Example improvement of Aggregate-Neighbor (right) over G-Aggregate (left). Input P = {(1 100), (2 40), (3 0), (15 50), (16 50), (17 50)}. With G-Aggregate with g = 2, the coreset Q = {(1.5 70), (3 0), (15.5 50), (17 50)}. The largest errors occur at x < 0 and around x = 9. If we add the extra points at the empty grid cells with nonempty neighbor grids, i.e., Aggregate-Neighbor, then L• is significantly reduced. We add three points ( 1 98.3124), (5 3.2559), and (13 50). would set the bandwidth s so that D = max p,p0 2 P k p x p0x k/s = T/s is upper bounded; otherwise, the smoothing is below the resolution of the what can fit in a view window (its too noisy). For these scenarios, we design a progressive approach where we allow more errors for older data points. Extending the grid-based approaches, as data becomes older (new points arrive) we increase the grid resolution g, and further compress the data. Specifically, we divide P into regions R1 , . . . , Rr so the resolution gi used in region Ri is gi = ai 1g , 1 where a is a constant (we use a = 1.5 in our experiments). Setting the width of region width( Ri ) = ai 1 width( R 1) ensures that there are the same number of grid cells in each region. Then, for a fixed resolution in the first region, the size of the coreset will grow only logarithmically with time. 5.3 Analysis We start by providing some structural lemmas that relate approximations of kernel density estimates and weighted kernel density estimates to kernel regression. Then we will use these results to bound the accuracy of specific techniques. Our goal in each case is to show that the coreset S approximates the full data set P in the following sense for parameters r, # 2 (0, 1). For any q 2 R d such that KDE P (q) > r, 84 then \| KR P (q) where M = max p,p0 2 P \| py KR S ( q )\|  #M, p0y \|. Then call S an (r, #)-coreset of P. We believe such strong worst case bounds should be surprising. If we revisit Figure 5.1 we can observe that removing one point can cause error in KR P (q) KR S ( q ) on the order of M (in this case M/4). • Structural results: We need a few definitions and previous results before we can begin stating our new structural tools. Recall in Chapter 4, a data set X and a family of subsets A define a range space ( X, A), and the range space's VC-dimension (informally) describes the combinatorial complexity of the ranges. A relative (r, #)-approximation of ( X, A) is a set Y \| A \ X\| max \|X\| A 2A \| A \ Y\|  # max \|Y \| ⇢ \| A \ X\| ,r . \|X\| Similarly, define a relative (r, #)-approximation of ( P, K ) for kernel K as a set S such that max \| KDE P ( x ) x 2R d KDE S ( x )\|  # max{ KDE P ( x ), r}. Define a (r, #)-approximation for kernel regression of P as a set S such that KDE P (q) \| KR S (q) where M = max p,p0 2 P \| py ( X, A), so KR P ( q )\| r, then  #M, p0y \|. Define a (non-relative) #-approximation Y of a range space max A 2A \| A \ X\| \|X\| \| A \ Y\|  #. \|Y \| It is know an #-approximation can be constructed, with probability at least 1 d via a random sample S of size O((1/#2 )(n + log 1/d)), and a relative (r, #)-approximation with size O((1/r#2 )(n log(1/r) + log(1/d))) [66, 85]. Given an #-approximation S of a range space linked to K, then it is known [78] that it is also a (non-relative) #-approximation of ( P, K ). In the Appendix, we generalize this linking result (roughly following the structure of the proof in [78]) to relative (r, #)-approximations. Theorem 13. For any kernel K : R d ⇥ R d ! R + linked to a range space (R d , A), a relative (r, #)- approximation S of ( P, A) is a (rK + , 2#)-approximation of ( P, K ), where K + = max p,q2 P K ( p, q). 85 Next we provide a sufficient condition for (r, #)-approximation for kernel regression. Lemma 13. For error parameters a, b, r > 0, with a  1/2, consider a point set P ⇢ R d+1 . Let S be a coreset of P so that for any query point q 2 R d , both \| KDE P (q) \| WKDE P (q) KDE S ( q )\| WKDE S ( q )\| Then for any q 2 R d such that KDE P (q) < a max{ KDE P (q), r} < bM. r, then \| KR S (q) KR P ( q )\|  4(a + b/r) M. Proof. Change the units of py so all values lie between 1 and 2. The shifting of these values does not change the approximation factor bM, but the rescaling of the range changes the bound to \| WKDE P (q) WKDE S ( q )\|  b, and also ensures 1  KR P ( q ) values have no bearing on KDE P (q).  2. And recall, py By using the Gaussian kernel we have KDE S (q) > 0 and also 0  WKDE P (q)  2. Thus, we can consider relative error bounds, using KDE P (q) > r, and hence also WKDE P (q) > r. KR S ( q ) KR P ( q ) = WKDE S ( q ) KDE P ( q ) WKDE P ( q ) KDE S ( q ) ✓ b 1 WKDE P ( q ) b =1 ◆✓ b r 1 ◆ a ab + 1 + a (1 + a) WKDE P (q) WKDE P ( q ) 1 a 1+a a Next, we see the relative error bound is slightly different in the other direction. KR S ( q ) KR P ( q ) =  WKDE S ( q ) KDE P ( q ) WKDE P ( q ) KDE S ( q ) ✓ 1+ = 1+ b WKDE P ( q ) b ◆✓ + 1+ a 1 + a (1 b a ab  1+ + + r 1 a (1 a ) r b a = 1+ + (1 a ) r 1 a 2b  1+ + 2a r WKDE P ( q ) 1 a a ◆ ab a) WKDE P (q) 86 Together these imply KR S ( q ) KR P ( q ) following additive error \| KR P (q) 2 [1 KR S ( q )\| b/r a, 1 + 2b/r + 2a]. This translates to the  2( b/r + a) KR P (q)  2( b/r + a)2M = 4(a + b/r) M. We also need another property about the slope of the Gaussian kernel. This is the only bound specific to the Gaussian kernel, so for any other kernels with a similar bound (e.g., Triangle, Epanechnikov) the remaining analysis and algorithms can apply. Lemma 14. A unit Gaussian kernel K ( x ) = exp( x2 /2s2 ) is 1/s-Lipschitz. dK ( x ) x2 dx = exp( 2s2 )( d2 K ( x ) = 0. We get x = dx2 Proof. By taking the first derivative of K with respect to x, we have Take the second derivative and thus \| dKdx(x) \| d2 K ( x ) dx2 = exp( 2 x2 )( sx 4 2s2 1 ) s2 and set has the maximum values on x = ±s, equals to exp( 1 1 2 )( s ) unit Gaussian kernel is 1/s-Lipschitz. 5.3.1 x ). s2 ±s,  1/s. So a Accuracy of Random Sampling We start by analyzing how kernel regression is preserved under random sampling. In many cases the "input" data to a problem should actually be modeled as a random sample of some much larger set, or it may be done as a first pass on data to reduce its complexity. The key structural result will be on sampling weighted sets. Lemma 15. For a weighted point set ( P, w) with P ⇢ R d of size n, then a random sample of points Q ⇢ P of size s = O((1/#2 )(d + log(1/d))), with probability at least 1 any B 2 B where M = max p2X w( p) 1 n Â w( p) p2 P\ B 1 s Â p2 Q\ B d, satisfies for w( p)  #M, min p2X w( p). Proof. First assume max p2X w( p) = 1 and that min p2X w( p) = 0; then M = 1. Otherwise, we can simply "change the units" by uniformly shifting and scaling all w values to reach this scenario. 87 We first consider ( X, w) as a point set P ⇢ R d+1 , where the y-coordinate is w( p). Then, we consider the range space ( P, A) where A defines the set of subsets induced by ranges which are balls in the first d coordinates, and an interval in the y-coordinate; we refer to them as hypercylinders. The range space has VC-dimension O(d). For a given query B 2 B on X (B is the ball in R d on the x-coordinates), we are interested hypercylinders A 2 A so that the x-coordinates are restricted to those in our query choice of B. In fact, we can break the hypercylinder R, which implicitly has a y-interval of [0, 1], up into h = c/# disjoint hypercylinders (design constant c so that c/# is an integer), each with the same ball B in x-coordinates and a y width of #/c. Let Pi be the set P restricted to ith such y interval. We can round all values within the interval to a value vi = i · (c/#), incurring at most #/c error. Then if each ith piece's sample Qi is off in count by ai and \| Âi ai \|  #n/2, then we can say the total error is at most \| P\|#/c + n#/2. Setting c 2, ensures the total as is at most #n as desired. However, individually bounding each ai to be small is hard. If there are few points in one of the levels, then we get a poor estimate on the count in Qi using standard techniques. Instead we can bound Âi ai in aggregate. By the definition of #-samples, if Q is an #/2j sample of ( P, A), then \| Âr=i ai \|  #/2 · n for all i, j 2 [1, h ]. And this holds by our random sample with probability at least 1 d. Now we can write the total error from ( P, w) to ( Q, w) in an ball B 2 B as 1 n Â h w( p) = p2 P\ I 1 Â w( p) n iÂ =1 p 2 P \ B i h  1 Â (vi + #/c) n iÂ =1 p 2 P \ B i h # 1 + vi \| Pi \ B\| c n iÂ =1 h ⇣ ⌘ # 1 n  + Â vi ai + \| Qi \ B \| c n i =1 s = h = h # 1 1 + Â vi ai + Â Â vi c n i =1 s 1=1 p 2 Q \ B i h   h # 1 1 + Â ai + Â Â (w( p) + #/c) c n i =1 s i =1 p 2 Q \ B i 2# # 1 + + c 2 s Â p2 Q\ B w ( p ). 88 Setting c 4, and repeating the argument symmetrically to show the lower bound, we obtain that for any B 2 B 1 n Â w( p) p2 P\ B 1 s Â p2 Q\ B w( p)  #. These results generalize to weighted kernel density estimates, for centrally-symmetric and non-increasing (as function of distance from center) kernels, following [78]. The only change is using the weighted bound in Lemma 15, in place of where Joshi et.al. used the unweighted bound in the definition of a ball-range space linked with the aforementioned kernels. Theorem 14. Consider any kernel K : R d ⇥ R d ! R + linked to (R d , B). For a weighted point set ( X, w) with X ⇢ R d , then a random sample Q ⇢ X of size s = O((1/#2 )(d + log(1/d))), with probability at least 1 d, for any x 2 R d satisfies \| WKDE X,w ( x ) where M = max p2X w( p) WKDE Q,w ( x )\|  #M, min p2X w( p). Now we are ready to show the main result. Theorem 15. Consider a point set P ⇢ R d+1 of arbitrary size, and parameters r, # 2 (0, 1). If S is a uniform sample from P of size O( #21r2 (d log(1/r) + log(2/d))), then with probability at least 1 d, the set S is a (r, #)-approximation for kernel regression on P. Proof. For a binary range space (such as ( P, B)) with constant VC-dimension [135] n, a random sample S of size k = O( (#01)2 r (n log(1/r) + log(2/d))) provides an (r, #0 )-sample with probability at least 1 d/2 [66, 85]. Theorem 13 gives a linking result for kernel density estimate, implying that this is also a relative (r, 2#0 )-coreset for a kernel where K ( x, x ) = 1. This satisfies the first condition of Lemma 13 with a = 2#0 . Second, we invoke Theorem 14 so that we have with probability at least 1 \| WKDE P (q) with b = #0 r. WKDE S ( q )\| d/2 that  (#0 r) M, hence satisfying the second condition of Lemma 13 89 Setting #0 = #/16 invoking Lemma 13, then with probability at least 1 1 d, for any q 2 R d that \| KR S (q) 5.3.2 KR P ( q )\| (d/2 + d/2) =  4(#/8 + (#r/16)/r) M  #M. Accuracy of Grid-Based Approaches We first bound the error in Grid. This implies other related algorithms (G-Aggregate, Aggregate-Neighbor) will have the same asymptotic error bounds for d constant. Theorem 16. Grid with g = #sr p 8 d R d +1 . produces S, a (r, #)-coreset for the kernel regression of P ⇢ Proof. We will prove bounds on both error in KDE P and WKDE P separately, then combine them with Lemma 13. This algorithm maps all points Pg for a grid cell g 2 Gg to a p single point, and by reweighting, changes each points location by at most g d. Using p that K is (1/s)-Lipschitz, this changes KDE P by at most g d/s in KDE S . Only considering KDE P ( q ) r, then \| KDE P (q) KDE S ( q )\|  p g d rs max{r, KDE P (q)}. For WKDE P the analysis is similar, but we may also replace py for a point p 2 Pg with a different sy . We can bound \| py sy \|  M = max p,p0 2 P \| py p \| WKDE P (q) WKDE S (q)\|  g dM/s. p0y \|. Hence for all q 2 R d , then Combining these two bounds together with Lemma 13 we obtain (for q with KDE P (q) r) that \| KR P (q) KR S ( q )\| p  4( grsd + p g d s /r ) M We bound coreset size with D = max p,p0 2 P k p x = 4( 8# + #r 8 /r ) M = #M. p0x k/s. Corollary 5.3.1. For P ⇢ R d+1 for constant d, methods Grid, G-Aggregate, and Aggregate- Neighbor, run in O(\| P\|) time, and return S a (r, #)-coreset for kernel regression of P of size at most O((D/#r)d ). • Accuracy of progressive grid-based methods If the size width( R1 ) of the first region in the progressive methods is a constant, there are at most O(log D) regions. Set g = #sr/8 · ai 1, so each region has a grid with O(1/#r) cells. Corollary 5.3.2. For P ⇢ R2 , under any allowable view window of size T and scaling so T/s is fixed, then the progressive Grid approach achieves an (r, #)-coreset for kernel regression of P of size at most O((1/#r) log D). 90 • Accuracy bounds for other methods Despite bounds for \| KDE P (q) KDE S ( q )\| for other methods (e.g., Z-order) we are not able to show these for WKDE P and, hence KR P . 5.4 Experiments Here we run an extensive set of experiments to validate our methods. We compare KR P where Px ⇢ R1 , Px ⇢ R2 , and Px ⇢ R6 with kernel regression under smaller coreset KR S for both synthetic and real data. To show our methods work well in large data sets, we use large real data set (n = 2 million and 24 million) and synthetic data (n = 1 million) for Px ⇢ R1 , and real data set (n = 1 million) for Px ⇢ R2 . Our algorithms scale well beyond these sizes, but evaluating error was prohibitive. 5.4.1 Data Sets For real data, we consider "Individual Household Electric Power Consumption" data set on UCI Machine Learning Repository. The number of instances is 2,075,259, we use the first three attributes to do kernel regression. Date, time (together for x-value), and global active power (for y-value): household global minute-averaged active power (in kilowatt). This data set has gaps on the x-axis, and kernel regression does a nice job of interpolating those gaps. To demonstrate the effectiveness of progressive grids, we use a "CloudLab" data set. CloudLab [115] is cloud computing platform, and we have obtained a trace of power usage from the Utah site with 400 million values. We use the most recent 10-month window which has size 24,351,363. The time series synthetic data Px ⇢ R1 is generated using formula: yi = c + fyi 1 + N (0, s), where the x-coordinates are i = 1, 2, · · · and yi is the corresponding y coordinates. It mimics a stock price so the next data depends on the previous one plus some random noise. In the experiment, we set c = 0, f = 1, y0 = 10, s = 1 and generate 1 million points. The original data and regression based on the first 10,000 points with bandwidth 50 and 200 is shown on Figure 1.2. For Px ⇢ R2 real data set, we consider OpenStreetMap data from the state of Iowa. Specifically, we use the longitude and latitude of all highway data points as Px and time stamp as Py . Kernel regression on this data set can give a good approximation of when the highway data point is added. 91 For the high-dimensional experiment, we consider two data sets. One is house price data set (CAD) in StatLib [97] and the other is Physicochemical Properties of Protein Tertiary Structure Data set (CASP) from UCI machine learning repository. CAD data set contains 20,640 observations on housing prices with 9 economic covariates and CASP data set has 45,730 data points for 10 random variables. For both data sets, we use the first 6 features to do the kernel regression. 5.4.2 Effectiveness of Coresets Coresets guarantee that kernel regression error is bounded for all values of q 2 R (as long as the data are not too sparse). But evaluating at all of these points, is by definition, impossible. As a result, we evaluate over a very fine covering of evaluation points (in our case 128,000 for Px ⇢ R1 and 512,000 for Px ⇢ R2 ). We have plotted error as the number of evaluation points increase and observed that all methods clearly converge well before this many samples. In more detail, we randomly generate a evaluation point q in the domain R for Px ⇢ R and q in the domain R2 for Px ⇢ R2 , without the restriction KDE P (q) > r. With fixed coreset size 64,000, we experiment on the number of evaluation points from 1,000 to 128,000 for Px ⇢ R and 1,000 to 512,000 for Px ⇢ R2 in Figure 5.2. As the number of evaluation points increases, the value of maximum error in the domain will consistently approach some error value and we can then have some confidence that we have the correct worst case error as this processes plateaus. Under all the subset selection methods (Figure 5.2), the errors are steady at size 128,000 for Px ⇢ R and 512,000 for Px ⇢ R2 , so we use evaluation points of size 128,000 for Px ⇢ R and 512,000 for Px ⇢ R2 in the following experiments. Since all the methods are randomized algorithms, we run all the subset selection meth- ods ten times and use the average errors as the final results. The bandwidth is set to 400 for the real data set in R1 , 50 for the synthetic data set in R1 , and 50 for real data in R2 ; other bandwidths have similar performance. Figure 5.3 shows all the methods converge as the size of the coreset increases. The exception is Z-Aggregate on real data in R; on inspection, the problem occurs in sparse regions, similar to Figure 5.1. G-Aggregate and Aggregate-Neighbor (and sometimes Z-Aggregate) work significantly better compared to all the other methods in all data sets 92 Figure 5.2. The maximum L• error found based on the number of evaluation points on real (left) and synthetic data (middle) when Px ⇢ R1 , and real data (right) when Px ⇢ R2 . Figure 5.3. L• error for coresets when also testing sparse regions on real data(left) and synthetic data(middle) when Px ⇢ R, and real data(right) when Px ⇢ R2 . with Px ⇢ R. They consistently decrease, at certain sizes have one or two orders of magnitude less error, and obtain virtually no error at size about 50,000. Even when the size of the coreset is small, G-Aggregate and Aggregate-Neighbor have very small errors and converge very fast when the size increases. For Px ⇢ R2 , Aggregate-Neighbor achieve noticeably smaller error, but G-Aggregate (and Grid) perform well and are simpler. In particular, G-Aggregate and Aggregate-Neighbor stay at least one order of magnitude smaller in error than Random Sampling. This indicates it is much better to aggregate based on x-value than just randomly sample. It also justifies further thinning the data with these methods if the data should be modeled as a random sample since the additional error introduced would be negligible compared to what was already present due to the sampling. The grid-based methods also consistently outperform the (z-order) sorting-based methods, so it is better to compress based on the x-coordinate change, rather than on the number of points. Filling in a few neighbor values (the -Neighbor method) can also result in significant 93 gains in accuracy for Px ⇢ R2 , but for Px ⇢ R, it does not show much improvement, and sometimes performs worse. The larger error per points (Figure 5.3, real data) is mainly due to the extra points added without much error reduction. It seems just aggregating does a good enough job for Px ⇢ R, but for Px ⇢ R2 more complicated situations arise where this extra step is helpful. 5.4.3 Efficiency of Coresets To show the efficiency of our methods, we compare the construction time and query time based on the coreset compared to the original data set (denoted Org) for the data set Px ⇢ R. For both comparisons, inspired by the Improved Fast Gauss Transform [146] and other fast kernel evaluation methods, for each query point, only the neighbor points within ten bandwidth are queried to calculate the kernel regression values. The construction time includes building the tree structure for the local data query, plus the time to generate the coreset. The query time are based on 128,000 evaluation points. From Figure 5.4 for Px ⇢ R , Grid, G-Aggregate, Random Sample, Z-order, and Z-Aggregate have the most efficient construction times, roughly as fast as just reading the data. Note that k-Center becomes quite slow for large coreset size. Similarly, Grid, G-Aggregate, and Random Sample are very efficient for Px ⇢ R2 (Figure 5.5), but Z-order and Z-Aggregate have noticeable overhead compared to the grid-based methods (and have no accuracy or analysis advantage). In both settings, there is also considerable time overhead to running Aggregate-Neighbor, which has a slight accuracy advantage for Px ⇢ R2 - thus, it is probably only worth it if preprocessing time on these scales are not of much importance but accuracy for Px ⇢ R2 is. For the query time, all the methods improve at least 2 orders of magnitude over using the original data. Their query times are all about the same; this is as expected since they all produce a coreset of the same size, which can be used as proxy for the full data set in precisely the same way. • Main take-away In conclusion, G-Aggregate is the best algorithm in terms of effectiveness and efficiency for Px ⇢ R, with Aggregate-Neighbor has better accuracy in Px ⇢ R2 , but has some increased overhead in construction time. They are orders of magnitude faster than using the original data (and best among all proposed methods) 94 Figure 5.4. Comparison of construction time and query time for real data set with Px ⇢ R. Figure 5.5. Comparison of construction time and query time for real data set with Px ⇢ R2 . and have extremely small error even for small coreset sizes (again the best among all proposed methods). For data sets of size 1 or 2 million, they achieve very small error using only 10,000 points and almost no error around 100,000 points. They (especially G-Aggregate) are very simple to implement, and about as fast to construct as reading the data. As we have seen in Section 5.3, we are also able to prove very strong error guarantees for these methods. 5.4.4 Consistency with Bandwidth In Figure 5.6, we test the consistency of the algorithms by varying the bandwidth. We fix the number of evaluation point as 128,000 and the coreset size 62,499. By varying the bandwidth from 40 to 800 for real data (left) of Px ⇢ R, 10 to 160 for synthetic data (middle) for Px ⇢ R and real data (right) for Px ⇢ R2 , the errors are decreasing for all the methods. This matches with our analysis in Section 5.3 and aligns with the notion that the more we smooth the data, the more stable it is, and the fewer data points we 95 Figure 5.6. The relation between L• error and bandwidth for real data (left), synthetic data (middle) for Px ⇢ R and real data (right) for Px ⇢ R2 . actually need. Again, G-Aggregate consistently performs the best or among the best of our methods. The exception is the real data in R2 (right), where for very large bandwidths, the simple methods Z-order and Random Sample dominate. In this setting, these data are so smoothed that these methods exactly or roughly amount to a random sample, and work better than trying to fit gridded data to circularly smoothed estimates. Note, we do not attempt to automatically choose the bandwidth, as this should be a choice of the user to determine the scale they examine the data [150]. 5.4.5 Progressive Grid-Based Approaches We evaluate the progressive grid-based approach on the CloudLab data. The total coreset size is 316,485, using G-Aggregate in each region. And we use 256,000 evaluation points. We evaluate the algorithm at four smoothing choices s = {10, 15, 30, 45}. For each s, we gradually increase the window size T, starting at 1 day (86,400), up to 10 months (2.5 · 107 ), as shown in Figure 5.7. We see that as a new region is reached, and the grid size enlarges, then so does the error. Also, as long as T/s is bounded by 4 · 104 , the error stays under 0.01. Figure 5.7. The relation between window size and L• error for progressive G-Aggregate method. 96 5.4.6 High Dimension For simplicity, we only compare G-Aggregate and Random Sampling. When increasing the grid size, the number of empty grid increases as well, so we observe that the size of coreset does not increase exponentially. By increasing the number of grids cells from 10 to 20, that is a factor 2 in each dimension, the average number of nonempty grids for CAD data set are {902, 1367, 1863, 2538, 3150, 3791} and for CASP data set are {1034, 1554, 2122, 2742, 3543, 4342}. The relationship of L• error and coreset size is shown in the Figure 5.8(left), using bandwidths 3 and 3.8, respectively. The error decreases when the size of coreset increases for both methods. For the same coreset size, Random Sampling perform better than G-Aggregate, and its running time (right figure in Figure 5.8) is much less. This aligns with our theoretical bounds. For example, for grid size 206 , the G-Aggregate method takes about 250s, while the Random Sampling takes only around 3s. So we recommend the simple and fast method Random Sampling to generate coreset for kernel regression for high dimension data sets. 5.5 Conclusion We describe several algorithms for coresets for kernel regression. Many (random sampling, order-based thinning, and grid-based thinning) are common heuristics. As we demonstrate on data sets with millions of points, those based on grids work much better, and that small modification of aggregating and sometimes filling in sparse-neighborhood boundaries can make large difference in error reduction. With our best methods, massive data sets can be drastically reduced in size and have negligible error. Find our code and data at: http://www.cs.utah.edu/~yanzheng/kernel-reg/. Figure 5.8. Left: L• error for coresets of high dimensional data sets. Right: Running time to generate the coresets. CHAPTER 6 CONCLUSION The rate at which scientists and businesses are producing data is increasing at an unstoppable rate. The advancement of science and industry becomes heavily dependent on understanding these data sets. Kernel smoothing provides a simple way of finding structures in data sets without the imposition of a parametric model, in this dissertation, we consider kernel regression and kernel density estimates. However, the enormity of the data size precludes brute force approaches of analyzing it. Thus, data summarization is an important tool for dealing with massive data. Coresets enable accurate query answering while requiring much lower resources, and can be much faster. In Chapter 2 and Chapter 5, we have demonstrated that the effectiveness of coresets in data size reduction, and thus reduce the computation time of kernel density estimates and kernel regression significantly. As we have seen an example of applying KDE coreset in topological data analysis, there are numerous other challenges to apply KDE coreset. For example, in [107], we propose a new definition of #-net under kernels, called (#, t )-net of kernel range space. It tries to answer questions of what is the sample size we need to maintain to include a significant witness for every large enough event of noisy spatial data points, for example, twitter users with geo-coordinates. Putting this result to visualization purpose, we get a very quick way to visualize the large twitter data without looking at the whole data set: we only need one witness point for each large enough event, and do not need to care about the small events or outlier. By tuning the parameter # and t, we can control how much outlier we want to filter out. This method gives users a freedom to decide how much outlier we want to filter out and only keep visually curious regions. Another challenge is applying kernel smoothing technique into other applications. For instance, spatial scan statistics [83], as a effective anomaly detection method, computes the maximum discrepancy region obtained by scanning the spatial region under study with 98 a set of circular regions of various radii. The discrepancy score for each region is based on a likelihood ratio test statistic constructed to detect significant over density under the Poisson or Bernoulli model. Putting kernel in the classic spatial scan statistics setting, the circular regions or other hard boundary geometric regions turn to be smooth ranges, the value of each point in each range relates how close it is to the kernel center, and thus statistically robust to spatial noise. As discussed in [36], kernel density estimates is a kernel mean that estimates the density of the data. The kernel mean embedding (KME) is another form of kernel mean that maps the probability distribution into a reproducing kernel Hilbert space. Since some of the kernels discussed in this dissertation, for example, Gaussian kernel and Laplacian kernel, are also symmetric positive definite kernels, our methods can apply to both kernel means, which have many applications in machine learning. For example, anomaly detection [39] and mean-shift clustering [32] for KDE, kernel two sample test [59] and support measure machines [94] for KME. In summary, this dissertation provides a contribution to the problem of creating coresets for kernel smoothing with approximation guarantees. There is still plenty of work left to be done in this area in order to be able to understand the kernel smoothing and how to use it in various applications. This is becoming a central problem to large noisy data analysis, and the techniques presented herein will hopefully be used as building blocks for future progress in this direction. APPENDIX PROOFS FOR CORESET ANALYSIS A.1 Linking and (r, #)-Approximations for Kernel Regression Theorem 17. For any kernel K : R d ⇥ R d ! R + linked to a range space (R d , A), a (r, #)approximation S of ( P, A) for S ⇢ R d is a (rK + , 2#)-approximation of ( P, K ), where K + = max p,q2 P K ( p, q). Proof. We first give the definition of k. For two point sets P, S, define a similarity between the two point sets as k ( P, S) = 1 1 \| P\| \|S\| Â Â K( p, s), p2 P s2S and when the pointset S only contains one single point s and a subset P0 ⇢ P, we have k P ( P0 , s) = (1/\| P\|) Â p2 P0 K ( p, s). Then we follow the same technique in proof of Theorem 5.1 in [78], suppose q is any query point, we can sort all pi 2 P in similarity to q so that pi < p j (and by notation i < j) if K ( pi , q) > K ( p j , q). Thus any super-level set containing p j also contains pi for i < j. We can now consider the one-dimensional problem on this sorted order from q. We now count the deviation D ( P, S, q) = KDE P (q) KDE S ( q ) from p1 to pn using a charging scheme. That is each element s j 2 S is charged to g = \| P\|/\|S\| points in P. For simplicity we will assume that g is an integer, otherwise we can allow fractional charges. We now construct a partition of P slightly differently, for positive and negative D ( P, S, q) values, corresponding to undercounts and overcounts, respectively. • Undercount of KDE S (q): For undercounts, we partition P into 2\|S\| sets { P10 , P1 , P20 , P2 , ..., P\|0S\| , P\|S\| } of consecutive points by the sorted order from q. Starting with p1 , we place points in set Pj0 or Pj following their sorted order. Recursively on j and i, starting at j = 1 and i = 1, we place each pi in Pj0 as long as K ( pi , q) > K (s j , q)(this may be empty). Then we place the next g points pi into Pj . After g points are placed in Pj , we begin with Pj0+1 , until all of P has been placed in some set. Let t  \|S\| be the index of the last set Pj 100 such that \| Pj \| = g. Note that for all pi 2 Pj (for j  t) we have K (s j , q) kS ({s j }, q) K ( pi , q), thus k P ( Pj , q). We can now bound the undercount as D ( P, S, q) = \|S\| Â j =1 k P ( Pj , q) \|S\| kS ({s j }, q) + Â k P ( Pj0 , q) j =1 t +1  Â kP ( Pj0 , q) j =1 since the first term is at most 0 and \| Pj0 \| = 0 for j > t + 1. Now consider a super-level set H 2 A containing all points before st+1 ; H is the smallest range that contains every non-empty Pj0 . Because (for j  t) each set Pj can be charged to s j , then Âtj=1 \| Pj \ H \| = g\|S \ H \|. And because S is an (r, #)-approximation of ( P, A), then n \|P \ H\| o \|S \ H \| ,r  # max \|S\| \| P\| \|P \ H\| \| P\| 1 t +1 0 1 t +1 0 \| P \| = \| Pj \ H \| j \| P\| jÂ \| P\| jÂ =1 =1 t 1 t +1 0 ( Â \| Pj \ H \| + Â \| Pj \ H \| g\|S \ H \|) \| P \| j =1 j =1 n \|P \ H\| o \| P \ H \| \|S \ H \| = ,r  # max \| P\| \|S\| \| P\| = We can now bound D ( P, S, q)  When \| P\ H \| \| P\| r, and r Â p2 P10 D ( P, S, q)   t +1 t +1 j =1 j=1 p2 Pj Â kP ( Pj0 , q) = K ( p,q) \| P\| t +1 . Â Â0 j=1 p2 Pj # \| P\| Â Â0 K ( p, q) \| P\| #, K ( p, q) \| P\| Â K( p, q) + Â 0 p2 P p2 P1 K ( p, q) \| P\|  # KDE P (q) + #r  2# max{ KDE P (q), r}. The second inequality is because, all the points in Pj has larger K (·, q) values than the points in Pj0+1 and 1 \| P\| \| P\ H \| +1 0 Âtj= 1 \| Pj \|  # \| P\| . 101 When \| P\ H \| \| P\|  r, D ( P, S, q)   • Overcount of KDE S ( q ): t +1 Â Â0 j=1 p2 Pj K ( p, q) \| P\| 1 t +1 0 + \| Pj \|K  #rK + . \| P\| jÂ =1 The analysis for overcounts is similar to undercounts, but we partition the data in a reverse way: we partition P into 2\|S\| sets { P1 , P10 , P2 , P20 , ..., P\|S\| , P\|0S\| } of consecutive points by the sorted order from q (some of the sets may be empty). Starting with pn (the furthest point from q) we place points in sets Pj0 or Pj following their reverse-sorted order. Recursively on j and i, starting at j = \|S\| and i = n, we place each pi in Pj0 as long as K ( pi , q) < K (s j , q) (this may be empty). Then we place the next g points pi into Pj . After g points are placed in Pj , we begin with Pj0 1 , until all of P has been placed in some set. Let t  \|S\| be the index of the last set Pj such that \| Pj \| = g (the smallest such j). Note that for all pi 2 Pj (for j t) we have K (s j , q)  K ( pi , q), thus kS ({s j }, q)  k P ( Pj , q). We can now bound the (negative) overcount as D ( P, S, q) = t Â k P ( Pj , q) kS ({s j }, q) k P ( Pj , q) kS ({s j }, q) + Â k P ( Pj0 , q) j=\|S\| 1 + Â j=t 1 k P ( Pt 1 , q) \|S\| j =1 1 Â j=t 1 kS ({s j }, q) since the first full term is at least 0, as is each k P ( Pj , q) and k P ( Pj0 , q) term in the second and third terms. We will need the one term k P ( Pt 1 , q) related to P. Now using that S is an (r, #)-sample of ( P, A), we will derive a bound on t. We consider the maximal super-level set H 2 A such that no points H 2 P are in Pj0 for any j. This is the largest set where each point p 2 P can be charged to a point s 2 S such that K ( p, q) > K (s, q), and thus presents the smallest (negative) overcount. In this case, H \ P = \w j=1 Pj for some w and H \ S = \w j=1 { s j }. Since t  w, then \| H \ P \| = ( w (w t + 1)\| P\|/\|S\| + \| Pt 1\| t + 1) g + \| Pt and \| H \ S\| = w. With the definition of (r, #)-sample, 1\| = 102 n \|P \ H\| o \|P \ H\|  # max ,r \| P\| \| P\| \|S \ H \| \|S\| If \| P\ H \| \| P\| r, t 1 \|S\| \| Pt 1 \| \| P\| \|S \ H \| \| P \ H \| \|S\| \| P\| w (w t + 1)\| P\|/\|S\| = \|S\| \| P\| t 1 \| Pt 1 \| \|S\| \| P\| \|  # \| P\|\PH , the same as \| D ( P, S, q) k P ( Pt 1 , q) k ( Pt 1 , q) = \| P\| # The second inequality is because any p 2 Pj , K (s j , q)  K ( p, q). If \| P\ H \| \| P\| mini0 2 Pt  r, 1 t 1 \|S\| K ( pi0 , q) \| Pt 1 \| \| P\| Â1j=w \| Pt 1 \| \| P\| \| Pt 1 \| \| P\| t 1 \|S\| 1 Â \| # \| P\|\PH , then \| j=t 1 kS ({s j }, q) Â1j=t 1 k ( Pj , q) \| P\| K (s j , q) \|S\| # KDE P (q) \| # \| P\|\PH and for each s j with j  w, for \| t 1 \|S\|  #r, the same as t \| Pt 1 \| \| P\| 2  #r\|S\| + \|S\|\| Pt 1 \| \| P\| 1. Letting pi = D ( P, S, q) k P ( Pt = 1 , q) kS ({st 1 }, q ) + 1 Â j=t 2 kS ({s j }, q) k ( Pt 1 , q) \| P\| K (st 1 , q) \|S\|\| Pt 1 \| K+ (#r\|S\| + 1) \|S\| \| P\| \|S\| ⇣ g \| P \| ⌘ g · K (s , q) k ( P , q) t 1 t 1 t 1 #rK + + K + \| P\| \| P\| ⇣ g \|P \| ⌘ ⇣ g \|P \| ⌘ t 1 t 1 #rK + + K + K ( pi , q ) \| P\| \| P\| #rK + So when \| P\ H \| \| P\| r, S is an (r, 2#)- approximation of ( P, K), and when (#rK + )-approximation of ( P, K). \| P\ H \| \| P\|  r, it is a REFERENCES [1] P. K. Agarwal, S. Har-Peled, H. Kaplan, and M. Sharir, Union of random minkowski sums and network vulnerability analysis, in Proceedings of 29th Symposium on Computational Geometry, 2013. [2] A. F. Alessandro Bergamo, Lorenzo Torresani, Picodes: Learning a compact code for novel-category recognition, in NIPS, 2011. [3] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler, Scale-sensitive dimensions, uniform convergence, and learnability, Journal of ACM, 44 (1997), pp. 615-631. [4] A. W. Appel, An efficient program for many-body simulation, SIAM J.Scientific and Statistical Computing, 6 (1985), pp. 85-103. [5] B. Aronov, E. Ezra, and M. Sharir, Small size #-nets for axis-parallel rectangles and boxes, Siam Journal of Computing, 39 (2010), pp. 3248-3282. [6] N. Aronszajn, Theory of reproducing kernels, Transactions of the American Mathematical Society, 68 (1950), pp. 337-404. [7] F. Bach, S. Lacoste-Julien, and G. Obozinski, On the equivalence between herding and conditional gradient algorithms, arXiv preprint arXiv:1203.4523, (2012). [8] A. Bagchi, A. Chaudhary, D. Eppstein, and M. T. Goodrich, Deterministic sampling and range counting in geometric data streams, ACM Transactions on Algorithms, 3 (2007), p. 16. [9] S. Balakrishnan, B. T. Fasy, F. Lecci, A. Rinaldo, A. Singh, and L. Wasserman, Statistical inference for persistent homology, tech. rep., ArXiv:1303.7117, March 2013. [10] J. Beck, Irregularities of distribution I, Acta Mathematica, 159 (1987), pp. 1-49. [11] M. Bern and D. Eppstein, Worst-case bounds for subadditive geometric graphs, Proceedings 9th Symposium on Computational Geometry, (1993). [12] R. Blundell and A. Duncan, Kernel regression in empirical microeconomics, Journal of Human Resources, (1998), pp. 62-87. [13] T. Bouezmarni and J. V. Rombouts, Nonparametric density estimation for multivariate bounded data, J. Statistical Planning and Inference, 140 (2010), pp. 139-152. [14] C. Boutsidis, P. Drineas, and M. Magdon-Ismail, Near-optimal coresets for leastsquares regression, IEEE Trans. Information Theory, 59 (2013). [15] A. W. Bowman, An alternative method of cross-validation for the smoothing of density estimates, Biometrika, 71 (1984), pp. 353-360. 104 [16] M. J. Brewer, A bayesian model for local smoothing in kernel density estimation, Statistics and Computing, 10 (2000), pp. 299-309. [17] T. Brox, B. Rosenhahn, D. Cremers, and H.-P. Seidel, Nonparametric density estimation with adaptive, anisotropic kernels for human motion tracking, in Human Motion- Understanding, Modeling, Capture and Animation, Springer, 2007, pp. 152-165. [18] J. D. Brutlag, Aberrant behavior detection in time series for network monitoring, in System Administration Conference (LISA), USENIX, 2000. [19] P. B. Callahan and S. R. Kosaraju, Algorithms for dynamic closest-pair and n-body potential fields, in SODA, 1995. [20] R. J. Campello, D. Moulavi, A. Zimek, and J. Sander, Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Transactions on Knowledge Discovery from Data (TKDD), 10 (2015), p. 5. [21] Y. Cao, H. He, H. Man, and X. Shen, Integration of self-organizing map (SOM) and kernel density estimation (KDE) for network intrusion detection, in SPIE Europe Security+ Defence, 2009. [22] S. Chan, I. Diakonikolas, R. A. Servedio, and X. Sun, Efficient density estimation via piecewise polynomial approximation, in STOC, 2014. [23] F. Chazal, D. Cohen-Steiner, and A. Lieutier, Normal cone approximation and offset shape isotopy, Computational Geometry: Theory and Applications, 42 (2009), pp. 566- 581. [24] F. Chazal, D. Cohen-Steiner, and A. Lieutier, A sampling theory for compact sets in Euclidean space, Discrete Computational Geometry, 41 (2009), pp. 461-479. [25] F. Chazal, D. Cohen-Steiner, and Q. Mérigot, Geometric inference for probability measures, Foundations of Computational Mathematics, 11 (2011), pp. 733-751. [26] F. Chazal and A. Lieutier, Weak feature size and persistent homology: computing homology of solids in rn from noisy data samples, Proceedings 21st Annual Symposium on Computational Geometry, (2005), pp. 255-262. [27] F. Chazal and A. Lieutier, Topology guaranteeing manifold reconstruction using distance function to noisy data, Proceedings of the 22nd Annual Symposium on Computational Geometry, (2006), pp. 112-118. [28] B. Chazelle, The Discrepancy Method, Cambridge, 2000. [29] B. Chazelle and J. Matousek, On linear-time deterministic algorithms for optimization problems in fixed dimensions, J. Algorithms, 21 (1996), pp. 579-597. [30] Y. Chen, M. Welling, and A. Smola, Super-samples from kernel hearding, in Conference on Uncertainty in Artificial Intellegence, 2010. [31] Y. Chen, M. Welling, and A. Smola, Super-samples from kernel herding, arXiv preprint arXiv:1203.3472, (2012). 105 [32] Y. Cheng, Mean shift, mode seeking, and clustering, IEEE transactions on pattern analysis and machine intelligence, 17 (1995), pp. 790-799. [33] K. L. Clarkson, Coresets, sparse greedy approximation, and the frank-wolfe algorithm, ACM Transactions on Algorithms (TALG), 6 (2010), p. 63. [34] G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava., Space- and timeefficient deterministic algorithms for biased quantiles over data streams, in PODS, 2006. [35] G. Cormode, S. Muthukrishnan, K. Yi, and Q. Zhang, Optimal sampling from distributed streams, in PODS, 2010. [36] E. C. Cortés and C. Scott, Sparse approximation of a kernel mean, IEEE Transactions on Signal Processing, (2016). [37] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney, Sampling algorithms and coresets for ` p regression, SICOMP, 38 (2009), pp. 2060-2078. [38] M. S. de Lima and G. S. Atuncar, A bayesian method to estimate the optimal bandwidth for multivariate kernel estimator, Journal of Nonparametric Statistics, 23 (2011), pp. 137-148. [39] M. Desforges, P. Jacob, and J. Cooper, Applications of probability density estimation to the detection of abnormal conditions in engineering, Proceedings of the Institution of Mechanical Engineers, Part C: Journal of Mechanical Engineering Science, 212 (1998), pp. 687-703. [40] L. Devroye and L. Györfi, Nonparametric Density Estimation: The L1 View, Wiley, 1984. [41] L. Devroye and G. Lugosi, Combinatorial Methods in Density Estimation, SpringerVerlag, 2001. [42] R. Duan, S. Pettie, and H.-H. Su, Scaling algorithms for approximate and exact maximum weight matching, tech. rep., arXiv:1112.0790, 2011. [43] D. P. Dubhashi and A. Panconesi, Concentration of Measure for the Analysis of Randomized Algorithms, Cambridge, 2009. [44] T. Duong et al., ks: Kernel density estimation and kernel discriminant analysis for multivariate data in r, Journal of Statistical Software, 21 (2007), pp. 1-16. [45] T. Duong and M. L. Hazelton, Cross-validation bandwidth matrices for multivariate kernel density estimation, Scandinavian J. of Stat., 32 (2005), pp. 485-506. [46] H. Edelsbrunner, M. Facello, P. Fu, and J. Liang, Measuring proteins and voids in proteins, in Proceedings 28th Annual Hawaii International Conference on Systems Science, 1995. [47] H. Edelsbrunner, B. T. Fasy, and G. Rote, Add isotropic Gaussian kernels at own risk: More and more resiliant modes in higher dimensions, Proceedings 28th Annual Symposium on Computational Geometry, (2012), pp. 91-100. 106 [48] J. Edmonds, Paths, trees, and flowers, Canadian Journal of Mathematics, 17 (1965), pp. 449-467. [49] B. T. Fasy, J. Kim, F. Lecci, and C. Maria, Introduction to the R package TDA, tech. rep., arXiV:1411.1830, 2014. [50] A. Gangopadhyay and K. Cheung, Bayesian approach to choice of smoothing parameter in kernel density estimation, J. of Nonparam. Stat., 14 (2002), pp. 655-664. [51] M. Gao, C. Chen, S. Zhang, Z. Qian, D. Metaxas, and L. Axel, Segmenting the papillary muscles and the trabeculae from high resolution cardiac CT through restoration of topological handles, in Proceedings of the International Conference on Information Processing in Medical Imaging, 2013. [52] J. Glaunès, Transport par difféomorphismes de points, de mesures et de courants pour la comparaison de formes et l'anatomie numérique., PhD thesis, Université Paris 13, 2005. [53] T. F. Gonzalez, Clustering to minimize the maximum intercluster distance, Theoretical Computer Science, 38 (1985), pp. 293-306. [54] S. Govindarajan, P. K. Agarwal, and L. Arge, CRB-tree: An efficient indexing scheme for range-aggregate queries, in ICDT, 2003, pp. 143-157. [55] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh, Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total, in ICDE, 1996. [56] L. Greengard and J. Strain, The fast Gauss transform, Journal Scientific and Statistical Computing, 12 (1991). [57] M. Greenwald and S. Khanna, Space-efficient online computation of quantile summaries, in SIGMOD, 2001. [58] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, A kernel two-sample test, Journal of Machine Learning Research, 13 (2012), pp. 723-773. [59] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, A kernel two-sample test, Journal of Machine Learning Research, 13 (2012), pp. 723-773. [60] S. Guha, N. Koudas, and K. Shim, Approximation and streaming algorithms for histogram construction problems, ACM TODS, 31 (2006), pp. 396-438. [61] D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi, Approximating multidimensional aggregate range queries over real attributes, in SIGMOD, 2000, pp. 463-474. [62] J. Habbema, J. Hermans, and K. van den Broek, A stepwise discrimination analysis program using density estimation, Proceedings in Computational Statistics, (1974). [63] P. Hall, J. Marron, and B. U. Park, Smoothed cross-validation, Prob. The. and Rel. Fields, 92 (1992), pp. 1-20. [64] P. Hall and M. P. Wand, Minimizing L1 distance in nonparametric density estimation, Journal of Multivariate Analysis, 26 (1988), pp. 59-88. 107 [65] S. Har-Peled, Geometric Approximation Algorithms, Chapter 2, American Mathematical Society, 2011. [66] S. Har-Peled, Geometric approximation algorithms, vol. 173, American mathematical society Boston, 2011. [67] S. Har-Peled, H. Kaplan, M. Sharir, and S. Smorodinksy, #-nets for halfspaces revisited, tech. rep., arXiv:1410.3154, 2014. [68] N. Harvey and S. Samadi, Near-optimal herding., in COLT, 2014, pp. 1165-1182. [69] D. Haussler and E. Welzl, epsilon-nets and simplex range queries., Disc. & Comp. Geom., 2 (1987), pp. 127-151. [70] M. Hein and O. Bousquet, Hilbertian metrics and positive definite kernels on probability measures, in Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 2005. [71] C.-T. Ho, R. Agrawal, N. Megiddo, and R. Srikant, Range queries in OLAP data cubes, in SIGMOD, 1997. [72] J. Horel, M. Splitt, L. Dunn, J. Pechmann, B. White, C. Ciliberti, S. Lazarus, J. Slemmer, D. Zaff, and J. Burks, Mesowest: Cooperative mesonets in the western united states, Bulletin of the American Meteorological Society, 83 (2002), pp. 211-225. [73] S. Hu, D. S. Poskitt, and X. Zhang, Bayesian adaptive bandwidth kernel density estimation of irregular multivariate distributions, CS&DA, 56 (2012), pp. 732-740. [74] Z. Huang, L. Wang, K. Yi, and Y. Liu, Sampling based algorithms for quantile computation in sensor networks, in SIGMOD, 2011, pp. 745-756. [75] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel, Optimal histograms with quality guarantees, in VLDB, 1998. [76] M. Jones and S. Sheather, Using non-stochastic terms to advantage in kernel-based estimation of integrated squared density derivatives, Statistics & Probability Letters, 11 (1991), pp. 511-514. [77] M. C. Jones, J. S. Marron, and S. J. Sheather, A brief survey of bandwidth selection for density esimation, American Statistical Association, 91 (1996), pp. 401-407. [78] S. Joshi, R. V. Kommaraji, J. M. Phillips, and S. Venkatasubramanian, Comparing distributions and shapes using the kernel distance, in Proceedings of the twenty-seventh annual symposium on Computational geometry, ACM, 2011, pp. 47-56. [79] J. Kiefer, Sequential minimax search for a maximum, Proc. Am. Mathematical Society, 4 (1953), pp. 502-506. [80] V. Kolmogorov, BLOSSOM V: A new implementation of a minimum cost perfect matching algorithm, Mathematical Programming, 1 (2009). [81] N. Koudas, S. Muthukrishnan, and D. Srivastava, Optimal histograms for hierarchical range queries, in PODS, 2000. 108 [82] K. Kulasekera and W. Padgett, Bayes bandwidth selection in kernel density estimation with censored data, Nonparametric Statistics, 18 (2006), pp. 129-143. [83] M. Kulldorff, A spatial scan statistic, Communications in Statistics-Theory and Methods, 26 (1997), pp. 1481-1496. [84] K. G. Larsen, On range searching in the group model and combinatorial discrepancy, in Proceedings of the 52nd IEEE Symposium on Foundations of Computer Science, 2011. [85] Y. Li, P. M. Long, and A. Srinivasan, Improved bounds on the sample complexity of learning, Journal of Computer and System Sciences, 62 (2001), pp. 516-527. [86] J. Liang, H. Edelsbrunner, P. Fu, P. V. Sudharkar, and S. Subramanian, Analytic shape computation of macromolecues: I. molecular area and volume through alpha shape, Proteins: Structure, Function, and Genetics, 33 (1998), pp. 1-17. [87] X. Lin, J. Xu, Q. Zhang, H. Lu, J. X. Yu, X. Zhou, and Y. Yuan, Approximate processing of massive continuous quantile queries over high-speed data streams, TKDE, 18 (2006), pp. 683-698. [88] G. S. Manku, S. Rajagopalan, and B. G. Lindsay, Approximate medians and other quantiles in one pass and with limited memory, in SIGMOD, 1998, pp. 426-435. [89] J. Marron and A. Tsybakov, Visual error criteria for qualitative smoothing, Journal of the American Statistical Association, 90 (1995), pp. 499-507. [90] J. Matoušek, Approximations and optimal geometric divide-and-conquer, in STOC, 1991. [91] J. Matoušek, Tight upper bounds for the discrepancy of halfspaces., Discrete & Computational Geometry, 13 (1995), pp. 593-601. [92] J. Matoušek, Geometric Discrepancy, Springer, 1999. [93] J. Matoušek, R. Seidel, and E. Welzl, How to net a lot with little: Small #-nets for disks and halfspaces, in Proceedings of the 6th Annual Symposium on Computational Geometry, 1990. [94] K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf, Learning from distributions via support measure machines, in Proceedings of Advances in Neural Information Processing Systems, 2012, pp. 10-18. [95] A. Müller, Integral probability metrics and their generating classes of functions, Advances in Applied Probability, 29 (1997), pp. 429-443. [96] E. A. Nadaraya, On estimating regression, Theory of Probability and its Applications, 9 (1964), pp. 141-142. [97] R. K. Pace and R. Barry, Sparse spatial autoregressions, Statistics & Probability Letters, 33 (1997), pp. 291-297. [98] J. Pach and P. K. Agarwal, Combinatorial geometry., Wiley-Interscience Series in Discrete Mathematics and Optimization, Wiley, 1995. 109 [99] J. Pach and G. Tardos, Tight lower bounds for the size of epsilon-nets, Journal of American Mathematical Society, 26 (2013), pp. 645-658. [100] E. Parzen, On estimation of a probability density function and mode, Annals of Mathematical Statistics, 33 (1962), pp. 1065-1076. [101] E. Parzen, Probability density functionals and reproducing kernel hilbert spaces, in Proceedings of the Symposium on Time Series Analysis, vol. 196, Wiley, New York, 1963, pp. 155-169. [102] J. M. Phillips, Algorithms for #-approximations of terrains, in Automata, Languages and Programming, Springer, 2008, pp. 447-458. [103] J. M. Phillips, #-samples for kernels, in Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, 2013, pp. 1622-1632. [104] J. M. Phillips, Coresets and Sketches, CRC Press, 3rd ed., 2016. [105] J. M. Phillips and S. Venkatasubramanian, A gentle introduction to the kernel distance. arXiv:1103.1625, March 2011. [106] J. M. Phillips, B. Wang, and Y. Zheng, Geometric inference on kernel density estimates, arXiv preprint arXiv:1307.7760 (SoCG 2015), (2013). [107] J. M. Phillips and Y. Zheng, Subsampling in smoothed range spaces, in Proceedings of the 26th International Conference, on Algorithmic Learning Theory, ALT 2015, 2015, pp. 224-238. [108] D. Pollard, Emperical Processes: Theory and Applications, NSF-CBMS REgional Confernece Series in Probability and Statistics, 1990. [109] V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J. Shekita, Improved histograms for selectivity estimation of range predicates, in SIGMOD, 1996, pp. 294-305. [110] C. A. Price, O. Symonova, Y. Mileyko, T. Hilley, and J. W. Weitz, Leaf gui: Segmenting and analyzing the structure of leaf veins and areoles, Plant Physiology, 155 (2011), pp. 236-245. [111] E. Pyrga and S. Ray, New existence proofs #-nets, in Proceedings 24th Annual Symposium on Computational Geometry, 2008. [112] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning, MIT Press, 2006. [113] V. C. Raykar and R. Duraiswami, Fast optimal bandwidth selection for kernel density estimation, in SDM, 2006. [114] V. C. Raykar, R. Duraiswami, and L. H. Zhao, Fast computation of kernel estimators, J. of Computational and Graphical Statistics, 19 (2010), pp. 205-220. [115] R. Ricci, E. Eide, and C. Team, Introducing cloudlab: Scientific infrastructure for advancing cloud architectures and applications, ; login:: the magazine of USENIX & SAGE, 39 (2014), pp. 36-38. 110 [116] M. Rosenblatt et al., Remarks on some nonparametric estimates of a density function, The Annals of Mathematical Statistics, 27 (1956), pp. 832-837. [117] M. Rudemo, Empirical choice of histograms and kernel density estimators, Scandinavian Journal of Statistics, (1982), pp. 65-78. [118] S. R. Sain, K. A. Baggerly, and D. W. Scott, Cross-validation of multivariate densities, Journal of the American Statistical Association, 89 (1994), pp. 807-817. [119] B. Schölkopf and A. J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, 2002. [120] E. Schubert, A. Zimek, and H.-P. Kriegel, Generalized outlier detection with flexible kernel density estimates, in Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014, pp. 542-550. [121] D. Scott, R. Tapia, and J. Thompson, Kernel density estimation revisited,, Nonlinear Analysis, Theory, Methods and Appplication, 1 (1977), pp. 339-372. [122] D. W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, Wiley, 1992. [123] D. W. Scott and G. R. Terrell, Biased and unbiased cross-validation in density estimation, J. ASA, 82 (1987), pp. 1131-1146. [124] N. Sharma, P. Sharma, D. Irwin, and P. Shenoy, Predicting solar generation from weather forecasts using machine learning, in SmartGridComm, 2011. [125] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. [126] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman & Hall/CRC, 1986. [127] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. G. Lanckriet, Hilbert space embeddings and metrics on probability measures, Journal of Machine Learning Research, 11 (2010), pp. 1517-1561. [128] S. Suri, C. D. Tóth, and Y. Zhou, Range counting over multidimensional data streams, Discrete and Computational Geometry, 36 (2006), pp. 633-655. [129] H. Takeda, S. Farsiu, and P. Milanfar, Kernel regression for image processing and reconstruction, Image Processing, IEEE Transactions on, 16 (2007), pp. 349-366. [130] G. R. Terrell, The maximal smoothing principle in density estimation, Journal of the American Statistical Association, 85 (1990), pp. 470-477. [131] F. Tom, Mining the quantified self: Personal knowledge discovery as a challenge for data science, Big Data, 3 (2016), pp. 249-266. [132] B. A. Turlach, Bandwidth selection in kernel density estimation: A review. Discussion paper 9317, Istitut de Statistique, UCL, Louvain-la-Neuve, Belgium. 111 [133] E. Ullman, A theory of location for cities, American Journal of Sociology, (1941), pp. 853-864. [134] V. Vapnik, Inductive principles of the search for empirical dependencies, in Proceedings 2nd Annual Workshop on a Computational Learning Theory, 1989. [135] V. Vapnik and A. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Theory of Probability and its Applications, 16 (1971), pp. 264-280. [136] K. R. Varadarajan, A divide-and-conquer algorithm for min-cost perfect matching in the plane, in Proceedings of the 39th IEEE Symposium on Foundations of Computer Science, 1998. [137] C. Villani, Topics in Optimal Transportation, American Mathematical Society, 2003. [138] J. S. Vitter, Random sampling with a reservoir, ACM Transactions on Mathematical Software (TOMS), 11 (1985), pp. 37-57. [139] J. S. Vitter, M. Wang, and B. Iyer, Data cube approximation and histograms via wavelets, in CIKM, 1998. [140] G. Wahba, Support vector machines, reproducing kernel Hilbert spaces, and randomization GACV, in Advances in Kernel Methods - Support Vector Learning, Bernhard Schölkopf and Alezander J. Smola and Christopher J. C. Burges and Rosanna Soentpiet, 1999, pp. 69-88. [141] M. P. Wand and M. C. Jones, Multivariate plug-in bandwidth selection, Computational Statistics, 9 (1994), pp. 97-116. [142] G. S. Watson, Smooth regression analysis, Indian Journal of Statistics, Series A, 26 (1964), pp. 359-372. [143] X. Wei and Y. Li, Theoretical analysis of a rigid coreset minimum enclosing ball algorithm for kernel regression estimation, in International Symposium on Neural Networks, 2008, pp. 741-752. [144] J. R. Wolberg, Expert Trading Systems: Modeling Financial Markets with Kernel Regression, Wiley, 2000. [145] C. Yang, R. Duraiswami, and L. S. Davis, Efficient kernel machines using the improved fast gauss transform, in NIPS, 2004. [146] C. Yang, R. Duraiswami, N. Gumerov, and L. Davis, Improved fast Gauss transform and efficient kernel density estimation, in Proceedings of the 9th International Conference on Computer Vision, 2003, pp. 664-671. [147] T. Zhang, R. Ramakrishnan, and M. Livny, Fast density estimation using cf-kernel for very large databases, in KDD, 1999, pp. 312-316. [148] X. Zhang, M. L. King, and R. J. Hyndman, A bayesian approach to bandwidth selection for multivariate kernel density estimation, CS&DA, 50 (2006), pp. 3009-3031. 112 [149] Y. Zheng, J. Jestes, J. M. Phillips, and F. Li, Quality and efficiency in kernel density estimates for large data, in Proceedings of the ACM Conference on the Management of Data (SIGMOD), 2012. [150] Y. Zheng and J. M. Phillips, L infty error and bandwith selection for kernel density estimates of large data, in Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015. [151] Y. Zheng and J. M. Phillips, Coresets for kernel regression, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '17, New York, NY, USA, 2017, ACM, pp. 645-654.
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6n05pch