Sublinear algorithms for massive data computations: streaming and coresets

Update Item Information
Publication Type dissertation
School or College College of Engineering
Department Computing
Author Daruki, Samira
Title Sublinear algorithms for massive data computations: streaming and coresets
Date 2018
Description The contributions of this dissertation are centered around designing new algorithms in the general area of sublinear algorithms such as streaming, core sets and sublinear verification, with a special interest in problems arising from data analysis including data summarization, clustering, matrix problems and massive graphs. In the first part, we focus on summaries and coresets, which are among the main techniques for designing sublinear algorithms for massive data sets. We initiate the study of coresets for uncertain data and study coresets for various types of range counting queries on uncertain data. We focus mainly on the indecisive model of locational uncertainty since it comes up frequently in real-world applications when multiple readings of the same object are made. In this model, each uncertain point has a probability density describing its location, defined as $k$ distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by examining only this subset. For each type of query we provide coreset constructions with approximation-size trade-offs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based techniques on axis-aligned range queries. In the second part, we focus on designing sublinear-space algorithms for approximate computations on massive graphs. In particular, we consider graph MAXCUT and correlation clustering problems and develop sampling based approaches to construct truly sublinear ($o(n)$) sized coresets for graphs that have polynomial (i.e., $n^{\delta}$ for any $\delta >0$) average degree. Our technique is based on analyzing properties of random induced subprograms of the linear program formulations of the problems. We demonstrate this technique with two examples. Firstly, we present a sublinear sized core set to approximate the value of the MAX CUT in a graph to a $(1+\epsilon)$ factor. To the best of our knowledge, all the known methods in this regime rely crucially on near-regularity assumptions. Secondly, we apply the same framework to construct a sublinear-sized coreset for correlation clustering. Our coreset construction also suggests 2-pass streaming algorithms for computing the MAX CUT and correlation clustering objective values which are left as future work at the time of writing this dissertation. Finally, we focus on streaming verification algorithms as another model for designing sublinear algorithms. We give the first polylog space and sublinear (in number of edges) communication protocols for any streaming verification problems in graphs. We present efficient streaming interactive proofs that can verify maximum matching exactly. Our results cover all flavors of matchings (bipartite/ nonbipartite and weighted). In addition, we also present streaming verifiers for approximate metric TSP and exact triangle counting, as well as for graph primitives such as the number of connected components, bipartiteness, minimum spanning tree and connectivity. In particular, these are the first results for weighted matchings and for metric TSP in any streaming verification model. Our streaming verifiers use only polylogarithmic space while exchanging only polylogarithmic communication with the prover in addition to the output size of the relevant solution. We also initiate a study of streaming interactive proofs (SIPs) for problems in data analysis and present efficient SIPs for some fundamental problems. We present protocols for clustering and shape fitting including minimum enclosing ball (MEB), width of a point set, $k$-centers and $k$-slab problem. We also present protocols for fundamental matrix analysis problems: We provide an improved protocol for rectangular matrix problems, which in turn can be used to verify $k$ (approximate) eigenvectors of an $n \times n$ integer matrix $A$. In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space.
Type Text
Publisher University of Utah
Subject Computer science
Dissertation Name Doctor of Philosophy
Language eng
Rights Management (c) Samira Daruki
Format Medium application/pdf
ARK ark:/87278/s64f6s1g
Setname ir_etd
ID 1513334
Reference URL https://collections.lib.utah.edu/ark:/87278/s64f6s1g
Back to Search Results