New problems in exploring distributed data

New problems in exploring distributed data

Title	New problems in exploring distributed data
Publication Type	dissertation
School or College	College of Engineering
Department	Computing
Author	Tang, Mingwang
Date	2015-05
Description	In the era of big data, many applications generate continuous online data from distributed locations, scattering devices, etc. Examples include data from social media, financial services, and sensor networks, etc. Meanwhile, large volumes of data can be archived or stored offline in distributed locations for further data analysis. Challenges from data uncertainty, large-scale data size, and distributed data sources motivate us to revisit several classic problems for both online and offline data explorations. The problem of continuous threshold monitoring for distributed data is commonly encountered in many real-world applications. We study this problem for distributed probabilistic data. We show how to prune expensive threshold queries using various tail bounds and combine tail-bound techniques with adaptive algorithms for monitoring distributed deterministic data. We also show how to approximate threshold queries based on sampling techniques. Threshold monitoring problems can only tell a monitoring function is above or below a threshold constraint but not how far away from it. This motivates us to study the problem of continuous tracking functions over distributed data. We first investigate the tracking problem on a chain topology. Then we show how to solve tracking problems on a distributed setting using solutions for the chain model. We studied online tracking of the max function on ""broom"" tree and general tree topologies in this work. Finally, we examine building scalable histograms for distributed probabilistic data. We show how to build approximate histograms based on a partition-and-merge principle on a centralized machine. Then, we show how to extend our solutions to distributed and parallel settings to further mitigate scalability bottlenecks and deal with distributed data.
Type	Text
Publisher	University of Utah
Subject	data synopsis; distributed; histogram; monitoring; tracking; uncertainty
Dissertation Institution	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	Copyright © Mingwang Tang 2015
Format	application/pdf
Format Medium	application/pdf
Format Extent	27,170 bytes
Identifier	etd3/id/3829
ARK	ark:/87278/s6768pm4
DOI	https://doi.org/doi:10.26053/0H-7M8X-DYG0
Setname	ir_etd
ID	197380
OCR Text	Show NEW PROBLEMS IN EXPLORING DISTRIBUTED DATA by Mingwang Tang A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computing School of Computing The University of Utah May 2015 Copyright © Mingwang Tang 2015 All Rights Reserved The Univers i ty of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Mingwang Tang has been approved by the following supervisory committee members: Feifei Li , Chair 10/02/14 Date Approved Jeffrey M. Phillips , Member 10/02/14 Date Approved Suresh Venkatasubramanian , Member 10/02/14 Date Approved Mohamed F. Mokbel , Member 10/02/14 Date Approved Tingjian Ge , Member 10/02/14 Date Approved and by Ross Whitaker , Chair/Dean of the Department/College/School of Computing and by David B. Kieda, Dean of The Graduate School. ABSTRACT In the era of big data, many applications generate continuous online data from dis-tributed locations, scattering devices, etc. Examples include data from social media, finan-cial services, and sensor networks, etc. Meanwhile, large volumes of data can be archived or stored offline in distributed locations for further data analysis. Challenges from data uncertainty, large-scale data size, and distributed data sources motivate us to revisit several classic problems for both online and offline data explorations. The problem of continuous threshold monitoring for distributed data is commonly encountered in many real-world applications. We study this problem for distributed proba-bilistic data. We show how to prune expensive threshold queries using various tail bounds and combine tail-bound techniques with adaptive algorithms for monitoring distributed deterministic data. We also show how to approximate threshold queries based on sampling techniques. Threshold monitoring problems can only tell a monitoring function is above or below a threshold constraint but not how far away from it. This motivates us to study the problem of continuous tracking functions over distributed data. We first investigate the tracking prob-lem on a chain topology. Then we show how to solve tracking problems on a distributed setting using solutions for the chain model. We studied online tracking of the max function on "broom" tree and general tree topologies in this work. Finally, we examine building scalable histograms for distributed probabilistic data. We show how to build approximate histograms based on a partition-and-merge principle on a centralized machine. Then, we show how to extend our solutions to distributed and parallel settings to further mitigate scalability bottlenecks and deal with distributed data. I dedicate this thesis to my family, who gives me all-enduring and selfless love on my way through PhD study. CONTENTS ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii LIST OF TABLES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x ACKNOWLEDGEMENTS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xi CHAPTERS 1. INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Motivation and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. THRESHOLD MONITORING FOR DISTRIBUTED PROBABILISTIC DATA : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Compute Pr[Y > ] Exactly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Filtering by Markov Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Improved Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.1 Improved Bounds on Pr[Y > ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2 Improved Adaptive Threshold Monitoring . . . . . . . . . . . . . . . . . . . . . 13 2.5 Sampling Methods to Estimate the Threshold . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.1 The Random Sampling Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5.2 Random Distributed "-Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.3 Deterministic Distributed "-Sample . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.4 A Randomized Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.5 Practical Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.1 Weighted Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.2 Handling Multiple ( ; ) Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7.1 Datasets and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.7.2 Effect of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7.3 Effect of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7.4 Effect of g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.7.5 Effect of . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.7.6 Sampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7.7 Integrated Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3. DISTRIBUTED ONLINE TRACKING : : : : : : : : : : : : : : : : : : : : : : : : : : : 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.1 Key Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.2 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Problem Formulation and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1 Performance Metric of an Online Algorithm . . . . . . . . . . . . . . . . . . . 41 3.2.2 State-of-the-art Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 The Chain Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.1 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.3.2 Optimal Chain Online Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.4 The Broom Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.1 A Baseline Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.2 Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.3 The BROOMTRACK Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5 The General Tree Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6 Other Functions and Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6.1 Other Functions for f . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.6.2 Other Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.7 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7.1 Datasets and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.7.2 Chain Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7.3 Broom Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7.4 General Tree Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.7.5 Other Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4. SCALABLE HISTOGRAMS ON LARGE PROBABILISTIC DATA : : : : : 74 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Background and State of the Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.1 Uncertain Data Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2.2 Histograms on Probabilistic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.3 Efficient Computation of Bucket Error . . . . . . . . . . . . . . . . . . . . . . . 80 4.3 Approximate Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.1 A Baseline Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.3.2 The PMERGE Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.2.1 Partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.2.2 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.2.3 Fast Computation of Bucket Error . . . . . . . . . . . . . . . . . . . . . . 83 4.3.2.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.2.5 Approximation Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.3 Recursive PMERGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.4 Distributed and Parallel PMERGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 The Partition Phase in the Value Model . . . . . . . . . . . . . . . . . . . . . . . 89 vi 4.4.2 The Partition Phase in the Tuple Model . . . . . . . . . . . . . . . . . . . . . . . 91 4.4.3 Recursive PMERGE and Other Remarks . . . . . . . . . . . . . . . . . . . . . . 91 4.5 Parallel-PMERGE with Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5.1 Sampling Methods for the Value Model . . . . . . . . . . . . . . . . . . . . . . . 92 4.5.1.1 The VS Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.5.2 Sketching Methods for the Tuple Model . . . . . . . . . . . . . . . . . . . . . . 93 4.5.2.1 The TS (Tuple Model Sketching) Method . . . . . . . . . . . . . . . . 93 4.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6.1 Datasets and Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6.2 Centralized Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6.2.1 Effect of m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.6.2.2 Effect of n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6.2.3 Effect of B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.6.2.4 Comparison with the Baseline . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6.3 Distributed and Parallel Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.6.3.1 Effect of Size of the Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.6.3.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.6.4 Distributed and Parallel Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.6.4.1 Comparing Effects of Synopsis in Both Models . . . . . . . . . . . . 103 4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5. OTHERWORKS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 106 6. CONCLUSION: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 108 REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 110 vii LIST OF FIGURES 1.1 Distributed streaming data in applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Shipboard Automated Meteorological and Oceanographic System. . . . . . . . . 2 2.1 (a) Attribute-level uncertain tuple model. (b) The flat model. . . . . . . . . . . . . 7 2.2 The Improved method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 The Iadaptive method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 The RS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 The MRS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6 The RD"S method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 The DD"S method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.8 Distributions of E(Xi;t) for WD, WS, SS, and TEM, where i 2 [1; g] and t 2 [1; T]. (a) WD. (b) WS. (c) SS. (d) TEM. . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.9 Communication: vary . (a) Messages. (b) Bytes. . . . . . . . . . . . . . . . . . . . . . 27 2.10 Response time: (a) vary . (b) vary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.11 Communication: vary . (a) Messages. (b) Bytes. . . . . . . . . . . . . . . . . . . . . . 28 2.12 Communication: vary g. (a) Messages. (b) Bytes. . . . . . . . . . . . . . . . . . . . . . 29 2.13 Response time: (a) vary g. (b) vary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.14 Communication: vary . (a) Messages. (b) Bytes. . . . . . . . . . . . . . . . . . . . . . 30 2.15 Performance of the sampling methods: vary (sample size per client). (a) Precision. (b) Recall. (c) Communication: bytes. (d) Response time. . . . . . . 32 2.16 Performance of the sampling methods: vary datasets. (a) Precision. (b) Recall. (c) Communication: bytes. (d) Response time. . . . . . . . . . . . . . . . . . 33 2.17 Performance of all methods: vary datasets. (a) Communication: messages. (b) Communication: bytes. (c) Response time. (d) Precision and recall. . . . . 34 3.1 Track f(t) = f(f1(t); f2(t); ; fm(t)). (a) broom model. (b) general-tree. . 40 3.2 Special cases: g(t) 2 [f(t)􀀀 ; f(t)+ ]. (a) chain topology (b) centralized seting. [87, 89]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.3 Tree online tracking. (a) simple tree. (b) general tree. . . . . . . . . . . . . . . . . . . 57 3.4 Other topologies. (a) observer at relay node. (b) graph topology. . . . . . . . . . 63 3.5 f1(t) for TEMP and WD, for t 2 [1; 1000]. (a) TEMP. (b) WD. . . . . . . . . . . . . 64 3.6 Performance of chain tracking methods on TEMP. (a) vary . (b) vary h. (c) vary N. (d)cost(method)/cost(offline). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.7 Performance of broom tracking methods on TEMP. (a) vary m. (b) vary . (c) vary h. (d) vary N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.8 General-tree: vary p. (a) TEMP. (b) WD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.9 General-tree: vary . (a) TEMP. (b) WD. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.10 General-tree: vary H. (a) TEMP. (b) WD. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.11 General-tree: vary F. (a) TEMP. (b) WD. . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.12 Track sum on broom and general tree. (a) Broom. (b) General-tree. . . . . . . . 71 3.13 Track median on broom and general tree. (a) Broom. (b) General-tree. . . . . . 71 4.1 An example of PMERGE: n = 16;m = 4;B = 2. . . . . . . . . . . . . . . . . . . . . . 84 4.2 Binary decomposition and local Q-AMS. (a) binary decomposition. (b) local Q-AMS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3 Vary m on the tuple model. (a) m vs running time. (b) m vs approximation ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.4 Approximation ratio and running time: vary n. (a) Tuple model: running time. (b) Value model: running time. (c) Tuple model: approximation ratio. (d) Value model: approximation ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.5 Approximation ratio and running time: vary B. (a) Tuple model: running time. (b) Value model: running time. (c) Tuple model: approximation ratio. (d) Value model: approximation ratio. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.6 Comparison against the baseline method. (a) Running time: WorldCup. (b) Approximation ratio: WorldCup. (c) Running time: SAMOS. (d) Approxi-mation ratio: SAMOS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.7 Time: vary number of slave nodes. (a) Tuple model. (b) Value model. . . . . . 101 4.8 Scalability of the parallel approximate methods. (a) Tuple model: vary n. (b) Value model: vary n. (c) Tuple model: vary B. (d) Value model: vary B. 102 4.9 Effects of using synopsis. (a) Communication. (b) Approximation ratio. . . . . 104 ix LIST OF TABLES 3.1 Input instance I1 and behavior of A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.2 A0 on input instance I1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.3 A on input instance I2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4 A00 on input instance I2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Example for tuple model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Example for value model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 ACKNOWLEDGEMENTS More than five years has passed since the first day I began my PhD study in the US. I am very lucky to experience two kinds of academic style from both Florida State University (FSU) where I spent my first two years of PhD study and University of Utah where the end of my PhD journey will be. I would like to first thank to cs faculties in FSU, who help me to build a solid knowledge in core courses in computer science. I would like to express my gratitude to my advisor Feifei Li, who gives me a great guidance on my research study. From our academic discussions, I have learned so much toward doing research, for instance, how to mathematically formalize a problem definition and how to approach the problem from different views and others. He is a very forthright person and will point out my shortcomings to help me to realize the places where I need to improve. I would never be able to reach the closing of this journey without his guidance, ad-vices and even criticisms. I would like to thank Feifei for being my advisor and supporting my PhD studies. Also, I would like to thank Prof. Jeff Phillips, who broadened my views and research skills from our collaborations. I thank Prof. Suresh Venkatasubramanian for his insightful comments for my presentations in our data group meeting. I also want to thank Prof. Mohamed F. Mokbel and Prof. Tingjian Ge for serving on my supervisory committee. Finally, I thank NSF for funding me for my PhD research projects. CHAPTER 1 INTRODUCTION The goal of this dissertation research is to design, implement, and evaluate novel data exploration techniques on distributed data to support scalable data analytics and decision-making systems. Therefore, we have studied three closely related problems in this thesis: (I) how to continuously monitor distributed probabilistic data against a probabilistic thresh-old; (II) how to continuously track function changes over distributed data; (III) how to build scalable histograms on large probabilistic data. 1.1 Motivation and Background Data exploration tasks aim to retrieve relevant information from a large dataset, which becomes more challenging in the era of big data. In many applications, large volumes of data are generated from scattering devices or distributed sources continuously. Examples include data collected from network intrusion detection systems, measurements from large sensor networks, and locations and application data from location-based services, as shown in Figure 1.1. Therefore, exploring distributed data in an online fashion becomes critical for further data analytics and decision-making applications. A natural example is continuous dis-tributed monitoring, where a function of distributed observations is continuously moni- Figure 1.1: Distributed streaming data in applications 2 tored against a user-specified threshold and an alarm will be generated when the threshold constraint is violated. In the distributed setting, it is always desirable, sometimes even critical, to reduce the communication cost for a number of reasons [3, 15-17, 26, 55, 56, 63, 86]. For instance, in sensor networks, transmitting messages consumes valuable on-board battery resources of sensor nodes, which in fact determines the lifetime of these networks. From the network infrastructure's point of view (e.g., ISP), transmitting the monitoring data to perform the distributed computation is impractical, which will seriously affect the network bandwidth. An emerging challenge for monitoring distributed data is uncertainty, which is inher-ently introduced when massive amounts of data are generated in distributed sources. For instance, measurements in a sensor network are inherently noisy; data integration systems produce fuzzy matches. A motivating example in our thesis is the Shipboard Automated Meteorological and Oceanographic System (SAMOS) project [70]. We have observed that in SAMOS: 1) meteorological data are observed from research vessels, ships, and towers which are naturally distributed; 2) ambiguity, errors, imprecise readings, and uncertainty are present in the real-time data collected, due to hazardous con-ditions, coarse real-time measurement, and multiple readings for the same observation; 3) large amounts of data (e.g., wind speed, temperature, humidity, etc.) need to be processed in less than a minute continuously. It is useful to represent data in SAMOS (as well as other applications previously dis-cussed) using distributed probabilistic data. For instance, a common practice in SAMOS is for each ship/tower to buffer data for each interval (e.g., 5 minutes) and send one represen-tative for data in an interval. Clearly, modeling data in a given interval using probabilistic data, such as a probability distribution function (pdf), is no doubt a viable and attractive solution (especially when we want to also account for the presence of uncertainty and errors (a) Data Applications Ships Towers satellite, radio frequency Applications (b) Figure 1.2: Shipboard Automated Meteorological and Oceanographic System. 3 in the raw readings). Existing techniques of threshold monitoring on deterministic distributed data [18, 48, 52, 74] cannot be directly applied to distributed probabilistic data as in the SAMOS sys-tem. This motivates us to investigate a novel threshold monitoring problem on distributed probabilistic data. We extend threshold queries to probabilistic data by considering a probabilistic-threshold semantics [10, 22, 77]. The goal of this part of our dissertation is to efficiently explore distributed data with uncertainty and produce an alarm when a true threshold crossing has happened with a high probability. One design principle we have applied in tackling computation and communication cost on distributed probabilistic data is utilizing tail-bound filters in the monitoring instance. Then, techniques of threshold mon-itoring for distributed deterministic data could be used to further reduce communication cost. Online threshold monitoring can only tell whether the observing function of distributed data are above or below a threshold but not how far it is from the threshold. It is useful to continuously tracking various functions of distributed data. For instance, SAMOS users may be interested in continuously tracking the maximum of current temperature readings for a number of ships from a region in proximity. Such examples can be easily found in location-based services. Recent studies [87, 89] focused on the online tracking problem with only one observer and one tracker, which has limited application to multiple observers in distributed locations. To this end, we have studied the problem of tracking functions of distributed data (each data source can be described by an arbitrary function) continuously in an online fashion. To achieve 100% accuracy for continuous online tracking of arbitrary functions, any change of an observing function will lead to a message to the tracker, which simply generates excessive communications. Similar to the strategies in [87, 89], we relax the requirement by allowing a maximum error of for the function under tracking by the tracker. We first investigated the tracking problem on a chain topology, where the observer connects to the tracker through multiple relay nodes. Following that, we studied the tracking problem where distributed observers connect to the tracker through a "broom" tree model and a general-tree topology. At the back end, distributed data often converge in a data warehouse, where in-depth data explorations and analysis are possible. The challenge is large data size and uncertainty 4 caused by data integration or distributed data sources, as we see in SAMOS data. Data summary techniques, e.g., histograms, provide an effective venue for exploring large prob-abilistic data. Recently, histograms on probabilistic data have been proposed to work with probabilistic datasets [12-14]. However, existing studies suffer from limited scalability and do not adapt to large-scale data size and distributed data sources. This motivates us to study scalable histograms on large probabilistic data. We address this problem by building scalable histograms on large probabilistic data using a partition and merge principle. We also extend our solutions to distributed and parallel settings to mitigate scalability bottlenecks and deal with distributed data. 1.2 Dissertation Outline The rest of this dissertation is organized as follows. In Chapter 2, we study the prob-lem of threshold monitoring for distributed probabilistic data. One outstanding challenge is that answering queries on probabilistic data is in #P-complete complexity. We first leveraged tail-bound techniques to help assert the threshold violations and combined them with techniques for threshold monitoring on distributed deterministic data to further reduce communication cost. When the tail bounds fail to make a decision, we propose effective sampling methods to approximate threshold queries. In Chapter 3, we study the problem of distributed online tracking. We first investigated optimal tracking on chain topologies, which gives us more insights for distributed online tracking on a general-tree topology. With large probabilistic data, building a succinct synopsis, e.g., histograms, becomes mean-ingful for data exploration and data management tasks. We study the scalable histogram construction on probabilistic data in Chapter 4 in an effort to support large-scale datasets and distributed and parallel processing on large probabilistic data. Finally, we conclude and discuss some open problems in Chapter 6. CHAPTER 2 THRESHOLD MONITORING FOR DISTRIBUTED PROBABILISTIC DATA 2.1 Introduction When massive amounts of data are generated, uncertainty is inherently introduced at the same time. For instance, data integration produces fuzzy matches [28, 77]; in mea-surements, e.g., sensor readings, data are inherently noisy, and are better represented by a probability distribution rather than a single deterministic value [10, 25, 37, 77]. In a lot of these applications, data are generated at multiple sources, and collected from distributed networked locations. Examples include distributed observing stations, large sensor fields, geographically separate scientific institutes/units, and many more [54, 83]. A concrete example is the SAMOS project, as we mentioned in Chapter 1. Meanwhile, as numerous studies in managing and exploring distributed data have shown, a primary concern is monitoring the distributed data and generating an alarm when a user-specified constraint is violated. A particular useful instance is the threshold-based constraint, which we refer to as the distributed threshold monitoring (DTM) problem and which has been extensively studied in distributed deterministic data [18, 48, 52, 74]. An application scenario is shown in Example 1. Example 1. Suppose each distributed site continuously captures temperature readings (one per system-defined time instance); the goal is to monitor them continuously and raise an alarm at the coordinator site whenever the average temperature from all sites exceeds 80 degrees at any time instance. Similar applications are required in exploring distributed probabilistic data. The notion of distributed threshold monitoring on probabilistic data is a critical problem, such as in the SAMOS system. The most natural and popular way of extending threshold queries to 6 probabilistic data is probabilistic-threshold semantics [10,22,77], which introduces another threshold on the probability of the query answer in addition to the threshold on the score value of the results. Consider the following example that extends Example 1: Example 2. Suppose readings in each site are now represented as probabilistic data (e.g., as we have just discussed for data in SAMOS), the goal is to monitor these readings continuously and raise an alarm at the coordinator site whenever the probability of the average temperature from all sites exceeding 80 degrees is above 70% at any time instance. We refer to them as the distributed probabilistic threshold monitoring (DPTM) problem. This variant is a robust alternative to DTM, more robust than the median, in that even if all sites report low-probability noise which skews their distributions, DPTM will only raise an alarm if a true threshold has been crossed, or what may have been noise occurs with high enough probability that it cannot be ignored. For the same reasons and motivations of its counterpart, the DTM problem, a paramount concern is to reduce the communication cost, measured by both the total number of messages and bytes communicated in the system. For example, on SAMOS, cutting down the communication cost would allow for the trans-mission of more accurate or diverse measurements. Due to the inherent difference in query processing between probabilistic and deterministic data, techniques developed from DTM are no longer directly applicable. This also brings up another challenge in DPTM, reducing the cpu cost, since query processing in probabilistic data is often computation-intensive, which is even worse in distributed probabilistic data [54]. We step up to these challenges and present a comprehensive study to the DPTM problem. Specifically: We formalize the DPTM problem in Section 2.2. We propose baseline methods in Section 2.3, which improve over the naive method of sending all tuples at each time instance. We design two efficient and effective monitoring methods in Section 2.4 that leverage moment generating functions and adaptive filters to significantly reduce the costs. When an exact solution is not absolutely necessary, we introduce novel sampling-based methods in Section 2.5 to further reduce the communication and the cpu costs. We extensively evaluate all proposed methods in Section 4.6 on large real data ob-tained from research vessels in the SAMOS project. The results have shown that our 7 monitoring methods have significantly outperformed the baseline approach. They also indicate that our sampling method is very effective when it is acceptable to occasionally miss one or two alarms with very small probability. We discuss some useful extensions in Section 2.6, survey the related work in Section 4.7, and conclude in Section 4.8. 2.2 Problem Formulation Sarma et al. [71] describe various models of uncertainty. We consider the attribute-level uncertain tuple that has been used frequently in the literature, and suits the applications for our problem well (e.g., data in SAMOS). Each tuple has one or more uncertain attributes. Every uncertain attribute has a pdf for its value distribution. Correlation among attributes in one tuple can be represented by a joint pdf. This model has a lot of practical applications and is most suitable for measurements and readings [25, 50]. Without loss of generality, we assume that each tuple has only one uncertain attribute score. Let Xi be the random variable for the score of tuple di, where Xi can have either a discrete or a continuous pdf, with bounded size (see Figure 2.1(a)). Since each pdf is bounded, we assume that for all Xi's, jXij n for some value n where jXij is the size of the pdf for Xi, which is the number of discrete values Xi may take for a discrete pdf, or the number of parameters describing Xi and its domain for a continuous pdf. Given g distributed clients fc1; : : : ; cgg, and a centralized server H, we consider the flat model for the organization of distributed sites as shown in Figure 2.1(b); e.g., SAMOS uses the flat model. At each time instance t, for t = 1; : : : ; T, client ci reports a tuple di;t with a score Xi;t. We assume that data from different sites are independent. Similar assumptions were made in most distributed data monitoring or ranking studies [18,39,48,52,54,62,74]. tuples attribute score d1 X1 d2 X2 : : : dt Xt (a) H c1 c2 cg (b) Figure 2.1: (a) Attribute-level uncertain tuple model. (b) The flat model. 8 Without loss of generality and for the ease of explanation, we assume that Xi;t 2 R+. Our techniques can be easily extended to handle the case when Xi;t may take negative values as well. Definition 1 (DPTM). Given 2 R+ and 2 [0; 1), let Yt = Pg i=1 Xi;t, for t = 1; : : : ; T. The goal is to raise an alarm at H, whenever for any t 2 [1; T] Pr[Yt > ] > . In our definition, DPTM monitors the sum constraint. Monitoring the average constraint is equivalent to this case, as well as any other types of constraints that can be expressed as a linear combination of one or more sum constraints. As argued in Section 4.1, the goal is to minimize both the overall communication and computation costs, at the end of all time instances. We measure the communication cost using both the total number of bytes transmitted and the total number of messages sent. Lastly, when the context is clear, we omit the subscript t from Yt and Xi;t. 2.3 Baseline Methods At any time instance t, let X1; : : : ;Xg be the scores from c1 to cg and Y = Pg i=1 Xi. To monitor if Pr[Y > ] > , the naive method is to ask each client ci to send his score Xi to H, which is clearly very expensive. 2.3.1 Compute Pr[Y > ] Exactly The first challenge is to compute Pr[Y > ] exactly at H. We differentiate two cases. When each Xi is represented by a discrete pdf, clearly, we can compute Y1;2 = X1 +X2 in O(n2) time by a nested loop over the possible values they may take. Next, we can compute Y1;2;3 = X1 + X2 + X3 = Y1;2 + X3 using Y1;2 and X3 in O(n3) time, since in the worst case the size of Y1;2 is O(n2). We can recursively apply this idea to compute Y = Y1;:::;g in O(ng) time, then check Pr[Y > ] exactly. But note that in this approach, since we did not sort the values in the output (to reduce the cost), in each step the discrete values in the output pdf is no longer sorted. A better idea is to compute Y1;:::;g=2, and Yg=2+1;:::;g separately, which only takes O(ng=2) time. Then, by using the cdf (cumulative distribution function) of Yg=2+1;:::;g , we can compute Pr[Y > ] as follows: Pr[Y > ] = P 8y2Y1;:::;g=2 Pr[Y1;:::;g=2 = y] Pr[Yg=2+1;:::;g > 􀀀 y]: 9 Computing the cdf of Yg=2+1;:::;g takes O(ng=2 log ng=2) in the worst case: as discussed above, discrete values in Yg=2+1;:::;g are not sorted. After which, finding out Pr[Yg=2+1;:::;g > 􀀀 y] for any y takes only constant time. Hence, this step takes O(ng=2) time only (the size of Y1;:::;g=2 in the worst case). So the overall cost of computing Pr[Y > ] exactly at H becomes O(ng=2 log ng=2). When some Xi's are represented by continuous pdfs, the above process no longer works. In this case, we leverage on the characteristic functions of Xi's to compute Y exactly. The characteristic function [6] of a random variable X is: 'X( ) = E(ei X) = Z +1 􀀀1 ei xfX(x)d(x); 8 2 R; where i is the imaginary unit and fX(x) is the pdf of X. Let 'i( ) and '( ) be the characteristic functions of Xi and Y , respectively; a well-known result is that '( ) = Qg i=1 'i( ) [6]. Furthermore, by definition, 'i( ) and '( ) are the Fourier transform of the pdfs for Xi and Y , respectively. Hence, an immediate algorithm for computing the pdf of Y is to compute the Fourier transforms for the pdfs of Xi's, multiply them together to get '( ), then do an inverse Fourier transform on '( ) to obtain the pdf of Y . After which, we can easily check if Pr[Y > ] > . The cost of this algorithm depends on the cost of each Fourier transform, which is dependent on the types of distributions being processed. Note that using this approach when all pdfs are discrete does not result in less running time than the method above: since the size of Y in the worst case is O(ng) (the pdf describing Y ), this algorithm takes at least O(ng) time in the worst case, even though we can leverage on fast Fourier transform in this situation. We denote the above algorithms as EXACTD and EXACTC, for the discrete and contin-uous cases, respectively. 2.3.2 Filtering by Markov Inequality By the Markov inequality, we have Pr[Y > ] E(Y ) . Given that E(Y ) = Pg i=1 E(Xi), if each client Xi only sends E(Xi), H can check if E(Y ) < ; if so, no alarm should be raised for sure; otherwise, we can then ask for Xi's, and apply the exact algorithm. We dub this approach the Markov method. We can improve this further. Since E(Y ) = Pg i=1 E(Xi) and our goal is to monitor 10 if E(Y ) < by the Markov inequality, we can leverage on the adaptive thresholds algorithm for the DTM problem in deterministic data [48] to monitor if Pg i=1 E(Xi) < continuously, which installs local filters at clients and adaptively adjusts them. Specifically, is treated as the global constraint; at each time instance, client ci can compute E(Xi) locally which becomes a "deterministic score". Thus, the algorithm from [48] is applicable. Whenever it cannot assert an alarm at a time instance t, clients transmit Xi's to H and the server applies the exact algorithm (only for that instance). This helps reduce the communication cost and we dub this improvement the Madaptive method. 2.4 Improved Methods We now improve on these baseline techniques. We replace the Markov Inequality through more complicated to apply, but more accurate, Chebyshev and Chernoff bounds (Improved). Then, we redesign Improved to leverage adaptive monitoring techniques de-signed for DTM (Iadaptive). 2.4.1 Improved Bounds on Pr[Y > ] We first leverage on the general Chebyshev bound: Pr[jY 􀀀 E(Y )j a p Var(Y )] 1=a2 for any a 2 R+; which gives us the following one-sided forms: Pr[Y E(Y ) + a] Var(Y ) Var(Y ) + a2 ; 8a 2 R+ (2.1) Pr[Y E(Y ) 􀀀 a] Var(Y ) Var(Y ) + a2 ; 8a 2 R+: (2.2) When > E(Y ), setting a = 􀀀 E(Y ) in (4.2) leads to: Pr[Y > ] < Pr[Y ] Var(Y ) Var(Y ) + ( 􀀀 E(Y ))2 : (2.3) As such, when > E(Y ), if Var(Y ) Var(Y )+( 􀀀E(Y ))2 , we definitely do not have to raise an alarm. When < E(Y ), we can set a = E(Y ) 􀀀 in (4.4) to get: Pr[Y ] Var(Y ) Var(Y ) + (E(Y ) 􀀀 )2 : (2.4) 11 This implies that, Pr[Y > ] = 1 􀀀 Pr[Y ] > 1 􀀀 Var(Y ) Var(Y )+(E(Y )􀀀 )2 : (2.5) Hence, when < E(Y ), as long as 1􀀀 Var(Y ) Var(Y )+(E(Y )􀀀 )2 , we should surely raise an alarm. Given these observations, in each time instance, clients send E(Xi)'s and Var(Xi)'s to H, which computes E(Y ) and Var(Y ) locally (given that Xi's are independent from each other, Var(Y ) = Pg i=1 Var(Xi)). Depending whether E(Y ) > or E(Y ) < , H uses (4.5) or (2.5) to decide to raise or not to raise an alarm for this time instance. Nevertheless, this approach may still incur expensive communication and computation when E(Y ) = , or (4.5) ((2.5), resp.) does not hold when E(Y ) > (E(Y ) < , resp.). It is also limited in the fact that H can only check either to raise an alarm or not to raise an alarm, but not both simultaneously, as E(Y ) > and E(Y ) < cannot hold at the same time. We remedy these problems using the general Chernoff bound and the moment-generating function [6]. For any random variable Y , suppose its moment generating function is given by M( ) = E(e Y ) for any 2 R, then: Pr[Y a] e􀀀 aM( ) for all > 0; 8a 2 R (2.6) Pr[Y a] e􀀀 aM( ) for all < 0; 8a 2 R (2.7) Here, a can be any real value (positive or negative). Suppose the moment generating function of Xi and Y is Mi( ) and M( ), respectively, then M( ) = Qg i=1Mi( ) [6]. Hence, when the checking based on either (4.5) or (2.5) has failed, for any 1 > 0 and 2 < 0, the server requests ci to calculate and send back Mi( 1) and Mi( 2). He computes M( 1) and M( 2), and by setting a = in (2.6) and (4.7), he checks if: Pr[Y > ] Pr[Y ] e􀀀 1 M( 1) ; and (2.8) Pr[Y > ] = 1 􀀀 Pr[Y ] > 1 􀀀 e􀀀 2 M( 2) : (2.9) When (2.8) holds, he does not raise an alarm; when (2.9) holds, he raises an alarm; only when both have failed, he requests Xi's for the exact computation. CalculatingMi( ) at a client ci is easy. For a lot of parametric continuous pdfs, closed- 12 form expressions exist for their moment generating functions, or, one can use numeric methods to compute Mi( ) to arbitrary precision for other continuous pdfs. For discrete pdfs, Mi( ) = P x2Xi e x Pr[Xi = x]. Another key issue is to figure out the optimal values for 1 and 2 in (2.8) and (2.9) to make the corresponding bound as tight as possible, which is to minimize e􀀀 1 M( 1) and e􀀀 2 M( 2) in (2.8) and (2.9), respectively. The central limit theorem states that the mean of a sufficiently large number of independent random variables will be approximately normally distributed, if each independent variable has finite mean and variance [6]. For a normal distribution with mean and variance 2, its moment generating function is e +1 2 2 2 for any 2 R. Hence, let Y 0 = 1 gY , then Y 0 can be approximated by a normal distribution well, and we can approximate its moment generating function as: MY 0( ) e E(Y 0)+1 2 Var(Y 0) 2 ; 8 2 R: (2.10) Note that Y = gY 0, (2.8) and (2.10) imply that for any 1 > 0: Pr[Y ] = Pr[Y 0 g ] e􀀀 1 gMY 0( 1) e􀀀 1 g e 1E( Y g )+1 2 Var( Y g ) 2 1 by (2.10) = e 1 g (E(Y )􀀀 )+ 1 2g2 Var(Y ) 2 1 (2.11) Hence, we can approximate the optimal 1 value for (2.8) by finding the 1 value that minimizes the RHS of (2.11). Let f( 1) be the RHS of (2.11) and take its derivative w.r.t. 1: f0( 1) = e 1 g (E(Y )􀀀 )+ 1 2g2 Var(Y ) 2 1 ( E(Y ) 􀀀 g + Var(Y ) g2 1): Let f0( 1) = 0; we get 1 = g( 􀀀E(Y )) Var(Y ) . Furthermore, we can show that the second order derivative of f( 1);f 00( 1), is always greater than 0 (we omit the details for brevity). That said, f( 1) (hence, the RHS of (2.11)) takes its minimal value when 1 = g( 􀀀E(Y )) Var(Y ) . Using a similar analysis, we can derive the optimal 2 value. However, a constraint is that 1 > 0 and 2 < 0. That said, also with the observation that f( 1) (the corresponding function for 2) is monotonically increasing when 1 > g( 􀀀E(Y )) Var(Y ) ( 2 < g( 􀀀E(Y )) Var(Y ) , respectively), let > 0 be some small value, 13 8>< >: 1 = g( 􀀀E(Y )) Var(Y ) ; 2 = 􀀀 if > Pg i=1 E(Xi); 1 = ; 2 = g( 􀀀E(Y )) Var(Y ) if < Pg i=1 E(Xi); 1 = ; 2 = 􀀀 otherwise; (2.12) will help achieve tight bounds in (2.8) and (2.9). This yields the Improved method, shown in Figure 2.2. 2.4.2 Improved Adaptive Threshold Monitoring The Improved method needs at least g messages per time instance; to reduce this, we again leverage on the adaptive thresholds algorithm developed for work on DTM [48]. Consider (2.8) and (2.9), when we can continuously monitor if: e􀀀 1 Yg i=1 M( 1) ; or 1 􀀀 e􀀀 2 Yg i=1 M( 2) (2.13) efficiently, whenever the first inequality in (2.13) holds at a time instance t, H knows for sure that Pr[Y > ] at t and no alarm should be raised at this time instance; whenever Algorithm Improved(c1; : : : ; cg;H) 1. for t = 1; : : : ; T 2. let Xi = Xi;t and Y = Yt = Pg i=1 Xi; 3. each ci computes E(Xi) and Var(Xi) locally, and sends them to H; 4. H sets E(Y ) = P E(Xi), Var(Y ) = P Var(Xi); 5. if ( > E(Y ) and Var(Y ) Var(Y )+( 􀀀E(Y ))2 ) 6. raise no alarm; continue to next time instance; 7. if ( < E(Y ) and 1 􀀀 Var(Y ) Var(Y )+(E(Y )􀀀 )2 ) 8. raise an alarm; continue to next time instance; 9. H sets 1 and 2 according to (2.12); 10. H broadcasts 1, 2 to all clients, and asks them to compute and send back Mi( 1)0s and Mi( 2)0s; 11. H sets M( 1) = Q iMi( 1), M( 2) = Q iMi( 2); 12. if (e􀀀 1 M( 1) ) 13. raise no alarm; continue to next time instance; 14. if (1 􀀀 e􀀀 2 M( 2) ) 15. raise an alarm; continue to next time instance; 16. H asks for Xi's, applies EXACTD or EXACTC; Figure 2.2: The Improved method. 14 the second inequality in (2.13) holds at t, H knows for sure that Pr[Y > ] > at t and an alarm should be raised. Monitoring the first inequality in (2.13) is the same as monitoring if Xg i=1 lnMi( 1) ln + 1 : (2.14) We can treat (ln + 1 ) as the global constraint, and at time t, let Vi = lnMi( 1) be the local deterministic score at client ci; this becomes the exactly same formulation for the DTM problem. We now apply the adaptive thresholds algorithm for constraint monitoring from [48] to monitor (2.14). We denote this monitoring instance as J1. At any time t, if J1 raises no alarm, H knows that no alarm should be raised at t, since by implication (2.14) holds, and hence Pr[Y > ] . Monitoring the 2nd inequality in (2.13) is to monitor if Xg i=1 lnMi( 2) ln(1 􀀀 ) + 2 : (2.15) By treating (ln(1 􀀀 ) + 2 ) as the global constraint, at time t let Wi = lnMi( 2) be the local deterministic score at client ci; then we again apply [48] to monitor (2.15). Denote this monitoring instance as J2. Constrasting J1 to J2, when J2 does not report an alarm at t, it means that (2.15) holds, which implies that Pr[Y > ] > , so H needs to raise an alarm. One choice is to let H run both J1 and J2. However, when Pr[Y > ] deviates from considerably, one of them will almost always raise alarms, which results in a global poll and adjusting the local filters [48]. So the total communication cost will actually be higher than running just one. A critical challenge is deciding which instance to run. A simple and effective method is to make this decision periodically using recent observations of Pr[Y > ] and . Suppose we set the period to k, and the current time instance is t. For any i 2 [t􀀀k; t), let ei = 1 if Pr[Yi > ] > and 0 otherwise; and e = Pt􀀀1 i=t􀀀k ei. If e k=2, then in a majority of recent instances Pr[Yi > ] > , hence (2.15) is more likely to hold and J2 is most likely not going to raise alarms and is more efficient to run. If e < k=2, in a majority of recent instances Pr[Yi > ] < , (2.14) is more likely to hold and J1 is most likely not going to raise alarms and is more efficient to run. 15 Another question is how to set the 1 and 2 values in (2.14) and (2.15). Since they are derived directly from (2.13), which are originated from (2.8) and (2.9), the same way of setting them as shown in (2.12) will likely result in tight bounds, thus less violations to (2.14) and (2.15), making J1 and J2 efficient to run, respectively. However, this does require H to ask for E(Xi)'s and Var(Xi)'s in every time instance, defeating the purpose of using the adaptive thresholds algorithm to reduce the number of messages. To remedy this, we let H reset the optimal 1 and 2 values for the two adaptive thresholds instances periodically in every k time instances, for a system parameter k. The complete algorithm, Iadaptive, is shown in Figure 2.3. 2.5 Sampling Methods to Estimate the Threshold In either of the previous methods, when the algorithm fails to definitively indicate whether an alarm should be raised or not, then likely Pr[Y > ] is close to . If H Algorithm Iadaptive(c1; : : : ; cg;H; k) 1. initialize (without starting) two adaptive thresholds instances J1, J2 [48]: J1 monitors P i Vi ln + 1 , and J2 monitors if P iWi ln(1 􀀀 ) + 2 ; 2. H sets 1 to a small positive value, e = 0, starts J1; 3. for t = 1; : : : ; T 4. let Xi = Xi;t, Y = Yt = P Xi; 5. ci computes Vi = lnMi( 1), or Wi = lnMi( 2); 6. if (J1 is running and raises no alarm) 7. H raises no alarm; continue to line 11; 8. if (J2 is running and raises no alarm) 9. H raises an alarm; e = e + 1; continue line 11; 10. H asks for Xi's, applies EXACTD or EXACTC, sets e = e + 1 if an alarm is raised; 11. if (t mod k == 0) 12. stop the currently running instance Jx; 13. each ci sends E(Xi) and Var(Xi) to H; 14. reset 1 in J1 and 2 in J2 according to (2.12); 15. if (e k=2) set x = 2 else set x = 1; 16. H sets e = 0, starts Jx, broadcasts setup information of Jx, and new 1 and 2 values; Figure 2.3: The Iadaptive method. 16 needs to be sure that the ( ; ) threshold is crossed, all of Xi have to be retrieved, and the exact algorithms in Section 2.3.1 are applied. But in a lot of situations, this is expensive and impractical, due to both the communication and computation costs involved. Since uncertainties naturally exist in probabilistic data, it is very likely that users are willing to approximate the conditions under which the server raises the alarm, if approximation guarantees can be provided. 2.5.1 The Random Sampling Approach A natural choice and standard approximation is to leverage random sampling. We first introduce the RS algorithm in Figure 2.4. Lemma 1. The RS estimator satisfies E(bp( )) = Pr[Y > ], and Pr[jbp( )􀀀Pr[Y > ]j < "] > 3 4 . Proof. Let "0 = "=2, then in line 1 is 1="02. Clearly, by lines 2-5, S is a random sample of Y with size 1="02. Suppose Y 's distribution is represented by a multi-set P of elements P = fy1; : : : ; yNg for some imaginary, sufficiently large value N 2 Z+. Let r( ) be the number of elements in P that is larger than , then Pr[Y > ] = r( )=N. Let p = 1=("02N); we then define N i.i.d. random variables Z1; : : : ;ZN, such that Pr[Zi = 1] = p and Pr[Zi = 0] = 1 􀀀 p. We associate Zi with yi 2 P. Then, S can be viewed as being created by the following process: for each i 2 [1;N], insert yi into S if Zi = 1. For any , s( ) in line 6 is a random variable determined by the number of elements in P larger than (each sampled with probability p) in S. There are precisely r( ) such elements in P, and we denote them as fy`1 ; : : : ; y`r( )g, where y`i 2 P. This means Algorithm RS (c1; : : : ; cg; t;H; ") 1. let Xi = Xi;t, Y = Yt = Pg i=1 Xi, S = ;, = 4="2; 2. for i = 1; : : : ; g 3. send random sample Si = fxi;1; : : : ; xi; g of Xi to H; 4. For any j 2 [1; ], H inserts Pg i=1 xi;j into S; 5. let s( ) be the number of elements in S greater than ; 6. return bp( ) = s( ) "2 4 ; Figure 2.4: The RS estimator 17 that: s( ) = Pr( ) i=1 Z`i . Since each Zi is a Bernoulli trial, s( ) is a Binomial distribution B(r( ); p). Immediately, E(s( )) = p r( ). Hence, E(bp( )) = E("02N s( ) N ) = 1 p p r( ) N = Pr[Y > ], and Var( s( ) p ) = 1 p2 Var(s( )) = 1 p2 r( )p(1 􀀀 p) < r( ) p = r( )"02N ("0N)2: Also, E(s( )=p) = r( ). By Chebyshev's inequality: Pr[ s( ) p 􀀀 r( ) 2"0N] 1 4 , which implies that: Pr[ 1 N s( ) p 􀀀 r( ) 2"0] 1 4 . Given " = 2"0 and p = 1=("02N), s( ) pN = s( )"2 4 , we have Pr[ s( )"2 4 􀀀 Pr[Y > ] "] 1 4 . Immediately, Pr[jbp( )􀀀Pr[Y > ]j < "] > 3 4 . We can boost up Pr[jbp( ) 􀀀 Pr[Y > ]j < "] to be arbitrarily close to 1 by the MRS (median RS) Algorithm in Figure 2.5. Theorem 1. MRS returns bpj( ) s.t. Pr[jbpj( ) 􀀀 Pr[Y > ]j < "] > 1 􀀀 , for any "; 2 (0; 1); it uses 32 g "2 ln 1 bytes. Proof. By Lemma 1, each Ii outputs 1 with probability at least 3 4 in line 3 in Figure 2.5. Let h = 8 ln 1 ; by the common form of the Chernoff Bound [58], Pr[ Ph i=1 Ii < h 2 ] < e􀀀2h( 3 4 􀀀1 2 )2 = . Pr[ Ph i=1 Ii < h 2 ] is exactly the probability of less than half of Ii's being 0. Since Ij is the median in I (line 4), there is at least (1 􀀀 ) probability that Ij = 1. By line 3, in this case, we must have jbpj( )􀀀Pr[Y > ]j < ". The communication in bytes is straightforward. Algorithm MRS (c1; : : : ; cg; t;H; "; ) 1. run 8 ln 1 independent instances RS (c1; : : : ; cg; t;H; "); 2. let bpi( ) be the ith RS's output for i 2 [1; 8 ln 1 ]; 3. set Ii be 1 if jbpi( ) 􀀀 Pr[Y > ]j < ", and 0 otherwise; 4. let Ij be the median of I = fI1; : : : ; I8 ln 1 g; 5. return bpj( ); Figure 2.5: The MRS estimator 18 Lastly, if bp( ) returned by MRS is greater than , H raises an alarm at t; otherwise, no alarm is raised. It approximates Pr[Y > ] within " with at least (1􀀀 ) probability, using O(g="2 ln(1= )) bytes, for any "; 2 (0; 1). 2.5.2 Random Distributed "-Sample Instead of using the standard random sampling approach as shown in Section 2.5.1, we can leverage on a more powerful machinery in our analysis to derive a new algorithm with the same guarantee w.r.t. a fixed pair of thresholds ( ; ), but it is simpler to implement and works better in practice. Later, in Section 2.6, we also show that it can handle multiple pairs of thresholds simultaneously without incurring additional costs. We can approximate the probabilities of raising an alarm by a Monte Carlo approach where H asks each ci for a sample xi from Xi. He then computes a value y = Pg i=1 xi; this is a sample estimate from the distribution over Y , so Pr[Y > ] = Pr[y > ]. Repeating this to amplify success is the random distributed "-sample (RD"S) algorithm in Figure 2.6. Theorem 2. RD"S gives E(v= ) = Pr[Y > ] and Pr[jv= 􀀀 Pr[Y > ]j "] 1 􀀀 , using O( g "2 ln 1 ) bytes. Proof. First, it is clear that in line 7 for any j 2 [1; ], yj = Pg i=1 xi;j is a random sample drawn from the distribution of Y . Hence, E(v) = Pr[Y > ]. We next leverage on the concept of VC-dimension [82]. Let P be a set of points, or Algorithm RD"S (c1; : : : ; cg;H; t; "; ) 1. Xi = Xi;t, Y = Pg i=1 Xi, Si = ;, v = 0, = 1 "2 ln 1 ; 2. for i = 1; : : : ; g 3. for j = 1; : : : ; 4. ci selects some value xi;j from Xi, into Si, at random according to its underlying distribution; 5. ci sends Si to H; 6. for j = 1; : : : ; 7. if (yj = Pg i=1 xi;j > ) v = v + 1; 8. if (v= > ) H raises an alarm; 9. else H raises no alarm; Figure 2.6: The RD"S method. 19 more generally a distribution. Let I be a family of subsets of P. Let P have domain R and let I consist of ranges of the form of one-sided intervals (x;1) for any x 2 R. The pair (P; I) is a range space and we say a subset X P shatters a range space (P; I) if every subset Xs X can be defined as I \ X for some I 2 I. The size of the largest subset X that shatters (P; I) is the VC-dimension of (P; I). For one-sided intervals I, the VC-dimension for a range space (P; I) using any set P is = 1. An "-sample for a range space (P; I) is a subset Q P that approximates the density of P such that: max I2I jI \ Pj jPj 􀀀 jI \ Qj jQj ": (2.16) A classic result of Vapnik and Chervonenkis [82] shows that if (P; I) has VC-dimension and if Q is a random sample from P of size O(( ="2) log(1= )), then Q is an "-sample of (P; I) with probability at least 1 􀀀 . Every yj in line 7 can be viewed as a random point in P, the distribution of values for Y . The ranges we estimate are one-sided intervals (( ;1) for any 2 R and they have VC-dimension = 1). If we let = O((1="2) ln(1= )), DTS gets exactly an "-sample and guarantees that jv= 􀀀 Pr[Y > ]j " with probability at least 1 􀀀 . 2.5.3 Deterministic Distributed "-Sample The sizes of samples in RD"S could be large, especially for small " and values, which drive up the communication cost (measured in bytes). We introduce another sampling algorithm, the deterministic distributed "-sample (DD"S) method, to address this problem, which is shown in Figure 2.7. Let ~X i represent Si in the DD"S algorithm. Clearly, ~X i approximates Xi. Let ~ Y = Pg i=1 ~X i, i.e., for any u 2 (1; : : : ; )g (as in lines 6-8), insert Pg i=1 xi;ui into ~ Y ; by the construction of the DD"S, it is easy to see that: Pr[ ~ Y > ] = v= g: (2.17) To analyze its error, consider the distribution Y6=j = Pg i=1;i6=j Xi. Note that Y = Y6=j + Xj . We can claim the following about the random variable ~ Yj = Y6=j + ~X j : Lemma 2. If ~X j is an "-sample of (Xj ; I) then j Pr[ ~ Yj > ] 􀀀 Pr[Y > ]j " with 20 Algorithm DD"S (c1; : : : ; cg;H; t; "; ) 1. Xi = Xi;t, Y = Pg i=1 Xi, Si = ;, v = 0; 2. "0 = "=g, = 1="0; 3. for i = 1; : : : ; g 4. ci selects evenly-spaced xi;j's from Xi into Si, s.t. Si = fxi;1; : : : ; xi; g, and R xi;j+1 x=xi;j Pr[Xi = x]dx = "0; 5. ci sends Si to H; 6. let (1; : : : ; )g define a g-dimensional space where each dimension takes values f1; : : : ; g; 7. for each u 2 (1; : : : ; )g //u is a vector of g elements 8. if ( Pg i=1 xi;ui > ) v = v + 1; 9. if (v= g > ) H raises an alarm; 10.else H raises no alarm; Figure 2.7: The DD"S method. probability 1. Proof. The distribution of the random variable ~ Yj has two components Y6=j and ~X j . The first has no error, thus, Pr[ ~ Yj > ] = 1 j ~X j j X x2 ~X j Pr[x + Y6=j > ] Each x 2 ~X j shifts the distribution of the random variable Y6=j , so part of that distribution that is greater than for xi 2 ~X j will also be greater than for xi+1 2 ~X j (since xi+1 > xi by definition). Let yi = 􀀀 xi denote the location in the distribution for Y6=j where xi causes y 2 Y6=j to have ~ Yj > . Now for y 2 [yi; yi+1] has y + xl if l < i and y + xl > if l i. So y 2 [yi; yi+1] only has error in Pr[y + x > ] (where x is either drawn from Xj or ~X j) for x 2 [xi; xi+1]. Otherwise, for x 2 [xl; xl+1], for l < i has Pr[y + x > ] = 0 and for x 2 [xl; xl+1], for l > i has Pr[y + x > ] = 1. Since for any i R xi+1 x=xi Pr[Xj = x] " (because ~X j is an "-sample of (Xj ; I)), we observe that: R yi+1 y=yi Pr[Y6=j = y] 1 j ~X j j P x2 ~X j j Pr[y + x > ] 􀀀 Pr[y + Xj > ] j dy " Z yi+1 y=yi Pr[Y6=j = y]dy: Thus, we use that 21 Pr[ ~ Yj > ] = Z y Pr[Y6=j = y] 1 j ~X j j X x2 ~X j Pr[y + x > ]dy to conclude that Pr[Y > ] 􀀀 Pr[ ~ Yj > ] Pj ~X j j i=0 " R yi+1 y=yi Pr[Y6=j = y]dy ". This bounds the error on Y with ~ Yj where a single Xj is replaced with ~X j . We can now define ( ~ Yj)l = ~ Yj 􀀀 Xl + ~X l = Pg i=16=j;l Xi + ~X j + ~X l, and then apply Lemma 2 to show that if ~X l is an "-sample of (Xl; I) then j Pr[( ~ Yj)l > ] 􀀀 Pr[ ~ Yj > ]j ": We can apply this lemma g times, always replacing one Xi with ~X i in the approximation to Y . Then the sum of error is at most "g. This implies the following theorem. Theorem 3. If for each ci constructs ~X i as an ("=g)-sample for (Xi; I) then for any j Pr[ ~ Y > ] 􀀀 Pr[Y > ]j " with probability 1. Finally, by the definition of "-samples on one-sided intervals (refer to (2.16) and the fact that in our case I consists of ( ;1)'s), it is easy to see that: Lemma 3. Using g=" evenly spaced points, each Si in DD"S gives ~X i that is an "=g-sample of (Xi; I). Combining with (2.17), we have: Corollary 1. DD"S gives j Pr[ ~ Y > ] 􀀀 Pr[Y > ]j " with probability 1 in g2=" bytes. 2.5.4 A Randomized Improvement We can improve the analysis slightly by randomizing the construction of the -samples for each Xi. We choose xi;1 2 ~X i (the smallest point) to be at random so that Pr[xi;1 = x] = 1 Pr[Xi = x j x x ] where x is defined so R x x=􀀀1 Pr[Xi = x]dx = . Then each xi;j still satisfies that R xi;j+1 x2xi;j Pr[Xi = x]dx = . This keeps the points evenly spaced, but randomly shifts them. 22 Now we can improve Theorem 3 by modifying the result of Lemma 2. We can instead state that the error caused by ~X i Hi = (Pr[ ~ Yj > ] 􀀀 Pr[Y > ]) 2 [􀀀 ; ]: Because the random shift of ~X i places each xi;j 2 ~X i with equal probability as each point it represents in Xi, then for I 2 I we have that E " jI \ ~X ij j ~X ij # = E jI \ Xij jXij and hence for any E[Pr[ ~ Yj > ]] = E[Pr[Y > ]]. Thus, E[Hi] = 0 and for all i = maxfHig􀀀minfHig 2 . Since the Hi are independent, we can apply a Chernoff- Hoeffding bound to the error on ~ Y . So, Pr[j Pr[ ~ Y > ] 􀀀 Pr[Y > ]j "] = Pr[j Xg i=1 Hij "] 2 exp(􀀀2"2=(g 2)) 2 exp(􀀀"2=(2g 2)) ; when "= p 2g ln(2= ). This implies that: Theorem 4. If each ~X i is of size (1=") p 2g ln(2= ) and is randomly shifted, for any Pr[j Pr[ ~ Y > ] 􀀀 Pr[Y > ]j < "] > 1 􀀀 : This gives a better bound when the acceptable failure probability satisfies 2 ln(2= ) < g. We can modify DD"S according to Theorem 4 to get the DD"S method: Corollary 2. DD"S guarantees Pr[j Pr[ ~ Y > ]􀀀Pr[Y > ]j < "] > 1􀀀 for any "; ; in (g=") p 2g ln(2= ) bytes. 2.5.5 Practical Improvements Whenever a sample is required at any time t, for both RD"S and DD"S algorithms when the local sample size jSij at t has exceeded the size required to represent the distribution Xi, client ci simply forwards Xi to the server and the server can generate the sample for Xi himself. This is a simple optimization that will minimize the communication cost. For the DD"S algorithm (in both its basic version and the random-shift version), a 23 drawback is that its computation cost might become expensive for larger sample size or a large number of clients. In particular, executing its lines 7-10 requires the calculation of g sums. In practice, however, we have observed that the DD"S algorithm can still give accurate estimation if we test only a small, randomly selected subset of possible combinations of local samples, instead of testing all g combinations, i.e., in line 7, we randomly select m < g such u's and in line 9 we test v=m instead. 2.6 Extension 2.6.1 Weighted Constraint Suppose the user is interested in monitoring Y = Pg i=1 aiXi, for some weights fa1; : : : ; agg, 8ai 2 R+. All of our results can be easily extended to work for this case. The Im-proved and Iadaptive methods can be adapted based on the observations that: 1) E(Y ) = Pg i=1 aiE(Xi) and Var(Y ) = Pg i=1 a2i Var(Xi); 2) M( ) = Qg i=1Mi(ai ). The RD"S and DD"S algorithms can also be easily adapted. For any sample j, instead of checking if Pg i=1 xi;j > , they check if Pg i=1 aixi;j > , in line 7 and 8 of Figures 2.6 and 2.7, respectively. The exact methods can also be extended easily. The discrete case is trivial, and the continuous case leverages on the observation that '( ) = Qg i=1 '(ai ). 2.6.2 Handling Multiple ( ; ) Thresholds The other nice aspect of RD"S and DD"S is that after the server has gathered the samples Si's from all clients and he wants to check another threshold pair ( 0; 0), he already has sufficient information. H re-executes lines 6-9 of RD"S or lines 6-10 of DD"S, with the new threshold pair ( 0; 0). The estimation of Pr[Y > 0] is again within " of 0 with at least probability 1 􀀀 and 1 for RD"S and DD"S, respectively, i.e., the same error " and the failure probability (or 0) cover all possible pairs ( ; ) simultaneously in RD"S (or DD"S). This is especially useful if there was a continuous set of threshold pairs 􀀀 such that any violation of ( ; ) 2 􀀀 should raise the alarm. Then RD"S and DD"S are sufficient to check all of them, and are correct within " with probability at least (1􀀀 ) and 1, respectively, without additional costs. This also means that RD"S delivers stronger guarantee than the basic random sampling method in Section 2.5.1. For the basic random sampling method approach, a second pair of thresholds ( 0; 0) is a separate, but dependent problem. We can also estimate Pr[Y > 0] > 24 0 with "-error with failure probability using the same sample as we used for estimating Pr[Y > ] > . But now the probability that either of the thresholds has more than " error is greater than . Using union bound, we need a sample size of about O( 1 "2 log 1 " ) from each client to monitor 1 " pairs of thresholds simultaneously, which is more than the sample size O( 1 "2 log 1 ) required by RD"S. Small additional samples are also required for DD"S to monitor multiple pairs of thresholds simultaneously. 2.7 Experiments All algorithms were implemented in C++. We used the GMP library when necessary in calculating the moment generating function Mi( ). We simulated the distributed clients and the server, and executed all experiments in a Linux machine with an Intel Xeon E5506 cpu at 2.13GHz and 6GB memory. Since the flat model is used, server-to-client commu-nication is broadcast and client-to-server communication is unicast. The server-to-client broadcast counts as one message, regardless the number of clients. Every client-to-server transmission is one separate message, which may contain multiple values or a pdf. Score and probability values are both 4 bytes. 2.7.1 Datasets and Setup We used real datasets from the SAMOS project [70]. Raw readings from the research vessel Wecoma were obtained which consist of approximately 11.8 million records ob-served during a 9-month interval in 2010, from March to November. Each record con-sists of the current time and date, and the wind direction (WD), wind speed (WS), sound speed (SS), and temperature (TEM) measurements which are observed roughly every second (sometimes in less than a second). The wind direction measures the directional degree of the wind. The wind speed and sound speed are measured in meters per second and the temperature is in degrees Celsius. We observed that some measurements were erroneous or missing, e.g., a temperature of 999 or -999 degrees Celsius. Currently in SAMOS, to reduce communication and processing costs, records are grouped every consecutive seconds (the grouping interval), then replaced by one record taking the average readings of these records on each measurement respectively, which obviously loses a lot of useful information. Instead, we derive pdfs (one per measurement) for records in one grouping interval 25 and assign these pdfs to an attribute-level probabilistic tuple. There are different ways to derive a pdf for a measurement attribute, for example, [24, 25, 50], which is not the focus of this work. Without loss of generality and to ease the presentation, we simply generate a discrete pdf based on the frequencies of distinct values for a given measurement attribute: the probability of a distinct value is proportional to its frequency over the total number of records in the current grouping interval. Four measurements lead to four datasets WD, WS, SS, and TEM, each with one prob-abilistic attribute. We were unable to obtain additional datasets of large raw readings from other research vessels, since in most cases they did not keep them after reporting the average readings per grouping interval. As a result, we simulate the effect of having multiple distributed vessels by assigning to each vessel tuples from a given dataset. Tuples are assigned in a round robin fashion to ensure and preserve the temporal locality of observed measurements. 0 100 200 300 400 0 50 100 150 200 250 300 E(X i,t ) counts WD (a) 0 10 20 30 40 0 500 1000 1500 2000 2500 E(X i,t ) counts WS (b) 335 340 345 350 0 2000 4000 6000 8000 10000 12000 14000 E(X i,t ) counts SS (c) 5 10 15 20 25 30 35 0 2000 4000 6000 8000 E(X i,t ) counts TEM (d) Figure 2.8: Distributions of E(Xi;t) for WD, WS, SS, and TEM, where i 2 [1; g] and t 2 [1; T]. (a) WD. (b) WS. (c) SS. (d) TEM. 26 The default values of key parameters are: = 300, g = 10, = 0:7, and is set to a value for a given dataset such that over all T instances, there should be approximately 30% alarms raised by an exact algorithm. The domains (in R) of WD, WS, SS, and TEM are [0; 359], [0; 58:58], [335:25; 355:9], and [5:88; 41:3], respectively. These datasets also give us quite different distributions, allowing us to investigate different algorithms thoroughly. To illustrate this, we plot the distributions of E(Xi;t) where i = [1; g] and t = [1; T] in the default setup in Figure 2.8. E(Xi;t) also presents interesting (but quite different) temporal patterns and significant temporal changes in 4 datasets, which is also quite natural given that they precisely represent the large, real raw readings of different measurements at sea for a long period. Due to the space constraint, we omit these figures. That said, the default value is 230g, 17g, 343g, and 19g for WD, WS, SS, and TEM. Xi;t also has quite different sizes in 4 datasets. Under the default setup, the average size of Xi;t is 41.15, 204.84, 20.5, and 20.98 for WD, WS, SS, and TEM, respectively (they also change when we vary , obviously). Under the default setup, T = 3932. For each experiment, we vary one of the key parameters while keeping the others fixed at their default values. For any sampling method, the default sample size per client is = 30. In the Iadaptive method, k = 0:3T by default. For communication costs and running time, since T may vary, we report the average cost of one time instance which is obtained by dividing the corresponding total cost by T. Note that, we calculate the total running time by counting the server's running time plus the maximum running time of one client at each time instance. This ensures that the average running time reflects the expected response time at each round (since clients are running in parallel at distributed sites). When mostXi;t have large variances, sampling methods have the worst approximations. In our datasets, Var(Xi;t) in WD are consistently large (much larger than other datasets) which least favors our methods. WD also has a medium average distribution size and a wide range of values (which makes it the most interesting for a monitoring problem). Thus, we use WD as the default dataset. For our problem, the naive solution is to run EXACTD every time instance, which is clearly much worse than the two baseline methods, Madaptive and Markov. Between the two, Madaptive is always better. Hence, we only show the results from Madaptive as the competing baseline. 27 2.7.2 Effect of Figure 2.9 shows the communication costs of Madaptive, Improved, and Iadaptive when we vary from 1500 to 3100. Both the number of messages and bytes reduce for all algorithms while increases, since probabilistic tail bounds become tighter for larger values. Nevertheless, Figure 2.9(a) indicates that Iadaptive communicates the least number of messages, and Figure 2.9(b) shows that Improved sends the least number of bytes. Improved employs the most sophisticated combination of various lower and upper bounds (on both sides of E(Y )), thus it has the largest number of "certain" instances where retrieving Xi;t's can be avoided, which explains its best communication cost in bytes. Furthermore, it maintains low bytes for all values (a wide range we have tested), meaning that its pruning is effective on both sides of E(Y ). However, Improved does require at least one, to a few, message(s) per client at every time instance, as shown in Figure 2.9(a). When reducing the number of messages is the top priority, Iadaptive remedies this problem. Figure 2.9(a) shows in most cases, it uses only half to one-third the number of messages compared to Madaptive and Improved. In fact, it sends less than one message per client per time instances in most cases. Figure 2.10(a) shows the response time of these methods when varies. Clearly, all methods take less time as increases, since there are less instances where they need to call the EXACTD method (which is costly). Improved and Iadaptive are much more efficient than Madaptive. The dominant cost in Madaptive and Improved is the calls to EXACTD, while the dominant cost in Iadaptive is the calculation of the moment generating 1500 2000 2500 3000 0 4 8 12 16 g number of messages Madaptive Improved Iadaptive (a) 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 3500 g number of bytes Madaptive Improved Iadaptive (b) Figure 2.9: Communication: vary . (a) Messages. (b) Bytes. 28 1500 2000 2500 3000 10 −4 10 −3 10 −2 g response time (secs) Madaptive Improved Iadaptive (a) 0.5 0.6 0.7 0.8 0.9 10 −4 10 −3 10 −2 d response time (secs) Madaptive Improved Iadaptive (b) Figure 2.10: Response time: (a) vary . (b) vary . function at the client. This explains why the response time of both Madaptive and Improved improves at a faster pace than that in Iadaptive when increases, since this mainly reduces the number of calls to EXACTD, but Iadaptive still needs to calculate moment generating functions. Nevertheless, Iadaptive is still more efficient than Madaptive in all cases. When = 3100, Iadaptive takes less than 0.001 second, and Improved takes close to 0.0003 second. 2.7.3 Effect of When changes from 0.5 to 0.9 in Figure 2.11, Madaptive benefits the most where both its messages and bytes are decreasing, since its global constraint is linearly dependent on , leading to a linearly increasing global constraint. Nevertheless, Iadaptive still uses much fewer messages and bytes than Madaptive, and Improved uses the least number of bytes, in 0.5 0.6 0.7 0.8 0.9 0 4 8 12 16 d number of messages Madaptive Improved Iadaptive (a) 0.5 0.6 0.7 0.8 0.9 0 500 1000 1500 2000 2500 3000 d number of bytes Madaptive Improved Iadaptive (b) Figure 2.11: Communication: vary . (a) Messages. (b) Bytes. 29 all cases. In terms of the response time, Figure 2.10(b) shows that their trends are similar to what we have observed in Figure 2.10(a): Improved and Iadaptive are more efficient than Madaptive. 2.7.4 Effect of g We next investigate the impact of the number of clients; Figure 2.12 shows the results on communication. Not surprisingly, we see a linear correlation between the number of messages and g in Figure 2.12(a) where Iadaptive consistently performs the best. Figure 2.12(b) shows that all methods send more bytes as g increases; nevertheless, both Improved and Iadaptive send many fewer bytes than Madaptive. All methods take a longer to respond on average in Figure 2.13(a) for larger g values, due to the increasing cost in executing EXACTD. However, the cost of Madaptive increases 5 10 15 20 5 10 15 20 25 30 g number of messages Madaptive Improved Iadaptive (a) 5 10 15 20 0 1000 2000 3000 4000 5000 g number of bytes Madaptive Improved Iadaptive (b) Figure 2.12: Communication: vary g. (a) Messages. (b) Bytes. 5 10 15 20 10 −4 10 −3 10 −2 10 −1 g response time (secs) Madaptive Improved Iadaptive (a) 150 300 450 600 750 900 10 −5 10 −4 10 −3 10 −2 t response time (secs) Madaptive Improved Iadaptive (b) Figure 2.13: Response time: (a) vary g. (b) vary . 30 at a faster pace than other methods, since it makes many more calls to EXACTD. On the other hand, both Improved and Iadaptive are highly efficient, even though EXACTD becomes quite expensive for large g values, since they avoid calling EXACTD in most cases. Even when g = 20, both of them only take less than 0.005 seconds to respond. 2.7.5 Effect of When changes, Figure 2.14 shows the communication of various methods. Figure 2.14(a) shows that Iadaptive reduces messages when increases, while the other two methods sends more messages. Larger values lead to larger pdfs, i.e., more values in Xi;t but each taking smaller probability value, which make the bounds based on the moment generating functions tighter. But other bounds become looser, since Xi;t becomes relatively more uniform for larger pdfs. Hence, Iadaptive, relying only the moment generating function bounds, is performing better for larger values, while others degrade slowly, in terms of number of messages. In terms of number of bytes, all methods send more bytes for larger values, which is easy to explain: whenever a call to EXACTD is necessary, Xi;t's need to be communicated and they become larger for larger values. Nevertheless, both Iadaptive and Improved are still much more effective than Madaptive, e.g., even when = 900 (15 minutes grouping interval), Improved only sends about 1000 bytes per time instance. Figure 2.13(b) shows that all methods take a longer time to respond, since EXACTD becomes more expensive due to the increase in the pdf size. Improved and Iadaptive are clearly faster than Madaptive. When = 900, both of them still only take less than 0.005 second to respond. 150 300 450 600 750 900 6 8 10 12 14 16 t number of messages Madaptive Improved Iadaptive (a) 150 300 450 600 750 900 0 500 1000 1500 2000 2500 3000 3500 t number of bytes Madaptive Improved Iadaptive (b) Figure 2.14: Communication: vary . (a) Messages. (b) Bytes. 31 2.7.6 Sampling Methods The RD"S method offers similar (and even stronger, see Section 2.6.2) theoretical guar-antee than the basic random sampling method in Section 2.5.1. Its performance in practice is also better. Thus, we focus on studying RD"S, DD"S, and its randomized improvement, denoted as DD"S. Note that we have incorporated the practical improvements introduced in Section 2.5.5; m = 2 for both DD"S and DD"S (which has achieved sufficient accuracy for both methods). In this set of experiments, we compare sampling methods against the EXACTD method by running them over all T time instances. We use the precision and recall metrics to measure the approximation quality of sampling methods. Here, precision and recall are calculated w.r.t. the set of true alarms among the T instances, i.e., suppose there are a set A of 300 true alarms over T = 1000 time instances; an approximate method may raise a set B of 295 alarms out of the 1000 instances, with 5 false positives and 10 false negatives. Then, its precision is 290=295 and its recall is 290=300. Figures 2.15(a) and 2.15(b) show that all sampling methods improve their precisions and recalls when the sample size per client increases. Theoretically, both DD"S and DD"S should always have better precisions and recalls than RD"S given the same sample size. However, since we have incorporated the practical improvement to DD"S and DD"S to cut down their computation cost, RD"S might perform better in some cases. Nevertheless, Figures 2.15(a) and 2.15(b) show that in practice, given the same sample size, DD"S achieves the best precision while DD"S has the best recall; and DD"S always outperforms RD"S. When = 30, they have achieved a precision and recall close to or higher than 0.98. The sample size required in practice to achieve good accuracy for all sampling methods is clearly much less than what our theoretical analysis has suggested. This is not surprising, since theoretical analysis caters for some worst cases that rarely exist in real datasets. In all remaining experiments, we use = 30 by default. Figures 2.15(c) and 2.15(d) show that sampling methods result in clear savings in com-munication (bytes) and computation costs. They are especially useful in saving response time, which is 1-2 orders magnitude faster than EXACTD and the gap expects to be even larger for larger pdfs or more clients. Note that all sampling methods have the same communication cost given the same sample size (hence, we only show one line for all 32 0 20 40 60 0.92 0.94 0.96 0.98 1 k precision RDeS DDeS aDDeS (a) 0 20 40 60 0.95 0.96 0.97 0.98 0.99 1 k recall RDeS DDeS aDDeS (b) 0 20 40 60 0 1000 2000 3000 4000 k number of bytes EXACTD Sampling Methods (c) 0 20 40 60 10 −5 10 −4 10 −3 10 −2 k response time (secs) EXACTD RDeS DDeS aDDeS (d) Figure 2.15: Performance of the sampling methods: vary (sample size per client). (a) Precision. (b) Recall. (c) Communication: bytes. (d) Response time. of them in Figure 2.15(c)). Also, they result in the same number of messages as EXACTD. We have also tested the sampling methods using all 4 datasets under the default setup, and the results are shown in Figure 2.16; the trends are clearly similar to what we have observed in Figure 2.15. Note that WS has quite large pdfs, thus, EXACTD becomes very expensive on this dataset in terms of both bytes communicated and running time, making sampling methods more valuable under these situations (several orders of magnitude more efficient than EXACTD). 2.7.7 Integrated Methods Lastly, we integrate our sampling methods with Madaptive, Improved, and Iadaptive to derive the MadaptiveS, ImprovedS, and IadaptiveS methods, where in any time instance a call to EXACTD is replaced with a call to a sampling method. In particular, we use DD"S as the sampling method since it achieves the best trade-off between efficiency and accuracy as shown in last set of experiments. We tested these methods, along with their exact versions, on all datasets using the default setup. The results are shown in Figure 2.17. The trends are 33 0.97 0.98 0.99 1 precision WD WS SS TEM RDeS DDeS aDDeS (a) 0.97 0.98 0.99 1 recall WD WS SS TEM RDeS DDeS aDDeS (b) 0 5000 10000 15000 20000 number of bytes WD WS SS TEM EXACTD Sampling Methods (c) 10 −5 10 −3 10 −1 10 1 response time (secs) WD WS SS TEM EXACTD RDeS DDeS aDDeS (d) Figure 2.16: Performance of the sampling methods: vary datasets. (a) Precision. (b) Recall. (c) Communication: bytes. (d) Response time. clear: 1) The approximate versions have outperformed the corresponding exact versions in both communication and response time consistently; 2) Our methods have outperformed the baseline methods, Madaptive, and MadaptiveS in all cases, by significant margins; 3) Iadaptive and IadaptiveS are the best exact and approximate methods in saving the number of messages, and Improved and ImprovedS are the best methods in saving the number of bytes. For example, Iadaptive and IadaptiveS use less than one message per client per time instance on all datasets; Improved and ImprovedS use less than 1000 and 100 bytes per time instance, respectively, on WS that has an average pdf size of 204.84; 4) Iadaptive, IadaptiveS, Improved, and ImprovedS are efficient to run. In particular, IadaptiveS and ImprovedS are extremely fast, e.g., Figure 2.17(c) shows that they take less than 10􀀀3 and 10􀀀4 seconds to respond, respectively, in all datasets. 5) DD"S is highly effective. Figure 2.17(d) shows that MadaptiveS, ImprovedS, and IadaptiveS have almost perfect precisions and recalls on all datasets (more than 0.996 in all cases). Note that their precisions and 34 0 5 10 15 20 number of messages WD WS SS TEM Madaptive, MadaptiveS Improved, ImprovedS Iadaptive, IadaptiveS (a) 10 1 10 2 10 3 10 4 number of bytes WD WS SS TEM Madaptive, MadaptiveS Improved, ImprovedS Iadaptive, IadaptiveS (b) 10 −6 10 −4 10 −2 10 0 10 2 response time (secs) WD WS SS TEM Madaptive, MadaptiveS Improved, ImprovedS Iadaptive, IadaptiveS (c) 0.996 0.997 0.998 0.999 1 precision and recall WD WS SS TEM WD WS SS TEM precision recall MadaptiveS ImprovedS IadaptiveS (d) Figure 2.17: Performance of all methods: vary datasets. (a) Communication: messages. (b) Communication: bytes. (c) Response time. (d) Precision and recall. recalls are clearly better than using sampling methods on every time instance; since many alarms will already be caught certainly by Madaptive, Improved, and Iadaptive, only a tiny fraction of undecided cases will be then decided by the sampling methods. 2.8 RelatedWork To our knowledge, aggregate constraint monitoring on distributed data with uncertainty has not been explored before. That said, ranking and frequent items queries were studied on distributed probabilistic data in [54,83]. Monitoring centralized uncertain data for top-k and similarity queries were studied in [38, 49, 84]. On the other hand, due to their importance and numerous applica-tions, constraint and function monitoring with thresholds on deterministic distributed data were examined extensively, e.g., [18, 39, 48, 52, 62, 74]. In our study, we have leveraged on the adaptive thresholds algorithm for the deterministic (sum) constraint monitoring from 35 [48]. This choice is independent from the design of our adaptive algorithms for the DPTM problem: any adaptive algorithms for the (sum) constraint monitoring in deterministic data can be used in our Iadaptive method. Our study is also related to aggregation queries in probabilistic data, e.g., [45, 46, 60, 69, 76, 81]. However, monitoring both score and probability thresholds on aggregate constraints continuously over distributed probabilistic data is clearly different from these studies. Probabilistic threshold queries in uncertain data are also relevant [11, 22, 66, 68], as they are also concerned with the probability thresholds on the query results, but they mostly focus on one-shot query processing over centralized, offline probabilistic data. Lastly, the basic sampling method MRS in Section 2.5.1 can be viewed as a standard extension of the random sampling technique [58, 82]. The RD"S and DD"S methods are related to VC-dimensions and "-samples [82] as we already pointed out. The design principle behind the RD"S method, i.e., using a Monte Carlo approach, has also been used for general query processing in probabilistic data (e.g., [33, 44, 66] and more in [77]). The DD"S and DD"S are based on several intriguing insights to the distinct properties of our problem. 2.9 Conclusion We studied the threshold monitoring problem over distributed probabilistic data. We focused on continuously monitoring threshold constraint over the sum function of dis-tributed probabilistic data and explore a number of novel methods that have effectively and efficiently reduced both the communication and computation costs. Extensive experiments demonstrate the excellent performance and significant savings achieved by our methods, compared to the baseline algorithms. Many interesting directions are open for future work. Examples include but are not limited to how to extend our study to the hierarchical model that is often used in a sensor network, how to continuously monitor a function value (e.g., max, min, median) of distributed data, and how to handle the case when data from different sites are correlated. In the next chapter, we are going to study the distributed online tracking problem, which is exploring continuously functions tracking on a general-tree topological model. CHAPTER 3 DISTRIBUTED ONLINE TRACKING 3.1 Introduction The increasing popularity of smart mobile devices and the fast growth in the deploy-ment of large measurement networks generate massive distributed data continuously. For example, such data include, but are not limited to, values collected from smart phones and tablets [8], measurements from large sensor-based measurement networks [26, 55, 86], application data from location-based services (LBS) [73], and network data from a large infrastructure network. Tracking a user function over such distributed data continuously in an online fashion is a fundamental challenge. This is a critical task in many useful applications in practice. For example, it is a common task for users to continuously track the maximal (minimal) value of the temperature readings from a number of measurement stations. Similar examples can be easily found in location-based services and other distributed systems. This problem is also useful in the so-called publish/subscribe systems [9, 27], where a subscriber (tracker) may register a function (also known as a query) with a publisher (observer). Data continuously arrive at the publisher. The publisher needs to keep the subscriber informed about the value of her function f, when f is continuously applied over the current data value. When a subscriber's function of interest depends on data values from multiple publishers, it becomes a distributed tracking problem. It is always desirable, sometimes even critical, to reduce the amount of communication in distributed systems and applications, for a number of reasons [3, 15-17, 26, 55, 56, 63, 86]. Many devices rely on on-board battery and incur high power consumption when they communicate, e.g., in sensors and smart phones. Hence, reducing the number of messages they need to send helps extend their battery time. Another reason is to save the network bandwidth. From the user's point of view, less communication often leads to economic 37 gains, e.g., most smart phones have a monthly budget for their data plan, or for nodes in remote areas in a large measurement network, communication via satellites come with a high price tag. From the network infrastructure's point of view (e.g., ISP such as Comcast), too much communications from any application could significantly congest their network and slow down the performance of the network (keep in mind that there could be many user applications running at the same time that share the available network bandwidth). To achieve 100% accuracy for continuous online tracking of arbitrary functions, the solution is to ask all stations to always send their readings back to a centralized coordinator (the tracker), from which various functions can be easily computed and then tracked. This baseline approach, unfortunately, generates excessive communications: every new reading from any station must be forwarded to the tracker to ensure the correctness of the output values of the function being tracked. But the good news is that, in many application scenarios, exact tracking is often unnec-essary. Users are willing to trade-off accuracy with savings in communication. In some applications, approximation is often necessary not just for reducing communication, but also for policy constraints, e.g, due to privacy concerns in location-based services [4] and law requirements. To formalize this accuracy and communication trade-off, we refer to a distributed site that continuously receives data from a data source as an observer, and the centralized site that wants to track a function (or multiple functions) computed over data from multiple, distributed data sources as the tracker. Without loss of generality, we assume that the tracker is tracking only one function, which is f. Clearly, f's output is a function of time, and is denoted as f(t) for a time instance t. More precisely, it is a function of multiple data values at time instance t, one from each observer. Based on the above discussion, producing the exact values of f(t) continuously for all time instances is expensive. Thus, the tracker's goal is to maintain an approximation g(t), which is his best knowledge of f(t) at any time instance t using a small amount of communication (accumulated so far). Focusing on functions that produce a one-dimensional output, we require that g(t) 2 [f(t)􀀀 ; f(t)+ ] for any t 2 [0; tnow], for some user-defined error threshold . Under this set up, when = 0, g(t) always equals f(t) and the baseline exact solution is needed, which is communication-expensive. On the other hand, in the extreme case when 38 = +1, g(t) can be a random value, and effectively there will be no communication needed at all. These two extremes illustrate the possible accuracy-communication trade-off enabled by this framework. 3.1.1 Key Challenge It is important to note that our problem is a continuous online problem that requires a good approximation for every time instance. This is different from many distributed tracking problems in the literature that use the popular distributed streaming model, where the goal is to produce an approximation of certain functions/properties computed over the union of data stream elements seen so far for all observers, from the beginning of the time until now. It is also different from many existing work on monitoring a function over distributed streams, where the tracker only needs to decide if f(t)'s value has exceeded a given (constant) threshold or not at any time instance t. When there is only one observer, our problem degenerates to a centralized, two-party setting (observer and tracker). This problem has only been recently studied in [87, 89] where they have studied both one-dimensional and multidimensional online tracking in this centralized setting. They have given an online algorithm with O(log ) competitive ratio, and shown that this is optimal. In other words, any online algorithm for solving this problem must use at least a factor of O(log ) more communication than the offline optimal algorithm. Note that, however, this problem is different from the classic problem of two-party computation [85] in communication complexity. For the latter problem, two parties Alice and Bob have a value x and y, respectively, and the goal is to compute some function f(x; y) by communicating the minimum number of bits between them. Note that in online tracking, only Alice (the observer) sees the input; Bob (the tracker) just wants to keep track of it. Furthermore, in communication complexity both inputs x and y are given in advance, and the goal is to study the worst-case communication; in online tracking, the inputs arrive in an online fashion and it is easy to see that the worst-case (total) communication bound for online tracking is meaningless, since the function f could change drastically at each time step. For the same reasons, our problem, distributed online tracking, is also different from distributed multiparty computation. 39 3.1.2 Our Contributions In this work, we extend the online tracking problem that was only recently studied in [87, 89] to the distributed setting with common aggregation functions (e.g., MAX), and investigate principled methods with formal theoretical guarantees on their performance (in terms of communication) when possible. We design novel methods that achieve good communication costs in practice, and formally show that they have good approximation ratios. Our contributions are summarized below. We formalize the distributed online tracking problem in Section 3.2 and review the optimal online tracking method from [87, 89] in the centralized setting. We examine a special extension of the centralized setting with one observer but many relaying nodes, known as the chain case. We study the chain model in Section 3.3 and design a method with O(log ) competitive ratio. We also show that our method has achieved the optimal competitive ratio in this setting. We investigate the "broom" model in Section 3.4 by leveraging our results from the chain model, where there are m distributed observers at the leaf-level and a single chain connecting them to the tracker. We design a novel method for MAX function and show that our method has very good approximation ratio among the class of online algorithms for the broom model. We extend our results to the general-tree model in Section 3.5, which is an extension of the broom model. We again show that our method has good approximation ratio among the class of online algorithms for the general-tree model. We discuss other functions and topologies in Section 3.6. We conduct extensive experiments to evaluate the effectiveness of our methods in practice in Section 4.6. We used several real datasets and the results have confirmed that our methods are indeed superior compared to other alternatives and baseline methods. 3.2 Problem Formulation and Background Formally, there are m observers fs1; : : : ; smg at m distributed sites, and a tracker T. Thesemobservers are connected to T using a network topology. We consider two common 40 topologies in this work, the broom topology and the general-tree topology, as shown in Figure 3.1. Observers always locate at the leaves, and the tracker always locates at the root of the tree. Both topologies are constructed based on a chain topology, as shown in Figure 3.2(a), and the centralized setting studied in [87, 89] is a special case of the chain topology, as shown in Figure 3.2(b). A relay node does not directly observe a function (or equivalently data values) that contribute to the computation of f, but it can receive messages from its child (or preceding) node(s), and send messages to its parent (or succeeding) node. It is important to note that our general-tree topology has already covered the case in which an intermediate replay node u may be an observer at the same time, who also observes values (modeled by a function) that contribute to the computation of function f. This is because we can always conceptually add an observer node s directly below (and connected to) such an intermediate node u. Let s report the data values that are observed by u; we can then only view u as a relay node (while making no changes to all other connections to u that already exist). More details on this issue will be presented in Section f1(t) f2(t) fm(t) g(t) 2 [f(t) − , f(t) + ] s1 s2 sm tracker T h relay nodes (a) f1(t)f2(t) fm(t) g(t) 2 [f(t) − , f(t) + ] s1 s3 sm tracker T s2 fi(t) si f3(t) maximum h relay nodes in a path from any si to T (b) Figure 3.1: Track f(t) = f(f1(t); f2(t); ; fm(t)). (a) broom model. (b) general-tree. f(t) observer h relay nodes g(t) tracker (a) f(t) observer g(t) tracker (b) Figure 3.2: Special cases: g(t) 2 [f(t) 􀀀 ; f(t) + ]. (a) chain topology (b) centralized seting. [87, 89]. 41 3.6. That said, in practice, a relay node can model a router, a switch, a sensor node, a computer or computation node in a complex system (e.g., Internet, peer-to-peer network), a measurement station in a monitoring network, etc. Each observer's data value changes (arbitrarily) over time, and can be described by a function. We dub the function at the ith observer fi, and its value at time instance t fi(t). The tracker's objective is to continuously track a function f that is computed based on the values of functions from all observers at time instance t, i.e., its goal is to track f(t) = f(f1(t); f2(t); : : : ; fm(t)) continuously over all time instances. Since tracking f(t) exactly is expensive, an approximation g(t) is allowed at the tracker T, subject to the constraint that g(t) 2 [f(t) 􀀀 ; f(t) + ] for any time instance t 2 [0; tnow]. 2 Z+ is a user-defined error threshold that defines the maximum allowed error in approximating f(t) with g(t). The goal is to find an online algorithm that satisfies this constraint while minimizing the communication cost. Note that depending on the dimensionality for the outputs of f(t), as well as f1(t); f2(t), : : :, and fm(t), we need to track either a one-dimensional value or a multidimensional value that changes over time. This work focuses on the one-dimensional case. In other words, we assume that f(t), and f1(t); : : : ; fm(t) are all in a one-dimensional space. 3.2.1 Performance Metric of an Online Algorithm There are different ways to formally analyze the performance of an online algorithm. For an online problem P (e.g., caching), let I be the set of all possible valid input instances, and A be the set of all valid online algorithms for solving the problem P. Suppose the optimal offline algorithm for P is offline. Given an input instance I 2 I, and an algorithm A 2 A (or offline), we denote the cost of running algorithm A on I as cost(A; I). In our setting, the cost is the total number of messages sent in a topology. A widely used metric is the concept of competitive ratio. Formally, for an algorithm A 2 A, the competitive ratio of A [59], denoted as cratio(A), is defined as: cratio(A) = max I2I cost(A; I) cost(offline; I) : Another popular metric is to analyze the performance of an algorithm A compared to 42 other algorithms in a class of online algorithms. Formally, we can define the ratio of A on an input instance I as follows: ratio(A; I) = cost(A; I) cost(A I ; I) ; where A I is the online algorithm from the class A that has the lowest cost on input I, i.e., A I = argminA02A cost(A0; I). Lastly, we can quantify an algorithm A's performance by considering its worst case ratio, i.e., ratio(A) = max I2I ratio(A; I): Note that the definitions of ratio(A; I) and ratio(A) are inspired by the classic work that has motivated and defined the concept of "instance optimality" [29]. In fact, if ratio(A) is a constant, then indeed A is an instance optimal online algorithm. Clearly, we always have, for any online problem P and its online algorithm A, cratio(A) ratio(A). 3.2.2 State-of-the-art Method Prior work has studied the online tracking problem in the centralized, two party setting [87,89], as shown in Figure 3.2(b). They studied both one-dimensional tracking and multi-dimensional tracking, defined by the dimensionality of the output value for the function f(t) at the observer. Since we focus on the one-dimension case, here we only review the one-dimension tracking method from [87, 89]. Finding a good online algorithm for this seemingly very simple setup turns out to be a very challenging problem. Consider the simple case where the function takes integer values at each time step, i.e., f : Z ! Z, and the tracker requires an absolute error of at most . The natural solution is to let the observer first communicate f(t0) to the tracker at the initial time instance t0; then every time f(t) has changed by more than since the last communication, the observer updates the tracker with the current value of f(t). However, this natural solution has an unbounded competitive ratio compared with the offline optimal method. Consider the case where f(t) starts at f(0) = 0 and then oscillates between 0 and 2 . The above algorithm will communicate for an infinite number of times while the offline optimal solution only 43 needs to send one message: g(0) = . This example demonstrates the hardness of the online tracking problem. For functions in the form of f : Z ! Z, Yi and Zhang proposed the method in Algorithm 1, and showed the following results. Theorem 5. (from [87, 89]) To track a function f : Z ! Z within error , any online algorithm has to send (log OptHist) messages in the worst case, where OptHist is the number of messages needed by the optimal offline algorithm. And, OPTTRACK is an O(log )-competitive online algorithm to track any function f : Z ! Z within . Furthermore, if f takes values from the domain of reals (or any dense set), the competitive ratio of any online algorithm is unbounded. Theorem 5 establishes the optimality of the OPTTRACK method, since it shows that any online algorithms for centralized online tracking (between two nodes) has a competitive ratio that is at least log , and OPTTRACK's competitive ratio O(log ) has met this lower bound. Note that the negative results on real domains and other dense domains do not rule out the application of OPTTRACK in practice on those cases. In practice, most functions (or data values for a sequence of inputs) have a fixed precision, e.g., any real number in a 64-bit machine can be described by an integer from an integer domain with size 264. To the best of our knowledge, and as pointed out in [87, 89], no prior work has studied the distributed online tracking problem as we have formalized earlier in this section. Algorithm 1: OPTTRACK ( ) (from [87, 89]) 1 let S = [f(tnow) 􀀀 ; f(tnow) + ] \ Z 2 while S 6= ; do 3 let g(tnow) be the median of S; 4 send g(tnow) to tracker T; set tlast = tnow; 5 wait until jf(t) 􀀀 g(tlast)j > ; 6 S S \ [f(t) 􀀀 ; f(t) + ] 44 3.3 The Chain Case We first examine a special case that bridges centralized and distributed online tracking. Considering the tree topology in Figure 3.1, it is easy to observe that each observer is connected to the tracker via a single path with a number of relay nodes (if multiple paths exist, we simply consider the shortest path). Hence, online tracking in the chain topology as shown in Figure 3.2(a) is a basic building block for the general distributed online tracking problem. We refer to this problem as the chain online tracking. The centralized online tracking as reviewed in Section 4.6.2 and shown in Figure 3.2(b) is a special case of chain online tracking, with 0 relay node. 3.3.1 Baseline Methods For a chain topology with h relay nodes, a tempting solution is to distribute the error threshold equally to all relay nodes and apply (h + 1) independent instances of the OPTTRACK algorithm. Suppose we have h relay nodes fn1; : : : ; nhg, an observer s, and a tracker T. Let n0 = s and nh+1 = T, for every pair of nodes fni􀀀1; nig for i 2 [1; h + 1], we can view ni as a tracker and its preceding node ni􀀀1 as an observer, and require that ni tracks ni􀀀1's function be within an error threshold of h+1. Let yi(t) be the function at ni for i 2 [1; h + 1], then yi(t) is the output of running OPTTRACK with an error threshold h+1, where ni􀀀1 is the observer, yi􀀀1(t) is the function to be tracked, and ni is the tracker. Since n0 = s and y0(t) = f(t), we have two facts: (1) y1(t) 2 [f(t) 􀀀 h+1; f(t) + h+1] for any time instance t. (2) yi(t) 2 [yi􀀀1(t) 􀀀 h+1; yi􀀀1(t) + h+1] for any i 2 [2; h + 1] and any time instance t. Since the tracker T is simply node nh+1, thus, g(t) = yh+1(t). Using the facts above, it is easy to verify that g(t) will be always within f(t) as required. We denote this method as CHAINTRACKA (chain-track-average). We can generalize CHAINTRACKA to derive other similar methods. Instead of dis-tributing the error threshold uniformly along the chain, one can distribute a random error threshold i to node ni for i = 1; : : : ; h + 1, subject to the constraint that Ph+1 i=1 i = . We denote this method as CHAINTRACKR (chain-track-random). Using a similar argu-ment, CHAINTRACKR also ensures that g(t) at T is always within f(t) . Unfortunately, these seemingly natural solutions do not perform well, even though 45 they are intuitive extensions of the optimal online tracking method between two nodes to multiple nodes. Given any valid algorithm A for chain online tracking, let yi(t) be the best knowledge of f(t) at node ni at any time instance t, for i = 1; : : : ; h + 1. The first observation on a chain topology is given by the following lemma. Lemma 4. For an algorithm A (either online or offline) that solves the chain online track-ing problem, we must have yi(t) 2 [f(t) 􀀀 ; f(t) + ] for any i 2 [1; h + 1] in order to reduce communication while ensuring correctness. This holds for any t 2 [0; tnow]. Proof. Consider any such algorithm A, the statement trivially holds for i = h + 1, since node nh+1 is the tracker T and g(t) = yh+1(t). By the requirement of the chain online tracking problem, g(t) must be within [f(t)􀀀 ; f(t)+ ] at any time instance t 2 [0; tnow]. Next, we show that at any time instance t, yi(t) must be within [f(t)􀀀 ; f(t)+ ] for any i 2 [1; h]. Assume that at some time instance t 2 [0; tnow], we have jyi(t) 􀀀 f(t)j = at node ni and > . Let "i+1 be the tracking error threshold between ni and ni+1 when running a tracking procedure between nodes ni and ni+1. Then, we must have jyi+1(t) 􀀀 yi(t)j "i+1. Hence, jyi+1(t) 􀀀 f(t)j + "i+1. Apply the same argument repeatedly at nodes ni+1; ni+2; : : : ; nh+1, we can show that at nh+1, it must be jyh+1(t) 􀀀 f(t)j + "i+1 + + "h+1. That said, if > , it could lead to jyh+1(t) 􀀀 f(t)j = > , because the best (minimal) tracking errors for subsequent nodes in ni+1; : : : ; nh+1 are 0; they cannot be negative values. Thus, the assumption that there exists a t and node ni such that jyi(t) 􀀀 f(t)j = > must be wrong. Lemma 4 formalizes a very intuitive observation on a chain topology. This result helps us arrive at the following. Lemma 5. Both CHAINTRACKA and CHAINTRACKR's competitive ratios are +1 for the chain online tracking problem. Proof. We prove the case for CHAINTRACKA. The proof for CHAINTRACKR is similar and omitted for brevity. 46 Consider a function f at the observer s (which is node n0) whose values always change no more than around a constant a. In other words, f(t) 2 [a 􀀀 ; a + ] for any time instance t. By the construction of CHAINTRACKA, we must have: (1) yi(t) 2 [f(t) 􀀀 i h+1 ; f(t) + i h+1 ] for any i 2 [1; h]; (2) g(t) = yh+1(t) 2 [f(t) 􀀀 ; f(t) + ]. Consider an adversary Alice that tries to explore the worst case for CHAINTRACKA. Suppose that t0 is the initial time instance. Alice first sets f(t0) = a 􀀀 . It takes h + 1 messages to let nh+1 learn a valid value for g(t0) at time t0. By the facts above, it must be yi(t0) 2 [a 􀀀 􀀀 i h+1 ; a 􀀀 + i h+1 ] for any i 2 [1; h]. Alice then sets f(t1) = a + . Now yi(t0) is more than away from f(t1) for any i 2 [1; h]. By Lemma 4, such yi(t0)'s are not allowed, hence, any node ni cannot simply set yi(t1) = yi(t0). Instead, every node ni needs to receive an update message to produce a valid tracking value yi(t1). This leads to h messages. Again, based on the design of CHAINTRACKA, yi(t1) 2 [a + 􀀀 i h+1 ; a + + i h+1 ]. Alice sets f(t2) = a􀀀 , by a similar argument, this will again trigger h messages. She repeatedly alternates the subsequent values for f between a+ and a􀀀 . CHAINTRACKA pays at least h messages for any t 2 [t0; tnow], which leads to O(htnow) messages in total. However, the offline optimal algorithm on this problem instance only needs to set g(t0) = yh+1(t0) = a at t0, which takes h+1 messages, and keeps all subsequent g(ti) the same as g(t0). Hence, cratio(CHAINTRACKA ) = htnow=(h + 1) = tnow = +1. 3.3.2 Optimal Chain Online Tracking Recall that the centralized, two-party online tracking (simply known as online tracking) is a special case of chain online tracking with no relay nodes, i.e., h = 0. The OPTTRACK method in Algorithm 1 achieves an O(log )-competitive ratio for the online tracking problem. Furthermore, it is also shown that O(log ) is the lower bound for the competitive ratio of any online algorithms for online tracking [87, 89]. Yet, when generalizing it to chain online tracking with either CHAINTRACKA or CHAINTRACKR, the competitive ratio suddenly becomes unbounded. The huge gap motivates us to explore other alternatives, 47 which leads to the optimal chain online tracking method, CHAINTRACKO (chain-tracking-optimal). This algorithm is shown in Algorithm 2, and its construction is surprisingly simple: allocate all error threshold to the very first relay node! Basically, CHAINTRACKO ensures that y1(t) is always within f(t) using the OPT-TRACK method. For any other relay node ni for i 2 [2; h + 1], it maintains yi(t) = yi􀀀1(t) at all time instances t. The tracker T maintains g(t) = yh+1(t) (recall node nh+1 is the tracker node). In other words, g(t) = yh+1(t) = yh(t) = = y2(t) = y1(t) for any t. Lemma 6. CHAINTRACKO's competitive ratio is O(log ) for chain online tracking with h relay nodes. Proof. While running algorithm OPTTRACK between the observer s and node n1, we define a round as a time period [ts; te], such that S = [f(ts) 􀀀 ; f(ts) + ] at ts and S = ; at te from line 1 and line 2 in Algorithm 1. By the proof of Theorem 5 in [87, 89], we know that OPTTRACK will communicate O(log ) messages in a round. Thus, by the construction of Algorithm 2, CHAINTRACKO has to communicate O(h log ) messages in this round. For a round [ts; te], consider any time instance t 2 [ts; te]. Lemma 4 means that yi(t) 2 [f(t) 􀀀 ; f(t) + ] for any i 2 [1; h + 1] in the offline optimal algorithm. Suppose node ni receives no message in this round, then it must be the case that: yi(x) 2 \x t=ts [f(t) 􀀀 ; f(t) + ] for any x 2 [ts; te]: (3.1) Consider the set S(x) at node n1 at time x, where S(x) is the set S at time x in Algorithm 1. By the construction of Algorithm 1, S(x) = \t[f(t)􀀀 ; f(t)+ ] for a subset of time instances t from [ts; x] (only when jf(t)􀀀g(t)j > ; S S\[f(t)􀀀 ; f(t)+ ]). Algorithm 2: CHAINTRACKO ( , h) 1 Let the tracking output at a node ni be yi(t). 2 Run OPTTRACK ( ) between observer s and the first relay node n1, by running n1 as a tracker. 3 for any node ni where i 2 [1; h] do 4 Whenever yi(t) has changed, send yi(t) to node ni+1 and set yi+1(t) = yi(t). 48 Clearly, we must have: \x t=ts [f(t) 􀀀 ; f(t) + ] S(x) for any x 2 [ts; te]: (3.2) By the end of this round, S(te) becomes ; by the definition of a round, which means that \x t=ts [f(t) 􀀀 ; f(t) + ] has become ; at some time x te by (4.4). But this means that (4.2) cannot be true. Hence, our assumption that node ni receives no message in this round must be wrong. This argument obviously applies to any node ni for i 2 [1; h + 1], which implies that an offline optimal algorithm must have sent at least h + 1 messages in this round. Thus, cratio(CHAINTRACKO ) = ((h + 1) log )=(h + 1) = log . CHAINTRACKO is optimal among the class of online algorithms that solve the chain online tracking problem, in terms of its competitive ratio. Specifically, O(log ) is the lower bound for the competitive ratio of any online algorithms for chain online tracking. Let C-OPT(h) be the number of messages sent by the offline optimal algorithm for a chain online tracking problem with h relay nodes. Lemma 7. Any online algorithms for chain online tracking of h relay nodes must send (log C-OPT(h)) messages. Proof. Suppose A is an online algorithm for chain online tracking with h relay nodes. The approximations of f(t) at different nodes are y1(t); y2(t); : : : ; yh(t); yh+1(t), respectively (note that nh+1 is the tracker T, and g(t) = yh+1(t)). Consider an adversary Alice, who maintains an integer set R such that any y 2 R at time instance t is a valid representative for the value of f(t), for t 2 [ts; te]. Note that a round is defined as a period [ts; te], such that R = [f(ts)􀀀 ; f(ts)+ ] at ts and R = ; at te. Let di be the number of integers in R after the i-th update sent out by algorithm A from the observer. Initially, R = [f(ts) 􀀀 ; f(ts) + ] and d0 = 2 + 1. In each round [ts; te], we will show that there exists a sequence of f(t) values such that A has to send (log C-OPT(h)) messages, but the optimal offline method has to send only C-OPT(h) messages. Consider a time instance t 2 [ts; te] after the i-th update sent by algorithm A. Let 49 x be the median of R. Without loss of generality, suppose more than h+1 2 values among fy1(t); y2(t); : : : ; yh+1(t)g are in the range [min(R); x], let Y (t) be the set of such values and z = max(Y (t)). Alice sets f(t + 1) = z + + 1. It is easy to verify that f(t + 1) 􀀀 yi(t) > for any yi(t) 2 Y (t). Hence, any node ni, such that yi(t) 2 Y (t), must receive an update at (t + 1) to ensure that yi(t + 1) 2 [f(t + 1) 􀀀 ; f(t + 1) + ] by Lemma 4. This means that A has to send at least h+1 2 messages (the size of Y (t)) from t to t + 1. After the (i + 1)-th update sent out by A from the observer, R has to be updated as R R\[f(t+1)􀀀 ; f(t+1)+ ], i.e., R = [z+1; x+(di􀀀1)=2] and its size reduces by at most half at each iteration when A sends out an update from the observer. It is easy to see that di+1 (di 􀀀 1)=2. Using the same argument, we will get a similar result if more than h+1 2 values of fy1(t); : : : ; yh+1(t)g are in the range [x; max(R)]. In that case, we set f(t + 1) = z 􀀀 􀀀 1 where z = min(Y (t)). That said, it takes at least (log ) iterations for R to be a constant, since R's initial size is O( ) and its size reduces by at most half in each iteration. When R becomes empty, Alice starts a new round. By then, an offline optimal algorithm must have to send at least one update from the observer to the tracker (as g(t) must hold a value from R to represent f(t) for t 2 [ts; te]), which takes at least O(h) messages. Hence, in any such round [ts; te], when R = ; at te, C-OPT(h) = O(h), but A has to send at least (h+1 2 log ) messages, which completes the proof. 3.4 The Broom Case A base case for distributed online tracking is the "broom" topology, as shown in Figure 3.1(a). A broom topology is an extension of the chain topology where there aremobservers (instead of only one) connected to the first relay node. Similarly as before, we denote the ith relay node as ni, and n1 is the first relay node that connects directly to m observers. In fact, a broom topology reduces to a chain topology when m = 1. Since there are m functions, one from each observer, an important distinction is that the function to be tracked is computed based on these m functions. Specifically, the goal is to track f(t) where f(t) = f(f1(t); : : : ; fm(t)) for some function f at T subject to an error threshold . Clearly, the design of online tracking algorithms in this case will 50 have to depend on the function f. We focus on the max aggregate function in this work, and discuss other aggregate functions in Section 3.6. Hence, in subsequent discussions, f(t) = max(f1(t); : : : ; fm(t)) and g(t) must be in the range [f(t) 􀀀 ; f(t) + ] at T, for any time instance t. 3.4.1 A Baseline Method A baseline method is to ask T to track each function fi(t) within fi(t) for i 2 [1;m] using a function gi(t). The tracker computes g(t) = max(g1(t); : : : ; gm(t)) for any time instance t. For the ith function, this can be done by using the CHAINTRACKO method for a chain online tracking instance, where the chain is the path from observer si to tracker T. Given that gi(t) 2 [fi(t) 􀀀 ; fi(t) + ] for all i 2 [1;m], it is trivial to show that g(t) 2 [f(t) 􀀀 ; f(t) + ]. We denote this approach as the m-CHAIN method. 3.4.2 Improvement Recall that CHAINTRACKO allocates all error threshold to the first relay node n1 in its chain; all other relay nodes simply forward every update arrived at n1 (from observer s) to T. Hence, in the m-CHAIN method, it is n1 that actually tracks g1(t); : : : ; gm(t) and n1 simply passes every update received for gi(t) through the chain to T. This clearly generates excessive communication. In light of this, we consider a class Abroom of online algorithms for broom online tracking as follows: 1. Every node u in a broom topology keeps a value yu(t) which represents the knowl-edge of u about f(t) in the subtree rooted at u at time t. For a leaf node u (an observer), yu(t) is simply its function value fu(t). 2. Each leaf node u's function is tracked by its parent v within error using gu(t), i.e., jgu(t) 􀀀 fu(t)j for every time instance t. Note that gu(t) does not need to be fu(t). Specifically, a leaf u sends a new value gu(t) to its parent v only when jgu(last) 􀀀 fu(t)j > , where gu(last) is the previous update u sent to v. Note that in both broom and tree models, we do not analyze the competitive ratio (cratio) of their online algorithms. The reason is that in a broom or a tree topology, since the offline optimal algorithm offline can see the entire input instance in advance, offline can "communicate" between leaf nodes for free. These are observers that are distributed in the online case. As a result of this, there always exists an input instance where the performance 51 gap between an online algorithm and offline is infinitely large. Hence, in the following discussion, we will analyze the performance of an online algorithm using the concept of ratio as defined in Section 3.2.1 with respect to the class Abroom. In a broom topology, we use yi(t) to denote yu(t) for a node u that is the ith relay node ni. Lemma 8. Any algorithm A 2 Abroom must track functions f1(t); : : : ; fm(t) with an error threshold that is exactly at the first relay node n1 in order to minimize ratio(A). Proof. Assume that the claim does not hold. This means there exists an algorithm A 2 Abroom that allows node n1 to track at least one function fi(t) using an error threshold that is less than . Without loss of generality, consider i = 1. In this case, we can show that ratio(A) = +1 by constructing an input instance I as follows. In this instance I, fj(t) = 􀀀2 for j = 2; : : : ;m and any time instance t; f1(t) = (􀀀1)t mod 2 . In other words, f1(t) alternates between 􀀀 and and all other functions are set to a constant 􀀀2 . In this case, clearly, f(t) = max(f1(t); : : : ; fm(t)) = f1(t) for all time instances t. But since the error threshold allocated to n1 for f1 is < , A needs to send +1 number of messages when t goes +1, no matter how it designs its online tracking algorithm between n1 and s1. But the optimal online algorithm for this particular instance only needs to send (m + h + 1) number of messages in the first time instance. This algorithm A I sets g1(t1) = f1(t1) + , and gi(t1) = fi(t1) for all i > 1 at the first time instance t1. It then asks each observer si to only send an update to n1 if and only if at a time instance t, fi(t) has changed more than from its last communicated value to n1. At any time instance, n1 sends y1(t) = max(g1(t); : : : ; gm(t)) to T, through the chain defined by n2; : : : ; nh, if y1(t) is different from the last communicated value from n1 to T. It is easy to verify that this algorithm belongs to Abroom and it works correctly on all possible instances. On the above input instance, at the first time instance t1, it will send g1(t1) = 0, and g2(t1) = = gm(tm) = 􀀀2 to n1 from observers s1; : : : ; sm (n1 also forwards only y1(t) = 0 to T). But it will incur no more communication in all subsequent time instances. Hence, its communication cost is (m + h + 1). 52 Thus, on this input instance I, ratio(A; I)=ratio(A I ; I) = +1. As a result, ratio(A) = +1. Note that the optimal algorithm for a particular input instance may perform badly on other input instances. The definition of ratio is to quantify the difference in the worst case between the performance of an online algorithm A against the best possible performance for a valid input instance. Lemma 8 implies tha
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6768pm4