Exploring the landscape of clusterings

Exploring the landscape of clusterings

Publication Type	dissertation
School or College	College of Engineering
Department	Computing
Author	Raman, Parasaran
Title	Exploring the landscape of clusterings
Date	2013-12
Description	With the tremendous growth of data produced in the recent years, it is impossible to identify patterns or test hypotheses without reducing data size. Data mining is an area of science that extracts useful information from the data by discovering patterns and structures present in the data. In this dissertation, we will largely focus on clustering which is often the first step in any exploratory data mining task, where items that are similar to each other are grouped together, making downstream data analysis robust. Different clustering techniques have different strengths, and the resulting groupings provide different perspectives on the data. Due to the unsupervised nature i.e., the lack of domain experts who can label the data, validation of results is very difficult. While there are measures that compute "goodness" scores for clustering solutions as a whole, there are few methods that validate the assignment of individual data items to their clusters. To address these challenges we focus on developing a framework that can generate, compare, combine, and evaluate different solutions to make more robust and significant statements about the data. In the first part of this dissertation, we present fast and efficient techniques to generate and combine different clustering solutions. We build on some recent ideas on efficient representations of clusters of partitions to develop a well founded metric that is spatially aware to compare clusterings. With the ability to compare clusterings, we describe a heuristic to combine different solutions to produce a single high quality clustering. We also introduce a Markov chain Monte Carlo approach to sample different clusterings from the entire landscape to provide the users with a variety of choices. In the second part of this dissertation, we build certificates for individual data items and study their influence on effective data reduction. We present a geometric approach by defining regions of influence for data items and clusters and use this to develop adaptive sampling techniques to speedup machine learning algorithms. This dissertation is therefore a systematic approach to study the landscape of clusterings in an attempt to provide a better understanding of the data.
Type	Text
Publisher	University of Utah
Subject	Clustering; Data mining; Machine learning; Sampling
Dissertation Institution	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	Copyright © Parasaran Raman 2013
Format Medium	application/pdf
Format Extent	2,294,192 bytes
Identifier	etd3/id/2699
ARK	ark:/87278/s6254sct
Setname	ir_etd
ID	196274
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6254sct

Back to Search Results