Using similarity in content and access patterns to improve space efficiency and performance in storage systems

Using similarity in content and access patterns to improve space efficiency and performance in storage systems

Publication Type	dissertation
School or College	College of Engineering
Department	Computing
Author	Lin, Xing
Title	Using similarity in content and access patterns to improve space efficiency and performance in storage systems
Date	2015-12
Description	In the past few years, we have seen a tremendous increase in digital data being generated. By 2011, storage vendors had shipped 905 PB of purpose-built backup appliances. By 2013, the number of objects stored in Amazon S3 had reached 2 trillion. Facebook had stored 20 PB of photos by 2010. All of these require an efficient storage solution. To improve space efficiency, compression and deduplication are being widely used. Compression works by identifying repeated strings and replacing them with more compact encodings while deduplication partitions data into fixed-size or variable-size chunks and removes duplicate blocks. While we have seen great improvements in space efficiency from these two approaches, there are still some limitations. First, traditional compressors are limited in their ability to detect redundancy across a large range since they search for redundant data in a fine-grain level (string level). For deduplication, metadata embedded in an input file changes more frequently, and this introduces more unnecessary unique chunks, leading to poor deduplication. Cloud storage systems suffer from unpredictable and inefficient performance because of interference among different types of workloads. This dissertation proposes techniques to improve the effectiveness of traditional compressors and deduplication in improving space efficiency, and a new IO scheduling algorithm to improve performance predictability and efficiency for cloud storage systems. The common idea is to utilize similarity. To improve the effectiveness of compression and deduplication, similarity in content is used to transform an input file into a compression- or deduplication-friendly format. We propose Migratory Compression, a generic data transformation that identifies similar data in a coarse-grain level (block level) and then groups similar blocks together. It can be used as a preprocessing stage for any traditional compressor. We find metadata have a huge impact in reducing the benefit of deduplication. To isolate the impact from metadata, we propose to separate metadata from data. Three approaches are presented for use cases with different constrains. For the commonly used tar format, we propose Migratory Tar: a data transformation and also a new tar format that deduplicates better. We also present a case study where we use deduplication to reduce storage consumption for storing disk images, while at the same time achieving high performance in image deployment. Finally, we apply the same principle of utilizing similarity in IO scheduling to prevent interference between random and sequential workloads, leading to efficient, consistent, and predictable performance for sequential workloads and a high disk utilization.
Type	Text
Publisher	University of Utah
Subject	cloud storage; compression; deduplication; storage system; testbed
Dissertation Institution	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	Copyright © Xing Lin 2015
Format Medium	application/pdf
Format Extent	27,803 bytes
Identifier	etd3/id/4041
ARK	ark:/87278/s6zp7ffz
Setname	ir_etd
ID	197591
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6zp7ffz

Back to Search Results