Towards building efficient error detectors for improving system resilience

Update Item Information
Title Towards building efficient error detectors for improving system resilience
Publication Type dissertation
School or College College of Engineering
Department Computing
Author Sharma, Vishal Chandra
Date 2017
Description Current scaling trends in transistor technology, in pursuit of larger component counts and improving power efficiency, are making the hardware increasingly less reliable. Due to extreme transistor miniaturization, it is becoming easier to flip a bit stored in memory elements built using these transistors. Given that soft errors can cause transient bit-flips in memory elements, caused due to alpha particles and cosmic rays striking those elements, soft errors have become one of the major impediments in system resilience as we move towards exascale computing. Soft errors escaping the hardware-layer may silently corrupt the runtime application data of a program, causing silent data corruption in the output. Also, given that soft errors are transient in nature, it is notoriously hard to trace back their origins. Therefore, techniques to enhance system resilience hinge on the availability of efficient error detectors that have high detection rates, low false positive rates, and lower computational overhead. It is equally important to have a flexible infrastructure capable of simulating realistic soft error models to promote an effective evaluation of newly developed error detectors. In this work, we present a set of techniques for efficiently detecting soft errors affecting control-flow, data, and structured address computations in an application. We evaluate the efficacy of the proposed techniques by evaluating them on a collection of benchmarks through fault-injection driven studies. As an important requirement, we also introduce two new LLVM-based fault injectors, KULFI and VULFI, which are geared towards scalar and vector architectures, respectively. Through this work, we aim to make contributions to the system resilience community by making our research tools (in the form of error detectors and fault injectors) publicly available.
Type Text
Publisher University of Utah
Subject Fault Injection; Fault Tolerance; Resilience; Soft Error Detectors; Computer science
Dissertation Name Doctor of Philosophy
Language eng
Rights Management (c) Vishal Chandra Sharma
Format application/pdf
Format Medium application/pdf
ARK ark:/87278/s6nk7kcw
Setname ir_etd
ID 1347745
Reference URL https://collections.lib.utah.edu/ark:/87278/s6nk7kcw
Back to Search Results