Page 113

Contents | 113 of 135

Download PDF | | Reference URL | Gallery View | Parent Record

Publication Type	technical report
School or College	College of Engineering
Department	Computing, School of
Creator	Lin, Frank Chung Huei
Title	Load balancing and fault tolerance in applicative systems
Date	1985-08
Description	Applicative systems are promising candidates to achieve high performance computing through aggregation of processors. This dissertation studies two important issues in building scalable applicative systems: load balancing problem and fault tolerance.; A dynamic load balancing scheme is proposed for large scale applicative systems. The method is based on a demand-driven approach, the gradient model, which transfers excessive tasks to the nearest idle processor via a gradient surface. The gradient surface is established by the demands from idle processors. The algorithm is fully distributed and asynchronous. A global balance is achieved by successive refinements of many localized balances. The gradient model is independent of system topology and can easily accommodate heterogeneous multiprocessor systems. Simulations have shown that the gradient model performs reasonably well.; The concept of functional checkpointing is proposed as the nucleus of a distributed recovery mechanism. This entails incrementally building a resilient structure as the evaluation of an applicative program proceeds. A simple rollback algorithm is suggested to regenerate the corrupted structure by the most effective functional checkpoints. Another algorithm, which attempts to recover all intermediate results, is also presented. The parent of a faulty task reproduces a functional twin of the failed task. The regenerated task inherits all offspring of the faulty task so that partial results can be salvaged.
Type	Text
Subject	computer architecture; load balancing; fault tolerance; computer science; applicative systems
Language	eng
Bibliographic Citation	Lin, FCH. (1985). Load balancing and fault tolerance in applicative systems. UUCS-85-118.
Series	University of Utah Computer Science Technical Report
Relation is Part of	ARPANET
Format Medium	application/pdf
Format Extent	52,468,022 bytes
File Name	Lin-Load_Balancing.pdf
Conversion Specifications	Original scanned with Kirtas 2400 and saved as 400 ppi uncompressed TIFF. PDF generated by Adobe Acrobat Pro X for CONTENTdm display
ARK	ark:/87278/s6ck0fp4
Setname	ir_computersa
ID	99648
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6ck0fp4

Page Metadata

Title	Page 113
Setname	ir_computersa
ID	99625
OCR Text	Show 101 traversing overhead. Wrap-around list or threaded link tree [39] is also very difficult to implement in a dynamic evaluation environment like appl icative systems. 3.4.6.2 Multiple faults. Both the rollback and splicing recoveries use functional checkpoints to tolerate hardware failure or failures. Although single node failure is assumed throughout the discussion, it is obvious that the rollback recovery is not limited to tolerate only single node failure. The difference between multiple faults and single fault in the rollback algorithm is the placement of the recovery border in the evaluation graph. The splicing recovery can handle some combinations of multiple faults gracefully. For example, multiple failures on different branches of a structure do not disturb the recovery algorithm at all. Separate recoveries take place at different parts of the program in parallel. However, if both the parent and grandparent of a task fail simultaneously, the orphan task would be stranded. The resilient structure concept can be extended to include pointers to the great grandparent and beyond to tolerate multiple failures on one branch of the graph. But. the preferred alternative is to consider the simple rollback recovery as a fall-back algorithm for the splice recovery. In case the splice algorithm fails to retrieve some partial results, the system can still redo the whole branch _of the program. Blending. both recovery methods is a straightforward exercise. Both algorithms can run harmoniously without any interference. A parent task may regenerate a step-child when the parent detects a failure of a child node (rollback recovery). When a partial result comes back from a grandchild, it is transported to the step-child (splice recovery). If multiple faults inhibit any partial result to be retrieved, the step-child generated by the rollback continues in dependently. 3.4.6.3 Hardware redundancy. In a hardware redundant fault tolerant system, several redundant machines execute an identical program on replicated data objects. The results produced by these machines are compared with each
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6ck0fp4/99625