Page 85

Contents | 85 of 135

Download PDF | | Reference URL | Gallery View | Parent Record

Publication Type	technical report
School or College	College of Engineering
Department	Computing, School of
Creator	Lin, Frank Chung Huei
Title	Load balancing and fault tolerance in applicative systems
Date	1985-08
Description	Applicative systems are promising candidates to achieve high performance computing through aggregation of processors. This dissertation studies two important issues in building scalable applicative systems: load balancing problem and fault tolerance.; A dynamic load balancing scheme is proposed for large scale applicative systems. The method is based on a demand-driven approach, the gradient model, which transfers excessive tasks to the nearest idle processor via a gradient surface. The gradient surface is established by the demands from idle processors. The algorithm is fully distributed and asynchronous. A global balance is achieved by successive refinements of many localized balances. The gradient model is independent of system topology and can easily accommodate heterogeneous multiprocessor systems. Simulations have shown that the gradient model performs reasonably well.; The concept of functional checkpointing is proposed as the nucleus of a distributed recovery mechanism. This entails incrementally building a resilient structure as the evaluation of an applicative program proceeds. A simple rollback algorithm is suggested to regenerate the corrupted structure by the most effective functional checkpoints. Another algorithm, which attempts to recover all intermediate results, is also presented. The parent of a faulty task reproduces a functional twin of the failed task. The regenerated task inherits all offspring of the faulty task so that partial results can be salvaged.
Type	Text
Subject	computer architecture; load balancing; fault tolerance; computer science; applicative systems
Language	eng
Bibliographic Citation	Lin, FCH. (1985). Load balancing and fault tolerance in applicative systems. UUCS-85-118.
Series	University of Utah Computer Science Technical Report
Relation is Part of	ARPANET
Format Medium	application/pdf
Format Extent	52,468,022 bytes
File Name	Lin-Load_Balancing.pdf
Conversion Specifications	Original scanned with Kirtas 2400 and saved as 400 ppi uncompressed TIFF. PDF generated by Adobe Acrobat Pro X for CONTENTdm display
ARK	ark:/87278/s6ck0fp4
Setname	ir_computersa
ID	99648
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6ck0fp4

Page Metadata

Title	Page 85
Setname	ir_computersa
ID	99597
OCR Text	Show 73 3.1.3 Predictable processor behavior Behavior of a faulty node has significant impact on fault-tolerant techniques. A very liberal fault behavior model is the Byzantine general's problem [4 1, 59]: A faulty node may pretend that it is healthy. The failed processor may respond to inquiries from other nodes with erroneous but valid messages. The other extreme of modeling fault behavior is to require that malfunctioning processors inform the rest of a system as soon as an internal fault has been identified. The assumption is normally enforced by extensive hardware redundancy. It is assumed that processor behaviors are "predictable." If a processor fails, it will not transmit any valid message. This assumption can be enforced by commanding a faulty node to keep silent and not to respond to any inquiry. Or, a faulty node may answer an inquiry with an invalid message. Several techniques are available for a processor to determine node malfunctioning. Parity checking on the system bus or resident memory, illegal instruction trap, protection violation, or a subsystem breakdown may trigger the CPU to report a processor failure. Duplication of processors within a node, or passive node diagnosis [47], is also a common technique to build a self-checking node. 3.1.4 Reliable communication channel The switching network is another possible source of system failures. Fault-tolerance of computer networks has been an active research subject in the parallel processing area. Most fault-tolerant networks [48] were proposed to provide redundant paths such that a healthy processor could easily reroute messages through an alternate communication channel. In this study, .it is assumed that a processor makes the best effort to communicate with a destination node. If the destination can~ot be reached due tG a network problem, the node is considered faulty. Problems with the interconnection network can be detected via coding or timeout mechanisms [5].
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6ck0fp4/99597