Area, power and performance optimization algorithms for elastic circuit control networks

Area, power and performance optimization algorithms for elastic circuit control networks

Title	Area, power and performance optimization algorithms for elastic circuit control networks
Publication Type	dissertation
School or College	College of Engineering
Department	Electrical & Computer Engineering
Author	Kilada, Eliyah Wadie Ragi
Date	2012-05
Description	Elasticity is a design paradigm in which circuits can tolerate arbitrary latency/delay variations in their computation units as well as communication channels. Creating elastic (both synchronous and asynchronous) designs from clocked designs has potential benefits of increased modularity and robustness to variations. Several transformations have been suggested in the literature and each of these require a handshake control network (examples include synchronous elasticization and desynchronization). Elastic control network area and power overheads may become prohibitive. This dissertation investigates different optimization avenues to reduce these overheads without sacrificing the control network performance. First, an algorithm and a tool, CNG, is introduced that generates a control network with minimal total number of join and fork control steering units. Synchronous Elastic FLow (SELF) is a handshake protocol used over synchronous elastic designs. Comparing to its standard eager implementation (that uses eager forks - EForks), lazy SELF can consume less power and area. However, it typically suff ers from combinational cycles and can have inferior performance in some systems. Hence, lazy SELF has been rarely studied in the literature. This work formally and exhaustively investigates the specifi cations, diff erent implementations, and verifi cation of the lazy SELF protocol. Furthermore, several new and existing lazy designs are mapped to hybrid eager/lazy imple-mentations that retain the performance advantage of the eager design but have power and area advantages of lazy implementations, and are combinational-cycle free. This work also introduces a novel ultra simple fork (USFork) design. The USFork has two advantages over lazy forks: it is composed of simpler logic (just wires) and does not form combinational cycles. The conditions under which an EFork can be replaced by a USFork without any performance loss are formally derived. The last optimization avenue discussed in this dissertation is Elastic Bu er Controller (EBC) merging. In a typical synchronous elastic control network, some EBCs may activate their corresponding latches at similar schedules. This work provides a framework for fi nding and merging such controllers in any control network; including open networks (i.e., when the environment abstract is not available or required to be flexible) as well as networks incorporating variable latency units. Replacing EForks with USForks under some equivalence conditions as well as EBC merging have been fully automated in a tool, HGEN. The impact of this work will help achieve elasticity at a reduced cost. It will broaden the class of circuits that can be elasticized with acceptable overhead (circuits that designers would otherwise nd it too expensive to elasticize). In a MiniMIPS processor case study, comparing to a basic control network implementation, the optimization techniques of this dissertation accumulatively achieve reductions in the control network area, dynamic, and leakage power of 73.2%, 68.6%, and 69.1%, respectively.
Type	Text
Publisher	University of Utah
Subject	Control network optimization; elastic circuits; hybrid self implementation; latency-insensitive design; lazy self implementation; verification
Dissertation Institution	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	Copyright © Eliyah Wadie Ragi Kilada 2012
Format	application/pdf
Format Medium	application/pdf
Format Extent	10,437,929 bytes
Identifier	us-etd3,84569
Source	Original in Marriott Library Special Collections, QA3.5 2012 .K55
ARK	ark:/87278/s6668tzz
DOI	https://doi.org/doi:10.26053/0H-BR2F-HCG0
Setname	ir_etd
ID	194696
OCR Text	Show AREA, POWER AND PERFORMANCE OPTIMIZATION ALGORITHMS FOR ELASTIC CIRCUIT CONTROL NETWORKS by Eliyah Wadie Ragi Kilada A dissertation submitted to the faculty of The University of Utah in partial ful llment of the requirements for the degree of Doctor of Philosophy Department of Electrical and Computer Engineering The University of Utah May 2012 Copyright c Eliyah Wadie Ragi Kilada 2012 All Rights Reserved Th e Un i v e r s i t y o f Ut a h Gr a d u a t e S c h o o l STATEMENT OF DISSERTATION APPROVAL The dissertation of has been approved by the following supervisory committee members: , Chair Date Approved , Member Date Approved , Member Date Approved , Member Date Approved , Member Date Approved and by , Chair of the Department of and by Charles A. Wight, Dean of The Graduate School. Eliyah Wadie Ragi Kilada Kenneth S. Stevens 01/12/2012 Chris J. Myers 01/04/2012 Erik Brunvand 01/12/2012 Priyank Kalla 01/12/2012 Michael Kishinevsky 01/12/2012 Gianluca Lazzi Electrical and Computer Engineering ABSTRACT Elasticity is a design paradigm in which circuits can tolerate arbitrary latency/delay variations in their computation units as well as communication channels. Creating elastic (both synchronous and asynchronous) designs from clocked designs has potential bene ts of increased modularity and robustness to variations. Several transformations have been suggested in the literature and each of these require a handshake control network (examples include synchronous elasticization and desynchronization). Elastic control network area and power overheads may become prohibitive. This dissertation investigates di erent optimization avenues to reduce these overheads without sacri cing the control network performance. First, an algorithm and a tool, CNG, is introduced that generates a control network with minimal total number of join and fork control steering units. Synchronous Elastic FLow (SELF) is a handshake protocol used over synchronous elastic designs. Comparing to its standard eager implementation (that uses eager forks - EForks), lazy SELF can consume less power and area. However, it typically su ers from combina- tional cycles and can have inferior performance in some systems. Hence, lazy SELF has been rarely studied in the literature. This work formally and exhaustively investigates the speci cations, di erent implementations, and veri cation of the lazy SELF protocol. Furthermore, several new and existing lazy designs are mapped to hybrid eager/lazy imple- mentations that retain the performance advantage of the eager design but have power and area advantages of lazy implementations, and are combinational-cycle free. This work also introduces a novel ultra simple fork (USFork) design. The USFork has two advantages over lazy forks: it is composed of simpler logic (just wires) and does not form combinational cycles. The conditions under which an EFork can be replaced by a USFork without any performance loss are formally derived. The last optimization avenue discussed in this dissertation is Elastic Bu er Controller (EBC) merging. In a typical synchronous elastic control network, some EBCs may activate their corresponding latches at similar schedules. This work provides a framework for nding and merging such controllers in any control network; including open networks (i.e., when the environment abstract is not available or required to be exible) as well as networks incorporating variable latency units. Replacing EForks with USForks under some equivalence conditions as well as EBC merging have been fully automated in a tool, HGEN. The impact of this work will help achieve elasticity at a reduced cost. It will broaden the class of circuits that can be elasticized with acceptable overhead (circuits that designers would otherwise nd it too expensive to elasticize). In a MiniMIPS processor case study, comparing to a basic control network implementation, the optimization techniques of this dissertation accumulatively achieve reductions in the control network area, dynamic, and leakage power of 73.2%, 68.6%, and 69.1%, respectively. iv CONTENTS ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii LIST OF TABLES: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x LIST OF ACRONYMS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xi ACKNOWLEDGEMENTS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xii CHAPTERS 1. INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Background And Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 What Is Elasticity? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Why Elasticity? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.3 Elasticization: Converting a Normally Clocked System into Elastic . . . . 4 1.2 Elasticity Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 List of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 This Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2. SYNCHRONOUS ELASTICIZATION AND THE MINIMIPS CASE STUDY: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 2.1 Synchronous Elastic Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 MiniMIPS Case Study and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Elasticizing the MiniMIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2 Case Study Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.3 Optimization Avenues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3. CONTROL NETWORK GENERATOR FOR ELASTIC CIRCUITS : : 21 3.1 Problem De nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 The Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Step I: Construct the Potential Terms . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2.3 Step II: Construct the Partial Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2.4 Step III: Collect Space Metrics and Remove Higher nAJ Partial Solution 47 3.2.5 Step IV: Divide, Re ne the Search Space and Find an Optimum Solution 55 3.2.6 OptSoln Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.1 CNG Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.3.2 Case Study: The MiniMIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.3 Di erent PTermS Construction Methods . . . . . . . . . . . . . . . . . . . . . . . 64 3.3.4 CNG vs. Other Synthesis Tools/Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4. LAZY AND HYBRID SELF PROTOCOL IMPLEMENTATIONS : : : : 71 4.1 SELF Channel Protocol Veri cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 SELF Control Network Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Fork Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.1 Lazy Fork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.3.2 Eager Fork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Lazy Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.1 Lazy Join Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.2 Lazy Join Veri cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.3 Lazy Join Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.5 Lazy SELF Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.1 Deadlock - D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.5.2 Oscillation Due to Logical Instability - LI . . . . . . . . . . . . . . . . . . . . . . . 82 4.5.3 Oscillation Due to Transient Instability - TI . . . . . . . . . . . . . . . . . . . . . 83 4.6 Hybrid SELF Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6.1 Cycle Cutting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6.2 Runtime Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.6.3 Eager to Hybrid Conversion Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.7 MiniMIPS Case Study and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.7.1 Eager Versus Lazy SELF Implementations . . . . . . . . . . . . . . . . . . . . . . . 88 4.7.2 Eager Versus Hybrid SELF Implementations . . . . . . . . . . . . . . . . . . . . . 90 5. UTILIZING THE ULTRA SIMPLE FORK AND CONTROLLER MERGING : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95 5.1 Eager to Ultra Simple Fork Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.1.1 Eager SELF Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.1.2 Eager Fork State Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.1.3 Input Behavior Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.1.4 Veri cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.1.5 Multi-output-channel EForks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Elastic Bu er Controller Merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3 Veri cation Models of Di erent Control Network Components . . . . . . . . . . . 106 5.3.1 n-Input Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.2 n-Output Fork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.3 Elastic Bu er Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.3.4 SELF Input Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.3.5 SELF Output Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.3.6 Variable Latency Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.4 HGEN Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.1 The MiniMIPS Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.2 S382 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 vi 6. CONCLUSION AND FUTURE WORK : : : : : : : : : : : : : : : : : : : : : : : : : : : 118 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6.1.1 CNG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.1.2 HGEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 APPENDICES A. HEURISTICS TO CUT CNG RUNTIME FOR BIG PROBLEMS : : : : : 125 B. ELIMINATING NEGATIVE SLACK IN SYNCHRONOUS ELASTIC CONTROL NETWORKS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 128 REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 142 vii LIST OF FIGURES 1.1 Sample read-modify-write memory structure. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Converting a clocked system into elastic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 An EB implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 SELF channel protocol. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 An n-to-1 lazy join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 A 1-to-n EFork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Block diagram of the ordinary clocked MiniMIPS. . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Hand-optimized control network of the elastic clocked MiniMIPS. . . . . . . . . . . 17 2.7 Fabricated chips schmoo plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Two possible implementations of Example 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 A sample control network of Example 3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 A Solution graph for Example 3.2 Solution of Eq. 3.2. . . . . . . . . . . . . . . . . . . 26 3.4 Rule I. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 Rule II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.6 Rule V. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.7 First and second iterations for Example 3.30 using Method IV. . . . . . . . . . . . . 61 3.8 First and second iterations for Example 3.31 using Method IV. . . . . . . . . . . . . 63 3.9 CNG-optimized control network of the elastic clocked MiniMIPS. . . . . . . . . . . 64 3.10 ProOverlap 5 1 example: CNG vs. DC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.11 ProOverlap 9 1 example: CNG vs. DC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.1 Vr1 of LF01. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 A 1-to-n lazy fork (maps to LF00). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Lazy fork speci cations (Vr1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Lazy fork veri cation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.5 A 2-output LF01 implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.6 Lazy join speci cations (Sl1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.7 Lazy join veri cation setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.8 A 2-input LJ1111 implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.9 A 2-input LJ1011 implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.10 Sample fork join combinations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.11 LF00 and LJ1111 combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.12 LF00 and LJ0000 combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.13 Vr1 (or Vr2) of the EFork and LFork under some constrained input behavior. 85 4.14 EFork-LFork performance equivalence veri cation setup. . . . . . . . . . . . . . . . . 86 4.15 A sample structure where eager protocol will have runtime advantage over lazy. 89 4.16 Stall patterns at the branches of FC in the presence of bubbles. . . . . . . . . . . . 91 4.17 Hybrid implementation of FC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.1 A 2-output-channel EFork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 The EFork state diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.3 A 2-output-channel USFork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.4 Vr1 (same for Vr2) in states s0 to s2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5 Sl in states s0 to s2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.6 EFork-USFork equivalence veri cation setup. . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.7 Eager to hybrid transformation of multi-output forks. . . . . . . . . . . . . . . . . . . . 105 5.8 EBC merging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.9 Illustration of elastic control network input and output channels. . . . . . . . . . . 109 5.10 A variable latency unit and a controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.11 Control network of the elastic clocked MiniMIPS with register le bubbles. . . 112 5.12 S382. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.1 A chart of the MiniMIPS control network area in di erent synchronous elastic implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 6.2 A chart of the MiniMIPS control network dynamic power in di erent syn- chronous elastic implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 B.1 Combining concatenated n-input and m-input joins. . . . . . . . . . . . . . . . . . . . . 129 B.2 Steps of rolling back fork FAB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 B.3 Rolling back an n-output fork through an m-input join . . . . . . . . . . . . . . . . . . 132 B.4 The proposed ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 B.5 Control network of Example B.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 B.6 Veri cation setup for rolling back a fork. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 B.7 Control network of the synchronous elastic version of s298. . . . . . . . . . . . . . . . 141 ix LIST OF TABLES 2.1 Clocked and eager elastic MiniMIPS chip results. . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Terms and PSs of Example 3.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 PTermSi of di erent PTermS Construction Methods. . . . . . . . . . . . . . . . . . 65 3.3 Search Space reduction (in terms of number of Solns) for di erent methods. . 65 3.4 CNG Cost vs. other synthesis tools/ ows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1 Mapping between published and this work lazy forks and joins. . . . . . . . . . . . . 74 4.2 CFr computation of LF00. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 CFt computation of LF00. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4 Lazy fork-join combination characterization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.5 Time required (in terms of #cycles) by lazy and eager protocols to nish the testbench program in [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6 Area, power, and runtime of the MiniMIPS control network using di erent hybrid (eager/lazy) SELF implementations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7 Elasticity area and power overheads of an all eager and a hybrid (eager/lazy) SELF implementations of the MiniMIP processor. . . . . . . . . . . . . . . . . . . . . . . 94 5.1 The EFork state table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.2 Area, power, and runtime of the MiniMIPS control network using di erent hybrid (eager/ultra-simple) SELF implementations with and without EBC merging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3 HGEN results for the elastic MiniMIPS control network. . . . . . . . . . . . . . . . . . 115 5.4 HGEN results for s382 benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5 HGEN results for other ISCAS benchmarks - in open network settings. . . . . . . 117 6.1 Summary of results for some of the di erent MiniMIPS control network im- plementations introduced in this dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.2 Elasticity area and power overheads of di erent hybrid SELF implementations of the MiniMIP processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 A.1 CNG Cost vs. other synthesis tools/ ows using heuristics. . . . . . . . . . . . . . . . . 127 B.1 Iteration 1 for Example B.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 B.2 Example B.2 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 B.3 MiniMIPS results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.4 S298 results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 LIST OF ACRONYMS DI Delay Insensitive design LI Latency Insensitive design SELF Synchronous ELastic Flow protocol MIPS Microprocessor without Interlocked Pipeline Stages EFork Eager Fork LFork Lazy Fork USFork Ultra Simple Fork HFork Hybrid Fork LJoin Lazy Join EB Elastic Bu er EBC Elastic Bu er Controller CNG Control Network Generator tool HGEN Hybrid GENerator tool ACKNOWLEDGEMENTS I am grateful to have the chance to do research in the University of Utah. Special thanks go to my advisor and mentor, Ken Stevens, for providing con dence, experience, directions, and funding throughout this past three years. Thanks to all my committee members for the fruitful discussions and technical inputs. Thanks to Chris Myers and Priyank Kalla for the classes they taught me and to Erik Brunvand for his continuous support for the 0.5 UoU library and the tool ow. Thanks to Michael Kishinevsky for sharing his expertise in synchronous elastic architectures, and for providing a verilog generic model for variable latency unit modules. Thanks to Alan Mishchenko for his support with the ABC tool, to Ganesh Gopalakrishnan for his help with the 6thSense tool license, and to Suresh Venkatasubramanian for his insights on the CNG complexity. Bennion Redd has been so helpful to me while working on the Verigy's V93000 SoC tester. I also like to thank Shomit Das for his help in the place, route, and layout of the 0.5 m chips. Thanks to the stackover ow.com community for the valuable help on LATEX. Thanks to Lori Sather for her administrative help. This material is part of work supported by the National Science Foundation under Grant No. 0810408. Finally, I am thankful for my friends and relatives who made my time in Utah and Canada most fruitful and enjoyable, and for my parents who sowed the seeds of faith, discipline, and love of knowledge in my life. CHAPTER 1 INTRODUCTION The dissertation problem statement is to reduce the power and area overheads of elastic system control networks without compromising performance. 1.1 Background And Motivations 1.1.1 What Is Elasticity? Elasticity is a design paradigm in which circuits can tolerate arbitrary latency/delay variations in their computation units as well as communication channels [2, 3]. Di erent levels of elasticity exist. Delay-Insensitive (DI) designs function correctly whatever the delay of their gates or wires [4]. Thus, DI designs provide the highest degree of elasticity. However, the number of circuits that can be implemented using DI methodology is limited [5]. This dissertation will focus on the synchronous implementation of elasticity (also known as latency insensitive (LI) design) [8, 9, 10, 11]. Some of the algorithms introduced in the work can also be extended to asynchronous elasticity with bundled data (and, for short, may be referred to later as just asynchronous elasticity or desynchronization) [4, 6, 7]. LI designs can tolerate discrete number (of clock cycles) of computation and communication latency variations, while asynchronous elasticity can tolerate ner delays. 1.1.2 Why Elasticity? Elastic design provides advantages much needed in the nanometer era. Without loss of generality, and for the ease of explanation, most of the following advantages will be illustrated through synchronous elasticity. Since LI design provides discrete elasticity of the ner asynchronous elasticity [3], these advantages naturally extend to the asynchronous implementation as well. 1. Provides tolerance for long interconnect latency variations and easier technology mi- gration. The International Technology Roadmap for Semiconductor (ITRS) reported in 2009 that chip-long communication cannot be done in a single clock cycle any more 2 [12]. Hence, interconnect pipelining is becoming a necessity. Interconnect delays are a ected by many factors that may not be accurately estimated before the nal layout (e.g., physical distance, metal layer used, crosstalk, etc.) [13, 14]. They also do not scale as well as logic gates [15, 16, 17]. Hence, due to technology migration or place and route extra delays, it is very likely to have interconnects that su er di erent latencies than estimated at earlier stages of the design. Hence, unless the design implements some kind of latency insensitive technique, severe changes may be required in the system to accommodate the new latencies and, possibly, a number of iterations [9, 17, 12]. This increases the time-to-market of a product. On the other hand, LI designs tolerate the variations of interconnect latencies by inserting any required number of empty pipeline stages (called bubbles). This essentially cuts an interconnect into segments that meet the target timing constraints. By the de nition of LI design, inserting empty pipeline stages does not a ect the system functionality. 2. Provides easier latency/throughput tradeo exploration. For either ordinary clocked designs or LI, architectural analysis is required to compute and optimize the impact of inserting pipeline stages on the overall system performance [18, 19, 20, 21]. Nonethe- less, the LI methodology allows for an easier exploration of latency/throughput trade- o s, since the computational blocks can be left untouched while inserting interconnect pipelines [22]. This also allows for easier exploration of new architectures [23, 24, 25]. 3. Provides more modular design and easier IP reuse. IP reuse is a key consideration for increased productivity in the current technology [12]. LI methodology facilitates IP assembly and reuse in complex SoCs. It can tolerate variable interconnect latencies among IPs without need of changing them. 4. Is a natural t for variable latency designs/interfaces - increasing performance by targeting the more frequent faster cases rather than the worst case. Some applica- tions require exible interfaces that can tolerate variable latencies. Examples include interfaces to variable latency ALUs, memories or network on chip [26, 27, 28, 29]. By its de nition, LI methodology naturally ts in these applications. In fact, it has been reported that applying exible latency design to the critical block of one of Intel R SOC (H.264 CABAC) can achieve 35% performance advantage [30]. Variable latency design aims at targeting an average performance rather than the worst case. In particular, instead of optimizing a circuit for all corner cases, variable latency design optimizes the fast paths in l1 clock cycles, and the slow paths in l2 cycles (with 3 l2 > l1). The average throughput increases as the probability of the input patterns that require longer latency decreases [31]. Though variable latency design comes at an area overhead, however, trying to achieve the same performance with static latency may lead to an even bigger design to meet the tight timing target. 5. Enables pipelining cyclic systems - a goal that cannot be achieved by the standard bypass and retiming of regular clocked systems [23]. To illustrate, consider the Read- Modify-Write (RMW) memory structure of Fig. 1.1. The memory structure supports three di erent operations (ops): read (rd), write (wr) and read-modify-write (rmw). An example of a rmw operation is updating a speci c memory location through a modify function fM (e.g., fM(mem[adr1]) = mem[adr1] + 1). For simplicity, assume the ops arrive to the memory interface with a maximum rate of 1 operation per clock. Bypass logic is designed around the memory to guarantee that every read operation from a memory location gets the most recent data written to that location (also referred to as memory access coherency). With regular bypass design, and if back-to- back rmw operations (of the same memory address) are allowed, the modify function fM cannot have a latency of more than 1 clock cycle (i.e., cannot be pipelined), otherwise the output of fM may be required for a following operation while fM is still being executed. Thus, the standard bypass and retiming of regular clocked designs cannot pipeline fM in this cyclic system. This is a typical observation that I also noticed while designing and verifying memory bypass logic during my internship at Cisco Systems R , Canada (Jan - Jul, 2011). On the other hand, LI design is able to pipeline cyclic systems through its natural capability to tolerate variable latency and to stall. For example, in LI design fM can be pipelined to take any number of clock Figure 1.1: Sample read-modify-write memory structure. 4 cycle latencies (to decrease the clock period for example). Whenever the output of fM is required while it is still executing, LI designs provide the natural ability to propagate back a stall signal through the system until fM nishes execution. Moreover, whenever fM is not required, LI design (with an early evaluation join [32], for example) provides the ability to ignore fM output such that the system will operate unstalled (i.e., with its normal latencies). Solving this design problem with synchronous elasticity using an early evaluation join is illustrated in [23]. 6. Saves dynamic power by activating stages only when necessary. LI design provides a ne-grained (per pipeline stage) clock gating based on dynamic data ow [8]. In LI designs, a stage is only activated when it is processing valid data and its downstream is not stalled. This can reduce the system dynamic power consumption. However, an o set to this power saving is the power overhead of the hand-shake control network. 7. Avoids distribution of long stall signals that can be on critical paths. LI design also provides an upstream stage-based stall propagation mechanism with no overhead on the clock frequency. This avoids distribution of long global stall signals that can be on critical paths and can limit scalability [8, 23]. 8. Asynchronous elastic designs provide low electro-magnetic interference (EMI) [6]. 9. Asynchronous elastic designs provide ner and dynamic tracking of Process, Voltage, and Temperature (PVT) variations - allowing for better typical case performance rather than worst case. Asynchronous elastic circuits synchronize through hand-shake signals (request/acknowledge) rather than a global clock. Hence, while the clock period of synchronous designs (and, in turn, their performance) is limited by the worst case conditions (of process, voltage, and temperature variations), asynchronous designs dynamically track the PVT variations providing better typical performance. Authors of [7] reported that a desynchronized DLX processor in 90 nm process has a performance degradation of 20% compared to a clocked one when both operate under worst case conditions. However, the desynchronized processor runs faster than the synchronous one in 90% of the time. They also reported 13.44% area overhead. 1.1.3 Elasticization: Converting a Normally Clocked System into Elastic Because of the above advantages, converting an ordinary clocked system into elastic (also referred to as elasticization) has been frequently studied in literature. Carloni et al. [2] introduced the concept of patient processes as a theoretical model for latency insensitive 5 design (aka synchronous elastic design). Informally, a module is a patient process if its behavior is de ned based on signal events order rather than their exact latencies [2]. Since then, several approaches were proposed to convert a clocked circuit into elastic (in both its synchronous and asynchronous avors). In all these approaches the resultant elastic and the ordinary clocked systems are ow equivalent. Two signals are ow equivalent if they exhibit the same sequence of informative events (i.e., after dropping all the empty events). Similarly, two systems are ow equivalent if, given ow equivalent input sequences, their outputs are ow equivalent [2, 33, 34]. Before going further through the di erent elasticization schemes, it is useful to consider the elasticization example shown in Fig. 1.2. Fig. 1.2a shows a synchronous circuit composed of registers A, E, G, and F connected through combinational logic (CL). A typical rst step in an elasticization scheme is to replace each ip- op (or possibly a group of them) in the original clocked system with a synchronization element (possibly double latches) enabled through a corresponding controller1. Following this step, data communications among registers are analyzed. For each register-to-register data communication there must be a corresponding elastic control channel (shown in dotted lines in Fig. 1.2b) to control the data ow between these two registers. A control channel is usually composed of two signals, one in the forward direction indicating the data validity and the other in the backward direction carrying the stall information. These two signals are typically referred to as Valid/Stall and Req/Ack in synchronous and asynchronous elasticity, respectively. A network of control channels is formed where channels are connected through join and fork components. A join component (shown in Fig. 1.2b as ) is used to join two or more input channels into one output channel. Similarly, a fork component (shown in Fig. 1.2b as ) is used to fork one input channel into two or more output channels. Implementations of the latch controllers, joins, forks, and channel protocol depend on the elasticization method. On the asynchronous side, desynchronization was proposed to convert a normally clocked circuit into an asynchronous one [6, 7]. Desynchronized designs are synchronized through the regular asynchronous Req and Ack hand-shake signals rather than a universal clock. Bundled data protocols are normally used; examples include 4-phase, 2-phase, or single rail [4, 35]. For each register-to-register communication, delay elements are inserted in the control path to match the critical data path delay between these two registers. Thus, the 1LID-2ss and LID-1ss mentioned later in the chapter are slightly di erent. However, the main concepts of Fig. 1.2 still apply to them. 6 (a) Normally clocked. (b) Elastic. Figure 1.2: Converting a clocked system into elastic. request signals are delayed long enough for the data signals to arrive. This guarantees each receiving latch is not activated before the data is ready at its input. Latch controller protocol design and implementation are crucial to achieve maximum concurrency among latch controllers, otherwise performance penalty can occur. Hence, di erent hand-shake protocols and latch controllers have been studied in the literature [36, 34, 37, 6, 35]. The matched delay elements keep track of their corresponding data path delays under di erent process, voltage and temperature variations. Thus, the desynchronized designs operate at a typical performance rather than the worst case (as in their clocked counterparts). Algorithms have been developed for testing desynchronized circuits [38, 39, 40]. In the synchronous domain, an initial implementation for the latency insensitive de- sign theorem was published in [22, 17, 41]. The initial implementation wraps normally clocked sequential modules inside latency insensitive wrappers (called pearls and shells, respectively). Channel latencies can be adjusted through what is called relay stations. The 7 protocol requires a receiver to keep the Stall (also referred to as Stop) signal asserted for two consecutive clock cycles to stall the sender. Hence, the implementation was later referred to as Latency Insensitive Design with two-stop-to-stall (LID-2ss) [42]. To avoid data over ow, each shell contains (bypassable) input queues for each input of the corresponding pearl. The queues bu er the data tokens during stall conditions and are implemented by standard edge-triggered FIFOs [42]. Synchronous Interlocked Pipeline (SIP) technique was introduced with two major dif- ferences comparing to LID-2ss [8]. A stall condition is simpler and indicated by asserting the Stall signal for only one clock cycle. Second, instead of implementing external queues, SIP splits the same ip- ops used in the original clocked system into master-slave latches of opposite polarity and with separate enables. Under normal operation, the two latches will have one clock cycle forward latency (same as an edge triggered ip- op). Under stall conditions, the two latches has the capacity (together) to carry two di erent data tokens while the stall signal is being propagated upstream if necessary. Thus, the SIP controllers consume less area than their LID-2ss counterparts [42]. The protocol used in SIP can, in principle, be used for arbitrary pipeline structures - including joins, forks, branches, and selects. However, the proposed implementation in [8] of the aligned (also referred to later as lazy) fork component can easily form combinational cycles when connected to join components in an arbitrary control network. The concept of state-machine based nonaligned (also referred to later as eager) fork was introduced in [8] but not implemented. Because of its eagerness eager forks can allow for shorter runtime comparing to lazy forks. Authors of [9, 10], based on a similar implementation to [8], proposed an automatic procedure to convert an arbitrary clocked circuit into LI, namely, synchronous elasticization. The protocol name was coined as Synchronous ELastic Flow (SELF). They also implemented the eager fork. Eager forks constitute no combinational cycles when connected to joins, allowing synchronous elasticization for arbitrary clocked designs. Also, support for synchronous variable latency controllers was included in [9, 10]. Other signi cant latency insensitive protocols include Phased SELF (or pSELF) and LID-1ss. pSELF is a modi ed version of SELF that maps easier to and from the asyn- chronous Req/Ack hand-shake protocol [26, 27]. LID-1ss was proposed as a modi ed version of LID-2ss with stall condition indicated by asserting the Stall signal for only one clock cycle [42]. A frame work for validating latency insensitive protocol families is given in [33]. Several enhancements to the original synchronous elasticity (with the SELF protocol) 8 have then been reported. The regular join component waits for all its input channels to carry valid data before it passes the data token to the output. Early evaluation joins wait only for a required subset of inputs to be valid to start execution [32]. For correct operation, the early evaluation join must keep track of the inputs that were not required when they arrive later. This is done by sending anti-token on the opposite direction of their control channels. When an anti-token meets a token on a control channel they annihilate [32]. An example for that is a multiplexor where both the selection line and the selected input are valid while the nonselected input has not arrived yet. In such a case an early evaluation join will process the valid input, pass the data token to the output, and pass an anti-token to the nonrequired input. Early evaluation achieves performance advantage over lazy evaluation when join inputs have di erent arrival latencies [43]. Several transformations that are well-known in the synchronous design to improve performance have been carried over to synchronous elastic circuits in correct-by-construction fashion. These include retiming, recycling and speculation [44]. Nonetheless, other transfor- mations that can also enhance performance are available only to elastic circuits. Examples include empty-FIFO (bubble) insertion, FIFO-capacity increase, anti-token insertion, and early evaluation [23]. 1.2 Elasticity Overhead Generating a control network is a necessary step in any of the elasticization approaches. The elastic control network area and power overheads may become prohibitive in some cases [3]. A desynchronized DLX processor in 90 nm process is reported to have a 13.44% area overhead (over the normally clocked one), and noticeable power overhead [7]. Authors of [42] show that elasticizing a 32 32 6-stage-pipelined multiplier with three di erent synchronous elasticization techniques results in an area overhead ranging from 10% to 19%. Our measurements of a MiniMIPS processor fabricated in a 0.5 m node show that synchronous elasticization with an eager SELF implementation results in area and dynamic power penalties of 29% and 13%, respectively [45]. Adding advanced features to synchronous elastic circuits (e.g., early evaluation and anti-token propagation) can pose an area versus controller performance tradeo [32]. Elastic control networks re ect the register-to-register communications in the original clocked system. The network overhead may decrease with wider data paths. Nonetheless, 9 the overhead is remarkable when a design has a communication complexity comparable to its computation complexity. Furthermore, elasticity can be applied at di erent levels of granularity [3]. A design may be divided into very few register groups, with every group enabled by only one elastic controller. However, ner granularity typically results in more robustness to variations, better performance, and is sometimes required to enjoy some of the elasticity advantages mentioned in Sec. 1.1.2 [7]. On the other hand, ner granularity typically comes at a higher elasticity cost in terms of area and power consumption. For all these reasons, this dissertation aims at achieving elasticity at a minimized cost. This will be done through minimizing the control network area and power overheads without sacri cing performance. The impact of this work will broaden the class of circuits that can be elasticized with acceptable overhead (circuits that designers would otherwise nd it too expensive to elasticize). The impact will also enable designers to deepen the level of elastic granularity in their designs to enjoy the full bene t of elasticity at a reasonable cost. Furthermore, all the algorithms in this dissertation (except CNGT ow presented in Appendix B) have been automated and applied to various benchmarks ensuring their suitability for tight time-to-market constraints. 1.3 List of Contributions 1. Elasticization and fabrication of a MiniMIPS processor case study in 0.5 m technol- ogy. The MiniMIPS processor is an 8-bit subset of the MIPS (Microprocessor without Interlocked Pipeline Stages) designed by Hennessy [1, 46]. It has been elasticized using an all eager implementation of the SELF protocol. No bubbles or variable latency units were used. The control network has been hand optimized. The 0.5 m MiniMIPS represents a class of circuits in which the register-to-register communication complexity is comparable to the computation complexity. It, thus, provides a basic starting point to run the optimization algorithms introduced in this dissertation. The elasticization case study and results have been published in [45]. 2. The Control Network Generator (CNG) algorithm and tool. The elastic control net- work can be constructed in many di erent ways. A direct approach is provided in [9, 3]. In that approach, for each register that is receiving data communications from multiple registers, one multi-input join is connected to this register controller input. Similarly, for each register that is sending data communications to multiple registers, 10 one multi-output fork is connected to this register controller output. This approach, however, could be ine cient in terms of the total number of joins and forks used. Hence, this dissertation introduces CNG. CNG is an algorithm (and a CAD tool) that generates a control network with minimum total number of 2-input joins and 2-output forks. This can substantially reduce the power and area of the control network. CNG automatically generates the optimal network for both synchronous elasticization or desynchronization. Comparing to the approach of [9], a MiniMIPS case study shows that synchronous elastic implementation of the network generated by CNG will save 27.9%, 31.4%, and 28.5% of the control network area, dynamic, and leakage power, respectively. CNG is published in [47] and an extended version in [48]. PreCNG tool is also introduced. PreCNG takes an ISCAS benchmark and automatically nds and expresses the register-to-register communications in eqn and verilog formats as well as another format that CNG accepts. The work also formalizes the problem of control network generation in a form that can be optimized by commercial synthesis tools. Results are compared. 3. Formal investigation of the speci cations, di erent implementations, and veri cation of the lazy SELF protocol. The Synchronous Elastic Flow (SELF) protocol is a communication protocol in synchronous elastic designs [9]. Eager implementation of this protocol was reported in [9]. This implementation uses eager forks (EForks) that try to optimize the control network runtime on the expense of more area and power consumption. A lazy SELF implementation (i.e., that uses normal or, so called, lazy forks (LForks)) consumes less area and power. However, the latter su ers from combinational cycles and inferior runtime in some systems. Therefore, lazy SELF has been rarely studied in the literature. To exploit its area and power advantages, this work formally and exhaustively investigates the speci cations, di erent implementa- tions, and veri cation of the lazy SELF protocol. 4. Hybrid (EFork-LFork) SELF implementation. To make use of the eager SELF runtime advantage and the lazy logic simplicity, this work introduces a novel hybrid implementation of the SELF protocol, where both eager and lazy forks are incorpo- rated. The hybrid SELF implementation proposed in this dissertation uses eager forks only when needed for runtime optimization and combinational cycle cutting, and lazy forks otherwise. Conditions for replacing eager with lazy forks without runtime loss are formally derived. A MiniMIPS case study shows that, comparing to an all eager 11 implementation, a hybrid SELF (EFork-LFork) will save 31.8%, 26.0%, and 30.8% in the control network area, dynamic, and leakage power, respectively, without any performance loss. This and the previous contribution have been published in [49]. 5. Introducing an Ultra Simple Fork (USFork) design and the hybrid (EFork-USFork) SELF implementation. To further extend the concept of hybrid network, this work introduces a novel fork structure called the Ultra Simple Fork (USFork). The USFork has two advantages over the lazy fork: it has even simpler logic (just wires) and it forms no combinational cycles. This allows for even more area and power reduction in the control network. The conditions under which an EFork will be protocol equivalent to a USFork (and thus can be replaced) are formally derived. Comparing to an all eager implementation of the elastic MiniMIPS processor, hybrid (EFork-USFork) implementation shows 36.9%, 31.3%, and 32.0% savings in the control network area, dynamic, and leakage power, respectively. 6. Merging Elastic Bu er Controllers (EBCs) under some equivalence conditions ver- i able in any synchronous elastic control network. In a typical synchronous elastic control network, some Elastic Bu er Controllers (EBCs) may activate their corre- sponding latches at similar schedules. This can allow for possible merging of these controllers into one controller that feeds them all (as much as the physical placement permits). Similar observation has been made by the authors of [50]. However, their algorithm requires both the control network and its environment to have static latencies. Hence, this dissertation introduces a framework for merging such controllers in any control network. That includes open networks (i.e., when the environment abstract is not available or required to be exible) as well as networks incorporating variable latency units. Comparing to an all eager implementation of the elastic MiniMIPS processor, hybrid (EFork-USFork) implementation with merged EBCs shows 62.8%, 54.1%, and 56.9% savings in the control network area, dynamic, and leakage power, respectively. 7. The Hybrid Network GENerator (HGEN) tool. HGEN incorporates the above two contributions. It takes an input verilog description of a control network. It runs IBMR 6thSense [51] as an embedded veri cation engine. HGEN produces a verilog description of a minimized version of the control network (i.e., EForks that are protocol equivalent to USForks are replaced, and optionally, equivalent EBCs are merged). Though HGEN has been used in this dissertation to do the EFork to 12 USFork conversion and EBC merging, its value is more than that. HGEN provides a framework where any type of synchronous elastic network can be formally veri ed. Any future veri cation-based research or optimization can be readily integrated in the tool. HGEN and the above two contributions have been published in [52]. 8. The CNGT transformation ow. CNG does not guarantee providing the minimum possible critical path delay in a control network. Normally this is not a problem since the critical delay of the datapath is usually larger than that of the control network. Nonetheless, this work introduces a systematic ow (referred to as CNGT) of structural transformations of the synchronous elastic control network that reduces the network delay to meet tight timing constraints. CNGT is veri ed that the two versions of the control network (i.e., before and after the transformations) are functionally equivalent. The ow, in its current state, does not take into account wire delays. 1.4 This Dissertation Structure Chapter 2 gives an overview of synchronous elasticity and the SELF protocol. It also introduces the MiniMIPS elasticization as a case study. Chapter 3 formalizes the problem of minimizing the total number of 2-input joins and 2-output forks in an elastic control network. It introduces the CNG theory, algorithm, and tool. Chapter 3 also compares the results of CNG to other possible ows using Synopsys R Design Compiler R (DC) [53] or Berkeley ABC [54] over ISCAS benchmarks and other case studies. Chapter 4 formally and exhaustively investigates the speci cations and di erent imple- mentations of the lazy SELF protocol. It also introduces a hybrid implementation of the SELF protocol where both eager and lazy forks are used. Chapter 5 introduces two techniques for further reducing the area and power overheads of synchronous elastic control networks, namely, utilizing the Ultra Simple Fork (USFork) and EBC merging. The two techniques have been integrated in an automatic tool, HGEN, based on 6thSense as an embedded veri cation engine. Chapter 6 concludes the dissertation. Appendix A shows some preliminary heuristics for running CNG on big problems. Ap- pendix B introduces CNGT ow and transformations. CNGT aims at transforming a given synchronous elastic control network such that it meets tight timing constraints. CHAPTER 2 SYNCHRONOUS ELASTICIZATION AND THE MINIMIPS CASE STUDY Synchronous elasticization converts an ordinary clocked circuit into Latency-Insensitive (LI) design [8, 9, 10]. The Synchronous Elastic Flow (SELF) is an LI protocol that can be used over synchronous elastic control network channels. This chapter gives an overview of the synchronous elastic architectures, SELF protocol and the process of synchronous elasticization. MiniMIPS elasticization is used as a case study. The chapter is concluded with investigation of the possible control network optimization avenues. 2.1 Synchronous Elastic Architectures1 A synchronous elastic system replaces the ip- ops used as pipeline latches in a clocked system with Elastic Bu ers (EBs). EBs serve the purpose of pipelining a design as well as synchronization points that implement an LI protocol, also allowing the clocked pipeline to be stalled. Fig. 2.1 [9] shows a block diagram implementation of an EB. An EB consists of a data-plane (double latches) and a controller. It can be in the Empty (bubble), Half or Full states depending on the number of data tokens its two latches are holding. A sample implementation of the EB controller can be found in [9]. EB controllers communicate through control channels. Each channel contains two control signals. Valid (V ) travels in the same direction as the data and indicates the validity of the data coming from the transmitter. Stall (S) travels in the opposite direction and indicates that the receiver cannot store the current data. The SELF channel protocol is shown in Fig. 2.2. It de nes three channel states: 1. Transfer (T): V &!S. The transmitter provides valid data and the receiver can accept it. 1Section 2.1 is a revised version of work originally published in [49]. 14 Figure 2.1: An EB implementation. 2. Idle (I): !V . The transmitter does not provide valid data. This dissertation identi es two Idle conditions: I0 (!V &!S) where the receiver can accept data and I1 (!V &S) where the receiver cannot accept data. 3. Retry (R): V &S. The transmitter provides valid data, but the receiver cannot accept it. In the Retry state, the valid data must be maintained on the channel until it is stored by the receiver. When the connection between EBs is not point-to-point, a control network is required to re ect the register-to-register communication in the original clocked circuit. The control network is composed of control channels connected through control steering units, namely, join and fork components. A join element combines two or more incoming control channels into one output control channel. A sample join design is shown in Fig. 2.3 [8, 9]. A fork element copies one incoming control channel into two or more output control channels. An n branch extension of the eager fork proposed in [9] is shown in Fig. 2.4. Fork and join components will be represented by and , respectively. Hereafter the term control network is used to aggregately refer to the joins, forks, and EB controllers in an elastic system. Figure 2.2: SELF channel protocol. 15 Figure 2.3: An n-to-1 lazy join. Figure 2.4: A 1-to-n EFork. 2.2 MiniMIPS Case Study and Results MIPS (Microprocessor without Interlocked Pipeline Stages) is a 32-bit architecture with 32 registers, rst designed by Hennessey [46]. The MiniMIPS is an 8-bit subset of MIPS, fully described in [1]. 2.2.1 Elasticizing the MiniMIPS2 The MiniMIPS is used as a case study of elasticization. Fig. 2.5 shows a block diagram of the ordinary clocked MiniMIPS [55, 1]. The MiniMIPS has a total of 12 synchronization points (i.e., registers), shown as rectangles in Fig. 2.5: P (program counter), C (controller), I1; I2; I3; I4 (four instruction registers), A;B and L (ALU two input and one output registers, respectively), M (memory data register), R (register le) and Mem (memory). To perform elasticization, each register is replaced by an elastic bu er (EB). Then, the register to register data communications in the MiniMIPS are analyzed. The following registers pass data to both A, B : R, to R : C, I2, I3, L, M, to C : C, I1, to I1, I2, I3, I4 : C, Mem, to L : A, B, C, I4, P, to M : Mem, to Mem : B, C, L, P, and to P : A, B, C, I4, L, P. For each register to register data communication there must be a corresponding control channel to control the data ow of this communication. The resultant 2Section 2.2.1 is a revised version of work originally published in [45]. c 2010 IEEE. Reprinted with permission. 16 Figure 2.5: Block diagram of the ordinary clocked MiniMIPS. control network can be implemented in di erent ways. Fig. 2.6 shows a control network that has been hand-optimized to minimize the number of joins and forks used in the control network (to reduce area and power consumption). From the control point of view, the register le (R) and memory (Mem) in a microprocessor can be treated as combinational units [9]. Hence, a separate EB for the register le (R) was not incorporated in Fig. 2.6. For the purpose of this case study, the memory (Mem) is o -chip. From the elastic control point of view, the MiniMIPS control signals (e.g., RegWrite, IRWrite, etc. - see Fig. 2.5) are considered part of the data plane and they need their own corresponding control channels. Mapping between datapath signals in the clocked MiniMIPS (of Fig. 2.5) and the control channels in the elastic MiniMIPS (of Fig. 2.6) should be self explanatory for most signals. RFWrite in Fig. 2.6 is the RegWrite control channel. RFWrite valid must be active if data is going to be written in the register le. Therefore, RFWrite valid has been ANDed with RegWrite inside the register le. Both the clocked and the elastic MiniMIPS have been synthesized, placed, routed and fabricated in a 0.5 m technology. The functionality of the fabricated processors have been 17 Figure 2.6: Hand-optimized control network of the elastic clocked MiniMIPS. veri ed on Verigy's V93000 SoC tester using the testbench in [1]. An eager implementation of the SELF protocol has been used with the EFork and lazy join of Figures 2.4 and 2.3, respectively. Table 2.1 summarizes the chip measurements. It shows that elasticizing the MiniMIPS has area, dynamic and leakage power penalties of 29%, 13% and 58.3%, respectively. For accurate leakage power comparison, both designs have been set to the same state (through a test vector) before measuring the average leakage supply current. Both MiniMIPS have been fabricated without the memory block. Memory values have been programmed inside the tester. An assumption about the memory access time was made. Since it a ects the maximum operating frequency of both MiniMIPS designs in the same way, therefore, an arbitrary memory access time of zero was assumed. Schmoo plots Table 2.1: Clocked and eager elastic MiniMIPS chip results. Measurements are done at 5V and 30 . Clocked MiniMIPS Eager Elastic MiniMIPS Penalty Area ( m X m) 1246.765 X 615.91 1284.1 X 771.54 29% Pdyn @80 MHz (mW) 330 373 13% Pleak ( W) 16.3 25.8 58.3% fmax (MHz) 91.7 92.2 -0.5% 18 for both clocked and elastic MiniMIPS are shown in Fig. 2.7. 2.2.2 Case Study Evaluation It should be noted that the elastic MiniMIPS has functional features that the clocked design does not have. The clocked design cannot support exible interface latencies nor the addition of extra pipeline stages between registers. The fabricated MiniMIPS case study did not take advantage of these functional features. For example: The fabricated MiniMIPS (clocked and elastic) used an o -chip memory with static latency. If the memory latency is not static, the clocked design will have to implement some kind of latency insensitivity in the data path to accommodate for latency variations (e.g., cache miss). A sample approach could be a nite state machine waiting for the memory data valid signal to assert, while stalling the processor or running no-operation (NOP) tasks. This, on the other hand, is handled naturally in the elastic MiniMIPS by the means of the Valid and Stall control signals, without need for additional logic in the datapath. The overhead of adding some sort of latency insensitivity to the data path of the normally clocked MiniMIPS should be taken into account in the comparison. The power saving due to stalling the processor (in the elastic version) rather than running NOPs tasks (in the ordinary clocked one) should also be considered. The fabricated MiniMIPS (clocked and elastic) used xed latency ALU. Similar ar- gument applies as the above. The fabricated MiniMIPS (clocked and elastic) did not have long interconnects that had to be pipelined (i.e., no bubble insertion was needed). The synchronous elastic design naturally handles long interconnect latencies by inserting any number of empty pipeline stages (i.e., bubbles) to meet the target timing constraints. On the other hand, to handle the problem in the ordinary clocked version, severe changes in the design may be required and/or the system frequency may need to slow down. Would elasticity be required (e.g., to accommodate variable latency interfaces, long interconnects, etc.), the presented MiniMIPS case study shows the cost of achieving this elasticity using the SELF protocol. The MiniMIPS is a relatively small design (8-bit datapath). The overhead of elasticization may decrease with increasing the word width. Nonetheless, the MiniMIPS represents a class of circuits in which the register-to-register communication complexity is comparable to the computation complexity. Thus, the control 19 (a) Schmoo plot for clocked MiniMIPS. (b) Schmoo plot for elastic MiniMIPS. Figure 2.7: Fabricated chips schmoo plots. Red boxes are for failed tests, while green are for passed ones. 20 network area and power overheads are remarkable. Other examples from the literature include: A desynchronized DLX processor in 90 nm process is reported to have a 13.44% area overhead (over the normally clocked one), and noticeable power overhead [7]. Elasticizing a 32 32 pipelined multiplier for a pipeline depth ranging from 2 to 6 with three di erent synchronous elasticization techniques is reported to result in an area overhead ranging from as low as 5% to as much as 23% [42]. 2.2.3 Optimization Avenues 1. Can the required register-to-register communication be achieved by using fewer num- ber of joins and forks? What is the minimum? - Chapter 3. 2. Eager forks incorporate one ip- op for each branch that is clocked every clock cycle. Thus, they are area and power expensive. Can the eager forks be replaced by lazy without sacri cing performance? - Chapter 4. 3. Are there any other fork structures that are cheaper in area and power than even lazy forks, do not form combinational cycles, and can substitute EForks without any performance loss? What are the replacement conditions? - Chapter 5. 4. Elastic bu er controllers are area and power expensive. Is it possible to merge some of the EBCs without any performance loss? - Chapter 5. CHAPTER 3 CONTROL NETWORK GENERATOR FOR ELASTIC CIRCUITS1 Creating latency insensitive or asynchronous designs from clocked designs has potential bene ts of increased modularity and robustness to variations. Several transformations have been suggested in the literature and each of these require a handshake control network (examples include synchronous elasticization and desynchronization). Numerous imple- mentations of the control network are possible. This chapter reports on an algorithm that generates an optimum control network consisting of the minimum total number of 2-input join and 2-output fork control components. This can substantially reduce the area and power consumption of the control network. The algorithm has been implemented in a CAD tool, CNG. It has been applied to the MiniMIPS processor showing a 14% reduction in the number of control steering units over the hand optimized version of Fig. 2.6, and a 42.9% reduction over a network that would be implemented using a basic approach introduced in [9]. CNG is also compared with control network synthesis approaches using industrial strength synthesis tools, e.g., Design Compiler R (DC) [53] from Synopsys R and ABC [54] from Berkeley. The tools were compared over many ISCAS-89 benchmarks as well as locally developed examples. In all complete benchmark runs in this chapter, DC and ABC produce a network with the same or more number of join (and fork) components than CNG. In s614, for example, ABC produces a network with 11.3% more joins than CNG (69 vs. 62). In s1238, DC produces a network with 10.9% more joins than CNG (51 vs. 46). Locally developed examples (in part based on observations seen in ISCAS benchmarks) show even more favor toward CNG. In one of the developed examples, DC produces a network with up to 50% more join components than CNG, and ABC with 57% more joins than CNG. 1This work has been submitted to the IEEE for possible publication [48]. Copyright may be transferred without notice. 22 3.1 Problem De nition Example 3.1. Let I1; I2;X1;X2 be four registers in the original ordinary clocked design. Both registers I1 and I2 pass data to both registers X1 and X2. Find a control network implementation for the elastic version of this design. Figures 3.1a and 3.1b are two example implementations for such a control network. The control network in Fig. 3.1b has one fewer join and one fewer fork components than the network of Fig. 3.1a. Things get more complicated when the number of registers and their corresponding communications increase. Hence, the purpose of the proposed algorithm is, given a set of required register-to-register communications, the algorithm should automatically generate a control network with minimum total number of 2-input join and 2-output fork components. This section lists a number of de nitions required to formalize the problem. Example 3.2 will be used as a running example throughout the chapter. Example 3.2. Let A;B;C;D;E; F; G;X1;X2;X3;X4;X5 be twelve registers in the orig- inal ordinary clocked design. The following registers pass data to X1 : B;C;G, and to X2 : A;B;C;G, and to X3 : A;B;C;D;E, and to X4 : A;B;D;E; F, and to X5 : A;B;E; F. Find a control network implementation for the elastic version of this design, that incorpo- rates minimum number of join and fork components. A data transmitting register as well as a primary input will be referred to as an input node (or INode). Similarly, a data receiving register as well as a primary output will be referred to as an output node (or ONode). The set of all INodes and the set of all ONodes in the network are designated as INodeS and ONodeS, respectively. In Example 3.2, INodeS = fA;B;C;D;E; F;Gg, and ONodeS = fX1;X2;X3;X4;X5g. Note that, in a typical system, a register is both receiving (a) (b) Figure 3.1: Two possible implementations of Example 3.1. 23 and transmitting data. Hence, from the data communication perspective, its data-input interface and data-output interface are ONode and INode, respectively. De nition 3.3. Term A set of one or more INodes. Constructing a Term typically means joining the control channels coming from its constituent INodes into one control channel. Each Term has a unique identi er, TermID. As an example, a Term that joins the control channels coming from: B;D;E, is fB;D;Eg and, for simplicity, will be referred to as BDE. jTerm1j designates the cardinality of Term1. A Term that is associated with an input node (i.e., composed of only one INode) is called a Source. The set of all Source Terms is designated as SourceS. Note that jSourceSj = jINodeSj. De nition 3.4. Target A Term that is associated with an output node. A Target of a certain ONode is a Term composed of all INodes that send data to that ONode. In Example 3.2, BCG is the Target Term associated with ONode X1. The set of all Target Terms is designated as TargetS. Note that jTargetSj = jONodeSj. The set of all Terms relevant to the problem is designated as TermS. Formally, TermS = fTermijTermi Targetj 8Targetj 2 TargetSg (3.1) Terms in TermS or in any other Term set introduced later are identi ed by their unique TermID rather than their INode set contents (see Term de nition in Def. 3.3). In general, every INode set will map to at most one TermID. However, an exception for this rule, and without loss of generality, are the INode sets of Target Terms. This work assumes that Target Terms are terminal in the sense that they cannot be used inside the control network to construct other Terms. If needed to be shared by other Terms, internal images that have the same INode set are used inside the network instead. Hence, TermS set of Eq. 3.1 can contain both a Target as well as its internal image. An example in the Terms listed in Table 3.1 is the Target whose INode set is fB;C;Gg and TermID = 1. It has an internal image (i.e., with the same INode set) which is the Term whose TermID = 8. De nition 3.5. Partial Solution or PS A set of Terms that could be used to implement another Term. Formally, PSt (set) is a partial solution of Termt, i S jPStj i=1 Termi = Termt ^ 8Termi 2 PSt : TermiID 6= TermtID, where TermiID and TermtID are the TermIDs of Termi and Termt, respectively. 24 Table 3.1: Terms and PSs of Example 3.2. Term types are: Target (T), PTerm (P) and Source(S). TermID Term Type PSID PS Initial nUsed Max Min 1 BCG T 1 fBCGg 0 0 2 ABCG T 1 fBCG;Ag 0 0 2 fABC;Gg 3 ABCDE T 1 fABDE;Cg 0 0 2 fABC;D;Eg 4 ABDEF T 1 fABDE; Fg 0 0 2 fABEF;Dg 5 ABEF T 1 fABEFg 0 0 6 ABDE P 1 fABE;Dg 2 0 7 ABEF P 1 fABE; Fg 2 1 8 BCG P 1 fBC;Gg 2 1 9 ABC P 1 fBC;Ag 2 0 2 fAB;Cg 10 ABE P 1 fAB;Eg 2 1 11 BC P 1 fB;Cg 2 1 12 AB P 1 fA;Bg 2 1 13-19 A 􀀀 G S; P 1 PSt represents one way of constructing Termt. One Term could be constructed in multi- ple ways, and thus has more than one PS. In Example 3.2, to construct Termt = ABCDE, one possible PS is fABC;D;Eg. Another is fABDE;Cg. Note that, by de nition, a Term cannot be used to implement itself. Also, Sources do not have PSs. De nition 3.6. Solution or Soln A vector of PSs, where TermIDs are used as indices ( rst index is 1). If Soln1 is a Solution, and TermtID is the TermID of Termt, then Soln1[TermtID] (or, for short, Soln1[Termt]) is the chosen PS to construct Termt in Soln1. Soln1[Termi] = ; ) Termi 2 SourceS.2 In Example 3.2, the following is a possible Solution (Terms are sorted by their TermIDs of Table 3.1, and Source PSs are ignored): Soln1 = < fBCGg ; fBCG;Ag ; fABDE;Cg ; fABDE; Fg ; fABEFg ; fABE;Dg ; fABE; Fg ; fBC;Gg ; fAB;Cg ; fAB;Eg ; fB;Cg ; fA;Bg > (3.2) 2Throughout this chapter, the ) symbol will be used to indicate implication, while ! will indicate the domain and codomain of a function. 25 Hence, a Solution can be seen as a vector of PS choices of di erent Terms. For example, Soln1[2] = fBCG;Ag. This means the PS = fBCG;Ag is used in Soln1 to construct Term ABCG (whose TermID is 2). Soln1 is depicted in Fig. 3.2. The set of all Solutions is designated as SolnS. De nition 3.7. nUsed nUsed[Termi] Soln1 de nes how many times Termi is used to construct other useful Terms in Solution, Soln1. Formally, nUsed[Termi] Soln1 is de ned recursively to be the number of Terms, Termt, that satisfy the following two conditions: 1. Termi 2 Soln1[Termt]. 2. nUsed[Termt] Soln1 > 0 _ Termt 2 TargetS. By de nition, 8Termi 2 TargetS : nUsed[Termi] = 0. De nition 3.8. Useful Term Termi is said to be useful in Soln1 (or Soln1 uses Termi), if any of the following two conditions hold: Termi 2 TaregtS. nUsed[Termi] Soln1 > 0. Figure 3.2: A sample control network of Example 3.2. 26 The function UsefulTermS(Soln1) : SolnS ! 2TermS is de ned to return the useful Terms in a given Solution. Formally, UsefulTermS(Soln1) = UTermS, where UTermS = fTermi 2 TermS jTermi is useful in Soln1 g. The su x Soln1 may be omitted from nUsed and other data structures and functions when the context is clear. For Example 3.2 and Soln1 of Eq. 3.2: Term ABE (with TermID of 10) is used to construct both Terms ABDE (with TermID of 6) and ABEF (with TermID of 7). Hence, nUsed[ABE] Soln1 = 2. Also, Term ABC (with TermID of 9) is not useful in Soln1. Term AB (with TermID of 12) is used to construct both Terms ABC (with TermID of 9) and ABE (with TermID of 10). However, since Term ABC is not useful in Soln1, therefore, nUsed[AB] Soln1 is only 1. De nition 3.9. Solution Graph or SG SG is a Directed Acyclic Graph (DAG) composed of the ordered pair (V;A). V is the set of vertices and A V V , the set of directed arcs. Any Soln, Soln1, can be represented by an SG, SG1, such that: V = fTargetS; SourceS; ITermSg. And, for short, V = fT; S; Ig. ITermS = fTermi 2 TermSjTermi =2 (SourceS [ TargetS) ^ Termi is useful in Soln1g. A = f(vi; vj) jvi; vj 2 V ^ vi 2 Soln1[vj ]g. For Example 3.2 and Soln1 of Eq. 3.2, SG1 is shown in Fig. 3.3. Note that from the A de nition above and PS and Soln de nitions (De nitions 3.5 and 3.6, respectively), SG1 is acyclic (i.e., no possible sequence of arcs can start from and end at the same vertex). The following functions are de ned for each vertex, vi 2 V : Figure 3.3: A Solution graph for Example 3.2 Solution of Eq. 3.2. 27 Ain(vi) : V ! 2A. For each vi, Ain(vi) returns the set of arcs that end at vi. Formally: Ain(vi) = faj = (vj ; vi)jaj 2 Ag. Similarly, Aout(vi) : V ! 2A. For each vi, Aout(vi) returns the set of arcs that start at vi. Formally: Aout(vi) = faj = (vi; vj)jaj 2 Ag. nJ2(vi) : V ! N. A function that returns the number of 2-input joins constructing the Term represented by vertex vi in the Solution represented by the graph. It is assumed in this work that an n-input join is implemented using (n􀀀1)J2s. Formally, nJ2(vi) = jAin(vi)j 􀀀 1 jAin(vi)j 1 0 jAin(vi)j = 0 (3.3) Similarly, nF2(vi) : V ! N. A function that returns the number of 2-output forks immediately branching from the Term represented by vi. It is assumed in this work that an n-output fork is implemented using (n 􀀀 1) F2s. Formally, nF2(vi) = jAout(vi)j 􀀀 1 jAout(vi)j 1 0 jAout(vi)j = 0 (3.4) De nition 3.10. Cost A function that returns the number of 2-input joins (J2s) required to implement a PS, a Term, or a Soln. Formally, let PSt be the PS of Term, Termt, in Soln, Soln1 (i.e., Soln1[Termt] = PSt), then Cost(Termt) in Soln1, Cost(Termt) Soln1 : TermS SolnS ! N, is de ned as follows: Cost(Termt) Soln1 = jPStj 􀀀 1 + jXPStj i=1 Cost(Termi) Soln1 nUsed[Termi] Soln1 (3.5) where Termi 2 PSt 8i = 1; 2; : : : jPStj. Cost(Termt) Soln1 and Cost(PSt) Soln1 will be used interchangeably (since Soln1[Termt] = PSt). Two factors contribute to Cost(Termt) in a Solution. First is the number of J2s used to join the PSt constituent terms. It is assumed in Eq. 3.5 that to implement an n-input join, (n 􀀀 1)J2s are required. The other factor is the Cost of the constituent Terms themselves, taking into account how much these Terms are shared among other Terms in that Solution. The Term sharing information is provided by the nUsed vector. By de nition, 8Termi 2 SourceS : Cost(Termi) = 0. For Example 3.2 and SG1 of Fig. 3.3, the chosen PS to construct Term ABE is fAB;Eg. nUsed[AB] = 1. Hence, Cost(ABE) = 1 + Cost(AB). The chosen PS to construct Term AB is fA;Bg, and hence, Cost(AB) = 1. Therefore, Cost(ABE) in Soln1 is 2. Similarly, Cost(ABDE) = 2. 28 Similarly, the function Cost(Soln1) : SolnS ! N is de ned to return the total number of J2s used to construct all TargetS in Soln1. Formally, Cost(Soln1) = jTaXrgetSj i=1 Cost(Targeti) (3.6) where Targeti 2 TargetS 8i = 1; 2; : : : jTargetSj. For Example 3.2 and Soln1 of Eq. 3.2 (or SG1 of Fig. 3.3), ve Targets exist, namely, BCG, ABCG, ABCDE, ABDEF, ABEF. The summation of the Costs of these Targets in Soln1 (i.e., Cost(Soln1)) is 9. De nition 3.11. OptCost The minimum Cost among all Solution Costs. Formally, OptCost = minjSolnSj i=1 Cost(Solni). The Optimum Solution or OptSoln is de ned to be a Solution such that Cost(OptSoln) = OptCost. An OptSoln may not be unique for a given problem, since multiple Solutions can have the same minimum Cost among all Solutions. Hence, OptSolnS is de ned to be the set of all optimum Solutions. De nition 3.12. Search Space or Space A Space (designated as Sk) is a set of Solutions. The (whole) search Space (designated as So) is initialized with SolnS, and then re ned throughout the algorithm until an OptSoln is found. De nition 3.13. Cone(Term) Cone(Termt) Soln1 : TermS SolnS ! 2TermS, a func- tion that returns the set of all Terms (down to SourceS) used in implementing Termt in Soln1. Formally, let Soln1[Termt] = PSt, then: Cone(Termt) Soln1 = PSt jP[Stj i=1 Cone(Termi) Soln1 (3.7) where Termi 2 PSt 8i = 1; 2; : : : jPStj. By de nition, 8Termi 2 SourceS : Cone(Termi) = ;. For Example 3.2 and SG1 of Fig. 3.3: Cone(BCG) = fBC; G;B;Cg. Similarly, let PS0 be a set of Terms (not necessarily a PS of any Term), then de ne Cone(PS0) Soln1 : 2TermS SolnS ! 2TermS as follows: Cone(PS0) Soln1 = PS0 jP[S0j i=1 Cone(Termi) Soln1 (3.8) where Termi 2 PS0 8i = 1; 2; : : : jPS0j. Hence, if Soln1[Termt] = PSt, then Cone(Termt) Soln1 and Cone(PSt) Soln1 will be used interchangeably. 29 De nition 3.14. Del operator - Soln1/D The Del operator (=) accompanied by a Del set D TermS are applied to a Solution. Applied to Soln1, it e ectively removes all the Terms in D from Soln1. Formally, Soln1=D[Termi] = Soln1[Termi] Termi =2 D ; Termi 2 D (3.9) Applying =D on Soln1 vector will also a ect its associated data structures and functions (e.g., nUsed, Cost and Cone). This will be denoted as, for example, nUsed[Termi] Soln1=D . Some of the useful Terms in Soln1 can become unused (i.e., their nUsed Soln1=D = 0) as so some of the Terms in their respective Cones. For Example 3.2 and SG1 of Fig. 3.3, deleting Term BCG, will decrease nUsed of the following Terms by 1: BC (will become unused), G (will become unused), B, and C. De nition 3.15. nAddedJoins or nAJ(Term) nAJ(Termi) Soln1 : TermS SolnS ! N, a function that returns the number of J2s that exist in Soln1 just to construct Termi (i.e., the J2s that, otherwise, would not be used if Termi was deleted from Soln1). Formally, let Soln1[Termt] = PSt, then: nAJ(Termt) Soln1 = ut Soln1 0 @jPStj 􀀀 1 + jConeX(Termt)j i=1 si Soln1=fTermtg nAJo(Termi) 1 A (3.10) where 8i = 1; 2; : : : jCone(Termt)j : Termi 2 Cone(Termt), and: nAJo(Termi) Soln1 = jSoln1[Termi]j 􀀀 1 Termi =2 SourceS 0 Termi 2 SourceS (3.11) ut Soln1 or u[Termt] Soln1 = 1 Termt is useful in Soln1 0 Termt is not useful in Soln1 (3.12) si Soln1=fTermtg or s[Termi] Soln1=fTermtg = 8>< >: 1 nUsed[Termi] Soln1=fTermtg = 0 0 nUsed[Termi] Soln1=fTermtg > 0 (3.13) Unless otherwise speci ed, nAJ will be calculated for useful Terms only. Hence, u[Termt] (or interchangeably ut) in Eq. 3.10 will be frequently omitted. Note the analogy be- tween nAJo of Eq. 3.11 and nJ2(vi) of Eq. 3.3. If Termi 2 Cone(Termt) Soln1 , then nAJo(Termi) Soln1 contributes to nAJ(Termt) Soln1 only if Termi is constructed in Soln1 30 for the sole purpose of constructing Termt in Soln1 (in other words, only if Termi would not be useful in Soln1 if Termt was deleted from Soln1). This information is provided through s[Termi] (or interchangeably si) de ned in Eq. 3.13. nAJ(Termt) Soln1 and nAJ(PSt) Soln1 will be used interchangeably (since Soln1[Termt] = PSt). As an example, let all the Terms used by PSt be already shared by other Terms in Soln1. In this case, all that is added to the network to construct PSt are the J2s required to join its constituent Terms (i.e., jPStj 􀀀 1). For Example 3.2 and SG1 of Fig. 3.3, nAJo(AB) Soln1 = 1 and nUsed[AB] Soln1=fABEg = 0, therefore, nAJ(ABE) Soln1 = 2. Although the Cost of ABDE is two, its nAJ is only one. The reason is, Term ABE which is used to construct ABDE in Soln1 is also used in the Solution to construct another Term (i.e., Term ABEF). Hence, to construct Term ABDE, the only added J2 to Soln1 is the join required to join ABE with D. 3.2 The Algorithm Lemma 3.1. Let nJ2 and nF2 be the total number of J2s and F2s in a network, respectively. Then, the following equality holds for any Solution 2 SolnS (i.e., whatever the PS choices of the di erent Terms): nJ2 􀀀 nF2 = jSourceSj 􀀀 jTargetSj (3.14) Proof. Construct a Solution graph, SG1, of a Solution, Soln1 (see Fig. 3.3, for example). Following Def. 3.9 of the SG, each arc starts at a vertex (i.e., a Term) and ends at a vertex (i.e., another term), therefore, the following equation holds: XjV j i=1 jAin(vi)j = XjV j i=1 jAout(vi)j (3.15) By de nition, 8vi 2 SourceS : jAin(vi)j = 0, and 8vi 2 TargetS : jAout(vi)j = 0. Hence, Eq. 3.15 is reduced to: jIXj+jTj j=1 jAin(vj)j = jIXj+jSj j=1 jAout(vj)j (3.16) Since all SG1 vertices represent useful Terms in Soln1 (see Def. 3.8), and since by the de nition of Solution (Def. 3.6) all useful Terms must be implemented using other Terms (except SourceS), therefore, the following holds: 8vi 2 (ITermS [ TargetS) : jAin(vi)j 1 (3.17) 8vi 2 (ITermS [ SourceS) : jAout(vi)j 1 (3.18) 31 Hence, from Equations 3.3 and 3.4, Eq. 3.16 can be rewritten in terms of nJ2(vj) and nF2(vj ), as follows: jIXj+jTj j=1 (nJ2(vj) + 1) = jIXj+jSj j=1 (nF2(vj) + 1) (3.19) The total number of 2-input joins and 2-output forks in Soln1 (i.e., nJ2 and nF2, respec- tively) can be computed as follows: nJ2 = jIXj+jTj j=1 nJ2(vj) (3.20) nF2 = jIXj+jSj j=1 nF2(vj) (3.21) Substituting Equations 3.20 and 3.21 in Eq. 3.19 concludes the proof. Theorem 3.2. An algorithm that minimizes nJ2 will also minimize nF2 and also nJ2+nF2. Proof. The theorem follows directly from Lemma 3.1. In other words, for some required communications in a control network, since an OptSoln (Def. 3.11) utilizes the minimum number of J2s, therefore, it will also incorporate the minimum total number of J2s and F2s. 3.2.1 Algorithm Overview Theorem 3.2 narrows down the problem to: Construct the TargetS from the SourceS using a minimum total number of J2s (i.e., nd an OptSoln). The proposed algorithm consists of four main steps, covered in the following four subsections. Step I nds the candidate Terms that can be used in an OptSoln. Then, for each of the candidate Terms, Step II nds the candidate PSs that may be used by an OptSoln. Step II uses a set of proven rules to identify (and exclude) PSs that are not needed to nd an OptSoln. At this point, the search Space of the problem consists of all the remaining possible PS choices of all the candidate Terms. Step III collects statistics about the search Space. Metrics computed include the max/min possible usage (or sharing) of the remaining Terms in the search Space, from which the max/min possible nAJ value of each remaining PS can be computed. Based on these metrics, Step III eliminates expensive PSs from the search Space. The latter Space reduction does in turn a ect the Space metrics, which in turn can lead to 32 removing further expensive PSs. Hence, Step III through a number of iterations prune out the search Space until no further reduction is possible, at which point the algorithm moves to Step IV. Choosing a certain PS for a Term (and omitting the other PSs from the search Space) does a ect the max/min possible usage of the constituent Terms of these PSs. This in turn can a ect the max/min possible nAJ value of other PSs which use these Terms, providing opportunity for removing expensive PSs. Hence, Step IV makes use of this fact in case there are more than one Solution still left in the search Space after Step III. Step IV splits the remaining search Space into multiple Spaces, each with mutually exclusive PS choices for some Terms (called STermS). It then updates each sub-Space metrics based on the speci c PS choices made for that sub-Space, allowing for further reduction. The splitting continues until there is only one Solution left in each sub-Space. The Cost of each Solution of each sub-Space is calculated and compared. An OptSoln is returned. 3.2.2 Step I: Construct the Potential Terms The rst step in the algorithm is to determine which Terms could be used to construct the TargetS Terms and eliminate the rest. De nition 3.16. Potential Terms or PTermS A set of Terms from which an OptSoln can be constructed. Formally, PTermS \ TargetS = ^ 9OptSolni 2 OptSolnS : (PTermS [ TargetS) UsefulTermS(OptSolni) (3.22) where UsefulTermS function is de ned in Def. 3.8. De nition 3.17. Common Terms or CTermS CTermS = fTermc 2 (TermS 􀀀 TargetS)jTermc = Targeti \ Targetj 8Targeti; Targetj 2 TargetS; Targeti 6= Targetjg (3.23) Following are the di erent methods used to construct the potential Terms (PTermS): 3.2.2.1 Method I: All Subsets of All CTermS Terms De ne PTermS1 o = fTermp jTermp Termci 8Termci 2 CTermSg (3.24) PTermS1 = PTermS1 o [ SourceS (3.25) 33 Theorem 3.3. Potential Terms of Method I PTermS1 satis es Def. 3.16 of the potential Terms (i.e., 9OptSolni 2 OptSolnS : 􀀀 PTermS1 [ TargetS UsefulTermS(OptSolni)). Hence, an optimum Solution can be constructed by using only Terms from PTermS1. Proof. The proof relies on other theorems to be stated later in the text. The reader is advised to read the proof after nishing Sec. 3.2.4. De ne the function FTargetS(read Father-TargetS): (TermS 􀀀 TargetS) ! 2TargetS, as follows: FTargetS(Termi) = fTargetj 2 TargetS jTermi Targetj g FTargetS(Termi) returns the set of Targets that Termi can be used in their construction. Also, de ne the following Term set: UnSharedT ermS = fTermi 2 (TermS 􀀀 (TargetS [ SourceS))j jFTargetS(Termi)j = 1g (3.26) From TermS de nition in Eq. 3.1, PTermS1 can be rede ned as follows: PTermS1 = TermS 􀀀 TargetS 􀀀 UnSharedT ermS, and Theorem 3.3 can be rewritten as follows: An optimum Solution can be found without using the Terms in UnSharedT ermS. The proof will be done by iteratively using Theorem 3.15 Rule V. It is easy to show that each Term in UnSharedT ermS can maximally be used by only one Target and zero or more other terms from UnSharedT ermS. De ne UnSharedT ermS1 to be the Terms in UnSharedT ermS which are maximally used once (i.e., by one Target and zero other Terms from UnSharedT ermS). Formally, UnSharedT ermS1 = fTermi 2 UnSharedT ermSj Termi Termt 2 TermS ) Termt 2 TargetSg (3.27) Obviously, 8Termi 2 UnSharedT ermS1 : nUsedMax[Termi] = 1. Hence, by Theo- rem 3.15 Rule V, all Terms in UnSharedT ermS1 can be omitted from the search Space (i.e., an OptSoln can be found without using them). Similarly, de ne UnSharedT ermS2 to be the Terms in UnSharedT ermS which are maximally used by only one Target and one or more Terms from UnSharedT ermS1: UnSharedT ermS2 = fTermi 2 UnSharedT ermSj Termi Termt 2 TermS ) Termt 2 (TargetS [ UnSharedT ermS1)g (3.28) 34 Since the Terms in UnSharedT ermS1 are omitted from the search Space, therefore, 8Termi 2 UnSharedT ermS2 : nUsedMax[Termi] = 1. Hence, by Theorem 3.15 Rule V, all Terms in UnSharedT ermS2 can also be omitted from the search Space. The above iterations can be repeated until all Terms in UnSharedT ermS are omitted from the search Space. Hence, an optimum Solution can be found without using any Term from UnSharedT ermS. That concludes the proof. Method I includes in PTermS1 all CTermS Terms as well as all their subsets. The number of potential Terms will thus quickly increase as the number and sizes of CTerms increase. This adversely a ects the algorithm runtime. Hence, following are some methods that try to minimize the number of PTerms. 3.2.2.2 Method II: All Intersections and Di erences of CTermS Terms This method initially populates PTermS (will be referred to, in this method, as PTermS2) with CTermS. It then considers the intersection of and the di erence between any two PTerms to be another PTerm. Formally, de ne PTermS2 o to be the smallest set (in cardinality) that satis es the following two conditions: 1. PTermS2 o CTermS. 2. 8Termpi; Termpj 2 PTermS2 o : Termpi 􀀀Termpj 2 PTermS2 o ^ Termpi \ Termpj 2 PTermS2 o . PTermS2 = PTermS2 o [ SourceS (3.29) It is easy to show that PTermS2 PTermS1. A proof (or counter proof) that PTermS2 satis es the de nition of PTermS (Def. 3.16) could not be found. Hence, using Method II to construct PTermS, while typically incorporates less number of Terms, is not proved (or disproved) to result in an optimum Solution for all problems. Nonetheless, for all the examples where Method I and Method II ran to completion, Method II provided optimum Solutions. 3.2.2.3 Method III: Target Division This method gives a label to each Term 2 TermS. The label re ects whether, for each Target, all the INodes (or Sources) joined by this Term belong to that Target, or only part of them, or none of them. It then groups Terms with similar label together. The biggest Term (in cardinality) in each group is then included in PTermS3. Non-Source Terms that 35 cannot be used for constructing more than one Target are excluded from PTermS3 (since an OptSoln can be found without using them according to the proof of Theorem 3.3). Formally, the Label function (L : (TermS 􀀀 TargetS) ! f0; 1; 􀀀gjTargetSj) is de ned as follows: L(Termt) = Vt such that Vt[i] = 8< : 1 Termt \ Targeti = Termt 0 Termt \ Targeti = ; 􀀀 ; Termt \ Targeti Termt (3.30) 8i = 1; 2; : : : jTargetSj. Also de ne nL(Termt) : (TermS 􀀀 TargetS) ! N to be the number of Vt[i] = 1; 8i = f1; : : : ; jVtjg where Vt = L(Termt). De ne: PTermS3 o =fTermp 2 (TermS 􀀀 TargetS)jnL(Termp) > 1^ 8Termi 2 (TermS 􀀀 TargetS); Termi 6= Termp : L(Termi) = L(Termp) ) Termi Termpg (3.31) PTermS3 =PTermS3 o [ SourceS (3.32) It is easy to show that PTermS3 PTermS2. However, similar to PTermS2, a proof (or counter proof) that PTermS3 satis es the de nition of PTermS (Def. 3.16) could not be found. Hence, using Method III to construct PTermS, while typically incorporates less number of Terms, is not proved (or disproved) to result in an optimum Solution for all problems. Nonetheless, in all the examples where Method I and Method III ran to completion, Method III provided optimum Solutions. 3.2.2.4 Method IV: All CTermS Intersections This method initially populates PTermS4 with CTermS. It then considers only the intersection between any two PTerms to be another PTerm. Formally, de ne PTermS4 o to be the smallest set (in cardinality) that satis es the following two conditions: 1. PTermS4 o CTermS. 2. 8Termpi; Termpj 2 PTermS4 o : Termpi \ Termpj 2 PTermS4 o . PTermS4 = PTermS4 o [ SourceS (3.33) It is easy to show that PTermS4 PTermS3 and thus Method IV exhibits the shortest algorithm runtime among all the four methods. Nonetheless, counter examples showing that PTermS4 may not satisfy the de nition of PTermS (Def. 3.16) in some cases do 36 exist. Examples are explained in Sec. 3.2.6. Sec. 3.2.6 also provides some techniques to help check whether a Solution returned by the algorithm when using Method IV is indeed optimum. Possible correction techniques are explained as well. The number of potential Terms provided by Step I is, at worst, exponential. In particular, PTermSi 2jSourceSj 􀀀 1 8i 2 f1; 2; 3; 4g PTermS4 o min 2jSourceSj 􀀀 1 ; 2jTargetSj 􀀀 jTargetSj 􀀀 1 (3.34) Nonetheless, in practice, the size of PTermS is much smaller (see Table 3.2). The actual size depends on the overlapping between the di erent Target set contents. 3.2.3 Step II: Construct the Partial Solutions The search Space (i.e., the possible Solutions), at this point, consists of all combinations of all possible PS choices of all PTermS. This step aims at excluding PSs that are not needed in an OptSoln. A cost metric is thus needed to di erentiate between several PSs of the same Term and to eliminate expensive PSs from the search Space. nAJ provides such a metric as shown in the following theorems: Theorem 3.4. Let Soln1 and Soln2 be two Solutions. Let also, Soln1=fTermtg = Soln2=fTermtg (i.e., 8i = 1; 2; : : : jTermSj ^ i 6= t : Soln1[Termi] = Soln2[Termi]), Soln1[Termt] = PSt1 , and Soln2[Termt] = PSt2. Then, if nAJ(PSt1) Soln1 nAJ(PSt2) Soln2 , then Cost(Soln1) Cost(Soln2). Greater and equal operators are ordered respectively. Proof. It follows from Def. 3.15 of nAJ that: Cost(Soln1) = Cost(Soln1=fTermtg) + nAJ(Termt) Soln1 (3.35) Cost(Soln2) = Cost(Soln2=fTermtg) + nAJ(Termt) Soln2 (3.36) Since Soln1=T ermt = Soln2=T ermt, therefore, Cost(Soln1=T ermt) = Cost(Soln2=T ermt). This concludes the proof. Corollary 3.5. Let PS1 and PS2 be two PSs of Termt. Then, if for all possible combinations of other Term PS choices nAJ(PS1) > nAJ(PS2), then any OptSoln will not use PS1. 37 Corollary 3.6. Let PS1 and PS2 be two PSs of Termt. Then, if for all possible combinations of other Term PS choices, nAJ(PS1) nAJ(PS2), then an OptSoln can be found that does not use PS1. Proof of both Corollaries 3.5 and 3.6 follows from Theorem 3.4 as well as Def. 3.11 of OptSoln. It is easy to show that the Cost function (Def. 3.10) cannot be used instead of nAJ in Theorem 3.4 to identify expensive PSs. In other words, let Soln1= fTermtg = Soln2= fTermtg, Soln1[Termt] = PSt1 , and Soln2[Termt] = PSt2. Then, if Cost(PSt1) Soln1 Cost(PSt2) Soln2 , then the following inequality does not necessarily hold: Cost(Soln1) Cost(Soln2). Following is a list of proven rules to be considered while constructing the PTermS PSs. The rules help identify and exclude PSs that are not needed while searching for an OptSoln. Lemma 3.7 will be useful to prove the rules. Lemma 3.7. Use si Soln1 as in Eq. 3.13. Let Term1 2 Cone(Termt) Soln1 . Then, if s[Term1] Soln1=fTermtg = 0, then, s[Termi] Soln1=fTermtg = 0 8Termi 2 Cone(Term1) Soln1 . Proof. By si de nition in Eq. 3.13, s[Term1] Soln1=fTermtg = 0 if nUsed[Term1] Soln1=fTermtg > 0. Hence, in the absence of Termt (i.e., Soln1= fTermtg) Term1 is still used at least once. From Def. 3.7 of nUsed and Def. 3.13 of Cone, it follows that all Terms 2 Cone(Term1) Soln1 will also still be used at least once in the absence of Termt (i.e., through Term1). That concludes the proof. Theorem 3.8. Rule I Adding a whole redundant Term to a PS always causes it to be more expensive (in terms of nAJ). Formally, let Termt; Term1; Term2 2 TermS, Term2 Term1 Termt. Let PSt1 and PSt2 be two PSs of Termt. Let both PS1 and PS2 be the same except that PS1 contains Term1, while PS2 contains Term1 and Term2. Then, an optimum Solution will not use PSt2. Proof. Let Soln1 and Soln2 be two Solutions such that: Soln1= fTermtg = Soln2= fTermtg, Soln1[Termt] = PSt1, and Soln2[Termt] = PSt2. Let PS0 be the maximal common subset of PSt1 and PSt2. Let also jPS0j = n0 0. Following the theorem text (see Fig. 3.4): 38 (a) PSt1 (b) PSt2 Figure 3.4: Rule I. PSt1 = PS0 [ fTerm1g PSt2 = PS0 [ fTerm1; Term2g (3.37) From Def. 3.15 of nAJ: nAJ(PSt1) Soln1 = C1 + s1 Soln1=fTermtg nAJo(Term1) Soln1 + jCone(TermX1)􀀀Cone(PS0)j i=1 si Soln1=fTermtg nAJo(Termi) Soln1 (3.38) nAJ(PSt2) Soln2 = C2 + 1 + s1 Soln2=fTermtg nAJo(Term1) Soln2 + s2 Soln2=fTermtg nAJo(Term2) Soln2 + j(Cone(Term1)[ConXe(Term2))􀀀Cone(PS0)j i=1 si Soln2=fTermtg nAJo(Termi) Soln2 (3.39) where Cl accounts for PS0 contribution to nAJ(PStl) Solnl (l 2 f1; 2g), as follows: Cl = n0 + jConXe(PS0)j i=1 si Solnl=fTermtg nAJo(Termi) Solnl (3.40) Since Soln1= fTermtg = Soln2= fTermtg, it follows that C1 = C2. Therefore, nAJ(PSt2) Soln2 􀀀 nAJ(PSt1) Soln1 1. The proof then follows from Corollary 3.5. Consider Term ABCG. PS1 = fA;BCGg is always cheaper than PS2 = fA;BCG;BCg. Hence, PS2 should be excluded from the search Space. 39 Theorem 3.9. Rule II Using a Term in a PS is always the same or cheaper (in terms of nAJ) than using all its constituent Terms. Formally, let Termt; Termc; Terma1; : : : Terman 2 TermS, Termc Termt, and Termc = Sn i=1 Termai. Let PSt1 and PSt2 be two PSs of Termt. Let both PSt1 and PSt2 be the same except that PSt2 contains Termc, while PSt1 instead contains Terms Terma1; : : : Terman. Then, an OptSoln can be found that does not use PSt1. Proof. Informally, the idea behind the theorem is, if Termt needs a set of Terms in its implementation, then it hurts nothing to join these Terms in one Term (Termc) and use Termc instead. This is the same or cheaper than using the constituent Terms directly, since Termc may be used by other Terms and its Cost will then be shared. Formally, de ne PSc1 = fTerma1; : : : Termang. Let PS0 be the maximal common subset of PSt1 and PSt2. Let also jPS0j = n0 0. Following the theorem text: PSt1 = PS0 [ PSc1 PSt2 = PS0 [ fTermcg (3.41) The theorem can be proved if it is proved that for each Soln1 where Soln1[Termt] = PSt1, there exists another Soln2 such that Soln2[Termt] = PSt2 and Cost(Soln2) Cost(Soln1). To prove the latter, it is su cient to prove the following: For each Soln1 where Soln1[Termt] = PSt1, there exists another Soln2 such that Soln2=fTermt; Termcg = Soln1=fTermt; Termcg, Soln2[Termt] = PSt2 and Cost(Soln2) Cost(Soln1). The proof hereafter will be concerned with the last statement. Termt and Termc may be referred to as Tt and Tc for brevity. Notice that the theorem does not specify a particular PS choice for Termc. Hence, in general, if there are k PSs for Termc in the search Space (call them PSc1, PSc2, : : : PSck) then de ne the following two sets of Solutions: Soln1S = fSoln1i jSoln1i[Termt] = PSt1 ^ Soln1i[Termc] = PSci ^Soln1i=fTcg = Soln1j=fTcg 8Soln1i; Soln1j 2 Soln1S g (3.42) Soln2S = fSoln2i jSoln2i[Termt] = PSt2 ^ Soln2i[Termc] = PSci ^Soln2i=fTcg = Soln2j=fTcg 8Soln2i; Soln2j 2 Soln2S g (3.43) Note that, by de nition, Solni=fTermt; Termcg = Solnj=fTermt; Termcg 8Solni; Solnj 2 (Soln1S [ Soln2S) (3.44) 40 For illustration, and without loss of generality, three particular PScis are shown in Fig. 3.5 when used in Soln1S and Soln2S Solutions. Note that PSc2 \ PSc1 = ; and ; PSc3 \ PSc1 PSc1. The theorem can be proved (i.e., PSt1 can be omitted from the search Space) if the following statement can be proved (for all Soln1S and Soln2S Solutions): 9Soln2i 2 Soln2S : Cost(Soln2i) minjSoln1Sj j=1 Cost(Soln1j) (3.45) Informally, if a Solution exists where PSt2 is used and which Cost is the same or lower than all Solutions that use PSt1 instead, then PSt1 can be omitted from the search Space. The claim is Soln21 does satisfy the above condition. To prove, extend Def. 3.15 of the nAddedJoins to more than one Term (namely, Termt and Termc) and similar to Eq. 3.35, the following holds for any Solni: Cost(Solni) = nAJ(Termt; Termc) Solni + Cost(Solni=fTermt; Termcg) (3.46) From Eq. 3.44, it follows that, to prove the statement of 3.45, it su ces to prove the following: 9Soln2i 2 Soln2S : nAJ(Termt; Termc) Soln2i minjSoln1Sj j=1 nAJ(Termt; Termc) Soln1j (3.47) nAJ(Termt; Termc) in the di erent Soln1;2S Solutions can be de ned as follows (refer to Fig. 3.5): nAJ(Termt; Termc) Soln1i = C + n 􀀀 1 + jCone(PSc1X)􀀀Cone(PS0)j j=1 sj Soln1i=fTt;Tcg nAJo(Termj) + u[Termc] (jPScij 􀀀 1) + u[Termc] jCone(PSci)􀀀(ConXe(PSc1)[Cone(PS0))j j=1 sj Soln1i=fTt;Tcg nAJo(Termj) (3.48) nAJ(Termt; Termc) Soln21 = C + n 􀀀 1 + jCone(PSc1X)􀀀Cone(PS0)j j=1 sj Soln21=fTt;Tcg nAJo(Termj) (3.49) where uc (or u[Termc]), sj (or s[Termj ]) and C are de ned as in Equations 3.12, 3.13 and 3.40, respectively. 41 (a) Soln11 (b) Soln12 (c) Soln13 (d) Soln21 (e) Soln22 (f) Soln23 Figure 3.5: Rule II. 42 It is clear from Equations 3.48 and 3.49 that Soln21 indeed meets the existential condition of 3.47. In particular, nAJ(Termt; Termc) Soln21 minjSoln1Sj j=1 nAJ(Termt; Termc) Soln1j (3.50) That concludes the proof. Note that in Equations 3.48 and 3.49, u[Termt] is implicitly set to one. In other words Termt is, and without loss of generality, assumed to be useful in all Soln1S and Soln2S Solutions. From Eq. 3.44, and from Def. 3.8 of usefulness, it is clear that if Termt is useful in one Solution in Soln1S [ Soln2S , then it is also useful in all of them. Proving the theorem in case Termt is not useful is trivial. Since, in that case Termt has no e ect on the Cost of the Soln1;2S Solutions. In other words, 8Soln1i 2 Soln1S; Soln2i 2 Soln2S : Cost(Soln1i) = Cost(Soln2i) which meets the existential condition of 3.45. Consider Term ABCG. PS1 = fA;BCGg is always the same or cheaper than PS2 = fA;BC;Gg. Hence, PS2 can be excluded from the search Space. Theorem 3.10. Rule III Using a Source in a PS is always the same or cheaper (in terms of nAJ) than any other non􀀀Source Term. Formally, let Term1; Term2; Termt 2 TermS, Term1 2 SourceS, and Term2 =2 SourceS. Let also Term1; Term2 Termt. Let PSt1 and PSt2 be two PSs of Termt. Let both PSt1 and PSt2 be the same except that PSt1 contains Term1, while PSt2 contains Term2, instead. Then, an OptSoln can be found that does not use PSt2. Proof. Let Soln1 and Soln2 be two Solutions such that: Soln1= fTermtg = Soln2= fTermtg, Soln1[Termt] = PSt1 and Soln2[Termt] = PSt2. Let PS0 be the maximal common subset of PSt1 and PSt2. Let also jPS0j = n0 > 0. Following the theorem text and Lemma 3.7: PSt1 = PS0 [ fTerm1g PSt2 = PS0 [ fTerm2g (3.51) 43 nAJ(PSt1) Soln1 = C + s1 Soln1=fTermtg nAJo(Term1) Soln1 + s1 Soln1=fTermtg jCone(TermX1)􀀀Cone(PS0)j i=1 si Soln1=fTermtg nAJo(Termi) Soln1 (3.52) nAJ(PSt2) Soln2 = C + s2 Soln2=fTermtg nAJo(Term2) Soln2 + s2 Soln2=fTermtg jCone(TermX2)􀀀Cone(PS0)j i=1 si Soln2=fTermtg nAJo(Termi) Soln2 (3.53) where C re ects the contribution of PS0 to nAJ(PSt1) Soln1 (or equivalently to, nAJ(PSt2) Soln2 ), and computed as in Eq. 3.40. Since Term1 is a Source, therefore, nAJo(Term1) Soln1 = 0 (Eq. 3.11). Also, Cone(Term1) 􀀀 Cone(PS0) = ;. Hence, nAJ(PSt1) Soln1 = C. From which, nAJ(PSt2) Soln2 nAJ(PSt1) Soln1 . The proof then follows from Corollary 3.6. Consider Term ABCG in Example 3.2. PS1 = fBCG;Ag is always the same or cheaper than PS2 = fBCG;ABg. Hence, PS2 can be excluded from the search Space. De nition 3.18. Target-image term or TITerm Termi is a TITerm if (Termi 2 PTermS) ^ (9Targetj 2 TargetS : Targetj = Termi) Also, de ne TITermS to be the set of all Target-image Terms. For Example 3.2 and SG1 of Fig. 3.3: Term BCG (with TermID of 8) is a TITerm, since it is an image of Target BCG (with TermID of 1) associated with ONode X1.3 Theorem 3.11. Rule IV Using a TITerm in a PS is always the same or cheaper (in terms of nAJ) than any other non 􀀀 TITerm. Formally, let Term1; Term2; Termt 2 TermS, Term1 2 TITermS, and Term2 =2 TITermS. Let also Term1; Term2 Termt. Let PSt1 and PSt2 be two PSs of Termt. Let both PSt1 and PSt2 be the same except that PSt1 contains Term1, while PSt2 contains Term2, instead. Then, an OptSoln can be found that does not use PSt2. Proof. Let Soln1 and Soln2 be two Solutions such that: Soln1= fTermtg = Soln2= fTermtg, Soln1[Termt] = PSt1 and Soln2[Termt] = PSt2. Following the theorem text, nAJ(PSt1;2) can be expressed the same as in Equations 3.52 and 3.53 used in the proof of Theorem 3.10, respectively. 3TermIDs are listed in Table 3.1. 44 Since Term1 is a TITerm, then by Def. 3.18, 9Targetj 2 TargetS : Targetj = Term1. Based on PS construction Rule II (i.e., Theorem 3.9), Soln1[Targetj ] = Soln2[Targetj ] = fTerm1g. Hence, nUsed[Term1] Soln1 1. It is realized from the Theorem text that Targetj 6= Termt, and, therefore, nUsed[Term1] Soln1=fTermtg 1. From si de nition in Eq. 3.13, it follows s1 Soln1=fTermtg = 0, and hence nAJ(PSt1) Soln1 = C. From which, nAJ(PSt2) Soln2 nAJ(PSt1) Soln1 . The proof then follows from Corollary 3.6. De nition 3.19. AddedCoverage (or for short ACov) ACov(Termi; PSt) : TermS 2TermS ! 2INodeS. A function that returns the letters (i.e., INodes) covered by Termi 2 PSt and not covered by any other Term in PSt. FS ormally, ACov(Termi; PSt) = Termi 􀀀 jPStj j=1;j6=i Termj . De nition 3.20. Redundant PS PSt is called a redundant PS if: 9Termi 2 PSt : jACov(Termi; PSt)j = 0 _ (jACov(Termi; PSt)j = 1 ^ Termi =2 SourceS) Also, Termi will be called a redundant Term in PSt. Corollary 3.12. An OptSoln exists that does not use redundant PSs. Proof. The proof follows directly from Rules I and III (i.e., Theorems 3.8 and 3.10, respectively). Algorithm 1 takes into account all the four rules while constructing the PSs. It takes ve arguments: Termt: the Term to be constructed. PSTerms: the contents (thus far) of the PS being constructed. Required: a subset of Termt, consisting of the INodes that have not yet been covered in the current PS. Initially, Required consists of all the INodeS in Termt. RTermS (or Relevant Terms): a set of Terms from which a PS of Termt can be built. RTermS are initialized with fTermi 2 PTermS jTermi Termt ^ TermiID 6= TermtIDg By Def. 3.5 of PS, a PTerm cannot be used to construct itself. Also, Targets cannot be used to construct any Term. Nonetheless, Target-image Terms (Def. 3.18) can construct their corresponding Targets. 45 ERTermS (or Essential Relevant Terms): a set initialized with (SourceS [ TITermS) \ RTermS. Algorithm 1 runs (recursively) on each Termt 2 (TargetS [ PTermS). For each Termt, it is initially called with Required = Termt, PSTermS = ;, and the appropriate RTermS and ERTermS. PS and PSTermS may be used interchangeably in the algorithm description. Lines 1 - 15 check whether a single Source or a single TITerm exists that can cover all the letters (i.e., INodes) in Required. If this is the case, the Source or the TITerm is added to the current PSTermS, and the algorithm returns without further need to search for cheaper PSs (Rules III and IV). If there is no single Source or TITerm that can cover all the letters in Required, the algorithm tries to cover them using all possible non-redundant combinations of the Terms in RTermS. First, Lines 17 - 20 check whether indeed a PS can be found using the current set of RTermS. If yes, the rst Term in RTermS (call it RTermi) is picked and removed from RTermS. Lines 23 - 27 check whether adding RTermi to the current PS will cause any redundancy (see Def. 3.20 of redundant PSs). If it causes redundancy, the next RTerm is picked instead. If not, the algorithm will nd all possible PSs in which RTermi is used. To do that, the algorithm creates a new set of Required1, PSTermS1, and RTermS1 structures that are modi ed copies of Required, PSTermS, and RTermS, respectively, based on the fact that RTermi is used (Lines 28 - 30). If adding RTermi to the current PS covers all the letters in Required (Line 31) then PSTermS1 is a complete PS. The PS is stored (Line 32) and the algorithm picks the next RTerm. If PSTermS1 is not yet complete (i.e., Required1 is not empty), the algorithm iteratively calls FindPSs (Line 36). However, adding RTermi to PSTermS1 typically renders redundant (Def. 3.20) some of the Terms in RTermS1. Hence, line 35 lters out such redundant RTerms (and also applies Rule II) before iteratively calling FindPSs algorithm. As an upper bound, Algorithm 1 will have to visit all possible combinations of RTermS. Hence, its complexity is O(2jRTermSj), and the number of PSs per Termt(=2 SourceS) is bounded by: jPSS[Termt]j 2jRTermS(of Termt)j 􀀀 1 jRTermS(of Termt)j 2jTermtj 􀀀 1 (3.54) Nonetheless, in practice, the algorithm is much faster than (and the number of PSs is 46 Algorithm 1 FindPSs(Termt, Required, RTermS, ERTermS, PSTermS) 1: ReuiredIsCoveredByAnETerm = 0 2: for each ERTermi 2 ERTermS do 3: if ERTermi Required then // i.e., A Source or a TITerm can cover Required - Rules III, IV 4: if Adding ERTermi causes the PS to be redundant then // Def. 3.20 5: return 6: end if 7: ReuiredIsCoveredByAnETerm = 1 8: CoveringERTerm = ERTermi 9: end if 10: end for 11: if ReuiredIsCoveredByAnETerm then 12: PSTermS = PSTermS [ CoveringERTerm 13: AddThisPS (PSTermS,Termt) 14: return 15: end if 16: while jRTermSj > 0 do 17: RTermSUnion = S jRTermSj i=1 RTermi 18: if RTermSUnion + Required then // A PS cannot be constructed from the remaining RTermS 19: return 20: end if 21: Take and remove the rst Term from RTermS, RTermi 22: ERTermS = ERTermS 􀀀 RTermi 23: ACov = RTermi \ Required 24: if jACovj > 1 _ (jACovj = 1 ^ jRequiredj = 1) then 25: if Adding RTermi causes the PS to be redundant then 26: continue 27: end if 28: Required1 = Required 􀀀 RTermi 29: PSTermS1 = PSTermS [ RTermi 30: RTermS1 = RTermS 31: if jRequired1j = 0 then // i.e., all letters covered 32: AddThisPS (PSTermS1,Termt) 33: continue 34: end if 35: Filter RTermS1 because of adding RTermi 36: FindPSs(Termt, Required1, RTermS1, ERTermS, PSTermS1) 37: end if 38: end while 39: return 47 much less than) exponential. This is because not all RTerm combinations are PSs. Also, applying Rules I, II, III, and IV as well as RTermSUnion check (in Line 18) eliminate substantial part of the RTerm combinations. Table 3.3 shows the reduction in the search Space due to applying the four rules of Step II for sample problems. For Example 3.2, the PSs computed by Algorithm 1 are listed in Table 3.1. 3.2.4 Step III: Collect Space Metrics and Remove Higher nAJ Partial Solutions Theorem 3.3 narrowed down the search Space by con ning the number of candidate Terms. Furthermore, Theorems 3.4 through 3.12 reduced their possible corresponding PSs. At this point the search Space of the problem consists of all the remaining possible PS choices of all the candidate Terms. This step aims at further pruning out the search Space by computing the di erent PS upper and lower bound nAJ values and eliminating expensive PSs. The value of nAJ(PSt) is Solution-dependent (e.g., a Solution that provides sharing to the constituent Terms of PSt will reduce its nAJ, and vice versa). Nonetheless, through calculating the maximum and minimum possible sharing (in any Solution in the search Space) of the PSt constituent Terms (called nUsedMax[Termi] and nUsedMin[Termi], respectively), the lower and upper bounds of nAJ(PSt) (called, nAJMin(PSt) and nAJMax(PSt), respectively) can be computed. Comparing such bounds of di erent PSs, some PSs can be found too expensive and thus omitted from the search Space. This step is iterative. Omitting some Term PSs can a ect the max/min usage (sharing) of the Terms constituting these PSs. This, in turn, a ects the nAJ lower/upper bounds of other PSs that use these Terms, allowing for further reduction. At the end of each iteration, more areas of the search Space can be eliminated. When the algorithm can do no more eliminations, it goes to the next step. Following are the de nitions of the basic data structures and functions associated with the search Space (also referred to as metrics): De nition 3.21. PSS PSS[Termt] Sk is the set of Termt PSs in the search Space, Sk. De nition 3.22. Usable Term A Term is usable in a search Space if it is useful (Def. 3.8) in at least one Solution in that Space. Formally, Termi is usable in search Space Sk if: 9Solni 2 Sk : Termi 2 UsefulTermS(Solni) 48 De nition 3.23. nUsedMax A vector of numbers where TermIDs are used as indices. nUsedMax[Termi] Sk provides an upper bound on the maximum possible sharing of Termi in any Solution in the search Space, Sk. Formally, nUsedMax[Termi] Sk mjSakxj j=1 nUsed[Termi] Solnj8Solnj 2 Sk nUsedMax[Termi] Sk is recursively de ned as the number of Termts in the search Space, Sk, that satisfy the following two conditions: 1. 9PSt 2 PSS[Termt] Sk : Termi 2 PSt. 2. Termt is usable in Sk. Table 3.1 shows the initial values of nUsedMax of di erent Terms in Example 3.2. At the end of each iteration, some PSs are omitted from the search Space, and hence, the value of nUsedMax of some Terms will decrease. De nition 3.24. Essential Term or ETerm Termt is an essential Term in a search Space if it is useful in all that Space Solutions. Formally, Termi is an ETerm Sk if 8Solni 2 Sk : Termi 2 UsefulTermS(Solni) All Targets are ETerms in all Spaces. ETermS Sk is de ned to be the set of all ETerms in Space Sk. De nition 3.25. Essential Child or EChild Termi is said to be an essential child of Termt in search Space Sk i all the following conditions are satis ed: 1. 8PSt 2 PSS[Termt] Sk : Termi 2 PSt. 2. Termt is usable in Sk. Also, de ne EChildren[Termt] Sk to be all EChild Terms of Termt in search Space Sk. De nition 3.26. nUsedMin A vector of numbers where TermIDs are used as indices. nUsedMin[Termi] Sk provides a lower bound on the minimum possible sharing of Termi in any Solution in the search Space, Sk. Formally, nUsedMin[Termi] Sk jSkj min j=1 nUsed[Termi] Solnj8Solnj 2 Sk nUsedMin[Termi] Sk is recursively de ned as the number of Termts in the search Space, Sk, that satisfy the following two conditions: 49 1. Termi is an EChild of Termt in Sk. 2. Termt is an ETerm in Sk. The calculation of nUsedMin in a search Space starts by the fact that all Targets are essential Terms (ETerms) in any search Space. Propagation of essentiality then takes place. If Termt is an ETerm, then all its EChildren will also be ETerms (increasing their nUsedMin by 1). Table 3.1 shows the initial values of nUsedMin of di erent Terms in Example 3.2. At the end of each iteration, more PSs are omitted and more Terms become ETerms, and hence, their nUsedMin increase. De nition 3.27. nAJMax(PS) nAJMax(PSt) Sk is an upper bound on the maximum value of nAJ(PSt) in all Solutions of the search Space Sk. Formally, nAJMax(PSt) Sk maxjSkj j=1 nAJ(PSt) Solnj . nAJ(PSt) is maximized in a Solution when the Solution provides minimum sharing to the constituent Terms of PSt. Calculation of the exact maximum value of nAJ(PSt) in all Solutions of a given search Space can be computation expensive. On the other extreme, a very conservative approximation for the upper bound can be easily computed but will provide too little selectivity (i.e., to nd and omit expensive PSs). Between these two extremes, nAJMax(PSt) Sk can be computed as follows. Let PSt1 be a PS of Termt, then: nAJMax(PSt1) Sk =jPSt1j 􀀀 1+ jPXSt1j i=1 si max;Sk nAJMaxo(Termi) Sk (3.55) nAJMaxo(PSt1) Sk =jPSt1j 􀀀 1+ jPXSt1j i=1 nAJMaxo(Termi) Sk (3.56) nAJMaxo(Termt) Sk = PSS[Termt] Sk max i=1 nAJMaxo(PSti) Sk (3.57) where 50 si max;Sk = 8>< >: 1 nUsedMin[Termi] Sk=fTermtg = 0 0 nUsedMin[Termi] Sk=fTermtg > 0 (3.58) where nUsedMin[Termi] Sk=fTermtg = the number of Termf s (where Termf 6= Termt) that satisfy the following two conditions: 1. Termi is an EChild of Termf in Sk. 2. Termf is an ETerm in Sk. Note that the above de nition of nAJMax(PSt) Sk will provide a value that is the same or greater than the exact maximum value of nAJ(PSt) in all Solutions of Sk. De nition 3.28. nAJMin(PS) nAJMin(PSt) Sk is a lower bound on the minimum value of nAJ(PSt) in all Solutions of the search Space Sk. Formally, nAJMin(PSt) Sk minjSkj j=1 nAJ(PSt) Solnj . nAJ(PSt) is minimized in a Solution when the Solution provides maximum sharing to the constituent Terms of PSt. nAJMin(PSt) Sk can be computed as follows. Let PSt1 be a PS of Termt, then: nAJMin(PSt1) Sk =jPSt1j 􀀀 1+ jPXSt1j i=1 si min;Sk nAJMino(Termi) Sk (3.59) nAJMino(PSt1) Sk =jPSt1j 􀀀 1 (3.60) nAJMino(Termt) Sk = PSS[Termt] Sk min i=1 nAJMino(PSti) Sk (3.61) where si min;Sk = 8>< >: 1 nUsedMax[Termi] Sk=fTermtg = 0 0 nUsedMax[Termi] Sk=fTermtg > 0 (3.62) where nUsedMax[Termi] Sk=fTermtg = the number of Termf s (where Termf 6= Termt) that satisfy the following two conditions: 1. 9PSf 2 PSS[Termf ] Sk : Termi 2 PSf . 2. Termf is usable in Sk. Note that the above de nition of nAJMin(PSt) Sk will provide a value that is the same or less than the exact minimum value of nAJ(PSt) in all Solutions of Sk. 51 More restricted conditions, yet easier to check than those of Corollaries 3.5 and 3.6 are stated in the following corollaries: Corollary 3.13. Let PS1 and PS2 be two PSs of Termt in Space Sk. Then, if nAJMin(PS1) > nAJMax(PS2), then any OptSoln will not use PS1. Corollary 3.14. Let PS1 and PS2 be two PSs of Termt in Space Sk. Then, if nAJMin(PS1) nAJMax(PS2), then an OptSoln can be found that doesn't use PS1. Theorem 3.15. Rule V A Term that is used at most once in any Solution of a given search Space can be omitted from that search Space. Formally, if nUsedMax[Termc] Sk = 1, then an OptSoln can be found without using Termc. Proof. The proof is a special case of Rule II (Theorem 3.9). Informally, the idea behind the theorem is, if Termt is the only Term (remaining) in the search Space that may need a certain set of Terms in its implementation, then it saves nothing to join these Terms in one Term (Termc) and use Termc instead. It saves nothing because Termc is not shared with any other Term. Formally, let Termt be the only Term in Sk that may use Termc (note that nUsedMax[Termc] Sk = 1). Without loss of generality, de ne PSt1 to represent the form of any PS of Termt that uses Termc, as follows: PSt1 = PS0 [ fTermcg (3.63) The theorem can be proved if it is proved that for each Soln1 where Soln1[Termt] = PSt1, there exists another Soln2 such that Soln2[Termt] = PSt2 where Termc =2 PSt2 and Cost(Soln2) = Cost(Soln1). To prove the latter statement, it is su cient to prove the following: For each Soln1 where Soln1[Termt] = PSt1, there exists another Soln2 such that Soln2=fTermtg = Soln1=fTermtg, Soln2[Termt] = PSt2 = PS0 [ PSci (where Soln1[Termc] = PSci), and Cost(Soln2) = Cost(Soln1). The proof hereafter will be concerned with the last statement. PSt1 and PSt2 are depicted in Fig. 3.6 (note that Termc is not useful in Soln2). From Def. 3.15 of nAJ: nAJ(PSt1) Soln1 = C + sc Soln1=fTermtg nAJo(Termc) + jCone(TermXc)􀀀Cone(PS0)j i=1 si Soln1=fTermtg nAJo(Termi) (3.64) 52 (a) PSt1 (b) PSt2 Figure 3.6: Rule V. where C re ects the contribution of PS0 to nAJ(PSt1) Soln1 (or equivalently to, nAJ(PSt2) Soln2 ), and is computed as in Eq. 3.40. From Lemma 3.7, de nition of nAJo in Eq. 3.11 and Def. 3.13 of Cone, it follows: nAJ(PSt1) Soln1 = C + sc Soln1=fTermtg (jPScij 􀀀 1) + sc Soln1=fTermtg jCone(PSciX)􀀀Cone(PS0)j i=1 si Soln1=fTermtg nAJo(Termi) (3.65) nAJ(PSt2) Soln2 = C + (jPScij 􀀀 1) + jCone(PSciX)􀀀Cone(PS0)j i=1 si Soln2=fTermtg nAJo(Termi) (3.66) Since sc Soln1=fTermtg = 1, therefore, nAJ(PSt2) Soln2 = nAJ(PSt1) Soln1 . That concludes the proof. De nition 3.29. Rule V Transformation Let nUsedMax[Termc] Sk = 1 and Termt be the only Term in Sk that may use Termc. De ne OldPSS PSS[Termt] Sk to be the set of all Termt PSs remaining in Sk that use Termc. Formally, OldPSS = PSti PSti 2 PSS[Termt] Sk ^ Termc 2 PSti (3.67) Let PSS[Termc] Sk = fPSc1; : : : ; PScng. Then, the following transformation will be referred to as Rule V transformation: Replace each PSti 2 OldPSS with n PSs (PSti1, : : :, PStin), where PStij is de ned as follows: If PSti = PS0 [ fTermcg 53 then, PStij = PS0 [ PScj The transformation has the potential of rendering many PStijs redundant (see Def. 3.20), and thus will be omitted from the search Space. This, in turn, updates the nUsedMin and nUsedMax structures of these PS constituent Terms. Hence, the transformation can result in a ecting nAJMin and nAJMax of other PSs that are using these Terms allowing for more Space reduction using Corollary 3.14. Algorithm 2 iteratively collects and updates the search Space metrics. It makes use of Corollaries 3.13 and 3.14 and Rule V (Theorem 3.15) and its transformation (Def. 3.29) to re ne the search Space. It incorporates the following data structures: nAJMino=Maxo[Termt] Sk : a vector that stores nAJMino=Maxo of all Termt in search Space Sk, respectively. PSnAJMin=Max[Termt][PSti] Sk and PSnAJMino=Maxo[Termt][PSti] Sk : two two dimensional structures that store nAJMin=Max and nAJMino=Maxo of all PSti of all Termt in search Space Sk, respectively. UT Sk : a set of Terms whose (or whose PS) nAJMin(o)=Max(o) need to be updated. The Terms are ordered within the set by their cardinalities starting from the largest to the smallest. UT is initialized with (TargetS [ PTermS 􀀀 SourceS). UPSMin=Max[Termt] Sk : a set of PSs of Termt whose nAJMin(o)=Max(o) need to be updated, respectively. They are initialized with PSS[Termt] Sk . PSR Sk : a set of PSs that are scheduled to be removed from the search Space, Sk. At this point, the current search Space consists of all the remaining possible PS choices of (TargetS [ PTermS). The su x Sk will be omitted in Algorithm 2, since it is implied that all data structures and functions are calculated for the current search Space. Algorithm 2 starts with UT initialized with (TargetS [ PTermS 􀀀 SourceS). Line 2 picks the smallest Term in UT, Termt. Lines 4 to 7 check whether Termt is used only once in the search Space and, if this is the case, apply Rule V transformation. The procedure in Line 5 also updates UT and UPSMin=Max with the Terms and PSs (respectively) whose nAJ need to be updated in a next iteration due to the transformation. Lines 8 and 9 store the old values of Termt nAJMaxo; nAJMino, and EChildren before doing any update. Lines 10 through 13 (Lines 14 through 17) update nAJMin(o)(nAJMax(o)) of the PSs of Termt speci ed in UPSMin[Termt](UPSMax[Termt]), respectively. Lines 18 54 Algorithm 2 Collect Space Metrics and Remove Higher nAJ Partial Solutions 1: while jUTj 1 do 2: Get and remove the last element in UT, Termt 3: if nUsedMax[Termt] 1 then // Termt is usable 4: if nUsedMax[Termt] = 1 then 5: Apply Rule V transformation 6: continue 7: end if 8: OldnAJMino=Maxo = nAJMino=Maxo[Termt] 9: OEChildren = EChildren[Termt] 10: for each PSti in UPSMin[Termt] do 11: Update PSnAJMin[Termt][PSti] and PSnAJMino[Termt][PSti] 12: Remove PSti from UPSMin[Termt] 13: end for 14: for each PSti in UPSMax[Termt] do 15: Update PSnAJMax[Termt][PSti] and PSnAJMaxo[Termt][PSti] 16: Remove PSti from UPSMax[Termt] 17: end for 18: for all PSti and PStj of Termt do 19: if PSnAJMin[Termt][PSti] PSnAJMax[Termt][PStj ] then 20: PSR:insert(PSti) 21: end if 22: end for 23: if jPSRj 1 then // Some PSs are to be removed 24: Remove PSs And Update nUsedMax 25: if nUsedMin[Termt] 1 then // ETerm 26: NEChildren = EChildren[Termt] 􀀀 OEChildren 27: Update nUsedMin Because Of NEChildren 28: end if 29: end if 30: Calculate and store NewnAJMino=Maxo of Termt 31: Compare them with OldnAJMino=Maxo respectively 32: if NewnAJMaxo 6= OldnAJMaxo then 33: Determine which PSs (of other Terms) whose nAJMax need to be updated. 34: Update UT and UPSMax accordingly 35: end if 36: if NewnAJMino 6= OldnAJMino then 37: Determine which PSs (of other Terms) whose nAJMin need to be updated. 38: Update UT and UPSMin accordingly 39: end if 40: end if 41: end while 42: return 55 through 22 apply Corollary 3.14 to prune out expensive PSs. PSs to be removed are stored in PSR. The procedure of Line 24 propagates the e ect of removing a PS, PSt, of Termt to nUsedMax of some (or all) of PSt constituent Terms (and possibly their corresponding constituent Terms as well - see Def. 3.23 of nUsedMax). This in turn can a ect nAJMin(o) of other PSs that use these Terms. The a ected Terms and PSs are added to UT, and UPSMin, respectively, so that they are updated in a following iteration of the algorithm. Removing PSs from Termt may not only a ect nUsedMax of the constituting Terms, but also may add to EChildren[Termt]. If Termt is an ETerm, and it gained new EChildren in this iteration, then its new EChildren will also become ETerms. This is handled in Lines 25 through 28 of Algorithm 2. The procedure of Line 27 propagates the e ect of essentiality to the nUsedMin of the new EChildren of Termt (and of their corresponding EChildren as wel
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6668tzz