COGENE: an automated design framework for domain-specific architectures

COGENE: an automated design framework for domain-specific architectures

Title	COGENE: an automated design framework for domain-specific architectures
Publication Type	dissertation
School or College	College of Engineering
Department	Computing
Author	Ramani, Karthik
Date	2012-12
Description	The embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible architectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computation through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, multithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches.
Type	Text
Publisher	University of Utah
Subject	compilers; computer architecture; energy efficiency; low power; ray tracing; recognition
Dissertation Institution	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	© Karthik Ramani
Format	application/pdf
Format Medium	application/pdf
Format Extent	890,366 bytes
ARK	ark:/87278/s6v41916
DOI	https://doi.org/doi:10.26053/0H-ER44-6300
Setname	ir_etd
ID	195775
OCR Text	Show COGENE: AN AUTOMATED DESIGN FRAMEWORK FOR DOMAIN-SPECIFIC ARCHITECTURES by Karthik Ramani A dissertation submitted to the faculty of The University of Utah in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science School of Computing The University of Utah December 2012 Copyright !c Karthik Ramani 2012 All Rights Reserved !"#$% & ' ( # ) * ' + , $- . $%+/" $0)/12/+#$34"--5 $ 3!6!7879!$:;$<=337>!6!=:9$6??>:@6A$ !"#$%&''#())&+,$+-$ "'$.##,$//(+0#%$.1$)"#$-+22+3&,4$'5/#(0&'+(1$6+77&))##$7#7.#('8$ 9$:"&($ ;)#$<//(+0#%$ 9$=#7.#($ ;)#$<//(+0#%$ 9$=#7.#($ ;)#$<//(+0#%$ 9$=#7.#($ ;)#$<//(+0#%$ 9$=#7.#($ ;)#$<//(+0#%$ ,%$.1$ 9$:"&($+- )"#$;#/()7#,)$+-$ ,%$.1$:"(2#'$<>$?&4")9$;#,$+-$!"#$@(%5)#$A6"++2>$ !"#$%&'()""+& ,-"+(./(0"1&2 3456756436 )"8991(:"-"2;<#"=+&"+ 3456756436 >=%+()9?9%# 3456756436 @"+92%(@=A"-"'#&2%+"+ 3456756436 B"=-=(C"#"<=2D%& 3354E56436 ,-"+(./(0"1&2 FD%==-(=G(H=A;$&+? ABSTRACT The embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algo-rithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible archi-tectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computa-tion through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, mul-tithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches. iv To my wife, parents, little brother, aunt and uncle, and my grandparents CONTENTS ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF TABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x CHAPTERS 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Nature of Business and Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Traditional Approaches and Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 CoGenE: The Grand Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4.1 Brief Overview of Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Dissertation Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.6 Road-map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2. RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Compilers and Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Embedded Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Architectural Support for Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3. FROM APPLICATIONS TO ARCHITECTURE . . . . . . . . . . . . . . . . . . . 15 3.1 Speech Recognition Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Wireless Telephony Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Face Recognition System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3.1 Preprocessing: Flesh Toning and Segmentation . . . . . . . . . . . . . . . . . . . 18 3.3.2 Viola-Jones Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.3 Holistic Face Recognition: PCA+LDA Algorithm . . . . . . . . . . . . . . . . . 19 3.3.4 Topology-based Face Recognition: EBGM Algorithm . . . . . . . . . . . . . . 19 3.4 Workload Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.1 Memory Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.2 IPC Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Architectural Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.5.1 DSA Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.5.2 Execution Back-end: "ASIC-like" Flows . . . . . . . . . . . . . . . . . . . . . . . . . 25 4. DSA SYSTEM ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1 DSA Evaluation for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5. THE COGENE COMPILER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1 Trimaran to CoGenE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.1.1 Integer Linear Programming (ILP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2 CoGenE Compiler Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.1 Modulo Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2.2 Interconnection Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2.3 Postpass Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.2.4 Efficiency of Interconnect-aware Scheduling . . . . . . . . . . . . . . . . . . . . . . 37 6. THE COGENE SIMULATOR GENERATOR . . . . . . . . . . . . . . . . . . . . . . 38 6.1 Simulation: Power and Energy Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1.1 Analytical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.1.2 RTL-based Empirical Models for Dynamic and Leakage Power . . . . . . . 39 6.1.3 Interconnect Power Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.2 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.2.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 6.2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7. SCA DESIGN EXPLORER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 7.1 DSE Using Stall Cycle Analysis (SCA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.2 Associating Cost for Architectural Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.3 Design Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.4 SCA Exploration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 8. EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 8.1 Face Recognition Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 8.1.1 PCA/LDA vs EBGM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 8.2 SCA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 8.2.1 DSA for Embedded Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 8.2.2 DSA for Embedded Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 54 8.2.3 DSA for Wireless Telephony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 8.2.4 Impact of Per Design Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . 56 8.2.5 Sensitivity Analysis: SCA Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . 57 9. RAY TRACING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 9.1 Importance of Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 9.2 Stream Filtering for Coherent Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 9.2.1 Core Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 9.2.2 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 9.2.3 Application to Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 9.2.3.1 Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 9.2.3.2 Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 9.2.3.3 Shading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 9.2.4 Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 9.3 StreamRay Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 9.3.1 The Ray Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 9.3.2 The Filter Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 vii 9.3.3 Interconnect Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 9.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 9.4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 9.4.2.1 SIMD Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 9.4.2.2 Rendering Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 9.4.3 StreamRay Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 9.4.3.1 Address Processing vs. Data Processing . . . . . . . . . . . . . . . . . . . . . 73 9.4.3.2 Partitioning Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 9.4.3.3 Frequency Scalability of Interconnect . . . . . . . . . . . . . . . . . . . . . . . 75 9.4.3.4 Supporting Alternative Ray Tracing Algorithms . . . . . . . . . . . . . . 75 10. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . 77 10.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 10.1.1 CoGenE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 10.1.2 "Interconnection-aware" Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 10.1.3 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 10.1.4 Face Recognition Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 10.1.5 The CoGenE Power Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 10.1.6 CoGenE for Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 10.2.1 Code Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 10.2.2 Integrated "Interconnect-Register" Scheduling . . . . . . . . . . . . . . . . . . . . 80 10.2.3 Automatic Code Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 10.2.4 Emerging Application Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 10.2.5 Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 viii LIST OF FIGURES 1.1 Automation from applications to chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 Processing Kernels in a Face Recognition System . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Execution profile for PCA/LDA face recognition system . . . . . . . . . . . . . . . . . 20 3.3 Execution profile for EBGM face recognition system . . . . . . . . . . . . . . . . . . . . 21 3.4 L1 cache miss rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 L2 cache hit rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1 Heterogeneous Multiprocessor Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Organization of the Recognition DSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Functional Unit Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Plots showing the potential for memory parallelism and 'ASIC-like' flows . . . . 31 5.1 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 8.1 Throughput comparisons for different configurations . . . . . . . . . . . . . . . . . . . . 48 8.2 Energy/input packet comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 8.3 Energy-delay product comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 8.4 SCA applied to face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 8.5 Energy-delay product comparisons for performance-energy designs . . . . . . . . . 53 8.6 Throughput comparison for performance-energy designs . . . . . . . . . . . . . . . . . 55 8.7 Energy comparisons for performance-energy designs . . . . . . . . . . . . . . . . . . . . . 55 9.1 Traversal in a BVH with stream filtering. In each traversal step, inactive rays are filtered from the stream before it is forwarded to subsequent operations with the relevant BVH nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 9.2 Programming model for Stream Filtering. Programmable stream filters export an interface to generate output streams. Filter tests perform the necessary operations and return a mask indicating whether or not individual rays pass the test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 9.3 StreamRay: High-level view of the N-wide architecture . . . . . . . . . . . . . . . . . . 66 9.4 Ray architecture: The ray engine provides address computation capabilities and delivers data efficiently to the filter cores . . . . . . . . . . . . . . . . . . . . . . . . . . 67 9.5 Execution unit architecture: Execution units/comparators communicate with the register files through the program-controlled interconnect . . . . . . . . . . . . . 68 LIST OF TABLES 3.1 Instructions per Cycle (IPC) for baseline alpha configuration with varying number of execution units (XUs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Speedup/slowdown over real-time corresponding to 5 frames per second (real-time is scaled to 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.1 Functional unit utilization rate and compilation time for the different face recognition kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.1 Types of models available for the different structures within a DSA . . . . . . . . 40 6.2 Empirical Table for a FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.3 Benchmarks and Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.1 Design space and cost for each functional unit variable . . . . . . . . . . . . . . . . . . 45 8.1 Best configurations for different constraints, throughput, and energy compar-isons for different targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 9.1 Comparing interconnect choices: Relative performance and area comparisons showcase the benefits of employing a nearest neighbor interconnection strategy 69 9.2 Architecture and rendering parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 9.3 Characteristics of the test scenes: Scenes of varying geometric complexity are used to evaluate the potential role of stream filtering in interactive ray tracing. These scenes employ three different material shaders to capture a variety of important visual effects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 9.4 SIMD utilization (%) (T / I / S) for secondary rays: Stream filtering exploits any coherence available in a particular stream . . . . . . . . . . . . . . . . . . . . . . . . . 72 9.5 Rendering performance: StreamRay delivers interactive frame rates for all scenes 72 9.6 Distribution of major operations as % of total: Here, the compute-related operations refer to those involving actual ray data; integer operations are subsumed by the load and store operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 9.7 Isolating address and data computations: StreamRay delivers higher perfor-mance at reduced power dissipation over a traditional execution subsystem by placing integer execution units in AGUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 9.8 Frequency scalability: Rendering performance scales well when the intercon-nect delay is doubled for a 50% increase in operating frequency . . . . . . . . . . . . 76 CHAPTER 1 INTRODUCTION Embedded systems have revolutionized the way we interact, perceive, and communicate information. The diverse characteristics of such devices have facilitated their deployment in many areas: inexpensive cellular phones for communication and mobile access to informa-tion, reliable pacemakers in the field of medicine, security cameras for surveillance purposes, etc. Recent advances have created a strong market desire for information fusion [76], a broad term that refers to a phenomena in which many different technologies are combined to provide the user with a plethora of usage scenarios. For example, a phone may be used to seamlessly switch between different networks without affecting call continuity and clarity. In the future, a single device will be expected to support many different technologies. While designing such systems presents many new problems to system designers, the following challenges create significant roadblocks. 1.1 Applications User demand for complex applications and easy-to-use interfaces drives the embedded application space. Providing natural human interfaces requires support for applications like face and speech recognition [55, 28, 47], real-time graphics [18], etc. For communication, a device needs to support a wide variety of cellular standards [42]. The algorithmic com-plexity of these applications is growing faster than Moore's law [76], but current embedded designs [75] are not flexible enough to adapt to these changes. Functional fusion, in which one device supports a diverse set of applications, is now a dominant market desire. The iPhone [74] is one example of such a device that provides a few applications, including the touch interface, audio playback, etc. In this case, the functionality for each application is provided by employing a dedicated intellectual property (IP) block on a system-on-chip (SOC) or an application-specific integrated circuit (ASIC). Such IP blocks are specialized circuits that are energy efficient and deliver high performance for one application or a domain of applications with similar computational characteristics. 2 In the future, devices have to be powerful enough to support an almost ubiquitous set of applications, including video gaming, gesture interfaces, mobile payments, social network-ing, and traditional desktop applications. Fusing hundreds of ASICs in a single device to support thousands of specialized applications will be practically infeasible due to a variety of reasons, including fabrication costs, yield of the product, etc. However, at any give time, it is likely that the user may only be using a subset of the available applications. This dissertation explores the opportunity to add specialization for an application domain while preserving the flexibility to target a variety of applications within the domain. The arrival of heterogeneous computing systems like the IBMCell [38], Intel Larrabee [90], and AMD Fusion [4] has blurred the design requirements in the embedded and desktop computing landscapes. Every new generation of devices is expected to provide an improved level of performance when compared to its predecessor. Power dissipation has also emerged as a first-order design constraint. For mobile devices in particular, energy dissipation should be contained within strict battery life requirements [75]. Unfortunately, battery capacities in mobile devices have been projected to improve at a meager 3-7% [76] every year. Given the exponentially increasing algorithmic complexity, this exacerbates the problem of delivering high performance, low energy, and increased flexibility. 1.2 Nature of Business and Environment Every new fabrication process requires high initial costs [75, 45] for manufacturing a single chip. Further, millions of chips have to be sold in order to amortize the huge capital investments and to provide sustainable profits. The market need to support new applications every year mandates very short design times for architecting a new chip. In addition, designing an SOC requires large design teams with a variety of expertise, ranging from applications to architecture and circuit design. Thus, the business of embedded devices is governed by extremely short design cycles and economies of large scale [75, 45]. These two constraints are in direct conflict with the amount of time and resources involved in designing and verifying processors in current process technologies. This calls for design methodologies that are scalable and flexible enough to adapt to the volatile markets. Depending upon the surroundings in which the system is deployed, various constraints have to be met. Data-center computing allows for sophisticated cooling techniques and while power is a concern, performance is given a higher priority. In contrast, small size is important for mobile devices. Easy to use interfaces and ergonomic style are necessary for cellular phones. The deployment environment thus introduces constraints that further 3 complicate the design of computing systems. 1.3 Traditional Approaches and Drawbacks Over the last two decades, ASICs were predominantly deployed for embedded computing systems due to their fantastic energy-performance characteristics. This worked well for fixed function devices as ASICs provide a high level of functional specialization while being optimized for area, performance, and power dissipation. Supporting a plethora of complex applications in the future will require lengthy design cycles for each ASIC [45]. In addition, changes in the application will incur expensive redesign costs. Digital signal processors (DSPs) and general purpose processors (GPPs) trade-off energy efficiency to provide flexibility in supporting many applications. They employ a general instruction set (ISA) to support any sequence of operations in a program. The side-effect is that they incur a high control and data access overhead to perform the actual computations. The cost of generality is that they cannot meet the performance and energy requirements for certain applications like face and gesture recognition, cellular standards, and real-time graphics. These applications are characterized by intertwined sequential and parallel code kernels phases. While GPPs can deliver good performance for control-intensive sequential code, they incur too much control overhead and power for compute-intensive kernels [77]. A good solution is to employ a heterogeneous multiprocessor in which the GPP executes the sequential code and the accelerator executes the various kernels. In devices like the iPhone [74], tens of ASICs perform the various kernel processing activities, although not all of the applications run at the same time. Hence, the goal of this study is to employ programmable accelerators to replace tens of ASICs. This dissertation argues that in a complex design space, automation is the key to sat-isfying the opposing design themes of high performance, low energy dissipation, flexibility, and short time to market. Such architectures are referred to as domain-specific architectures (DSAs). The DSA [61, 42, 77] is specialized to extract the parallelism within the various kernels of an application domain. For example, the face recognition domain includes all the processes involved in real-time face recognition, including flesh toning, segmentation, face detection, and face identification. A detailed characterization of this domain is performed in Chapter 3. The compute, control, and data access characteristics of all the kernels are analyzed to create a recognition DSA. The memory system of the DSA consists of hardware support for multiple loop contexts that are common in embedded applications. In addition, the hardware loop unit (HLU) 4 and address generators provide sophisticated addressing modes which increase IPC since they perform address calculations in parallel with operations performed in the execution units. In combination with multiple SRAM memories, this results in very high memory bandwidth sufficient to feed the execution units. The program is horizontally microcoded and each bit in the program word directly corresponds to a binary value on a physical control wire. This very fine grained VLIW approach was inspired by the RAW project [98]. The side-effect of this microcode approach is ISA-independence and provides the flexibility to mimic the data flow and operations within the program closely while incurring minimal overhead for control flow. Multiple execution units can be chained together to provide "ASIC-like" computation flows due to program controlled data movement through the DSA's resources rather than the usual fetch, decode, and execute microarchitecture. The result is a programmable "ASIC-like" DSA whose energy-delay characteristics approach that of an application specific integrated circuit (ASIC), while retaining most of the flexibility of more traditional programmable processors. The cost of this microcode approach is increased compiler complexity due to the need to schedule data movement, memory access, register allocation, and execution unit utilization on a cycle by cycle basis. Compile time is also problematic [65, 77], although the compile-rarely nature of these systems mitigates this issue. Another drawback is that it incurs significant time and resources to understand the application domain and design the DSA. For example, the face recognition approach [77] involved man-months of characterization, manual code generation, and architecture design time. Finally, the design of programmable DSAs requires expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. To solve these problems, this dissertation presents and explores CoGenE, a single unified framework that automates the design of DSAs for streaming application domains. The goal is to reduce capital costs, time, and resources significantly while meeting the often conflicting system design goals. 1.4 CoGenE: The Grand Goal CoGenE, which stands for Compiler-simulator Generator-design Explorer, is a toolkit intended for use by application domain experts. The automation flow is shown in Figure 1.1. The expert factors the application into sequential code that runs on the general purpose host and kernel code that runs on the DSA. In adherence to the stream model employed by the framework, the kernel code is modified manually to process data on a per-frame basis and represents streaming code in C. This code is fed as input to CoGenE. The framework 5 Application Suite Splitter CoGenE Compiler CoProcessor Description CoProcessor Object Code Final Chip C & ifc dilation thinning add point design choice Human interaction Stream Code Design Track Graph Host Compiler CoProcessor Simulator Simulation Analysis & Design Space Exploration User Synthesize C Host Code Host Object Code GRAND GOAL Figure 1.1. Automation from applications to chips 6 takes the suite that defines the domain, an initial architecture specification of our DSA, and generates a simulator, and an executable binary for that architecture. The architecture is then simulated and both energy, and performance statistics are cataloged. Simulation and compiler data affecting area, energy and performance are then combined. The architecture description is then modified to better satisfy user-specified constraints for any combination of area, power/energy, and performance, at which point the process repeats. Finally, the user is given a set of feasible design points that satisfy his/her requirements. Results (Chapter 8) demonstrate that this process works independent of the choice of the initial starting point and hence, requires little or no architecture expertise from the application expert. The application expert is given a choice to re-factor the code if an adequate solution is not found by CoGenE. In general, CoGenE removes or alleviates the need for compilation, circuit, and architecture expertise, and the error-prone process of designing a specialized accelerator for a given application suite. It also evaluates many more design options than would be possible without similar automation. The result is improved design quality and a significant reduction in design time. 1.4.1 Brief Overview of Framework CoGenE integrates three distinct activities that all contribute to the design process: • The compiler that generates execution binary given an architectural specification, • The simulator generator that creates a cycle-accurate simulator for this template and collects statistics, and • The design explorer that explores the architectural design space to arrive at energy-performance optimal designs. The DSA approach employed in CoGenE chains multiple execution units to mimic the data computations within the application with very low overhead. While this delivers high performance, this requires simultaneous scheduling for data motion, function units, and memory accesses in both space and time. To solve this problem, the CoGenE compiler employs integer linear programming (ILP)-based interconnect-aware scheduling techniques to map the kernels to the DSA. The code optimization tactics are based on [77, 78, 36], which have shown that interconnect-aware scheduling improves performance and energy dissipation. After preliminary control and data flow analysis is performed, the innermost loop is identified and memory addressing is set up. After register assignment, ILP-based interconnection scheduling is done followed by postpass scheduling to resolve conflicts. 7 The resulting object code from the compiler is executed on the simulator. Performance statistics are collected from the resulting compilation and simulation schedules. Energy dissipation is estimated using high-level parameterizable power models. Power models em-ploy analytical models from Wattch [13] for all predictable structures and empirical models similar to [81, 82] for complex structures like the ALU and interconnects. Area estimates are obtained from Synopsys MCL and design compiler. CoGenE's exploration phase then analyzes these statistics to identify potential architectural options for performance or energy improvements. The simple iterative design space exploration algorithm (Chapter 7) is based on ana-lyzing the source of performance problems that appear during compilation and simulation. Stall causes such as insufficient parallelism, routability problems, etc., all boil down to usage conflicts for various physical resources in the architecture. Adding the appropriate resources (a process called dilation) will improve performance but will also increase area and energy consumption. Appropriate removal of lightly or unused resources (thinning) may reduce performance but will also reduce energy and area. Improper dilation will increase energy with no performance benefit and improper thinning will significantly reduce performance with little energy benefit. During diagnosis, several options are investigated to remove the bottleneck (this term is used in the context of area and energy as well as the more common performance usage), and each option is assigned a cost. In addition to maximizing performance [32], the notion of cost attempts to optimize energy dissipation and compilation complexity. The least costly alternative is tried first. The process iterates and results in near-optimal designs for user-specified energy-performance constraints. 1.4.2 Evaluation The effectiveness of CoGenE is evaluated as a case study for three important ap-plication domains: face recognition, speech recognition, and wireless telephony. These domains are fundamentally different in their access, control, and computational charac-teristics [77, 61, 42] and present a diverse embedded workload [55, 28]. The results demonstrate that CoGenE arrives at designs that are competitive with or better than previous best-effort manual designs and significantly better than what can be obtained on more conventional programmable platforms such as the Xscale. The CoGenE compiler generates efficient schedules for a variety of architectures within the DSA framework and performance approaches that of the best manual schedules. The side-effect is that automatic compilation removes the need to invest man-hours into manual code generation. The exploration process is independent of the choice of the initial architectural template and 8 results show that CoGenE always arrives at optimal energy-performance candidates in a very short time. Overall, this design tool can be employed by application experts to design optimal energy-performance DSAs with little or no expertise in the area of embedded system design. DSAs designed for embedded applications demonstrate the robust nature of CoGenE for stream-oriented workloads. Workloads that are multithreaded by nature represent a different test to the framework. To evaluate the generality of CoGenE, this dissertation analyzes its capability on ray tracing, a multithreaded graphics application. Ray tracing was chosen due to its many applications in entertainment, science, and industry. In addition, designing an architecture for ray tracing has implications beyond embedded computing. To fit the CoGenE streaming model, stream filtering is employed. This approach [36] recasts the basic ray tracing algorithm as a series of filter operations that partition an arbi-trarily sized group of rays into active and inactive subsets in order to exploit coherence and achieve speedups via SIMD processing. CoGenE is employed to design various constituent parts of StreamRay [80, 79], a novel multicore architecture that efficiently supports ray tracing. The architecture consists of two major subsystems: the ray engine, which performs address computations to form large data streams for SIMD processing, and the wide-SIMD filter engine, which performs the data and filter computations. CoGenE is employed to synthesize the filter engine and the interconnect subsystem. The compiler also generates code for the filter engine. Results demonstrate that StreamRay improves performance significantly and delivers interactive frame rates of 15-32 frames/second (fps) for scenes of high geometric complexity. 1.5 Dissertation Statement Given the rapidly evolving application space, automation is the key to achieving the op-posing design themes of high performance, low energy dissipation, flexibility, and extremely short design time. This dissertation provides the following contributions in achieving these goals: • Automation through CoGenE. A unified design framework that analyzes kernels in an application domain and presents a set of energy-performance optimal DSAs automatically to the application expert who has little knowledge of architecture and circuit design. CoGenE also provides an optimizing compiler that automates code generation. By automating the process of workload characterization, compilation, and architectural design, CoGenE represents a new design methodology that delivers 9 a significant improvement in system performance, power dissipation, resources, and design time. • Workload Studies. During the initial stage of CoGenE development, the face recognition domain was completely characterized to design a recognition DSA. This is the first study that analyzes the computational requirements of many different face recognition algorithms. • CoGenE for Ray Tracing. This dissertation presents StreamRay, a novel multicore architecture that efficiently supports ray tracing. The CoGenE compiler generates object code for the various ray tracing kernels. The CoGenE explorer was also em-ployed to automatically synthesize the SIMD execution cores and the interconnection subsystem. Given the importance of this emerging application and the new challenges this domain presents to CoGenE, our results demonstrate the robustness of CoGenE in designing DSAs for a variety of compute intensive applications. It also opens a novel area for future work. 1.6 Road-map A survey of the background work and its limitations is performed in Chapter 2. Chap-ter 3 showcases our DSA design approach by systematically characterizing and analyzing the face recognition domain. Chapter 4 discusses the various features of our "ASIC-like" DSA methodology followed by the compilation methodology in Chapter 5. Design space ex-ploration is explained in Chapter 7. The evaluation infrastructure and results are discussed in Chapters 6 and 8, respectively. Ray tracing and DSA design is presented in Chapter 9. Conclusions and future work are summarized in Chapter 10. CHAPTER 2 RELATED WORK Embedded designs have to achieve the often conflicting goals of high performance, low power, flexibility, and short design time. In recent years, contributions have been made to meet some or all of these goals. The CoGenE design methodology is compared and contrasted against various approaches in the literature to showcase the major differences. It also helps to highlight the novel capabilities provided by CoGenE. 2.1 Applications 2.1.1 Face Recognition Gottumukkal [35] designed a FPGA-based face recognition (identification) architecture that identifies a person from a database. The difference is that this dissertation performs the study of a complete face recognition system: flesh toning, segmentation, face detection, and face identification. Mathew et al. [59] perform a detailed characterization of a feature recognition system based on the Eigenfaces algorithm. In contrast, to our knowledge, this is the first study that compares and contrasts the hardware needs of different recognition algorithms. 2.1.2 Ray Tracing The use of ray packets to exploit SIMD processing was first introduced byWald et al. [99]. The original implementation targets the x86 SSE extensions, which execute operations using a SIMD width of four, and consequently uses packets of four rays. Later implementations use larger packet sizes of 4×4 rays [7], but these fixed-size packets are neither split nor reordered. Reshetov [84] has shown that even for narrow SIMD units, perfect specular reflection rays undergoing multiple bounces quickly lead to almost completely incoherent ray packets and 1 N SIMD efficiency. Thus, worst-case SIMD efficiency is not only a theoretical possibility, but has been demonstrated in current packet-based ray tracing algorithms [11]. Stream filtering in CoGenE maintains high efficiency when processing seemingly incoherent groups of rays, 11 including secondary rays required for a number of important visual effects. Efficiency is achieved by adding hardware support for filtering divergent rays and by gathering a group of coherent rays for subsequent operation. The evaluation in Chapter 9 demonstrates that it is possible to achieve high SIMD utilization and 50% higher performance while delivering power savings of 12% per SIMD core compared to existing approaches [84]. Several recent research efforts have investigated the problem of coherence in secondary rays. Boulos et al. [11] describe packet assembly techniques that achieve CoGenE level performance (in terms of rays/second) for distribution ray tracing as for standard recursive ray tracing. Similarly, Mansson et al. [58] describe several coherence metrics for ray reordering to achieve frame rates of 3-5 frames per second (fps) with secondary rays. Instead of tracing rays in a depth-first manner, several works have investigated breadth-first ray traversal. Nakamaru and Ohno [66] describe one such algorithm designed to minimize accesses to scene data and maximize the number of rays processed at a time. Mahovsky and Wyvill [57] have explored breadth-first traversal of bounding volume hierarchies (BVHs) to render complex models with progressively compressed BVHs. This approach, however, uses breadth-first traversal to amortize decompression cost and does not target either interactive performance or SIMD processing. CoGenE builds on these ideas to extract maximum coherence in arbitrarily-sized groups of rays. 2.2 Compilers and Scheduling Improving performance or power via VLIW techniques is a common theme in modern embedded systems [3], including mapping and instruction scheduling techniques [54, 93]. However, these efforts do not address low-level communication issues. CALiBeR reduces memory pressure in VLIW systems but cannot directly schedule activities to reduce register file communication at the cluster level [2]. Tiwari et al. have explored scheduling algorithms for less flexible architectures which split an application between a general purpose processor and an ASIC [95]. Eckstein and Krall focus on minimizing the cost of local variable access to reduce power consumption in DSP processors [29]. Park et al. [70] discuss a graph-based software pipelining technique for mapping loops on coarse grain reconfigurable architectures. They have shown performance optimization sacrifices several opportunities for energy reduction. They stress the need for compilation techniques that optimize energy consumption, and employ techniques that significantly reduce energy consumption while minimally degrading performance. High-performance compilation techniques have also been investigated: RAW [53], CGRAs [70], Imagine [85], 12 and Merrimac [26]. The RAW machine has demonstrated the advantages of low-level scheduling of data movement and processing in function units spread over a two-dimensional space and motivates CoGenE's fine-grained resource control approach. The main difference is that CoGenE's methodology also tries to minimize energy consumption as a first-order design constraint. Mahlke's group has also developed automated techniques for identifying candidate code blocks for coprocessor acceleration and for generating customized instruction set extensions to control those processors [21, 70, 20, 105]. A similar approach by Pozzi also provides graph-based optimizations for micro-architectural constraints such as limited register ports [73]. The main differences between these efforts and CoGenE are the target application space and the approach to co-optimize performance and energy consumption rather than just performance. Results in Chapter 8 show that targeting a DSA for multiple applications [21, 70, 73] consumes significantly higher energy ( 80%) than targeting a single application domain. Scheduling techniques for power-efficient embedded processors have achieved reasonably low power operation, but they have not achieved the energy-delay efficiency of our architecture [40]. 2.3 Embedded Architectures Recent approaches [17, 65, 70] have proposed the design of programmable processors or coarse-grained reconfigurable arrays for video processing or wireless algorithms for mobile devices. These devices work in various modes to alternatively execute sequential code and the parallel kernels. The problem is that sequential and parallel codes exhibit different kinds of data access characteristics and their execution time varies across different kernels within a domain. For rapidly evolving applications with stringent real time requirements, these devices will be inefficient at extracting different kinds of parallelism and may incur frequent mode changes, thereby degrading application performance. The MOVE family of architectures explored the concept of transport triggering where computation is done by transferring values to the operand registers of a function unit and starting an operation implicitly via a move targeting a trigger register associated with the function unit [41]. In this dissertation, this concept is used for data transfer between function units. Application-specific clusters are investigated in [52, 31]. These complementary scheduler approaches minimize inter- rather than intracluster communication and therefore are not able to optimize register utilization as described in this work. In some sense, the fine-grain horizontal microcode approach taken here can be viewed as a fine-grained extension of the 13 VLIW concept. However, the addition of more sophisticated address generators, multiple address contexts per address generator, the removal of the register file, and the fine-grained steering of data are aspects of this work that are not evident in these other efforts. The energy overhead incurred by the width of the horizontal microcode can be minimized by employing instruction compression or caching techniques [42]. The other parallelism approach that is becoming increasingly popular is short vector or SIMD data parallelism [67, 12]. These techniques have been shown to improve performance by up to an order of magnitude on DSP-style algorithms and even on some small speech processing codes [46]. CoGenE is capable of capitalizing on this form of data parallelism as well. From an energy-delay perspective, however, it was found that SIMD operation [42] does not generally have an advantage. Tensilica's Xtensa system [34], ARM's OptimoDE processor, Bluespec [9], and IBM's Cell processor are all current commercial approaches in the high-performance, energy-efficient embedded systems domain. The main difference is that the user designs a custom VLIW machine by specifying a customized instruction set. In contrast, our ISA-independent approach mimics the data flow within the application closely and significantly reduces the control and access overhead. CoGenE is driven by an application suite and our architecture provides a richer set of options than a traditional more coarse-grained VLIW approach. 2.3.1 Architectural Support for Ray Tracing Packet-based ray tracing has also been exploited successfully in special-purpose ray tracing hardware projects [89, 103]. We generalize packet-based ray tracing to process arbitrarily sized groups of rays efficiently in wide SIMD environments. While commercial implementations like the G80 [68] and the R770 [4] provide wider-than-four SIMD capability, these machines employ the execution core for address computations and hence, interfere and compete with the actual data computations for resources, thus degrading performance. The Larrabee project [90] employs a many-core task-parallel architecture to support a variety of applications. In contrast, StreamRay extracts performance from ray tracing by efficiently isolating the core tasks of stream generation and stream processing to deliver high performance. 2.3.2 Design Space Exploration Recent research has investigated exploration techniques [49, 1, 37, 105, 92] to automate the design of application specialized processors or accelerators. Based on the type of archi-tectures explored, these techniques can be classified into three relevant categories. First, 14 [49, 92] have investigated analytical techniques for the automation of super-scalar processors for SPEC or media application kernels. While [49] is fast, the algorithm was evaluated for one particular program phase, rather than all computational intensive components. [62, 77] have shown that complex multimedia applications like face and speech recognition consists of multiple compute intensive phases. This dissertation employs a multiple context hardware loop unit to efficiently support these phases. While Silvano et al. [92] address this issue, their architectural analysis focuses on the memory system and not in great detail on the interconnect and execution units. The second class of architectures explored for automation is transparent accelerators [105] for embedded systems. This study balances compilation and accelerator constraints and is similar to our approach. While their approach is based on instruction set customization, ours is tailored to extract the data flow patterns within the application. The third and final class of architectures, including our study, fall into the category of long word machines. The PICO design system [1] consists of a VLIW GPP and an optional nonprogrammable accelerator and tries to identify a cost effective combination of the two. Our approach explores the design of a programmable DSA that satisfies the energy-performance-area constraints for the entire application domain. Grovels et al. [24] employed the idea of lost cycles analysis for predicting the source of overheads in a parallel program. They present a tool to analyze performance trade-offs among parallel implementations for a 2D FFT. The CoGenE design employs a similar approach to explore many design points in the architecture space for diverse application domains. Other studies [44] have explored machine learning-based modeling for design spaces and this could potentially replace the simulator employed in our study. In contrast, CoGenE employs static information from the compiler and the integer linear programming scheduler in Chapter 5 to search the design space and arrive at optimal design points for varying constraints. CHAPTER 3 FROM APPLICATIONS TO ARCHITECTURE The effectiveness of the framework is evaluated for four different application domains: face recognition, speech recognition, wireless telephony, and ray tracing. The source code for the workloads was obtained from application software research groups in various uni-versities [25, 23, 42, 62]. The applications were manually factored into sequential code and streaming compute intensive kernels. Each of the C-based kernels were then modified to fit the stream processing model required as input to CoGenE. Mathew et al. [62] performed a complete characterization of the speech recognition domain and contributed to the initial architectural methodology. Ibrahim et al. [42] characterized the wireless telephony domain. In both cases, manual effort was involved in generating object code for execution on the architecture. This dissertation performs additional detailed characterizations of face recognition and ray tracing domains. In addition, this dissertation presents the design of the optimizing compiler and the explorer that automatically generates the design of the DSA. This chapter begins with a brief overview of speech recognition and wireless telephony. A detailed characterization of the the face recognition domain is then performed to illustrate the salient features of the architectural methodology. The complete process incurred one to two years of design time for one application domain. While time consuming, designs for all three approaches converged to a common architectural approach. This design served as the starting point and led us to explore automation for the process. 3.1 Speech Recognition Overview The speech recognition application consists of three phases that contribute to 99% of total execution time: preprocessing for feature vector generation, the Hidden Markov language Model (HMM), and the Gaussian (GAU) phase [60]. In preprocessing, sound is represented by Mel-Cepstral vectors [60]. These vectors capture the spectrum of the sound and contain information about the phonemes in sound. In addition, the vectors also 16 capture the first and second derivatives that contain information about how a phoneme was altered by preceding and succeeding phonemes. The GAU phase, also known as the acoustic model, associates probabilities to the input vectors to map it to a word or series of words in a language. For a given set of input vectors, multiple possible candidates may emerge and they are passed to the final phase. The HMM phase or the language model employs a large table to associate context and meaning to a sequence of words. By interpreting context and meaning, it selects the most probable word sequence. Preprocessing converts the raw input signal into feature vectors and is dominated by floating point computations. Nevertheless, it only accounts for 1% of the total execution time. GAU and HMM represent Gaussian probability density evaluation and hidden Markov model evaluation, respectively. GAU occupies 57.5% and HMM consumes 41.5% of the execution time of the Sphinx 3.2 speech recognition system. Both Gaussian distributions and hidden Markov models are components of most mature speech recognizers [51, 106]. GAU computes how closely a 10 ms frame of speech matches a known Gaussian probability distribution. One input packet corresponds to evaluating a single acoustic model state over 10 frames of a speech signal. A real-time recognizer needs to process 600,000 invocations of the GAU algorithm every second. The HMM algorithm performs a Viterbi search over a hidden Markov model corresponding to one model state. One input packet to the HMM implementation consists of 32 five-state hidden Markov models. While the GAU algorithm is entirely floating point, the HMM algorithm is dominated by integer compare and select operations. Its average rate of invocation varies significantly with context, but to guarantee real-time performance, it is assumed in this research that all HMM models are evaluated. 3.2 Wireless Telephony Overview Due to the existence of many different wireless communication protocols [75], the most important kernels from signal processing and wireless communication domains are chosen to form a benchmark suite. The matrix multiply, vec max, and the dotp sqr kernels are chosen from the signal processing domain. While vec max selects the maximum amongst a 128 element vector, the dot products V 1.V 1 and V 1.V 2 of two input vectors V 1 and V 2 is computed in dotp sqr. The other three applications were selected from the 3G wireless telephony standard [42]. T-FIR is a 16-tap transpose FIR filter. The Rake receiver [42] extracts signals from multipath aliasing effects and the implementation involves four complex correlation fingers. Turbo decoder [42] is a complex encoding application that exhibits superior error correction capabilities. This implementation contains 2 max-log- 17 MAP modules, an interleaver, and a deinterleaver. 3.3 Face Recognition System Overview The human face recognition problem is a complex task given the diverse range of facial features and skin tone variations. The importance of face recognition has motivated numer-ous algorithms [72, 19] and recognition accuracy evaluation efforts [72]. Face recognition can be viewed as two sequential phases: 1) face detection, which analyzes video or camera frames to produce a set of normalized skin-tone patches which likely contain a face, and 2) face identification, which compares the patches to a database of target faces to determine a probable match. Some of the face detection techniques are essentially generalized methods of object detection, and can be adapted to perform other visual feature recognition tasks. For embedded systems, there is a natural bias towards using cheap, low-resolution cameras. Images may be poorly lit, contain occlusions, and may not contain frontal views. Figure 3.1 shows the major steps involved in face recognition. The input to the system is a stream of 320x200 pixel frames arriving at a rate of 5-10 frames per second. The stream is processed one frame at a time and state is maintained to perform motion tracking. The process is a pipeline of kernels, and the goal is to process them in real time. Flesh toning looks for patches of skin toned pixels. Segmentation looks for a patch that is big enough to contain a face and performs edge smoothing to create a patch. To facilitate processing by the next stage, the patch is converted into a rectangle. Face detection looks for features in the patch which correspond to facial features such as eyes, ears, nose, etc. Eye location pinpoints the probable eye location candidates and normalizes the patch to meet the Face Recognition Technology (FERET) [72] normalization requirements. It also creates a boundary description for the patch. Face recognition then tries to match the probable facial patch to a face in the database. The goal is to minimize the number of false positives and negatives. The CSU face recognition group has analyzed a variety of face recognition algorithms and has evaluated their accuracy [25, 23]. Two algorithms were chosen due to their superior recognition accuracy and relatively high computational parallelism. The Principal Compo-nents Analysis/Linear Discriminant Analysis (PCA+LDA) algorithm recognizes faces by performing holistic image matching while the Elastic Bunch Graph Matching (EBGM) algorithm compares known features (eyes, nose, etc.) of different faces. Because of the fundamental difference in the two algorithms, the execution, data access, and control flow patterns are diverse and together represent a diverse domain. The study in [62] employs 18 Flesh Toning Segmentation Face Detection Eye Location Input Stream Face Identified Face Recognition Figure 3.1. Processing Kernels in a Face Recognition System the PCA technique. A brief description of the different components in a complete face recognition system is followed by a study of the execution profile of the system and its memory requirements. These techniques are also useful in general visual feature and gesture recognition systems. 3.3.1 Preprocessing: Flesh Toning and Segmentation Skin colors are more tightly clustered in the HSV ( Hue, Saturation, Value) or the NCC (Normalized Color Co-ordinated) color space than in the normally employed RGB encoding space. Pixels are thus converted from RGB space to the HSV color space and the NCC space. To improve accuracy, the consensus of two separate flesh toning algorithms based on the NCC and the HSV color spaces are employed respectively [59, 8, 94]. The output of this stage is a bit mask of the image marking where the pixel color is a viable flesh tone. Image Segmentation is the process of clumping together individual pixels into regions where the face might be found. Because face detection mechanism requires rectangular regions for its operation, two simple mathematical operators are performed: erosion and dilation. An erosion operator examines each pixel and blacks it out unless all its neighbors in a 3x3 pixel map are set [39]. This makes sure that small occlusions are removed from subsequent consideration. Dilation then lights up the pixel if any of its neighbors in a 4x4 window are set. 3.3.2 Viola-Jones Face Detection The face detector phase is based on the Viola-Jones approach which is similar in pur-pose to the AdaBoost algorithm [87, 97]. The AdaBoost strategy is to employ a series of increasingly discriminating filters so that weaker/faster filters need to look at larger amounts of data and the stronger/slower filters examine less data. The Viola-Jones takes a 19 similar approach but rather than cascading filters, their approach is to use multiple parallel weak filters to form a strong filter. Viola-Jones achieves a 15x speedup over the Rowley detector [86]. The Viola-Jones code is proprietary but the algorithm was published and a version of this algorithm was developed at the University of British Columbia (UBC). The AdaBoost algorithm also provides statistical bounds on training and generalization errors. Common operations are sum or difference operations between pixels in adjacent rectangular regions. Face detection involves computing the weighted sum of the chosen rectangles and applying a threshold. A 24x24 detector is swept over every pixel in the image and the image is rescaled. A detection will be reported at several nearby pixel locations at one scale and at corresponding locations in nearby scales. A simple voting mechanism decides the final detection locations. In this approach, a detector with 100 different matching criteria is employed. 3.3.3 Holistic Face Recognition: PCA+LDA Algorithm Our PCA-based face recognition algorithm is based on [104]. This algorithm was preferred over the Eigenfaces technique [59] due to the increased recognition accuracy in the original FERET study. In the first step, the face images are projected onto a feature space defined by the eigenvectors of a set of faces. The LDA algorithm is then employed to perform image classification. All the training images from the PCA subspace are grouped according to subject identity and basis vectors are computed for each subject. A test image is then projected onto the PCA+LDA subspace and two distance measures are calculated between each pair of images. The distance measures are then used to label the test image for comparison with known persons in the database. 3.3.4 Topology-based Face Recognition: EBGM Algorithm The EBGM algorithm works on the premise that all human faces have a topological structure and was originally developed by the USC/Bochum group [102]. Faces are repre-sented as graphs, with nodes positioned at facial features such as eyes, nose, etc. and the edges are represented by distance vectors. Distances between the nodes are then used to identify faces. The computational complexity of the algorithm is dependent on the number of feature nodes to be compared. A re-implementation of the EBGM algorithm was provided by the CSU research group [10]. The EBGM advantage is that it performed well in the original FERET studies on facial images that were not frontal views. The output of eye location is normalized, smoothed, and rescaled in order to increase the efficiency of landmark localization in the face recognition step. The normalized image and 20 the landmark locations are used to create face graphs for every image in the database. The final step in the algorithm is to produce a distance matrix for the images. Face identification is based on nearest neighbor classification. In the original CSU implementation, real-time performance was not a goal. Hence, the version in this dissertation employs sufficient code motion and reordering to process the image information on a real-time frame-rate basis. 3.4 Workload Characterization Figures 3.2 and 3.3 show the relative execution profiles for the face recognition system with the PCA/LDA and the EBGM algorithms, respectively. The native profiling results were obtained using SGI SpeedShop on a 666 MHz R14K processor. The face detection kernel accounts for more than 50% and face identification consumes 25% of the total computation cycles. This implies that detection and identification (PCA/LDA and EBGM) are the most time-intensive kernels and are, therefore, the key targets for acceleration. Viola Face Detection Eye Location 57% 15% PCA+LDA recognition 24% Preprocessing 4% Figure 3.2. Execution profile for PCA/LDA face recognition system 21 Preprocessing 3% Viola Face Detector 50% Eye Location 14% EBGM 33% Figure 3.3. Execution profile for EBGM face recognition system 3.4.1 Memory Characteristics Memory and execution characteristics studies are based on the SimpleScalar [15] simu-lation framework with architectural parameters chosen to model an out-of-order processor (1.7 GHz) similar to a Alpha 21264. The test configuration is a baseline machine with four integer and four floating point units each in order to provide sufficient execution resources, a 2MB L2 cache, and a 600 MHz DRAM interface. In addition, the size of the caches, the number of integer units, and the number of floating point units are varied for sensitivity analysis. Figure 3.4 shows the L1 data cache miss rates for four different configurations: i) complete detection pipeline with PCA/LDA identification, ii) complete detection pipeline with EBGMidentification, iii) PCA/LDA face recognition without detection, and iv) EBGM recognition without detection. All the configurations achieve 99.4% hit rates in the ICache. We observe good cache locality for all configurations with a small 8KB data cache, which indicates that small self-managed SRAMs are likely to be a good fit for these codes. A 320x200 pixel color image is 188 KB in length while the corresponding gray scale version 22 0 1 2 3 4 5 6 7 8 9 10 8 16 32 64 L1 cache size (KB) Miss rate (%) PCA/LDA based recognition EBGM based recognition PCA/LDA Alone EBGM alone Figure 3.4. L1 cache miss rates is about 64 KB. While the image will not directly fit in the L1 cache, the flesh toning kernel requires only one pass over every pixel and hence, data can be accessed in a stream based manner. This provides a 64 KB bitmap image that is processed in at most two passes in the segmentation phase. Good cache locality results because the phase accesses at most two rows at a time. Face detection and recognition kernels process even smaller windows (50x50 pixels or 2.5 KB) on this data multiple times and good cache locality is observed for the whole system. Figure 3.5 shows the L2 cache (unified) hit rates for the same configurations. The L2 hit rates are computed as the number of hits in the L2 cache divided by the total number of hits for the application. The very low hit percentages suggest that an L2 cache will be prohibitive in terms of energy and area while providing minimal performance improvements. 3.4.2 IPC Saturation While the cache behavior of the domain seems to be a good match for embedded proces-sors with limited cache resources, the performance numbers seem to indicate a different view. Table 3.1 shows the instructions committed per cycle (IPC) for four different configurations as the number of integer and floating point function units vary. It can be observed that adding more functional units does not provide a commensurate increase in performance. The configuration with 4 integer and 4 floating point units outperforms the one with 2+2 23 0 1 2 3 4 5 6 7 8 9 10 512 1024 1536 2048 L2 cache size (KB) Hit rate (%) PCA/LDA based recognition EBGM based recognition PCA/LDA Alone EBGM alone Figure 3.5. L2 cache hit rates Table 3.1. Instructions per Cycle (IPC) for baseline alpha configuration with varying number of execution units (XUs) Num. XUs PCA/LDA complete EBGM complete PCA/LDA alone EBGM alone 1+1 0.651 0.623 0.780 0.757 2+2 0.703 0.683 0.830 0.793 3+3 0.727 0.712 0.897 0.877 4+4 0.729 0.720 0.905 0.890 units by a marginal 5%. In addition, performance saturates beyond 6 units (3+3). Table 3.2 shows the speedup or slowdown of the four configurations over actual real-time corresponding to 5 frames per second. It can be observed that executing a complete face recognition application is at least 2 times slower than real-time with less than 2+2 functional units. At best, the applications run 1.78 times slower than real-time by adding more resources. Executing the identification algorithms alone can achieve real-time performance with sufficient resources. The performance improvement comes at the cost of a significant increase in power dissipation. The power dissipated by an out-of-order core like the Alpha is likely in tens of watts and this exceeds the power budgets available for embedded systems. This motivates the search for a non-GPP approach to provide real-time face recognition at 24 Table 3.2. Speedup/slowdown over real-time corresponding to 5 frames per second (real-time is scaled to 1) Num. XUs PCA/LDA complete EBGM complete PCA/LDA alone EBGM alone 1+1 2.310 2.560 1.530 1.610 2+2 2.050 2.107 1.378 1.383 3+3 1.800 1.870 1.040 1.160 4+4 1.780 1.784 0.978 1.003 power levels compatible with the embedded space. There are four reasons for the low performance. They are summarized below: • The face recognition kernels commonly perform a lot of computations of the form Z[i] = Z[i − 1] +!mj =0 X[j] $ Y [W[j]], which contains loop carried dependencies. • The problem is further exacerbated in multilevel loops where such computations entail complex indirect accesses. • A large number of loop variable accesses compete with the actual array data accesses, causing port saturation in the data cache. Since the ratio of array variable accesses is high compared to the number of arithmetic operations, contention is a big issue. • The slow real-time rate indicates that instruction throughput is low. Even when func-tional units are available, dependencies and memory contention significantly reduce the actual IPC. 3.5 Architectural Implications Increasing the number of SRAM ports in the system can address the problem of port saturation. Given that an 8KB cache provides good locality in a conventional cache-based system and the L2 miss rate is high, this motivates a choice to use self-managed SRAMs. Three distributed 8KB SRAMs (input,output, and scratch) were employed for the face recognition DSA. The input and output SRAMs can be double-buffered to allow simulta-neous communication with the host and the execution cluster. The scratch SRAM is used for holding intermediate data. In addition, each SRAM is dual ported to support the needs of the multiple execution units. The system mimics a distributed 24KB cache with 6-ports but does so more efficiently in terms of area, power, and latency. 25 3.5.1 DSA Memory Architecture As with most real-time applications, face recognition loops run for a fixed number of iterations and loop indices are used in data address calculations. The predominant data access pattern consists of 2D array and vector accesses. Extracting parallelism across multilevel nested loops requires complex addressing modes. A hardware loop unit (HLU) is a programmable hardware structure that provides support for multiple simultaneous loop contexts for efficient data access . The loop unit automatically updates the loop nest indices in the proper order and the implementation is similar to [62]. The Viola/Jones detection kernel requires a maximum of three simultaneous loop contexts. Hence, the loop unit supports 3 contexts. Increasing the number of contexts further increases the area, complexity, and power dissipation while providing little performance improvements for the face recognition domain. In addition, the loop unit provides hardware support for modulo scheduling. The problem of contention between address calculations and actual data computations is only partially solved with distributed memory. The use of programmable Address Generator Units (AGUs) on each SRAM port allows multiple address calculations to be done in parallel with arithmetic operations, which improves IPC. Each AGU effectively services the needs for a particular pipeline. The AGUs use the index values provided by the loop unit to facilitate data delivery to the execution units. Overall, the memory system for the DSA consists of a loop unit, three distributed 8KB SRAMs with two ports each, and associated AGUs. 3.5.2 Execution Back-end: "ASIC-like" Flows In a traditional super-scalar processor, instructions are fetched, decoded, issued, and retired. Function units receive operands from a register file and return results to the register file. This represents a huge amount of overhead, which then gets amortized over over relatively miniscule amount of computation work in the function unit. The challenge is to amortize the overhead over more work in order to increase performance and reduce power consumption. ASICs are complex computational pipelines which transform input data into results with almost no overhead, but they lack generality and flexibility. Our execution back-end mimics the ASIC approach while preserving programmability. The use of programmable multiplexers allows function units to be linked into 'ASIC-like' pipelines which persist as long as they are needed. The outputs of each MUX stage and each execution unit is registered, which allows value lifetime and value motion to be under program control. This removes the need for a large multiported register file, which saves significant power 26 with no reduction in performance. Flexibility is preserved by providing the ability to specify interconnect routes via MUX configurations under program control. The execution resources need to support a large amount of floating point calculations in the face recognition kernels. In addition, integer arithmetic is also required to support address calculations in cases where the AGUs cannot handle these duties autonomously. Our execution units comprise four floating point units and three integer functional units. As will be seen, this provides a good balance between performance and energy consumption. A SIMD approach also delivers high data parallelism and reduces register file complexity by clustering the register file and thereby reducing port complexity. Our VLIW approach provides high instruction level parallelism by performing memory operations and data computations simultaneously, albeit with a larger control overhead due to the width of the instruction word. Our execution back-end is less dependent on a centralized register file. Moreover, the vast difference in the type of data and address computations performed in a cycle in the face recognition domain makes the SIMD approach less efficient. From performance and energy perspectives, a VLIW approach is more beneficial and is the choice for face recognition. During the course of characterization of various application domains, a few trends emerged that motivated us to explore automation. First, most of these applications are characterized by streaming data. An input frame is read once, an output frame is written once, and little data are used for preserving state across frame boundaries. This led to the evolution of a unified memory design approach for the three domains. Second, energy efficient execution dictates that the function units chain operations and maximize the compute to access ratio. Automation would save a significant amount of time (6-7 man years for three domains) while allowing us to explore many candidate design choices. CHAPTER 4 DSA SYSTEM ARCHITECTURE At the system level, this dissertation employs a heterogeneous multiprocessor and com-prises a general purpose processor (GPP) for sequential code and a DSA to accelerate the kernels. The host GPP can be an ARM or x86 CPU or a digital signal processor (DSP) core. The architecture, depicted in Figure 4.1, is an example of a decoupled access-execute architecture [69]. The host GPP handles general control and set-up duties and moves data to and from the DSA via double buffered input and output SRAMs. Host Processor Memory Controller DMA Data DRAM Bus Results u-Code SRAM Input SRAM Output SRAM Execution Cluster Scratch SRAM Streaming coprocessor Figure 4.1. Heterogeneous Multiprocessor Organization 28 The DSA is shown in Figure 4.2. The memory system includes a HLU, SRAMs, and address generator units (AGU). Each HLU context stores the current value of the loop variables in a kernel's loop nest. If multiple kernels will be concurrently active, then multiple contexts are necessary to avoid delays in reloading context data into the HLU. The loop variable values are used by the AGU's for generating addresses to support various addressing modes, including 2D array accesses for row and column walks, strided and strided offset accesses, and complex patterns including A[B[i]] [61]. The use of multiple SRAMs provides higher memory bandwidth. Each SRAM is role-specific in this stream-based DSA strategy, in which applications consume input frames to produce output data and state information for subsequent frame processing. Since the input SRAM is double buffered, the host processor loads the next input frame while the DSA is processing the current frame. The output SRAM is similarly structured so the host processor can remove the previous frame outputs while the DSA is generating the current frame outputs. The scratch SRAM may be dual ported, but in this case, both ports would Figure 4.2. Organization of the Recognition DSA 29 be used by the DSA in order to increase state data bandwidth. The HLU permits modulo scheduling [83] of loops whose loop counts are not known at compile time and this capability reduces compilation complexity. The horizontal microcode approach allows the multiplexer-based interconnect to be con-figured under program control (Figure 4.3). This allows function units and their associated pipeline registers to be linked to create pipelines, which persist for as long as they are needed. This persistent pipeline characteristic is similar to the fixed yet inflexible pipelines found in application-specific integrated circuits (ASICs) and is a significant factor in the energy-delay efficiency of the approach. Value lifetime and motion are also under program control. The compiler generated microcode controls data steering, clock gating (including pipeline registers), function unit utilization, and single-cycle reconfiguration of the address gener- Left neighbor Left neighbor Right Right neighbor neighbor Cluster Interconnect Mux 4 x 1 Mux 4 x 1 Unit Execution Compiler controlled Mux Pipeline Reg Pipeline Reg Pipeline Reg Figure 4.3. Functional Unit Architecture 30 ators associated with the SRAM ports. A functional unit can either be an integer or floating point execution unit or a register file. As with any highly parallel system, the interconnect subsystem is performance critical. Operating frequency can be increased by reducing individual multiplexer widths and/or adding additional multiplexer levels. The result is improved interconnect throughput at the cost of a slight increase in fall-through delay. 4.1 DSA Evaluation for Face Recognition Figure 4.4 compares the IPC of the baseline Alpha machine with different DSA config-urations: • DSA with perfect back-end implies no stalls due to communication or execution resources, which shows the performance of the memory system, • DSA with perfect memory system, which indicates the performance of the interconnect and execution cluster back-end, • Complete DSA configuration with actual memory and back-end, but with seven functional units and the register file, and • Complete DSA configuration with eight functional units and no register file. It can be observed that the DSA configuration with perfect back-end provides as much as a 4.5x IPC improvement for face detection, and around a 10x IPC improvement for face identification (EBGM and PCA/LDA) over a general purpose processor with a traditional cache architecture. This shows that the memory system reduces port contention significantly and efficiently supports indirect addressing schemes. The configuration with perfect memory evaluates the cluster back-end. When compared to the Alpha processor, this configuration provides a 3x improvement for face detection and 6.7x improvement for face identification. The advantage comes from exploiting persistent pipeline flows where scheduling data for high computation to storage ratio sustains the high memory bandwidth inherent in the system. It also serves to demonstrate the effectiveness of the pipelined registers for storing intermediate values. The last two configurations in Figure 4.4 show the performance of the complete DSA with the actual memory and actual execution cluster. Here, the performance of the system with and without a register file is done in order to evaluate the effectiveness of the register file. In addition, the register file is replaced by an integer functional unit to evaluate performance 31 0 1 2 3 4 5 6 7 8 9 10 Viola Detection PCA/LDA EBGM Benchmarks IPC Alpha Baseline Arcface (Perfect Back-end) Arcface (Perfect Memory) Arcface ( w/ RF) Arcface ( No RF) Figure 4.4. Plots showing the potential for memory parallelism and 'ASIC-like' flows trade-offs. The complete DSA provides as much as a 2.7x performance improvement for face detection and a 5.5x-5.8x improvement for the face identification kernels when compared to the Alpha. The execution cluster and memory system are well matched in terms of throughput. The combination of high memory parallelism and "ASIC-like" flows works well for the face recognition domain. Replacing the register file with an additional integer functional unit provides a marginal 3-4% performance improvement. The register file does ease the difficulty of compiler-based scheduling and is a more generally useful structure than another execution unit if the algorithms change in a substantial fashion. Comparing the complete model to the model with perfect memory shows a performance degradation of about 13-18%. This is explained by the fact that the baseline system employs a cluster-wide interconnect for communication between the memory and the execution units. Due to contention in the global interconnect for data computation and data access, scheduling delays are introduced, leading to a performance degradation. Employing a hierarchical or separate interconnect will solve the problem, but at increased power costs. Given the performance goal of meeting real time requirements, the power conservative choice is chosen. 32 The fine-grained horizontal microcoded nature of the DSA implies that the compiler is responsible for managing all of the physical resources at an equally fine-grained level. Managing different function units, multiple memories and their associated AGUs, and scheduling data flows through the interconnect is a complex task. The inherent pro-gramming complexity of the architecture makes hand coding a lengthy and error-prone process. Even though the architecture is capable of impressive performance at low power consumption levels, it will be a futile effort unless the scheduling task can be performed automatically by a compiler. The CoGenE compiler that alleviates code generation time is described in Chapter 5. This is followed by a discussion of the CoGenE design space explorer in Chapter 7. CHAPTER 5 THE COGENE COMPILER The architectural flexibility of the DSA lends itself to be tailored to satisfy the per-formance and energy demands of the application as it evolves over many generations. This process of tailoring requires that the application expert communicate the workload requirements to the compiler designer and architect, who then agree on the final design of the chip. In a market with short time-to-market constraints, this process is prohibitive. Ideally, the application expert would like to employ a tool to automatically design the architecture and generate the associated software tools required to run the application on an architecture simulator to get an estimate of performance and energy consumption. This dissertation presents CoGenE, a tool that solves the above problem for the application expert while requiring minimal to no knowledge of the intricacies of compiler, architecture, or circuit design. "ASIC-like" DSAs deliver high performance due to the ability to coschedule data com-putations and address computations in space and time on the programmable interconnect on a cycle-by-cycle basis. This improves the computation to access ratio and energy dissipation is reduced as a result of minimized data movement. This dissertation proposes and employs a novel interconnect scheduling phase to produce optimized code for the DSA. The effectiveness of the CoGenE compiler in reducing code generation time while delivering high performance for the recognition domain is also discussed. 5.1 Trimaran to CoGenE The Trimaran compiler (www.trimaran.org) was the starting point for the CoGenE (Compiler Generator Explorer) compiler development. Trimaran was chosen since it allows new back-end extensions, and because its native machine model is VLIW [88]. Significant modifications were needed to transform Trimaran from a traditional cache-and-register architecture to meet the needs of the DSA's fine-grained cache-less approach. 34 The result is a compiler that takes streaming code, written in C, and code generation is parameterized by a machine description file which specifies: the number of clusters, the number and type of functional units in each cluster, the number of levels of intercluster and intracluster interconnect, and the individual multiplexer configurations. A new back-end code generator that is capable of generating object code for the coprocessor architecture described by the architecture description file was developed. The code generator includes a modified register allocator that performs allocation for multiple distributed register files rather than for a single register file. Since the compiler controls the programming of the multiplexers and the liveness of the pipeline registers, register allocation is inherently tightly coupled with interconnect scheduling. Hence, a separate interconnect scheduling process is performed after register allocation and the scheduling scheme is based on integer linear programming (ILP) [64] techniques. Before delving into the scheduling details, an overview of ILP-based problem solving is provided. 5.1.1 Integer Linear Programming (ILP) Computing an optimal solution for an ILP program is NP complete [30]. Researchers at Saarland University have contributed to significant advances in improving the efficiency of ILP techniques by reducing the process of enumeration [30]. Integer Linear Programming is the following optimization problem: min zIP = cT x x % PF & Zn where PF = {x\|Ax ' b, x % IRn+ }, c % IRn, b % IRm,A % IRmxn zIP is the objective function that needs to be optimized subject to a set of constraints. The set PF is called the feasible region and it is integral if it is equal to the convex hull PI of the integer points (PI = conv({x\|x % PF & Zn})). In this case, the optimal solution can be calculated in polynomial time, and hence, any formulation of the ILP program should find equality constraints such that PF is integral. 5.2 CoGenE Compiler Flow The overall CoGenE flow is illustrated in Figure 5.1. The Trimaran loop detection analysis package is used to identify the loops and calculate the start and end conditions. The standard Trimaran data flow packages are used to annotate the dependence graph with 35 Modulo scheduling Output code Post pass scheduling Interconnect scheduling Dependence Graph Register assignment Figure 5.1. Code Generation variable use and definition locations. Back substitution is then performed to reduce critical path length. After this stage, the number of loops and their characteristics are known. 5.2.1 Modulo Scheduling With information from the previous step, the innermost loop and the lowest bound on the initiation interval are computed, similar to the modulo scheduling approach [83]. If the bound is high enough to cause degradation, loop unrolling is performed to improve the results of scheduling. This is followed by a simple register assignment scheme where the pipeline registers hold the result. 5.2.2 Interconnection Scheduling The main decision variables employed are xk nt where a value of 1 means that instruction n is executed in clock cycle t on execution unit k. The index k of the decision variables is relevant for instructions that can be executed on several different execution units. For all address calculations, the AGUs are paired to a unique execution unit. Let I denote the set of instructions from the input program. The interval N(n) is the earliest control step in which instruction n can be started without violating any data dependencies. The scheduling polytope is composed of different types of constraints. The assignment constraint ensures that each instruction is executed exactly once by one execution resource. Let R(n) denote the set of execution unit types that the instruction n can be assigned to: " k!R(n) " t!N(n) xk nt = 1 (n % I The precedence constraint models the data dependencies within the input program. The dependences can be further classified into two categories: weak or antidependences (Write after Read), and strong dependencies (Read after Write). Write after Write dependencies are not an issue in this architecture since write targets do not conflict. Weak dependencies 36 within a group are allowed. Let wmn represent the minimum number of cycles from start time m to end n during which the dependence is to be respected, then: " k " tn"t xk ntn +" k " tm#t−wmn+1 xk mtm ) 1 The precedence constraints exclude any ordering of instructions where data dependences are violated. Until now, the feasibility function is integral, i.e., the solution can be calculated in polynomial time. Resource constraints are now added to the system. Resource constrained scheduling is NP complete. Let Rk denote the number of execution units of type k available in the processor. The resource constraint prevents more than Rk instructions being assigned in a cycle. It should be noted that resource constraints also implicitly include the constraints on the multiplexer at the output of the execution units. If U is the precalculated upper bound on the number of clock cycles for the input program, then: " n!I:k!R(n) xk nt ) Rk (k * 1 ) t ) U Now, every integer point saturating the constraints corresponds to a feasible solution of the interconnect scheduling algorithm. The goal is to find a schedule of minimal length L. The value of L is defined by: " k " t!N(n) txk nt ) L (n % I The goal is to minimize the objective function L. So far, our objective function does not take into consideration the instructions that take several clock cycles because of interconnect constraints. This could produce instruction slots with no instructions to be scheduled. The objective function minimizes the execution time as a primary constraint. The ILP model in the infrastructure is a solver that employs the simplex method and a solution was efficiently obtained within minutes for most kernels. 5.2.3 Postpass Scheduling A final pass is done over the code and conflicts in scheduling that can happen due to weak dependencies are distributed to the register file. In addition, those resources that are not used are completely clock gated when their instruction slots are empty. For modulo scheduled loops, a check is made to see if the loop and the address contexts are correctly programmed with the initiation interval. 37 5.2.4 Efficiency of Interconnect-aware Scheduling The efficiency of interconnect-aware scheduling is estimated by comparing it against hand-coded schedules. One metric that is useful is utilization rate, a measure of the total fraction of time for which all the seven functional units in the DSA are employed. Table 5.1 shows that we observe around 62-65% utilization rate for the PCA/LDA and the EBGM face identification kernels. Overall, the compiled code achieves an average utilization rate of 60%. When compared to manual scheduling that incurs man-months of optimization for each kernel, the compiler delivers close to 85% of the actual performance within minutes. The 15% disparity is because weak dependencies introduce conflicts in scheduling and this causes delays in the compiled code. Further, data transfers across functional units that are spatially further away from each other incurs longer delays. Addressing these issues will improve the scheduling algorithm; however, our technique still delivers a high utilization rate. The high utilization rates also demonstrate the effectiveness of interconnect-aware scheduling for delivering high instruction throughput. CoGenE incurs tens of seconds of compilation time for all the kernels except for the EBGM kernel in which ILP solving incurs hours to a few days to explore a few feasible schedules from a large scheduling space. When compared to man-months of manual code generation, the CoGenE compiler provides a significant reduction in design time. Further, CoGenE's modularity in compiling to many different architectural templates helps the application expert in exploring a variety of design options. Table 5.1. Functional unit utilization rate and compilation time for the different face recognition kernels Benchmarks Utilization Utilization Compilation rate rate time (Compilation) (Manual) (seconds) Flesh Tone 0.57 0.74 23 Erode 0.575 0.675 37 Dilate 0.570 0.65 40 Viola 0.69 0.75 60 PCA/LDA 0.62 - 49 EBGM 0.65 - '1000 CHAPTER 6 THE COGENE SIMULATOR GENERATOR The CoGenE compiler generates executable binary for various architectural configu-rations as specified by the architecture template file. Along with the code generator, the framework also generates a cycle-accurate architectural simulator that can be used to collect program statistics. Performance estimation is similar to existing cycle-accurate simulators like Simple-Scalar [16]. Power dissipation is also a first-order design constraint and hence, this dissertation presents an architectural power estimation framework [81] that employs the combination of two different models. We employ analytical models for regular and predictable structures like memory, FIFOs, etc. Power dissipation for complicated structures like execution units, control logic, etc., depend greatly on the implementation and hence, we employ empirical models based on low-level RTL-based power models [81]. Interconnect power dissipation contributes to a significant fraction of total chip power [56] and hence, an area cost is used to build models for interconnects based on the methodology described in [6, 100]. 6.1 Simulation: Power and Energy Estimation Early stage power estimation for CPUs has been a popular research area in the academic community. Wattch [14], a power simulator employs parameterizable analytical models of units like memory structures, clock tree network, and execution units, etc. to estimate dynamic power dissipation in a CPU. Other models (SimplePower [96], TEM2P2EST [27]) employ empirical models based on known circuit implementations for better accuracy. These models trade-off ease and speed of simulation for estimation accuracy and/or scalability across process technologies. In addition, these models do not accurately model power dissipation for wires and interconnects. Given the unique design issues in DSAs, the market need for extremely tight design schedules, and the lack of accurate but flexible power models, the CoGenE simulator attempts to address the above issues. In CMOS circuits, the dynamic power dissipation (Pd) is defined as 39 Pd = aCV 2f, where f and V are the frequency and voltage of operation of the circuit, respectively, C is the load capacitance, and a is the switching activity factor. The performance model estimates the activity factor for structures such as FIFOs, buses, caches, etc. For internal circuits, where the modeling is not accurate enough to calculate activity factor, we assume toggle rate values, as recommended by the Wattch model [14, 6]. Leakage power, which also contributes to a significant fraction of total power dissipated in CMOS circuits today, is determined using analytical models for memory structures (similar to HotLeakage [107]) and empirical table lookup models for all other circuits. 6.1.1 Analytical Models These types of models are employed for parameterizable regular structures and we employ a methodology similar to Wattch [14] to build the models for various structures. Each of the structures are broken into different constituent stages and equivalent RC models are built for each of them. Finally, we add the capacitances for each of the stages and then calculate the dynamic power for the structure. This type of modeling is relatively well understood and models power for wires internal to a circuit as well. Table 6.1 shows the type of modeling available for each type of structure within a DSA. For structures represented by both the models, the choice of model is dependent on the required level of accuracy, speed of simulation, and the required level of detail. 6.1.2 RTL-based Empirical Models for Dynamic and Leakage Power This type of modeling is employed for all structures where the underlying implementation varies across different units and in structures where it is difficult to build parameterizable analytical models such as control circuits, custom data-path, arithmetic units, etc. Power dissipation for such structures is determined by the activity factor of data and the control signal that determines the type of operation performed in the structure. For example, in an FIFO circuit, the control signal (push, pop) determines the operation type and the activity factor of input data determines the switching activity in the circuit. Hence, power dissipation is computed for various activity and control values using commercial low-level power simulators similar to Ramani [81] and form a table for the circuit. The table contains both the dynamic and the leakage power for the circuit. For power estimation, we perform a table lookup based on the control signals and activity. Table 6.2 shows an example for a FIFO that is modeled empirically. 40 Table 6.1. Types of models available for the different structures within a DSA Structure Analytical Empirical Model Model Cache dynamic, leakage dynamic, leakage FIFO dynamic, leakage dynamic, leakage Register file dynamic, leakage dynamic, leakage Bus dynamic, leakage - Crossbar dynamic, leakage dynamic, leakage Arbiter dynamic, leakage dynamic, leakage HLU - dynamic, leakage AGU - dynamic, leakage Arithmetic - dynamic, leakage data path - dynamic, leakage Table 6.2. Empirical Table for a FIFO Activity Control Dynamic Leakage Factor (push, pop) Power Power 0.2 00 0.105 0.118 1.0 00 0.105 0.118 0.5 01 1.610 0.122 . . . . . . . . 0.7 11 1.610 0.122 41 6.1.3 Interconnect Power Models Power dissipation on interconnects contributes to a major fraction of the total chip power [56]. An analytical model of a bus is used to estimate power dissipation for global and local buses that contribute a significant fraction to the total power of a macroblock. For global buses, a methodology similar to [5] is used, with appropriately sized buffers and repeaters, interbuffer distances, etc., depending on the delay and power requirements. For other interconnects, we employ analytical models similar to [100] for matrix-based crossbars, arbiters, and empirical models for a multiplexer-based crossbar. 6.2 Evaluation Methodology CoGenE compiled code running on the DSA is also compared to three other design options, all of which were normalized [91] to a 0.13μ process : 1. Software running on a 400 MHz Intel XScale processor that represents a highly energy efficient embedded processor. The Xscale does not have floating point instructions, and so, we make our comparisons against an idealized Xscale, where all floating point operations are replaced by integer operations. The code is then run on an actual Xscale processor and performance and power consumption are measured. 2. Software running on a 2.4 GHz Intel Pentium 4 processor that can support the real time requirements of the face recognition kernels. 3. Manually scheduled microcode implementation running on the simulated cluster ar-chitecture representing the best performance point. Energy and performance numbers are calculated using Synopsis Nanosim, a commercially designed spice level simula-tor, on a fully synthesized and back-annotated Verilog and Module Compiler-based implementation. The results are then normalized to a 0.13μ process. Normalization was done by employing conservative constant field scaling [91]. The simulated model includes a full clock tree and worst-case wire loads based on assigning wire parasitics based on metal 1. Hence, these results are pessimal since in a fabricated design, the long wires would be routed on larger metal layers. 6.2.1 Benchmarks Our benchmarks consists of seven kernels from face recognition, three kernels from speech recognition, and six kernels from wireless telephony domains. The face recognition kernels constitute the different components in a complete face recognition application. To 42 increase the robustness of the study, we employ two fundamentally different face recognition algorithms. The EBGM algorithm is more computationally intensive compared to the PCA/LDA recognition scheme. All the face recognition kernels were obtained from the CSU face recognition suite [25]. The speech recognition application consists of three phases that contribute to 99% of total execution time: preprocessing, HMM, and the Gaussian phase [61]. The kernels from the wireless domain include predominant operations like matrix multiplication, dot product evaluation, determining maximum element in a vector, decoding operations like rake and turbo, and the FIR application. Finally, we employ three kernels from the ray tracing domain. A description of the benchmarks are provided in Table 6.3. 6.2.2 Evaluation Metrics To effectively compare the performance of different architectures, we employ throughput measured in terms of the number of input frames processed per second. We employ the energy-delay product as advocated by Horowitz [33] product to compare the efficiency of different processors since both energy and delay for a given unit of work are conflicting constraints for the architect and circuit designer. We employ pruning ability and exploration time as metrics to evaluate the efficiency of design space exploration. Given the complete design space, degree of pruning gives us a measure of the reduction in the size of the exploration space. The total time for exploration evaluates the time taken to arrive at optimal design points for various constraints. 43 Table 6.3. Benchmarks and Description Benchmarks Description Face Recognition Flesh Toning preprocessing for identifying skin toned pixels Erode First phase in image segmentation Dilate Second phase in image segmentation Viola Detection Identifies image location likely to contain a face Eye Location Process of locating eye pixels in a face EBGM recognition Graph based computationally intensive matching PCA/LDA recognition Holistic face matching Speech Recognition Preprocessing Normalization for further processing HMM Hidden Markov Model for searching the language space GAU Gaussian probability estimation for acoustic model evaluation Wireless Telephony Vecmax Maximum of a 128 element vector matmult Matrix multiplication operation (integer) dotp square Square of the dot product of two vectors Rake Receiving process in a wireless communication system Turbo decoding received encoded vectors FIR Finite Impulse response filtering Ray Tracing Traversal Ray intersection with acceleration structure Intersection Ray intersection with primitive objects Shading Computing color and illumination of pixel CHAPTER 7 SCA DESIGN EXPLORER DSA design space exploration (DSE) involves a number of choices in each of the three subsystems: memory, interconnect, function units. The simplest choice set is the function unit subsystem given that the choice is at a high grain-level of integer, floating point unit, or register file. Width and the number of registers and ports are also choices. The interconnect layer is composed of one or more multiplexer layers and each multiplexer has a choice of widths. The memory subsystem is a bit more complex. SRAM choices involve width, size, and number of ports. AGU's perform affine address computations but can vary in number. The HLU can have one or more contexts. A summary of the current design space choice options and costs are summarized in Table 7.1. Dilation and thinning is obvious for all but the interconnect subsystem where dilation means adding levels or widening one or more multiplexers. Thinning reverses this choice. Increasing the number of levels increases the fall through delay but may improve frequency, while widening a multiplexer increases the delay of that component and may reduce frequency. Given the number of design choices and the number of kernels within the application, the combined set of options may create a design space that is too large to exhaustively examine. In order to simplify the process, only one architectural feature is changed per iteration and the choice is based on the lowest cost. Making too many changes can lead to a feedback loop where the exploration algorithm gets stuck in a local minima. Then the choice is which subsystem to change first. During the course of this work, it was found that since the biggest performance problem typically lies in function unit starvation. We therefore choose to modify the memory system first, then the functional units, and then the interconnect in order to balance the function unit requirements with memory system capability. The process iterates to address additional imbalances across the subsystems. 45 Table 7.1. Design space and cost for each functional unit variable Component Range Performance Energy Compiler Total cost cost cost cost Data width 16, 32, 64 (bit) 1 1 1 3 SRAMs (input, output, and scratch) 1, 2, 4, 8, 16, 32, 64 (KB) each 0 0 0 0 Ports (SRAMs and RF) 1, 2, 3 each 1 1 1 3 Hardware loop unit contexts 1, 2, 3, 4, 5 0 1 0 1 AGUs 1-8 (increments of one) per SRAM 0 0 1 1 Register file size 8, 16, 32 entries 1 0 0 1 Register file number 1,2,3,4,5,6 1 1 0 2 Functional unit type integer, floating point - - - - Functional unit mix multiplier, adder, compare, etc. - - - - Functional unit number 1-8 0 1 0 1 Interconnect Width 2-5 0 1 1 2 Interconnect levels 1-3 1 1 1 3 7.1 DSE Using Stall Cycle Analysis (SCA) SCA is a simple idea; namely, whenever the compiler's instruction schedule is delayed due to resource contention or whenever stalls occur in simulation, then there must be a bottleneck culprit. These culprit points are logged, classified by culprit type, and quantified in terms of their impact. Examples of such logged statistics are average functional unit utilization rate, register file utilization rate, contention rate in the interconnect subsystem, execution time, energy dissipation, etc. The major overheads that are detrimental to performance or energy are: • Function unit starvation is due to the inability of the memory system to deliver data to the function units at the right time. The culprit may be too few AGUs, not enough HLU contexts, interconnect contention, SRAM port contention, or insufficient SRAM capacity indicated by a high SRAM miss rate. • Insufficient hardware to support the available application parallelism. This bottleneck arises when there are more independent instructions than can be issued in a cycle. High function unit or interconnect contention helps identify the culprit. • Under utilized function units may be caused by starvation or by having more than are needed. • Routability problems will force values to be stored in either pipelined or centralized register files. High interconnect path contention identifies the interconnect culprit and can be fixed by widening multiplexers or increasing the number of multiplexer levels. 46 7.2 Associating Cost for Architectural Attributes Culprit solutions vary with culprit type and importance and the cost of each dilation or thinning option guides the choice. In order to pick the best solution, a predetermined cost is associated with each resource choice. Resource cost is based on the improvements in performance, energy, or compilation complexity that the resource provides. If the increase in size or number of a particular unit delivers a significant increase in performance, then the unit is designated with a low cost for performance, and vice versa. For energy or code generation complexity, a high cost is assigned if the unit significantly increases energy or compilation complexity. In this dissertation, a simplified Boolean cost model is used: 1 for high and 0 for low. In general, the system user may assign costs as any integer or floating point value. The total cost is the weighted sum of the performance, energy, and code generation costs. The SCA approach defines the best solution to be the one with the lowest cost. Although assigning various weights to the three metrics can lead to interesting search spaces, CoGenE restricts itself to equal weights in this study. The notion of assigning cost is nontrivial in certain cases. For example, applications with multiple loop contexts benefit significantly in performance when an HLU is present. The HLU [62] automatically updates the loop indices for all the loop contexts and generates indices for the AGU to perform address calculations. The HLU contains its own stack, registers, and adder units. The addition of a HLU has the potential to deliver very high performance, but energy dissipation increases. It also provides hardware support for modulo scheduling of loops whose indices are not known at compile time. This reduces code generation complexity. Hence, the unit is assigned a low performance cost, high energy cost, and a low code generation cost. Table 7.1 shows the costs for each of the architectural resources. 7.3 Design Selection The importance of choosing the best initial design choice is significantly reduced given that the design space will be automatically explored. A poor choice will result in more iterations but the results of the process will be very similar. Hence, the starting point for design selection (DSEL) is: a 1 KB single ported input, scratch, and output SRAM; one AGU per SRAM, a single context HLU, one floating point unit, one integer unit, and a single level interconnect using 4-wide multiplexers. For this configuration, high initial SRAM miss rates cause back-end starvation. Although different solutions in the memory subsystem (adding HLU contexts, AGUs, increasing ports) can be employed, the low-cost 47 solution is to increase the size of the input and output SRAMs. After an arbitrary number of iterations, a further increase in SRAM size may provide minimal performance improvements and an incommensurate energy increase. The choice then is to reduce their sizes for greater energy savings at minimal/no performance loss. SCA proceeds with the addition of AGUs before an increase in HLU contexts is chosen. Increasing the number of contexts arbitrarily increases the area, complexity, and power dissipation while providing minimal performance improvements. The key to HLU configuration choice is to provide what the AGU's need, hence AGU dilation precedes HLU dilation. Since the initial design choice only has one floating point and one integer unit, high function unit contention will be observed for any nontrivial application suite. Function units with high utilization and contention are dilated by one for each type as needed. An increase in the number of functional units entails an increase in the length of the instruction word, which increases power dissipation in the instruction cache and interconnect. The register file use rate and interconnect utilization metrics are employed to increase/decrease the size and number of register files. Routability problems implies increasing multiplexer width. If a frequency target is specified, then multiplexer width will be constrained and an additional interconnect level and the associated pipeline register will need to be investigated. This increases compilation complexity and may lead to infeasible schedules. For this case, the algorithm returns to the previous design point. 7.4 SCA Exploration Algorithm In summary, the DSE steps are: 1. Collect program statistics during compilation and simulation. 2. If function unit starvation is evident, then optimize the memory subsystem. Modify the architecture description file and go to step 1. 3. If function unit contention is seen, then optimize the function unit selection. Modify the architecture description file and go to step 1. 4. IF high interconnect contention is observed, then optimize the interconnect subsystem. Modify the architecture description file and go to step 1. The process iterates until a set of near-optimal designs are found. In cases where the algorithm cannot provide a feasible design, the tool returns to the last iteration. CHAPTER 8 EVALUATION The design goal of the instruction scheduling algorithm is to provide real-time perfor-mance with minimum energy. In order to evaluate the throughput and energy control capabilities of CoGenE, CoGenE is compared against the performance of hand scheduled code on a Pentium 4 (Figure 8.1). The result is then compared to an XScale-based implementation for energy consumption. Finally, a comparison of the two face recognition algorithms is presented. 0 1 2 3 4 5 6 7 FL-Tone Dilate Erode Viola FIR Eye location PCA/LDA EBGM Benchmarks Normalized Throughput Pentium 4 Xscale CoGenE ASIC Best manual schedule Figure 8.1. Throughput comparisons for different configurations 49 8.1 Face Recognition Evaluation Figure 8.1 compares the throughput (number of input frames processed per second) for the different processors. The hand coded implementation delivers the best throughput. The CoGenE version delivers a throughput that is 1.65 times better than the Pentium 4 processor and 8.64 times better than the XScale processor. This underlines the fact that our CoGenE framework exploits the streaming nature of the face recognition kernels to deliver the throughput necessary to achieve real-time constraints. CoGenE is able to achieve 85% of the throughput of manually scheduled code. Figures 8.2 and 8.3 show the energy consumption per input and the energy-delay product comparison for the different processors. The CoGenE compiler reduces energy consumption by 9.25x when compared to the low-power XScale processor. The energy advantage comes from efficient decoupling between address and data computations provided by the loop unit and AGUs, and by minimizing communication overhead due to the ASIC-like pipeline structures. The result is a DSA that performs face recognition at embedded energy budgets. It is noteworthy the energy-delay product of the Xscale processor is within 35% of the Pentium 4 processor, and that our approach provides 80x improvement over the Xscale. The improvements are 1.1 2.20E-01 1.30E-01 5.30E-02 2.80E-01 1.10E-01 1.70E-01 2.10E-01 2.20E-02 2.30E-02 1.50E-02 4.90E-02 3.60E-02 3.40E-02 4.15E-02 4.70E-02 4.90E-03 2.30E-03 1.30E-03 2.20E-03 4.00E-03 3.90E-03 5.30E-03 6.20E-03 0.001 0.01 0.1 1 10 FL-Tone Dilate Erode Viola FIR Eye location PCA/LDA EBGM Benchmarks Energy (mJ/input) Energy Pentium 4 Energy Xscale Energy CoGenE Figure 8.2. Energy/input packet comparison 50 1.20E+00 3.70E-01 6.30E-02 1.80E+00 5.94E-01 1.05E+00 1.34E+00 7.70E-01 8.10E-01 3.30E-01 3.60E-02 2.00E+00 1.08E+00 1.35E+00 1.82E+00 3.47E-02 6.09E-03 2.36E-03 2.35E-04 1.88E-02 1.50E-02 2.35E-02 3.03E-02 3.00E+01 1.00E-04 1.00E-03 1.00E-02 1.00E-01 1.00E+00 1.00E+01 1.00E+02 FL-Tone Dilate Erode Viola FIR Eye location PCA/LDA EBGM Benchmarks ED (J* 1e-9s) Pentium 4 Xscale CoGenE Figure 8.3. Energy-delay product comparison consistent across all the applications in the domain. It is interesting that flesh toning accounts for less than 5% of the total execution time but consumes an incommensurate amount of the total energy. This is because the floating point parallelism in flesh toning exceeds the number of floating point units (four) available in the cluster. This means intermediate results must be saved and retrieved from the register file, which is inefficient. The hand scheduled code does a better job of vectorizing the code, which indicates that further scheduling improvements are possible. CoGenE does well on the image segmentation phase (erode and dilate kernels) , and the architecture delivers two orders of magnitude better energy-delay product than the XScale. The Viola/Jones face detection algorithm is characterized by a recurrence that involves two adjacent image rows and an additional row for intermediate for intermediate storage. The algorithm sweeps over the image by operating on a 24x24 window. The algorithm then successively shifts by one pixel position. Pixel value lifetimes are therefore high. The architecture benefits as a result and reduces energy consumption by as much as 22x over the XScale. The CoGenE FIR version delivers two orders of magnitude energy-delay product im-provement over the XScale processor and is only 24x worse than the ASIC implementation. 51 This is partly because the ASIC possesses significantly more functional units than our architecture. 8.1.1 PCA/LDA vs EBGM One of the goals of this study is to compare two fundamentally different face recognition algorithms and to identify the algorithm that is better suited for hardware implementa-tion. The PCA/LDA algorithm is a holistic image comparison algorithm as opposed to the EBGM algorithm. The EBGM algorithm requires an additional normalization step after face detection to increase the accuracy of the algorithm. This adds computational complexity in the algorithm and contributes to the 9% performance advantage of PCA/LDA algorithm. The PCA-LDA algorithm also has a 17% advantage in energy and a 30% advantage in energy-delay product. We then reduced the number of facial feature nodes in the EBGM algorithm in order to reduce complexity but found that accuracy immediately fell to unacceptable levels. The conclusion is that the PCA-LDA algorithm is superior for our architecture and compilation approach. 8.2 SCA Results Embedded designers typically attempt to design a DSA to meet a given performance and energy budget (Figure 8.4) and then optimize the area for the design. SCA is employed in a similar manner and attempts to search through the design space for a set of designs that meet the minimum performance and energy budgets. In the first case study, the impact of SCA is evaluated in designing an optimal domain-specific architecture for the face recognition domain. The seven benchmarks required for face recognition are fed as inputs to the framework for iterative exploration. We then discuss the design of optimal DSAs for speech recognition and wireless telephony. Each of the energy-Delay optimal DSAs are compared to the best manual designs from previous studies and also to industrial design points (wherever applicable) for performance and energy dissipation. 8.2.1 DSA for Embedded Face Recognition Figure 8.4 shows the SCA design points for the seven kernels in the face recognition suite. It shows the throughput and energy dissipation for each of the design points, starting with the usual initial design point (1 INT, 1 FPU, 1 KB input, scratch, and output SRAMs, 1 AGU/SRAM, no HLU) and observe that its throughput is about fives times slower than real-time performance set at 5 frames/sec. Minimum real-time performance is shown by a vertical line normalized to 1. All points 52 1.50E-03 2.00E-03 2.50E-03 3.00E-03 3.50E-03 4.00E-03 4.50E-03 5.00E-03 5.50E-03 6.00E-03 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Throughput (compared to real time) Energy (mj/input) Best Manual Design Energy envelope Minimum Real time performance Quadrant of acceptable designs Figure 8.4. SCA applied to face recognition to the left of the line do not meet performance goal. SCA successively increases the size of the input and output SRAMS to 8 KB with up to 2 AGUs/SRAM. At this point, throughput starts to saturate and this is indicated by a low SRAM miss rate and very high utilization of AGU and interconnect resources. SCA then adds an HLU and successively increases the number of contexts to improve performance. Once memory optimization is complete, SCA dilates function unit resources and significant increases in performance is observed. A configuration of 3 INT + 3 FPU function units achieves the minimum required performance. All design points to the right of this configuration are checked against the energy requirements. The horizontal line indicates the energy budget and was set to be one order of magnitude better than the XScale. The feasible design quadrant contains designs that meet both energy and performance constraints and the user can then choose a particular design for fabrication. Our previous best manual design [61, 77] comprised: three 8 KB SRAMs, with 3 HLU contexts, and a 8 way VLIW (3 INT + 4 FPU + 1 register file)) and was shown to be 1.65 times faster than the minimum required real-time performance with a 10x energy benefit when compared to the XScale. Exploration also found this design point (Figure 8.5). The energy-delay plots demonstrate that the architecture was designed for close to optimal 53 2.00E-03 2.20E-03 2.40E-03 2.60E-03 2.80E-03 3.00E-03 3.20E-03 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Area Energy-Delay Product Best Manual Design Figure 8.5. Energy-delay product comparisons for performance-energy designs energy-delay characteristics. Table 8.1 shows that SCA also found a configuration with a 4KB scratch SRAM and an additional integer unit to have a 4% energy improvement and a marginal energy-delay product improvement over the manual design. Manual design is error prone and extremely time consuming. This case study indicates that similar or better results can be found rapidly by exploring additional design points. Due to rapid SCA, this ex
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6v41916