Design and implementation of clocked open core protocol interfaces for intellectual property cores and on-chip network fabric

Design and implementation of clocked open core protocol interfaces for intellectual property cores and on-chip network fabric

Title	Design and implementation of clocked open core protocol interfaces for intellectual property cores and on-chip network fabric
Publication Type	thesis
School or College	College of Engineering
Department	Electrical & Computer Engineering
Author	Gudla, Raghu Prasad
Date	2011-05
Description	This thesis designs, implements, and evaluates modular Open Core Protocol (OCP) interfaces for Intellectual Property (IP) cores and Network-on-Chip (NoC) that re- duces System-On-Chip (SoC) design time and enables research on di erent architectural sequencing control methods. To utilize the NoCs design time optimization feature at the boundaries, a standardized industry socket was required, which can address the SoC shorter time-to-market requirements, design issues, and also the subsequent reuse of developed IP cores. OCP is an open industry standard socket interface speci cation used in this research to enable the IP cores reusability across multiple SoC designs. This research work designs and implements clocked OCP interfaces between IP cores and On-Chip Network Fabric (NoC), in single- and multi- frequency clocked domains. The NoC interfaces between IP cores and on-chip network fabric are implemented using the standard network interface structure. It consists of back-end and front-end submodules corresponding to customized interfaces to IP cores or network fabric and OCP Master and Slave entities, respectively. A generic domain interface (DI) protocol is designed which acts as the bridge between back-end and front-end submodules for synchronization and data ow control. Clocked OCP interfaces are synthesized, placed and routed using IBM's 65nm process technology. The implemented designs are veri ed for OCP compliance using SOLV (Sonics OCP Library for Veri cation). Finally, this thesis reports the performance metrics such as design target frequency of operation, latency, area, energy per transaction, and maximum bandwidth across network on-chip for single- and multifrequency clocked designs.
Type	Text
Publisher	University of Utah
Subject	Average power; Energy per transaction; Maximum bandwidth; Area and latency; Domain interface; Back-ends; Front-ends; IP cores; System-on-chip; Network-on-chip; Open core protocol; RTL development; Logic synthesis; APR and testing; Single-frequency clocked domain; Multifrequency clocked domain
Dissertation Institution	University of Utah
Dissertation Name	Master of Science
Language	eng
Rights Management	Copyright © Raghu Prasad Gudla 2011
Format	application/pdf
Format Medium	application/pdf
Format Extent	9,105,880 bytes
Identifier	us-etd3,14979
Source	Original in Marriott Library Special Collections QA3.5 2011 .G83
ARK	ark:/87278/s6kk9sh9
DOI	https://doi.org/doi:10.26053/0H-NK4J-H900
Setname	ir_etd
ID	194488
OCR Text	Show DESIGN AND IMPLEMENTATION OF CLOCKED OPEN CORE PROTOCOL INTERFACES FOR INTELLECTUAL PROPERTY CORES AND ON-CHIP NETWORK FABRIC by Raghu Prasad Gudla A thesis submitted to the faculty of The University of Utah in partial ful llment of the requirements for the degree of Master of Science Department of Electrical and Computer Engineering The University of Utah May 2011 Copyright c Raghu Prasad Gudla 2011 All Rights Reserved The University of Utah Graduate School STATEMENT OF THESIS APPROVAL This thesis of Raghu Prasad Gudla has been approved by the following supervisory committee members: Kenneth S. Stevens , Chair 01/14/2011 Date Approved Alan L. Davis , Member 01/14/2011 Date Approved Erik L. Brunvand , Member 01/14/2011 Date Approved and by Gianluca Lazzi , Chair of the Department of Electrical and Computer Engineering and by Charles A. Wight, Dean of the Graduate School. ABSTRACT This thesis designs, implements, and evaluates modular Open Core Protocol (OCP) interfaces for Intellectual Property (IP) cores and Network-on-Chip (NoC) that re- duces System-On-Chip (SoC) design time and enables research on di erent archi- tectural sequencing control methods. To utilize the NoCs design time optimization feature at the boundaries, a standardized industry socket was required, which can address the SoC shorter time-to-market requirements, design issues, and also the subsequent reuse of developed IP cores. OCP is an open industry standard socket interface speci cation used in this research to enable the IP cores reusability across multiple SoC designs. This research work designs and implements clocked OCP interfaces between IP cores and On-Chip Network Fabric (NoC), in single- and multi- frequency clocked domains. The NoC interfaces between IP cores and on-chip network fabric are implemented using the standard network interface structure. It consists of back-end and front-end submodules corresponding to customized interfaces to IP cores or network fabric and OCP Master and Slave entities, respectively. A generic domain interface (DI) protocol is designed which acts as the bridge between back-end and front-end submodules for synchronization and data ow control. Clocked OCP interfaces are synthesized, placed and routed using IBM's 65nm process technology. The implemented designs are veri ed for OCP compliance using SOLV (Sonics OCP Library for Veri cation). Finally, this thesis reports the perfor- mance metrics such as design target frequency of operation, latency, area, energy per transaction, and maximum bandwidth across network on-chip for single- and multi- frequency clocked designs. CONTENTS ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii LIST OF FIGURES: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii LIST OF TABLES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ix ACKNOWLEDGEMENTS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xi CHAPTERS 1. INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Bus-centric Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.2 Core-centric Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2. BACKGROUND: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 2.1 Existing Architectural Designs with OCP . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Building Standard NoC Interfaces Using OCP . . . . . . . . . . . . . . . . . . 11 2.3 OCP Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Introduction to OCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 OCP Operation and Its Key Features . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 OCP Signals and Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.3.1 Data ow Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.3.2 Sideband Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.3.3 Test Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.4 OCP Signal Directions and Groups . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Prior Relevant Research Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Design Development, Implementation, and Testing . . . . . . . . . . . . . . . 21 2.5.1 RTL Development Using Verilog HDL. . . . . . . . . . . . . . . . . . . . . 23 2.5.2 Logic Synthesis and Automatic Place and Route . . . . . . . . . . . . 23 2.5.3 Design Testing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.4 Design Metrics Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Validation of OCP Complaint IP cores . . . . . . . . . . . . . . . . . . . . . . . . 27 3. DESIGN DESCRIPTION AND SPECIFICATIONS : : : : : : : : : : : 29 3.1 Design Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Design Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1.1 Modular Components Across Design . . . . . . . . . . . . . . . . . . 29 3.1.1.2 Synchronization and Bu ering Placement . . . . . . . . . . . . . . 31 3.1.2 Network Interface Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.1.2.1 Back-end and Front-end Interfaces . . . . . . . . . . . . . . . . . . . 33 3.1.2.2 Domain Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.3 Synchronization and Bu ering . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.3.1 Asynchronous Clocked Pointer FIFOs . . . . . . . . . . . . . . . . . 34 3.1.4 Network-on-Chip Using Synchronous Routers . . . . . . . . . . . . . . . 36 3.1.4.1 Data Packets Switching Technique . . . . . . . . . . . . . . . . . . . 36 3.1.4.2 Data Packetization and Depacketization . . . . . . . . . . . . . . . 37 3.1.4.3 Request Packet Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.4.4 Response Packet Format . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.1.4.5 Data Flit Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Design Speci cations and Supporting Features . . . . . . . . . . . . . . . . . . 40 3.2.1 Proposed OCP Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.2 Supporting Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2.1 Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2.2.2 Burst Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.2.3 Tagging or Out-of-Order Response . . . . . . . . . . . . . . . . . . . 43 4. DESIGN IMPLEMENTATION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 44 4.1 Clocked OCP Design Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1.1 Single-Frequency Clocked Domain Implementation . . . . . . . . . . . 46 4.1.1.1 IP Cores Back-end Modules . . . . . . . . . . . . . . . . . . . . . . . . 48 4.1.1.2 Domain Interface (DI) Module . . . . . . . . . . . . . . . . . . . . . . 48 4.1.1.3 On-Chip Network Fabric Back-end Modules . . . . . . . . . . . . 49 4.1.1.4 OCP Master and Slave Entity Modules . . . . . . . . . . . . . . . . 50 4.1.2 Multifrequency Clocked Domain Implementation . . . . . . . . . . . . 51 4.1.2.1 DI Module with FIFOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 Design Implementation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.1 RTL Source Codes Using Verilog HDL . . . . . . . . . . . . . . . . . . . . 55 4.2.2 Design Synthesis Using Synopsys Design Compiler . . . . . . . . . . . 57 4.2.3 Design APR Using Cadence SoC Encounter . . . . . . . . . . . . . . . . 57 5. VALIDATION OF CLOCKED OCP COMPLIANT INTERFACES 59 5.1 OCP Checker Setup and Instantiation . . . . . . . . . . . . . . . . . . . . . . . . . 60 6. DESIGN TESTING AND RESULTS : : : : : : : : : : : : : : : : : : : : : : : : : 62 6.1 Testing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 Performance Models and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.1 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.2.2 Performance Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.2.1 Model for Latency Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.2.2 Model for Energy per Transaction . . . . . . . . . . . . . . . . . . . . 68 v 6.2.2.3 Maximum Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2.3.1 Single-Frequency Clocked Domain . . . . . . . . . . . . . . . . . . . . 69 6.2.3.2 Multifrequency Clocked Domain . . . . . . . . . . . . . . . . . . . . . 71 6.2.4 Design Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.4.1 Clocked Designs (SFCD and MFCD) . . . . . . . . . . . . . . . . . 81 6.2.4.2 Network-On-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2.5 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.2.5.1 NoC Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7. CONCLUSION AND FUTURE RESEARCH : : : : : : : : : : : : : : : : : 97 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 7.2 Extensions and Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.2.1 Extensions and Improvements to OCP Design . . . . . . . . . . . . . . 100 7.2.2 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 APPENDIX: OCP INTERFACE SIGNALS : : : : : : : : : : : : : : : : : : : : : : 102 REFERENCES: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 104 vi LIST OF FIGURES 1.1 An SoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 IP Cores with OCP Interfaces Using Hierarchical and Heterogeneous Interconnect System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 IP Cores with Native OCP Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 IP Cores with Clocked Wrapper OCP Interfaces . . . . . . . . . . . . . . . . . . 11 2.3 A Simple System with OCP between IP Cores and On-Chip Bus . . . . . 13 2.4 IP Cores with Native OCP Interfaces Implementation . . . . . . . . . . . . . . 21 2.5 Design Implementation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Design Testing Environment Using Verilog Test Bench . . . . . . . . . . . . . 25 2.7 Pipelined CPU 32-Bit Address Format . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.8 PrimeTime (PX) Power Analysis RTL VCD Flow . . . . . . . . . . . . . . . . . 26 2.9 CoreCreator tool for Validating OCP Compliant IP Cores . . . . . . . . . . 28 3.1 Design Structure with Customized Back-ends, Front-ends and DI Module 30 3.2 Design Structure with Modular Components Across Design Path . . . . . 31 3.3 Synchronization and Bu ering Placement Across Design Path . . . . . . . 32 3.4 Standard Network Interface Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Asynchronous Dual Clocked Pointer FIFOs . . . . . . . . . . . . . . . . . . . . . . 35 3.6 3-Port Synchronous Router Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.7 Request Packet Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.8 Response Packet Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.9 Single Flit Data Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Clocked OCP Design Implementation with IP Cores and NoC . . . . . . . 45 4.2 Single-Frequency Clocked Domain (SFCD) Implementation with Mod- ular Components Across Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3 On-Chip Network Back-ends Interfacing NoC . . . . . . . . . . . . . . . . . . . . 49 4.4 Transaction Phases between OCP Master and OCP Slave Entities . . . . 50 4.5 Multifrequency Clocked Domain (MFCD) Implementation Using Asyn- chronous Clocked FIFOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.6 DI Module with Asynchronous Clocked FIFOs . . . . . . . . . . . . . . . . . . . 53 4.7 Design Implementation Flow with Technology Library Files . . . . . . . . . 54 4.8 Design RTL Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.9 Design Synthesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1 SOLV Components and Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 OCP Checker Code Snippet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Design Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 6.2 Performance Comparison Between NSP and SP Operating Modes in SFCD: Frequency v/s Average Power . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3 Performance Comparison Between NSP and SP Operating Modes in SFCD: Frequency v/s Energy per Transaction . . . . . . . . . . . . . . . . . . . . 84 6.4 Performance Comparison Between NSP and SP Operating Modes in SFCD: Frequency v/s Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.5 Performance Comparison Between NSP and SP Operating Modes in SFCD: Frequency v/s Maximum Bandwidth . . . . . . . . . . . . . . . . . . . . . 85 6.6 Performance Comparison Between NSP and SP Operating Modes in MFCD: Frequency v/s Average Power . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.7 Performance Comparison Between NSP and SP Operating Modes in MFCD: Frequency v/s Energy per Transaction . . . . . . . . . . . . . . . . . . . 86 6.8 Performance Comparison Between NSP and SP Operating Modes in MFCD: Frequency v/s Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.9 Performance Comparison Between NSP and SP Operating Modes in MFCD: Frequency v/s Maximum Bandwidth . . . . . . . . . . . . . . . . . . . . 87 6.10 NoC Performance: Number of Routers v/s Latency . . . . . . . . . . . . . . . . 92 6.11 NoC Performance: Number of Routers v/s Energy per Transaction . . . 92 6.12 NoC Performance in SFCD: Frequency v/s NSP Mode Average Power . 93 6.13 NoC Performance in SFCD: Frequency v/s SP Mode Average Power . . 93 6.14 NoC Performance in SFCD: Frequency v/s Idle Mode Average Power . . 94 6.15 NoC Performance in MFCD: Frequency v/s NSP Mode Average Power 94 6.16 NoC Performance in MFCD: Frequency v/s SP Mode Average Power . . 95 6.17 NoC Performance in MFCD: Frequency v/s Idle Mode Average Power . 95 A.1 OCP Data ow Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 A.2 OCP Sideband and Test Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 viii LIST OF TABLES 2.1 OCP Basic Data ow Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 OCP Signal Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1 OCP Proposed Subset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 6.1 Nonsplit Mode Test Bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2 Split Mode Test Bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6.3 SFCD: Individual and Total Area Reported from SoC Encounter in Case of 1-router NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 6.4 SFCD: Reported Total Average Power from SoC Encounter in Case of 1-router NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.5 SFCD: Reported Top1 Module Average Power from Primetime in Case of 1-router NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.6 SFCD: Reported Top2 Module Average Power from Primetime in Case of 1-router NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.7 SFCD: Reported NoC Module Average Power from Primetime in Case of 1-router NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.8 SFCD: Total Average Power and Energy per Transaction Computation in Case of 1-router as NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.9 MFCD: Individual and Total Area Reported from SoC Encounter in Case of 2-routers NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.10 MFCD: Reported Total Average Power from SoC Encounter in Case of 2-routers NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.11 MFCD: Reported Top1 Module Average Power from Primetime in Case of 2-routers NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.12 MFCD: Reported Top2 Module Average Power from Primetime in Case of 2-routers NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.13 MFCD: Reported NoC Module average Power from Primetime in Case of 2-routers NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.14 MFCD: Total Average Power and Energy per Transaction Computation in Case of 2-routers as NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.15 Performance Metrics Over a Range of Frequencies for Single-Frequency Clocked Domain and Multifrequency Clocked Domain . . . . . . . . . . . . . . 82 6.16 Performance Metrics Over a Range of Frequencies for Single-Frequency Clocked Domain and Multifrequency Clocked Domain . . . . . . . . . . . . . . 83 6.17 NoC Evaluation: Latency, Energy per Transaction for Varying Number of Routers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.18 NoC Evaluation: NSP Mode Average Power for Varying Number of Routers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.19 NoC Evaluation: SP Mode Average Power for Varying Number of Routers 90 6.20 NoC Evaluation: Idle Mode Average Power for Varying Number of Routers 91 x ACKNOWLEDGEMENTS I would like to convey my gratitude to all who made this thesis possible. First and foremost, I want to thank my advisor Dr. Kenneth Stevens, whose guidance and sound advice has enabled me to accomplish this research. Above all, I appreciate his un inching support, contributions of time and ideas in making my MS study experience productive and stimulating. Over this period, I have imbibed a few of his great qualities and will always look up to him as an excellent example of a professor. I gratefully acknowledge Dr. Al Davis and Dr. Erik Brunvand for their supervision and valuable input during the initial stages of the research. I owe my deepest gratitude to the University of Utah, which has contributed in all possible ways to my success. I thank the Open Core Protocol International Partnership (OCP-IP) organization and Sonics Corporation for providing OCP speci cations, tools, and support. I would like to thank my advisor for supporting my research nancially on behalf of the Semiconductor Research Corporation (SRC) and National Science Foundation (NSF). I am indebted to thank especially my research group colleagues, Junbok You and Daniel Gebhardt, in realization of design and timely guidance during research accomplishment. I take this opportunity to thank my other research group colleagues, Krishnaji, Vikas, Shomit, Eliyah, and Yang Xu. I extend my thanks to all my friends in Salt Lake City and in India. Lastly, and most importantly, I wish to thank my parents and entire extended family for their love and encouragement. Of course above all, I thank whole heartedly God for ubiquitous support in my life. For my parents who raised me, taught me, and loved me for what I am, I dedicate this thesis to them. Thank you. CHAPTER 1 INTRODUCTION 1.1 Motivation Current technology trends, scaling, and with end users showing a marked pref- erence for the smaller geometries of deep submicron processes forces a design style where multiple independent circuit implementations are integrated together into a single System-On-Chip (SoC). However, contemporary SoC designs have their own share of issues and challenges. The major challenges faced by a design engineer include the ever increasing complexity in modern SoC designs, reusability, time- to-market, communication between Intellectual Property (IP) cores, integration of di erent clocked domain IP cores, and global clock distributions on a chip. The design of standard Network-on-Chip (NoC) interfaces to SoC is pivotal in addressing design reusability, time-to-market, and integration of IP cores employing di erent clock domains (synchronous, elastic, and asynchronous). I became motivated to take up this prospective research from knowledge learned through academic experience in Very Large Scale Integrated Circuit (VLSI) design, veri cation, and testing domains. This research is mainly targeted to provide an e cient solution to address SoC design challenges by building standard NoC interfaces using an industry standard socket interface, Open Core Protocol (OCP). Also listed below are a few motivating factors from an industrial perspective in the realization and implementation of this project: a. Levels of device integration lead us to SoC design style SoC provides the platform for integration of di erent architectural cores such as microprocessor chips, application speci c integrated circuit (ASIC) chips, random access memory (RAM) chips, and peripherals on a single die. The major advantage of SoC design over custom design is its shorter time-to-market, but at the expense 2 of performance and area. SoC designs help enable IP reusability when they utilize a standard communication interface. Employing a standard socket interface on a SoC enables reuse of the good designs with minimal modi cation to IP cores [1]. This project targets the shorter time-to-market feature of SoCs to build standard interfaces between IP cores at the expense of power, performance, and area. b. NoC is the best method of communication on SoC designs With the ever growing complexity in SoC designs and the need for better per- formance, an e cient communication medium is needed. SoC designs can use a NoC or on-chip bus as the on-chip communication medium between IP cores. Network-on-chip is an e cient communication medium compared to bus because of its advantages like the following [2]: E ciency improvement in speed, bandwidth, area, and power consumption Supports concurrency - e ective spatial reuse of resources Low latency Scalable bandwidth Modularity c. OCP a standard socket interface to address IP core reusability OCP is a standard core-centric protocol which addresses the IP cores reusability [3]. This not only allows independent IP core development without IP core interconnect network but also allows IP core development in parallel with a system design, reducing design time, design risk, and manufacturability costs. d. Design implementation supports di erent IP core architectures and enables research on NoC Building standard NoC interfaces using OCP between IP cores and on-chip network fabric not only supports di erent architectural designs but also gives a chance to research di erent NoC architectures by employing di erent clocking strategies. Commercially available IP cores with OCP are typically synchronous in nature, but the IP cores with wrapper OCP interfaces enables us to have clocked, elastic, or asynchronous interfaces which will support di erent IP core architectures across multiple SoC designs. 3 1.2 Related Work Over the years, relentless advances in semiconductor fabrication and continuous increase in the complexity of modern SoC designs led to integration of more IP blocks into a chip. Figure 1.1 shows a typical SoC architecture design [4]. Hundreds of IPs are integrated into an SoC providing various functionalities including inter-IP core communications, networking, multimedia, storage, etc. An open and exible standard interface protocol such as bus-centric or core-centric protocol for IP cores and NoC is necessary to address design reusability, time-to- market, e cient on-chip intercommunication, SoC integration and veri cation. Semi- conductor IP core designers are striving to ensure their IP can be used in the widest possible range of applications but integration of these IPs on SoC is an issue. SoC integration of third party IPs with di erent interface standards requires an external adapter or bridge to connect them to a standard protocol. Designing such adapters or bridges is not a di cult task but veri cation is an issue due to standard translation and compliance checks. CPU DSP DMA Video Eng UART SRAM DRAM Controller Flash Controller USB PCIe DMA CCD LCD Peripheral Bus Fabric GPIO SPI Peripheral Peripheral Generic Bus Fabric High Performance Bus Fabric Cross bar Shared-link Audio Eng Figure 1.1. An SoC Architecture 4 1.2.1 Bus-centric Protocols Bus-centric protocols de ne technology independent standard bus protocol method- ologies for easy integration of IPs within an SoC [5] [6]. Bus protocols are based upon a printed circuit board style of interconnect structures that consist of hierarchical wire bundles and are proving to be ine ective communication for complex SoC designs. All the bus protocols strictly de ne an interblock data ow communication methodology. Also bus protocols typically do not support sideband control (Reset, Control, Status, Error, and Interrupt signals) and test signals (Scan, JTAG signals) which create a loss of features or performance on interfacing with another bus/core-centric protocol. The Advanced Microcontroller Bus Architecture (AMBA) was developed by ARM Ltd. and is widely used as the on-chip communication bus standard in SoC designs [7]. AMBA is one solution to interface IP cores with each other on an SoC and also enables development of multiprocessor designs with large numbers of controllers and peripherals. The AMBA bus provides design reusability by de ning a common backbone for SoC modules using AXI, AHB, APB, ASB, and ATB speci cations, each targeted to meeting di erent requirements. CoreConnect is an IBM microprocessor bus architecture speci cation for inter- connecting IP cores and custom logic on SoC designs [8]. CoreConnect has similar bridging capabilities to that of AMBA bus architecture, allowing reuse of existing SoC cores (processor, system, and peripheral cores). The CoreConnect architecture provides three buses: a processor local bus (PLB) for connecting high performance peripherals (low latency), an on-chip peripheral bus (OPB) to connect the slower peripheral cores (reduces the tra c on PLB), and a device control register bus (DCR) designed to transfer data between the CPUs general purpose registers, and the DCR slave logic device control registers. Wishbone is an open source interconnection bus architecture from Silicore Corpo- ration intended to interface IP cores with each other on an SoC and provide reusability by creating a common data exchange protocol between IP cores [9]. It is a simple one bus compact architecture for all applications. Because of its simplicity and exibility, it is utilized in simple embedded controllers and high performance systems. 5 All three buses, AMBA, CoreConnect, and Wishbone, are fully synchronous in nature using the rising edge of the clock to drive and sample all signals [10]. The di erences are in the supporting features of the speci cation depending on the choice of system buses used by the designer in case of AMBA and CoreConnect. A designer might face problems integrating di erent interfaces of the interconnects. Bridges or adapters are required to build a complete system in case of interconnect incompat- ibility. Wishbone connects all its IP cores to the same standard interface and a system designer can always have the exibility to choose two wishbone interfaces for implementation in a microcontroller core, one for high-speed low-latency devices and one for low-speed, low-performance devices. Bus-centric protocols are targeted for single unique application, and the interface circuitry is de ned for that particular application. Any changes in the design appli- cation requires redesign of the arbitration logic and interface circuitry for the new application. Incompatibility between the chosen interconnect system and the bus- centric native protocol requires multiple bus bridges for communication, and limits the maximum utilization of IP capabilities [11]. Also, whenever there are di erences in data and address presentation sequences between an IP core's native bus and target bus, the IP core's performance will likely su er due to the bridge-on-a-bridge e ect of having to correlate the signaling between the two disparate bus architectures. Also, one needs to compromise on the bridge gate count implementation which is likely to be higher. Selecting a bus-centric protocol as an IP core's native interface will ultimately limit its reusability compared to core-centric protocol. A socket can ful ll the reusability requirement virtually across any application, and the process can also be automated. 1.2.2 Core-centric Protocols The solution to maximize IP core's reusability while still exploring the advan- tages of proven industry standard interfaces (sockets) is to adopt a well-speci ed core-centric protocol as an IP core's native interface [1] [6]. Sockets are universally accepted and are targeted to support virtually any application because of their design reusability and veri cation features. Also, sockets provide complete interface speci - 6 cation between IP cores and on-chip interconnect systems which enables the designers to independently develop individual SoC cores, reducing overall SoCs development time, e ectively decreasing time-to-market. Basically, the individual IP cores are designed and developed simultaneously by decoupling them from the system in which they reside. The nal SoC is developed in parallel and the design time is reduced to that of the longest e ort required in a single element design or the SoC integration. System designers are also bene ted by not having to consider other diverse core protocols and delivery styles. Use of a standard IP core interface eliminates having to adapt each core for every SoC integration, and instead allows the system designer to focus on system level design issues. Also, since the IP cores are decoupled from the on-chip interconnect and from each other, it is easier to swap one core for another in meeting changing requirements. For an IP core to be truly reusable, it must remain untouched as it moves from one system to another system and its interface must match continuously di ering requirements of each systems interconnect. A standard IP core interface speci cation must be scalable and con gurable to adapt to the wide range of requirements. Also, it must be able to capture the non-data ow signals (such as reset, interrupts, error, and ow control signals) along with the data ow signals. Following are a couple of core-centric protocols: The Open Core Protocol (OCP) is an open standard, bus-independent protocol provided by Open Core Protocol-International Partnership (OCP-IP). It meets all the core centric requirements and is one of the protocols which uni es all the intercore communications including sideband control and test signals [3]. OCP de nes a high performance, complete interface socket between IP cores facilitating design reuse, and also reduces design time, design risk, and manufacturing costs for SoC designs. By adopting OCP, IP cores can be developed independent of interconnect topology and implementation. Figure 1.2 shows IP cores with OCP interfaces connected with di erent interconnect topologies [4]. OCP supports very high-performance data transfer models ranging from simple request-grants through burst pipelined and tagging objects. OCP protocol compliance veri cation is one of the distinguishing feature from other protocols. OCP-IP not only provides the speci cation and its 7 Interconnectipn IP OCP Master Core OCP Slave Core OCP Slave Core OCP Slave Core OCP Slave Core Interconnection IP Cross bar Shared-link Heterogeneous OCP Master Core OCP Master Core OCP Master Core Figure 1.2. IP Cores with OCP Interfaces Using Hierarchical and Heterogeneous Interconnect System member-driven evolution, but also industrial grade tools and services that ensure its members can rapidly con rm compliance and maximize their productivity. IP cores using native OCP interfaces can easily communicate with any bus ar- chitecture or on-chip network fabric through simple bridge structures. Even with bus bridges, IP cores can utilize its maximum capabilities using OCP. If the chosen interconnect system cannot interface directly to OCP, the IP developer can design and develop bus bridges (wrapper interfaces) for common bus architectures that a customer may choose. The Virtual Component Interface (VCI) is another core-centric protocol provided by Virtual Socket Interface alliance (VSIA) group [12]. VCI is similar to OCP in capability and philosophy. But VCI supports only data ow aspects of core com- munications compared to OCP, which is a superset of VCI supporting con gurable sideband control signaling and test signals. 8 1.3 Objectives This research will design and implement clocked OCP interfaces between IP cores and on-chip network fabrics for single- and multifrequency (Globally Asynchronous Locally Synchronous architectures (GALS)) domains including the following: Design and implementation of customized back-ends to IP cores and NoC. Design and implementation of front-ends (OCP Master and Slave entities). Design and implementation of a generic Domain Interface (DI) module. Customize and build asynchronous dual clocked First-in-First-out memory struc- tures (FIFOs) for the DI module used in GALS architectures. Determine the best placement of bu ering and synchronization across the design path. Synthesize clocked designs with IBM's 65nm process technology using Synopsys Design Compiler (DC). Automatic Place and Route of clocked designs using Cadence SoC Encounter. Design validation and derivation of performance metrics for clocked designs in nonsplit and split operation modes. Perform power analysis on di erent clocked designs using Synopsys PrimeTime PX for deriving energy per transaction values. Build parameterized computational models for estimating the performance of clocked design con gurations. Performance evaluation of clocked designs and NoC. Performance comparison between single- and multifrequency clocked designs. Veri cation of OCP compliant IP-cores using the SOLV component. 1.4 Contributions Following are the key contributions from this research: i. Designs and implements industry standard clocked OCP interfaces for single- frequency clocked domain and GALS architectures. ii. A novel modular architecture is adopted that provides high design reusability, meeting SoC shorter time-to-market requirements, and simplifying design valida- tion. 9 iii. The implemented OCP interfaces enable future research: Deriving elastic and asynchronous OCP implementations. Evaluating NoC performance from end-to-end. Research and comparison of di erent architectural sequencing control meth- ods for IP cores and NoC (clocked, elastic, and asynchronous). 1.5 Thesis Overview The rest of this thesis is organized into ve chapters which give a detailed de- scription of the research. Chapter 2 presents prerequisite information on current architectural design issues and challenges, and its solution with building of standard NoC interfaces using OCP. A brief description follows on OCP, including its operation, key features, and protocol signals. Chapter 3 brie y describes the project design and its speci cations. An overview of the design structure is presented, including the modularization of components across the design and network interface structure, synchronization and bu ering, and data packetization and depacketization mechanisms. Chapter 4 explains the design implementation in single- and multifrequency clocked domains. The design implementation includes a brief description about the steps involved in the implementation of the two clocked domains and the functionality of IP cores and NoC back-ends, OCP entities, and the DI module. Chapter 5 describes the veri cation methodology used to determine OCP compliance of IP cores. Chapter 6 explains the di erent testing strategies and performance metrics used for comparison. The maximum target frequency in single-frequency and multifre- quency clocked domains is determined. Other performance metrics such as latency, energy per it, area, and maximum bandwidth across NoC are also tabulated. Dif- ferent design parameters are varied to determine the design maximum and minimum limits on performance in case of design expansion, and worst and best scenarios. Finally, Chapter 7 summarizes the research work, including the project scope, application, and areas of future research. The Appendix includes detailed information about desynchronization and elasticization of clocked NoC interfaces and the complete set of OCP protocol signals. CHAPTER 2 BACKGROUND 2.1 Existing Architectural Designs with OCP The existing OCP interfaces for IP cores can be classi ed into two categories depending on how OCP is implemented with respect to the IP core functionality and communication interface. If the OCP interface is integrated as part of the IP core, we refer to it as Native OCP interface; otherwise, we call it Wrapped OCP interface [3]. In case of native OCP interfaces, the OCP interfaces are integrated as part of the IP core and are typically developed along with the new IP cores. Figure 2.1 illustrates IP cores with native OCP interfaces. In case of wrapped OCP interfaces, a wrapper (bridge) OCP interface is placed around the IP core's native interface to communicate to the outside world. The additional wrapper OCP interface basically maps the signals and values of the existing IP core to OCP compatible signals and values. Essentially, this wrapped interface is an exercise in protocol conversion between an IP core's existing protocol and OCP as it must implement ow control and at times executes commands not part of the base IP core functionality. The conversion logic can range from a few gates in complexity to a very complicated interface. However, OCP Master System Initiator Master IP OCP Slave System Target Slave IP OCP Slave OCP Master Bus Protocol Bus Protocol On-Chip Bus/NW Figure 2.1. IP Cores with Native OCP Interfaces 11 from the viewpoint of any IP core or an on-chip network fabric, which uses OCP to communicate with the another IP core, the type of OCP interface (native or wrapped) does not matter; it is all just using the OCP protocol. 2.2 Building Standard NoC Interfaces Using OCP This research provides the solution for interfacing di erent architectural IP cores on SoC by building standard NoC interfaces using OCP. Current architectural designs which utilize OCP for interfacing with other IP cores are synchronous in nature and typically OCP is integrated (native OCP) into the IP core. As OCP is a clocked protocol, integration of OCP onto IP cores which employ clocking strategies like elastic protocols and asynchronous handshaking have not been investigated. In order to explore elastic and asynchronous clocking strategies, this project will implement clocked wrapper OCP interfaces (OCP located outside the IP core). Figure 2.2 illustrates IP cores with clocked wrapper OCP interfaces. A signi cant portion of the research will address the issue of simplifying the interfacing of di erent architectural IP cores to OCP by the following: a. Design and implement customized back-end interfaces to IP cores (such as pipelined processor bus and synchronous memory) and on-chip network fabrics. b. Investigate the possibility of a generic OCP wrapper interface protocol, called the domain interface (DI). OCP Master System Initiator Master IP OCP Slave System Target Slave IP OCP Slave OCP Master Bus Protocol Bus Protocol On-Chip Bus/NW Figure 2.2. IP Cores with Clocked Wrapper OCP Interfaces 12 2.3 OCP Description 2.3.1 Introduction to OCP The Open Core Protocol is an openly licensed, core-centric protocol standard, which de nes a high performance, synchronous, bus independent con gurable inter- face for communication between IP cores and NoC [3][1]. It is an e cient point-to- point connection standard and because of its con gurability, scalability, and gener- ality, it has been widely accepted from low-power to high-performance applications. It can be optimized to use only the necessary features required for communicat- ing between any two components, which saves chip area. It dramatically improves reusability of IP cores independent of the architecture and design of the systems, which leads directly to a more predictable, productive SoC designs and also simpli es the system veri cation and testing. OCP consists of an aggregation of signals that aims to unify the communication among IP blocks, reducing the system design time signi cantly. It is comprised of a continuum of communication protocols that share a common de nition for the whole system where it ensures dramatic time reduction of functional veri cation for any future releases of the system. 2.3.2 OCP Operation and Its Key Features OCP de nes a point-to-point interface between two communicating entities such as IP cores and bus interface modules (bus wrappers). One entity acts as the master of the OCP instance, and the other as the slave. Only the master can present commands and is the controlling entity [3]. The slave responds to commands presented to it, either by accepting data from the master, or presenting requested data to the master. For two entities to communicate in a peer-to-peer fashion, there needs to be two OCP instances connecting them, one where the rst entity is a master, and one where the other entity is a slave. The characteristics of the IP core determine whether the core needs master, slave, or both sides of the OCP. The bus wrapper interface modules must act as the complementary side of OCP for each connected entity. Depending on the direction and type of the signal (like MCmd:Master command) one can classify whether the module is a master or slave. Figure 2.3 shows a simple system containing an on-chip 13 Master Slave Bus Initiator / Target Core Core Core Slave Master Master Slave Slave Bus Target Master OCP Bus Wrapper Interface Request Response On-Chip Bus System Initiator System Initiator / Target System Target Bus Initiator / Target Core Slave Master Master Slave System Initiator / Target Bus Initiator Figure 2.3. A Simple System with OCP between IP Cores and On-Chip Bus bus and four IP core entities: one that is a system target, one that is a system initiator, and the other two that have both. Di erent OCP transactions are possible between system initiator and system target with optional response. In one OCP transaction type, the Master IP (system initiator) presents a request command without expecting a response from Slave IP. Another type expects a response from the Slave IP in the same path. A third case starts like the rst, but the Slave IP issues a response using its OCP master interface to send a message back to the initiating IP. Each transfer across this system occurs as follows. A system initiator (as the OCP master) presents command, control, and possibly data to its connected slave. The interface module presents the request across the on-chip bus system. OCP does not specify the embedded bus functionality. Instead, the interface designer converts the OCP request into an embedded bus transfer. The receiving bus interface module (as the OCP master) converts the embedded bus operation into a legal OCP command. The system target (OCP slave) receives the command and takes the requested action. Some of the OCP key features include the following [3]: a. Point-to-Point Synchronous Interface: To simplify timing analysis, physical design, and general comprehension, OCP is composed of unidirectional signals driven with respect to, and sampled by, the rising edge of the OCP clock. 14 b. Operational Commands: There are two basic commands, Read and Write, and ve command extensions. Extensions include ReadExclusive, ReadLinked, WriteNonPost, WriteConditional, and Broadcast. c. Con gurable Data ow Signals (Address, Data, Control): To increase transfer e ciencies, OCP supports a con gurable data width to allow multiple bytes to be transferred simultaneously. OCP supports word sizes of power- of-two and non-power-of-two (byte addressable, word aligned) and also supports transfers of less than a full word of data by providing byte enable information that speci es which octets are to be transferred. d. Con gurable Sideband Signals (Interrupts and Errors): Di erent types of control signaling are required to coordinate data transfers (for instance, high-level ow control) or signal system events (such as interrupts). Many devices also require the ability to notify the system of errors that may be unrelated to address/data transfers. OCP refers to all such communication as sideband (or out-of-band) signaling, since it is not directly related to the protocol state machines of the data ow portion of OCP. e. Pipelining Transfer Support: OCP allows pipelining of transfers to improve bandwidth and latency character- istics also known as split transactions. To support this feature, the return of read data and the provision of write data may be delayed after the presentation of the associated request. f. Burst Transfer Support: To provide high transfer e ciency, burst support is essential for many IP cores. A burst is a set of transfers that are linked together into a transaction having a de ned address sequence and number of transfers. g. In-band Information: Using in-band signals, OCP can pass core-speci c information along with the other information during transactions. In-band extensions exist for requests and responses, as well as read and write data. A typical use of in-band extensions is to pass cacheable information or data parity. 15 h. Tagging or Out-of-Order Completion Support: To support out-of-order responses and to commit writes out-of-order, tagging is used as long as the transactions target addresses are di erent. Without tags, a slave must return responses in the order that the requests were issued by the master. The tag links the response back to the original request. i. Multithreading and Connections: Out-of-order request and response delivery can also be enabled using multiple threads. Concurrency is also supported at the expense of having an independent ow control for each thread, eliminating ordering restrictions for transactions on di erent threads. The notion of a thread is a local concept between a master and a slave communicating over OCP. Thread information is passed from initiator to target using connection identi ers. Connection information helps to identify the initiator and determine priorities or access permissions at the target. 2.3.3 OCP Signals and Encoding OCP interfaces are synchronous, employing a single clock signal. Thus, all its signals are driven with respect to, and sampled by, the rising edge of the clock, except reset [3] [11]. OCP interface signals are divided into three categories: data ow, sideband (error, interrupt, ag, control, and status), and test signals (scan, JTAG). With the exception of the clock, all OCP signals are unidirectional and point-to-point. The rising edge of the OCP clock signal is used to sample other OCP signals to advance the state of the interface. The Clk and EnableClk signals are required inputs in both masters and slaves and they are driven by a third entity (neither the masters nor the slaves). When the EnableClk signal is not present, the OCP clock is simply the Clk signal. 2.3.3.1 Data ow Signals Data ow signals consist of a set of signals, some of which are used for data transfers, while others are con gured to support any additional communication re- quirements between the master and slave components. Data ow signals can be divided into the following categories: 16 a. Basic Signals: These include the clock, address, read and write data, transfer type, and handshak- ing/response signals between the master and the slave. Only the clock and transfer type signals (MCmd) are mandatory for an OCP interface, the remaining signals being optional. Table 2.1 lists the OCP basic signals. The widths of the address, read data, and write data are con gurable, and not limited to being multiples of eight. The transfer type indicates the type of data transfer operation issued by a thread running on a master, and can be any one of the following: i. Idle: No operation is required to be performed. ii. Read: Reads data from the addressed location in a slave. iii. Write: Writes data to the addressed location in a slave. iv. Broadcast: Writes data to the addressed location using MData eld, which may be mapped to more than one slave in a system-dependent way. Broadcast clears the reservations on any con icting addresses set by other threads. Table 2.1. OCP Basic Data ow Signals Name Width Driver Function Clk 1 varies Clock input EnableClk 1 varies Enable OCP clock MAddr configurable master Transfer address MCmd 3 master Transfer command MData configurable master Write data MDataValid 1 master Write data valid MRespAccept 1 master Master accepts response SCmdAccept 1 slave Slave accepts transfer SData configurable slave Read data SDataAccept 1 slave Slave accepts write data SResp 2 slave Transfer response 17 v. Exclusive Read: Reads from a location in a slave using SData eld and locks it, preventing other masters from writing to the location (exclusive access). The location is unlocked after a write to it from the original master that caused the lock to be set. vi. Linked Read: Reads data from the addressed location in a slave using SData eld, and sets a reservation in a monitor for the corresponding thread, for the addressed location. Read or write requests from other masters to the reserved location are not blocked from proceeding, but may clear the reservation. vii. Nonposted write: Writes data to the addressed location in a slave using MData eld, unlocking the location if it was locked by an exclusive read, and clearing any reservations set by other threads. viii. Conditional write: Only writes to the addressed location in a slave using MData eld, if a reservation is set for the corresponding thread. Also clears all reservations on the location. If no reservation is present for the corresponding thread, no write is performed, no reservations are cleared, and a FAIL response is returned. b. Simple Extensions: These include signals to indicate the address region (e.g. register or memory), byte enables for partial transfers, and core-speci c con gurable signals that send additional information with the transfer request, read data, write data, and the response from the slave. Con gurable signals can transmit information about data byte parity, error correction code values, FIFO full or empty status, and cacheable storage attributes. Simple extension signals include the following: MAddrSpace: Speci es the address space and is an extension of MAddr basic signal which is used to indicate the address region of transfer. MByteEn: Indicates which bytes of OCP word are part of the current transfer. MDataByteEn: Indicates which bytes of OCP word are part of the current write transfer. MDataInfo: Extra information is sent with the write data. 18 c. Burst Extensions: These signals are used to support burst transfers which allows the grouping of multiple transfers that have a de ned address relationship. Bursts can either include addressing information for each successive command (which simpli es the requirements for address sequencing/burst count processing in the slave), or include addressing information only once for the entire burst. Burst extension signals include the following: MBurstLength: This eld indicates the number of transfers for a row of the burst and stays constant through the burst. MBurstPrecise: This eld indicates whether the precise length of the burst is known at the start of burst. MBurstSeq: This eld indicates sequences of addresses for requests in burst (Incrementing, Custom or user de ned, Wrapped, Stream, Exclusive-OR and Unknown). MBurstSingleReq: This burst has a single request with multiple data transfers. MDataLast: This eld indicates last write data in a burst. d. Tag Extensions: These signals are used to assign tags (or IDs) to OCP transfers to enable out-of- order responses and to indicate which transfers should be processed in order. Tag numbering begins at 0 and is sequential. The binary encoded TagID must carry a value less than the tag's parameter. Tag extension signals include the following: MTagID: This eld indicates the request tag from the Master IP. MTagInOrder: This eld indicates that the current request cannot be reordered with respect to other requests when this eld is asserted. STagID: This eld indicates the response tag from the Slave IP. STagInOrder: This eld indicates that the current response cannot be reordered with respect to other requests when this eld is asserted. 19 e. Thread Extensions: These signals are used to assign IDs to threads in the master and slave, and for a component to indicate which threads are busy and unable to accept any new requests or responses. Thread numbering begins at 0 and is sequential. The binary encoded ThreadID must carry a value less than the thread's parameter. Thread extension signals include the following: MConnID: This eld indicates the connection identi er. MDataThreadID: This eld indicates the write data thread identi er. MThreadBusy: This eld indicates master thread busy. MThreadID: This eld indicates request thread identi er. SThreadBusy: This eld indicates slave thread busy. SThreadID: This eld indicates response thread identi er. 2.3.3.2 Sideband Signals Sideband signals are optional OCP signals that are not part of the data ow phases, and can change independent of the request/response ow (but are still synchronous to the rising edge of the clock). These signals are used to transmit control information such as interrupts, resets, errors, and other component speci c information like core speci c ags. They are also used to exchange status and control information between a component and the rest of the system using Control, ControlBusy, ControlWr, Status, StatusBusy, and StatusRd signals. All sideband signals are optional except for reset (active low). Either the MReset n or the SReset n signal must be present. 2.3.3.3 Test Signals The OCP test signals are also a set of optional signals, and are responsible for supporting scan, clock control, and IEEE 1149.1 (JTAG). The scan interface signals include ScanCtrl, Scanin, and Scanout. Debug and test interface signals include TCK, TDI, TDO, TMS, and TRST N. 20 2.3.4 OCP Signal Directions and Groups Depending on the module instance acting as a master or slave, the direction of request/response/datahandshake signals are de ned and control signals are de ned depending on the module acting as a system or core. Interface types to each module is de ned depending on OCP entity and connected system. If a module acts as an OCP master and also a system, it is designated as system master. Some of the OCP signals are grouped together depending upon the active state of the signals at the same time. Data ow signals are classi ed into three groups: request, response, and datahandshake signals. The handshake and response signals are optional and can be con gured depending on the IP core's communication requirements. Table 2.2 lists the OCP signal groups. Table 2.2. OCP Signal Groups Group Signal Condition MAddr always MAddrSpace always MBurstLength always MBurstPrecise always MBurstSeq always MBurstSingleReq always MByteEn always MCmd always MConnID always MData* datahandshake = 0 MTagID always MTagInOrder always MThreadID always SData always SDataInfo always SResp always SRespInfo always STagID always STagInOrder always SThreadID always MData* datahandshake = 1 MDataByteEn always MDataTagID always MDataThreadID always MDataValid always Request Response DataHandshake 21 2.4 Prior Relevant Research Work As part of my thesis study, I implemented a portion of OCP in the advanced VLSI course, which helped me learn about a subset of OCP, the importance of OCP interfaces in SoC designs, and its functionality. For the class project, I considered two IP cores (Master as CPU and Slave as Memory system) with native OCP interfaces. The on-chip communication medium between the two IP cores was assumed to be a simple bu ering unit with some latency in the path. Using a subset of OCP signals (data ow signals), simple memory read and write transactions between the two IP cores was implemented. The modules were developed using VHDL. Figure 2.4 shows the block diagram used in this study. 2.5 Design Development, Implementation, and Testing Building modern digital integrated circuits is a complex process and requires powerful Electronics Design Automation (EDA) and Computer Aided Design (CAD) tools in the design development and implementation [13]. Figure 2.5 illustrates the steps required in this project implementation [14]. The implementation ow follows the standard ASIC design steps which includes speci cation, front-end, and back-end phases. The front-end phase is comprised of the development of schematics or register transfer level (RTL) Verilog code, and functionality testing of synthesized netlist. This project developments RTL Verilog code in the front-end phase. The back-end phase is comprised of Automatic Place and Route (APR) of the physical design (synthesized OCP Master Master IP Core OCP Slave Network Bus OCP Slave Slave IP Core MCmd SCmdAccept MAddr MData SResp SData OCP Master Clk MCmd SCmdAccept MAddr MData SResp SData Figure 2.4. IP Cores with Native OCP Interfaces Implementation 22 Verilog Test Bench Product Requirement RTL Development Logic Synthesis (Gate Level Netlist) Physical Implementation (APR) System Testing Specifications Logic Design Physical Design Functionality Testing Behavioral Specification Check Check Functional Verification (sdf back-annotation) Functional Verification (sdf back-annotation) Power Analysis Figure 2.5. Design Implementation Flow , \ I \ ... -----------------------, I I' I I I I I I I I I I \ ,---------- -----------_ .. , I , , ... -------- ----------- ..... , \ \ I \ I \ I' t r-( '- ) K) f \ \ \ ~"/ ,, \ -...-...-. -_------------ ---------------------- ,- ,, \ I \ t-- \ .... _-------- ---- --------, , -."Jl J .)-. 23 structural Verilog netlist), and functionality veri cation. Finally, the routed design is tested using a Verilog test bench where di erent performance metrics are determined. 2.5.1 RTL Development Using Verilog HDL Verilog HDL is one of the most widely used hardware description languages (HDL) in the development of modern ASICs. It is used to describe the design, veri cation, and implementation of digital circuits [15]. HDLs di er from other programming languages in describing signal propagation times and their dependencies (sensitivity). RTL employs a constrained HDL format using gate level, transistor level, and behav- ioral level descriptions, and provides an e cient way of synthesizing any given design. Synthesis productivity advantages have propelled HDL technology into a central role for digital design compared to schematic methodologies because of its features to provide high speed, low power, and the ability to synthesize circuitry. HDL also provides a simulation and debugging environment in validating a design's intended functionality. This project uses Modelsim [16] and NC-Verilog [17] simulators for simulating RTL and synthesized netlists. 2.5.2 Logic Synthesis and Automatic Place and Route Logic synthesis is the process of converting a RTL description of design into an optimized gate-level representation (structural Verilog) using cells from a standard cell library. The generated structural Verilog is expected to provide the same functionality with extracted delays assigned to each cell from the technology process library. Logic synthesis can generate a ddc le (binary database le for further processing), sdf le (standard delay format le to back-annotate the simulations with extracted timings from cells), sdc le (standard design constraint le for physical implementation of the circuit), and pow le (reports the required power for the design). This project uses IBM's 65nm process technology library and Synopsys Design Compiler tool for synthesizing the design. Place and Route is the process of converting a structural Verilog description into a physical circuit (layout with pins and metal routing) which involves oorplan synthesis depending on the aspect ratio and cell utilization, power grid and clock tree synthesis, 24 placement of the standard cells, and routing the wiring connections between them. During the placement of standard cells, an abstract view of the cells is used since it has the physical information about the size and shape of the cells, the connection points of the cells (pins), and routing abstractions in those cells. This project uses Cadence SoC Encounter to automatic place and route the design. SoC Encounter requires standard cell characterized le (liberty format le), cell abstract information (.lef le), structural Verilog, and sdc les from Synopsys DC Compiler. Placing and routing the design involves oor planning, power planning, placement of standard cells, pre- and postclock tree synthesis, and nal routing. The nal routed design is optimized and geometry and connectivity are veri ed. The Place and Route process generates design exchange format (DEF) le (used to read layout back in layout editor), structural Verilog ( nal placed and routed circuit), sdf le (standard delay format le to back-annotate the simulations with extracted timings from cells and interconnect), spef le (reports the extracted information on parasitics used in power analysis) and pow report le (reports leakage power, internal power, and switching power of the design). 2.5.3 Design Testing Environment This project design uses an emulated pipelined RISC CPU as the Master IP core [18], a synchronous memory block (ROM and RAM) as the Slave IP core, and a NoC built from synchronous routers [19]. A Verilog test bench is used to drive the device under test (DUT) inputs and monitors the outputs. Figure 2.6 illustrates the design testing environment. The test bench generates the required tra c stimulus for the project design testing in both single- and multifrequency clocked domains. In the multifrequency clocked domain, the test bench generates three asynchronous clocks for Master IP, Slave IP, and NoC, respectively. The test bench is customized to generate pipelined CPU tra c compatible with synchronous memory. Figure 2.7 shows the customized 32-bit CPU address format used in this project implementation. The 32-bit logical address generated from Master IP consists of routing information, physical address, transaction type, operation mode, and burst data. 25 Verilog Test Bench with initial block and required traffic Test Module Top Instance Verilog Netlist module (DUT) DUT inputs DUT outputs Figure 2.6. Design Testing Environment Using Verilog Test Bench Routing Information 16 - Bit Physical Address (16 Bits) Burst Info (7 Bits) R/W Mode Unused 31 26 25 24 23 22 16 15 0 Burst Length BP Burst 22 20 18 Sequence 16 Source Routing Bits (6 Bits) Control Bits (18 Bits) Payload (48 Bits) R /W Mode MTagID Burst Information Addr (16 bits) WData(32 bits) Control Bits (18 bits) 32 - Bit address from Master IP Request Transaction Packet Format 17 11 6 0 19 10 9 7 DByteEn DValid 16 15 14 Payload (48bits) Figure 2.7. Pipelined CPU 32-Bit Address Format 2.5.4 Design Metrics Computation With technology advancement, power management is one of the major design challenges for deep submicron technologies where timing, power, and signal integrity are interrelated. Power analysis is a crucial aspect in a design ow since it can a ect packaging, cooling decisions, device battery life, and cheap reliability. PrimeTime PX (PTPX) is an accurate power analysis tool that includes timing interdependencies for power dissipation that can be used at various stages in the design ow [20]. It also performs static timing analysis and signal integrity analysis. PrimeTime computes the total power dissipated in a design from dynamic and static power dissipation components. Dynamic power is used during switching of the 26 transistors and short circuit of power rails. Static power dissipation or leakage power occurs when the devices are at steady state. PrimeTime power analysis can be done in a vector-free ow (independent of switching activity) or RTL VCD ow (using switching activity from simulations). In this project, a more accurate RTL VCD ow methodology is chosen for power analysis. Figure 2.8 demonstrates the PrimeTime Power analysis RTL VCD ow used in this project. RTL VCD ow requires the following steps to be executed: Create RTL VCD Testbench with VCD dumpvars RTL Design Modelsim Simulator RTL VCD Enable Power Setup Read Design Data 1. Read Technology Lib 2. Read Design Netlist 3. Read Design Constraints 4. Read Parasitics Power Analysis 1. Convert vcd to saif and annotate on nets & registers 2. Propagate Activity to nets/registers not annotated from RTL VCd 3. Calculate power using Primetime timing data Report Average Power pt_shell set Power Analysis TRUE set link_library read_verilog /read_db link read_sdc read_parasitics read_vcd rtl.vcd -rtl_direct update power report power Synopsis Primetime Sample _RTLVCD Specify Activity RTL VCD Figure 2.8. PrimeTime (PX) Power Analysis RTL VCD Flow 27 a. Create a VCD le from Modelsim simulation of the nal physically routed design. b. Enable the power analysis setup and read in the technology library (tech.lib) le, nal routed structural Verilog le, design constraints (sdc) le, and parasitics le (spef). c. Read the VCD le into PrimeTime and specify the activity. d. Convert the VCD le into SAIF (switching activity interchange format) le and annotate on nets and registers. e. Propagate activity to nets/registers not annotated from RTL VCD. f. Calculate the power using PrimeTime timing data. g. Report average power, timing constraints, and switching activity. This project reports energy required per each transaction in nonsplit and split modes of operation as one of the performance metrics. Energy per transaction is derived from total average power and the total test time. Energy is the capacity to do work over the time. Total Energy ( Average Power Test time (2.1) Energy=T ransaction ( (Total Energy)=No: of Transactions (2.2) 2.6 Validation of OCP Complaint IP cores The OCP-IP organization provides the CoreCreator II tool for its members to verify OCP compliant design implementations [21] [11]. Figure 2.9 shows the setup of the OCP CoreCreator tool. The CoreCreator tool automates the tasks of building, simulating, verifying, and packaging OCP compatible cores. It can be used with both traditional Verilog and VHDL test bench environments to create directed tests for OCP designs. The OCP-IP organization provides the Veri cation IP entities and SOLV (Sonics OCP library for veri cation) for debugging tools. Debugging tools include an OCP checker to ensure protocol compliance, a performance analyzer to measure system performance, and a disassembler, which helps to view the behavior of OCP tra c. This tool is used in validating this project design ow. One restriction to this work is that the tools are limited to test and verify clocked interface designs. 28 Test Bench Examples OCP Slave OCP Master OCP Slave/ VIP S OCP Master / VIP M Simulator Testbench OCP (System / Core) SVA Checker ocpdis ocpperf Monitor svt_ocp_generate Core Creator II Components Dataflow, Sideband and Test signals Dataflow, Sideband and Test signals Figure 2.9. CoreCreator tool for Validating OCP Compliant IP Cores Once di erent design interfaces are built, we can test the clocked architecture design implementation with the core creator tool for its performance to verify OCP compliance. Using the CoreCreator tool, we can measure the master/slave core metrics like issue rate (throughput), maximum number of operations outstanding (pipelining support), and the e ect of burst support on issue rate and unloaded latency for each operation (only for slave core). Future work will translate the OCP compliant clocked design into elastic and asynchronous implementations. If this translation modi es the timing protocol, but remains faithful to the OCP protocol behavior, then we assume these versions will retain their OCP compliance. CHAPTER 3 DESIGN DESCRIPTION AND SPECIFICATIONS 3.1 Design Description OCP is one of the viable core-centric solutions to address contemporary SoC design implementation requirements. Commercially existing native OCP interfaces (integrated into IP Cores) are fully synchronous in nature which limits an IP core's capabilities to interface with other clocking control methodologies. In order to ex- plore other clocking control methods, this research study designs and builds modular clocked wrapper OCP interfaces (OCP located outside the IP core) for existing IP cores. Figure 3.1 illustrates the project design structure. This research develops a new approach to increase modularity, improve reliability, and reduce design time to interface di erent IPs to the OCP socket. This consists of splitting the design into common shared components and custom back-ends that are speci c to the IP core. The common components consist of OCP master and slave components and a domain interface (DI) module. The DI module is used to synchronize mutually asynchronous clocked domains and data ow control. These will be described in more detail in this chapter. 3.1.1 Design Structure 3.1.1.1 Modular Components Across Design This project will build clocked NoC interfaces which enables multifrequency clocked designs to utilize these interfaces for standard communication. Later, as an extension to this research work, asynchronous and elastic NoC interfaces will be derived based on clocked NoC interfaces. As part of this project implementation, on a high level, it will design and implement clocked OCP interfaces between IP cores and on-chip network 30 Core B.E F.E F.E Core B.E DI DI DI DI NW BE1 IP Core IP Core NW BE2 F.E F.E N O C OCP Master OCP Master OCP Slave OCP Slave System Initiator System Target B.E : Back-ends - Customized interfaces to IP Cores and NoC F.E : Front-ends - OCP entities (Master and Slave) D.I : Domain Interface Module Figure 3.1. Design Structure with Customized Back-ends, Front-ends and DI Module fabric for single- and multifrequency clocked domains. Figure 3.1 illustrates the communication between two IP cores over an on-chip network fabric using customized back-end modules interfacing to the IP cores and NoC, common front-end (OCP entities), and domain interface modules. In providing e cient and modular NoC interfaces, this project will build cus- tomized components which can be reused. In Figure 3.2, the dotted regions 1 and 2 represent the same design. The only di erence is that the OCP master entity and DI module is communicating with an IP core back-end interface in the rst dotted region, and in the second dotted region, it is communicating with a network fabric back-end interface. This is the same for dotted region 2. The key point to note here is that a single design for the front-end modules and DI module will communicate with any back-end IP core or network fabric back-end interfaces. This improves modularity and simpli es design validation. The design and implementation includes the following: a. Designing the customized back-ends to IP cores and network fabric. b. Designing the OCP master and slave entities (front-end interfaces) and DI module. c. Implementing the DI module with asynchronous FIFOs for GALS architectures. d. Determining the proper placement of bu ering and synchronizers across the design path (IP core-network fabric-IP core) to improve the design performance. 31 Core B.E F.E F.E Core B.E DI DI DI DI NW BE1 IP Core IP Core NW BE2 F.E F.E N O C OCP Master OCP Master OCP Slave OCP Slave System Initiator System Target B.E : Back-ends - Customized interfaces to IP Core and NoC F.E : Front-ends - OCP entities (Master and Slave) D.I : Domain Interface Module 1 2 1 2 Figure 3.2. Design Structure with Modular Components Across Design Path Also depending on the IP core architectures, the interface between OCP master and slave entities (front-end modules) can be varied based on design requirements. The design structure in Figure 3.2 enables the study of di erent clocking methodolo- gies. For example, mutually asynchronous clocked IP cores can be interfaced with clocked on-chip network fabrics. 3.1.1.2 Synchronization and Bu ering Placement Communication between di erent architectural IP cores and on-chip network fabric in a GALS SoC requires synchronizers and bu ering to mitigate metastability and uncertainty in timing [22]. Determining the proper placement of synchronizers and bu ering to support multifrequency clocked domains is one of the major tasks in this study, since performance can be degraded or improved depending on the placement across the design path. Asynchronous dual clocked pointer FIFOs are employed for synchronization and data ow control. Synchronization and bu ering schemes will not be employed across OCP (master and slave) entities [23]. OCP is a point to point interface, and the data transfer between the two entities (master and slave) should use the same clock since request-acknowledgments are done with mutual consent. This also simpli es the validation of the OCP protocol. With this assumption, Figure 3.3 illustrates the possible locations where synchronizers and bu ering can be employed for e ective communication. 32 Core B.E F.E F.E Core B.E DI DI DI DI NW BE1 IP Core IP Core NW BE2 F.E F.E N O C OCP Master OCP Master OCP Slave OCP Slave System Initiator System Target -- Possible locations for synchronizer and buffering placement Figure 3.3. Synchronization and Bu ering Placement Across Design Path 3.1.2 Network Interface Structure Figure 3.4 shows the structural view of the network interface, consisting of front- end and back-end submodules and a domain interface in between [24]. Typically, the network front-end modules can be implemented using a standard point-to-point protocol allowing IP core resusability across multiple SoC designs, and the back-end interface to IP cores are implemented using existing protocols such as AMBA AXI, IBM CoreConnect, and OpenCores Wishbone [10]. In this project, back-end modules implement customized logic for the IP core and NoC to convert their native signals into the DI protocol. The front-end modules are implemented using OCP. A generic DI will be designed which acts as a bridge between back-end modules and OCP entities to provide bu ering and ow control across the design path. IP Core IP Core / NoC Back End Front End DI Domain Interface Figure 3.4. Standard Network Interface Structure ~'~ '---- 33 3.1.2.1 Back-end and Front-end Interfaces The back-end submodules are custom designs for each speci c IP core and NoC that interface them to the domain interface protocol, which in turn communicates with the OCP entities (master or slave). All communication occurs using a compile time con gurable address, data, and control information format. Master IP core back-ends provide the functionality to generate source routing information from the IP core's logical address, which are used for sending and reception of packets to the correct destination over on-chip network fabric. Network back-ends are customized to provide packetization and depacketization functionalities. The Slave IP core end of the network back-end stores the received source routing bits, tagging, and transaction information when a OCP protocol requires a response from the slave back to the requested Master IP core. This is used to compose the response network packet. The NoC front-end submodules are implemented using OCP and can act as either an OCP master or slave entity. The front-end modules are interfaced together to form the OCP point-to-point interface. The OCP entities are modular in nature to provide reusability across the design. 3.1.2.2 Domain Interface The domain interface module is designed basically to synchronize the transactions across di erent clock regimes and control data ow using bu ering. Also, it pipelines the incoming data from IP cores and maps the existing IP cores signals and values to OCP compatible signals and values using registers (FIFOs). The same DI will be reused across the path (IP core-NoC-IP core) and its bu ering capacity is de ned by the IP core communication requirements and NoC bandwidth. The tradeo of employing an intermediate DI protocol is that it may add some latency as part of the conversion between IP core signals to OCP compatible signals. This conversion usually does not add signi cant latency but increases design modularity. In the case of the multifrequency clocked domain, the DI module adds signi cant latency in request and response paths because it becomes the point of synchronization between clock domains. Synchronization also increases power dissipation, area, and reduces maximum bandwidth due to the employment of FIFOs. 34 3.1.3 Synchronization and Bu ering Interfacing di erent architectural IP cores can introduce metastability and uncer- tainty in timing during the transactions between two mutually asynchronous clock domains [22]. Synchronization and buggering are employed to provide safe synchro- nization between IP cores and the NoC. Bu ering is not only used for supporting multifrequency clocked domains but also increases the throughput and reduces the overhead at peak times (tra c congestion). Determining the proper placement of synchronizers and bu ering is one of the research goals in this project implementation, since performance can be degraded or improved depending on the placing across the design path (IP core-NoC-IP core). 3.1.3.1 Asynchronous Clocked Pointer FIFOs Asynchronous FIFOs are employed to safely pass data between mutually asyn- chronous clocked domains. Data are written to a bu er (FIFO Memory) from one clock domain and the data are read out in another clock domain, where the two clock domains are asynchronous to each other [23] [25]. Figure 3.5 illustrates an asynchronous pointer FIFO design [23]. This design consists of FIFO memory, gray code pointers, two- op synchronizer modules, and modules to generate stall signals into two clock domains (write full and read empty signals). FIFO memory (bu er), is accessed by both clock domains and the bu er can be instantiated using a synchronous dual-port RAM or created using a 2-D register array. Gray code pointers are used to generate write pointer, read pointer, and write full and read empty signals depending on the access of FIFO Memory. The two- op synchronizer modules are used to synchronize write and read clock domains by synchronizing read pointer into the write clock domain (using write full signal) and by synchronizing write pointer into the read clock domain (using read empty signal). Operation: The write pointer always points to the next word to be written on the FIFO Memory. After the FIFO write operation, the write pointer is incremented to the next location to be written. Similarly, the read pointer always points to the current FIFO address to be read, and after the read operation, it is updated to point to the next location. On reset, both the pointers are set to zero which asserts 35 Figure 3.5. Asynchronous Dual Clocked Pointer FIFOs read empty signal to high, pointing the read pointer to invalid data (since the FIFO Memory is empty and the empty ag is asserted). When the rst data word is written to the FIFO memory, the write pointer increments and points to next location and de-asserts the empty ag. The read pointer, which is still pointing to the rst valid data location, reads out the data onto output port (RDATA). If the receiver clock is fast and if it does not require two write clock cycles for synchronization and reading out data, it can lead to both pointers pointing to same address, indicating that the FIFO is empty. A FIFO empty condition happens in case of a reset operation and when the read pointer catches up to the write pointer. A FIFO full condition (write full signal is asserted) happens when both read and write pointers address to the same location, but in this case, the write pointer has wrapped around (faster) and caught up to the read pointer. This design adds an extra bit to each pointer to distinguish between full and empty conditions. When the write pointer reaches the maximum FIFO address, the write pointer increments the unused most signi cant bit (MSB) while resetting the rest of the bits to zero, indicating that the FIFO has been wrapped once. The same occurs for the read pointer. The MSBs of two pointers will now determine if the FIFO is either full or empty. 36 3.1.4 Network-on-Chip Using Synchronous Routers In this research, a NoC is employed as the communication medium between IP cores. A 3-port synchronous router designed by fellow research students is used as the NoC. Figure 3.6 illustrates the 3-port clocked elastic router architecture [19]. Each router consists of three switch and merge modules. Each switch and merge module has the bu ering capability using one set latches at the input and output ports. The switch module guides the incoming data to one of the outgoing ports and the merge module arbitrates between two input requests to an output port. Routers used in this design employ simple source routing, single- it packets, and low latency paths [19] [26]. Each packet consists of a header containing the source routing information and the data eld. Packets are switched from the input port to one of the output ports through a simple demultiplexer using the most-signi cant routing bit. The address bits are rotated each time and the next routing bit (MSB) controls the switching for the output packet. In this design, routers are con gured at support 6 source routing bits and 66 bits of data. 3.1.4.1 Data Packets Switching Technique Employing on-chip networks on an SoC is motivated from its novel solutions to support concurrent data transfers from the same resource, data restoration, and plat- form (modular components) to build reusable IPs [27]. The choice of packet switching Switch A Merge A Merge B Switch B Switch C Merge C Figure 3.6. 3-Port Synchronous Router Design 37 technique for the on-chip network fabric is important to gain better performance. Di erent switching techniques are available such as store-and-forward, virtual-cut- through, and wormhole [24]. Each switching technique has di erent performance properties depending on the requirements and hardware resources. Store-and-forward is used in this design to route packets from one router to the next. Store-and-forward routing analyzes packets passing through and therefore does content-aware packet routing. Normally, in this technique, big packets introduce extra packet delay at every router stage and also require a signi cant amount of bu er space to store multiple packets at the same time. This design employs single it packets which requires minimal bu ering space and does not introduce signi cant packet delay at each router. 3.1.4.2 Data Packetization and Depacketization Packet preparation is one of the key stages of a network interface architecture since the latency associated with it can signi cantly impact overall communication latency [27] [24]. In this project, the network back-end modules act as wrapper logic used for packetizing and depacketizing request and response data at the NoC boundaries. At the network fabric sender end, it receives the contents from the DI module, prepares the packets, and dispatches them onto the NoC. At the network fabric receiver end, it receives the packets from the networking logic and presents the content to the DI module. Typically, packets transported on NoCs consist of a header, control information, payload, and tail [24]. The header contains the source and destination address. Control information contains transaction type, tagging, and burst data. The tail contains error checking and correction code. In this design, packet format is cus- tomized to contain a header, control information, and payload. The header consists of source routing bits which will be used in traversing the request packets from source to destination address and also will be used for back traversing the response packets from destination to source address. At the NoC back-ends, data packetization involves constructing the request/response packets containing the source routing bits, control information, and payload from received signals from DI module. Depacketizing the 38 data at network back-ends involves conversion of received request/response packet data into DI compatible signals. In this project design, 72-bit packets are used. 3.1.4.3 Request Packet Format The request packet is a 72-bit packet comprised of source routing bits, control information, and payload as shown in Figure 3.7. Source routing bits are used to guide the request transaction packets to the correct destination across the on-chip network fabric and later during the response phase, the same source routing information will be used for traversing back to the source IP core. Control bits include information about transaction type, mode of operation, request tag, burst data, and write data byte enables. The payload consists of a 16-bit address and 32-bits of write transfer data. 3.1.4.4 Response Packet Format In modularizing the design, the response packet is constructed similar to a request packet at the cost of not utilizing all the available bits. Figure 3.8 shows the response packet format. Response packets also contain source routing bits, control bits, and payload, but the di erence is in the control bits and usage of the payload bits. Only 5 of the 18 control bits are used for the response type (2-bits) and response tag information (3-bits). The payload is only partially used, sending 32-bits of read data. The remainder of the bits are zero lled. Source Routing Bits (6 Bits) Control Bits (18 Bits) Payload (48 Bits) R /W Mode MTagID Burst Information Addr (16 bits) WData(32 bits) Control Bits (18 bits) 17 11 10 9 7 6 0 DByteEn DValid 16 15 14 Payload (48bits) Figure 3.7. Request Packet Format 39 Source Routing Bits (6 Bits) Control Bits (18 Bits) Payload (48 Bits) Resp STagID RData(32 bits) Control Bits (18 bits) 17 4 3 2 0 Payload (48bits) Null (16 Bits) 5 Null (13 Bits) 47 16 15 0 Figure 3.8. Response Packet Format 3.1.4.5 Data Flit Implementation A typical protocol layer architecture consists of the physical layer, link layer, routing layer, transport layer, and protocol layer [28][27]. The physical layer level consists of the wiring and the transmitter and receiver hardware. The transmission of data is in the form of phits (physical data units). Each link layer is responsible to send and receive data from the physical layer in the form of its ( ow control units). The routing layer is responsible for generating the header message containing source and destination addresses. The transport layer is optional and is not used in point-to-point connections. The protocol layer is responsible for sending and receiving packets on behalf of the device. In this project design, 72-bits of single data its are send and received across NoC. Figure 3.9 shows the single it data format used in this project implementation. SRB (6 Bits) Control Bits (18 Bits) 40 Bits (71-32 ) PayLoad (32 Bits) 32 Bits (31-0) : Write Transaction Null 32 Bits (31:0 ) : Read Transaction SRB (6 Bits) Payload (48) Control Bits (18 Bits) 72 Bits Request Transaction Flit Response Transaction Packet PayLoad (16 Bits) Addr (16) Data (32) Figure 3.9. Single Flit Data Format I I 40 3.2 Design Speci cations and Supporting Features 3.2.1 Proposed OCP Subset In this project, an OCP subset is de ned based on the OCP 2.2 speci cation to support di erent IP cores' communication requirements. Most of the OCP signals are con gurable and can be extended depending on their requirements. Depending on the de ned OCP subset, the back-ends and DI modules are customized to handle the transactions data. The OCP subset comprises basic data ow signals, simple extension signals, burst extension, and tag extension signals. The reset signal is also supported in this implementation. The following subset of OCP is included in this implementation: a. Basic Signals: only MCmd and Clk signals are required; the rest are all optional Clk - Clock signal MCmd - Transfer command (IDLE, READ and WRITE) SCmdAccept - Slave accepts transfer MAddr - Transfer address MData - Write data SDataAccept - Slave accepts write data MDataValid - Write data valid MRespAccept - Master accepts response SResp - Transfer response SData - Read data b. Simple Extensions: MByteEn - Byte Enables MDataByteEn - Write Data Byte Enables c. Burst Extensions: MBurstLength - Burst length of the current transaction MBurstPrecise - Whether current burst is precise or not MBurstSeq - Address sequence of the current burst (User De ned, INCR, WRAP) 41 d. Tag Extensions: MTagID - Request tag ID from Master IP. MTagInOrder - Current request cannot be reordered on high assertion of this signal. STagID - Response tag ID from Slave IP. STagInOrder - Current response cannot be reordered on high assertion of this signal. e. Sideband Signals: Reset - Asynchronous reset signal. The above de ned OCP subset is implemented to support memory read, write, and idle operations, including burst transactions with out-of-order responses for single- and multifrequency clocked domains in both nonsplit and split modes. OCP supports con gurable widths for address, data, bursting, and tagging bits. Table 3.1 lists the OCP subset signals and their chosen widths depending on the IP core's communication requirements. This project implements a subset of the OCP protocol; the complete OCP protocol signals are listed in the appendix. 3.2.2 Supporting Features This project implementation supports the following features. 3.2.2.1 Modes of Operation In the nonsplit mode (NSP) of operation, a request is initiated and waits for a response back from the Slave IP before sending out another request. Pipelining of request transactions from the same IP core or simultaneous requests from di erent IPs are not possible in this mode of operation. In split mode (SP), pipelining of request transactions from the same IP and simultaneous instantiation of requests is possible. Due to the pipelining of requests in split mode, performance and throughput are increased and the maximum NoC bandwidth can be utilized. Split mode reduces overall latency, increases throughput, and lowers energy per transaction compared to nonsplit mode. 42 Table 3.1. OCP Proposed Subset Name Width Driver Function Clk 1 varies Clock input MCmd 3 master Transfer command SCmdAccept 1 slave Slave accepts transfer MAddr 32 master Transfer address MData 32 master Write data MDataValid 1 master Write data valid SDataAccept 1 slave Slave accepts write data SResp 2 slave Transfer response SData 32 slave Read data MRespAccept 1 master Master accepts response MByteEn 4 master Byte Enables MDataByteEn 4 master Write Data Byte Enables MBurstLength 8 master Burst Length MBurstPrecise 1 master Determines Burst Transaction is precise or not MBurstSeq 3 master Determines type of address Sequence MTagID 8 master Request Tag from Master IP MTagIDInOrder 1 master InOrder request transaction STagID 8 slave Response Tag from Slave IP STagIDInOrder 1 slave InOrder response transaction Basic Dataflow Signals Simple Extensions Burst Extnesions Tag Extensions 43 3.2.2.2 Burst Transactions A burst transaction is a set of transfers that are linked together, which have a prede ned address relation and number of transfers. Compared to single data phase transactions, burst transactions improve data throughput since the address is trans- ferred only during initial bus grant followed by chunks of data. Burst transactions are supported not only to increase the throughput but also to reduce latency and data activity factor across the network router links. Di erent burst implementations are possible depending on the burst length (size of the burst) and burst sequence (relation between the addresses). This project design supports a maximum burst size of 8 (8* word size) i.e., 256-bits of data. This project implementation supports three types of address sequences (user de ned, incremental, and wrapped types), and data can be transferred during each clock cycle. 3.2.2.3 Tagging or Out-of-Order Response Basically, tags are used to support out-of-order response by directly linking the Slave IP core response to the original request which triggered it. By supporting out-of- order responses, in most cases, the use of tags can improve overall system performance since responses are not halted due to dependencies on previous transactions. In this project implementation, a tag size of 8 is employed. CHAPTER 4 DESIGN IMPLEMENTATION 4.1 Clocked OCP Design Implementation The project design implementation includes building of standard OCP interfaces to IP cores and NoC in single- and multifrequency clocked domains. Figure 4.1 shows a clocked OCP design implementation with a signal level description from Master to Slave IP cores. Both designs include customized back-end interfaces to IP cores and NoC, front-ends (OCP entities), and DI module. A standard ASIC design ow is used in developing the designs from RTL code to physical placement and routing. The developed modules are functionally tested at each phase of the implementation. An emulated RISC pipelined CPU and synchronous memory (RAM/ROM) are em- ployed as Master IP (system initiator) and Slave IP (system target), respectively, and customized synchronous routers are used as NoC for testing the functionality. Data transfer transactions include simple memory read, write, and idle operations (nonsplit transactions) through pipelining requests (split transactions) with read /write bursts to complex out-of-order operations (tagging). The following steps are executed in building OCP interfaces between IP cores and the NoC in both single- and multifrequency clocked domains. a. Single-Frequency Clocked Domain Implementation (SFCD): i. Customized back-end interfaces to IP-cores and NoC ii. A generic DI module iii. OCP Master and Slave entities (Front-ends) iv. Integrating built modules with on-chip network fabric b. Multifrequency Clocked Domain (MFCD) Implementation: This implementation utilizes the developed back-ends and front-ends modules from single-frequency 45 Master IP Core Back- End OCP Master OCP Slave CYC STB ADR DATO ACK WE DATI Resp WData RData STag_ID DI MCmd MAddr MData SCmdAccept MDataValid SDataAccept SData SResp MTag_ID MBurstPrecise STagID STAG_ID MTAG_ID SEL RW Addr DByteEn DValid Resp WData RData STag_ID MTag_ID RW Addr DByteEn DValid MBurstSeq MBurstLength MTagID DI Resp WData RData STag_ID MTag_ID Addr DByteEn DValid Resp WData RData STag_ID MTag_ID RW Addr DByteEn DValid NoC Back- End OCP Master OCP Slave Slave IP Core Resp WData RData STag_ID DI MTag_ID RW Addr DByteEn DValid Resp WData RData STag_ID MTag_ID RW Addr DByteEn DValid DI Resp WData RData STag_ID MTag_ID RW Addr DByteEn DValid Resp WData RData STag_ID MTag_ID RW Addr DByteEn DValid Slave Back- End CE WE DATI SEL OE DATO ADR MByteEn MCmd MAddr MData SCmdAccept MDataValid SDataAccept SData SResp MBurstPrecise STagID MBurstSeq MBurstLength MTagID MByteEn Stall Stall RW Stall Stall Stall Stall Stall Stall DValid DValid Stall_ Resp Stall_ Req DValid DValid Stall_ Resp Stall_ Req Data Data Data Data Networ kBack- End Network On-chip Network on-Chip (NoC) CLK RST CLK RST Figure 4.1. Clocked OCP Design Implementation with IP Cores and NoC + ~ ~ l ~ '--, '---1 f------. ~ ~ ~ ~ ~ - r------. - T l 1 L 1 '--" ~ '---1 ~ ~ ~ +- ¢= - - ------ 46 clocked domain. The DI module design and implementation with FIFOs is pivotal in this implementation. i. Customize and build asynchronous dual clocked pointer FIFOs ii. A generic DI module with asynchronous FIFOs iii. Determining proper placement of bu ering and synchronizing schemes at IP core and network fabric boundaries Interface between back-end modules and DI module Interface between front-end modules (OCP entities) and DI module iv. Integrating built modules with on-chip network fabric 4.1.1 Single-Frequency Clocked Domain Implementation Figure 4.2 illustrates a single-frequency clocked domain (SFCD) design with NoC interfaces over a single IP Core-NoC-IP Core path. The following steps list the data ow transaction steps across the design in a single-frequency clocked domain. a. A global clock and asynchronous reset is employed to the IP cores, On-Chip network fabric and NoC interfaces. b. Communication between all modules is triggered on the positive edge of the clock cycle. c. On assertion of the asynchronous reset signal high, all the current transaction data are cleared. d. During request and response phases, transaction data are traversed and guided from one module to the next module in a pipelined stage fashion as described below. Master IP core required communication signals are mapped to the customized master back-end signals. Master back-end transfers the customized signals onto DI stage and pipelined at OCP master entity. OCP master entity maps the customized signals into OCP protocol compat- ible signals and transfers them onto the OCP slave entity. The OCP slave entity remaps the OCP protocol signals into DI module signals which are pipelined at the on-chip network fabric back-end. 47 Core B.E F.E F.E Core B.E DI DI DI DI NW BE1 Master IP Core Slave IP Core NW BE2 F.E F.E N O C OCP Master OCP Master OCP Slave OCP Slave System Initiator System Target B.E : Back-ends - Customized interfaces to IP Core and NoC F.E : Front-ends - OCP entities (Master and Slave) D.I : Domain Interface Module Single Frequency Domain Figure 4.2. Single-Frequency Clocked Domain (SFCD) Implementation with Mod- ular Components Across Design The network back-end interface packetizes the data in the required packet structure format and presents it to the on-chip network fabric. Data its are traversed across the routers using source routing information and presented at the other network fabric back-end. The other network fabric back-end interface depacketizes the data and the above process is followed until the data are mapped to required Slave IP cores existing signals. e. Stall signal generated from any successive module is traversed back to source IP generating requests. f. In nonsplit operation mode, the IP core master back-end module blocks the channel until the reception of a response for the nonsplit transaction. g. In split mode, simultaneous or pipelining of requests are possible and inclusion of tagging supports out-of-order responses from Slave IP. h. Burst read/write transactions are supported with passage of burst information (burst length and address sequence) from Master IP to Slave IP. 48 4.1.1.1 IP Cores Back-end Modules Figure 4.2 shows the customized back-end interfaces to IP cores (Master Back-end and Slave Back-end). IP core back-ends are customized to interface standard point- to-point protocols like OCP. The IP core back-end modules are primarily responsible for mapping legacy IP core signals to a de ned set of DI module signals and vice-versa. Mapping of signals between back-ends and DI module include request, response, tagging information, and stall signals. The master back-end provides the functionality to determine data validity of each initiated request by validating request transaction control signals from IP cores and stall signals from the DI module. The master back-end is also responsible for generating source routing bits from the received logical address using customized memory mapped registers and inserting them as part of the address which traverses to the NoC back-end. In this implementation, the master back-end is distinguished from the slave back-end in the handling of operation modes (nonsplit and split) depending on the received commands from a Master IP. In nonsplit mode, it does not allow the IP core to initiate another transaction or pipeline any more requests before the reception of response to previous nonsplit transaction. A slave back-end handles the functionality to enable the control signals (Chip Enable, Read/Write Enable) of a Slave IP core (Synchronous memory) for the current transaction depending on the valid requests received from DI module. It is also responsible for communicating burst, tagging information, and for synchronizing the response data from Slave IP to DI module. 4.1.1.2 Domain Interface (DI) Module The domain interface module serves as a bridge between IP cores/NoC back-ends and front-ends (OCP entities) following standard network interface structure. The DI module handles the request and response data transfer ow between back-ends and front-ends. In this implementation, the need for asynchronous FIFOs to provide synchronization and bu ering is not required with a global clock employed to all modules. The synthesized and physically routed DI module introduces combinational logic delay on the output ports. 49 4.1.1.3 On-Chip Network Fabric Back-end Modules Each network fabric back-end module is responsible for data packetization and de- packetization functionalities with a single it implementation methodology onto the NoC using a store-and-forward packet switching technique. Figure 4.3 shows on-chip network back-ends interfacing the NoC. Packetizing of the data includes assembling source routing bits, control information, and payload into a single required transaction format packet ( it). Depacketizing the data involves retrieving the required control and data signals of the DI module. Each data it containing 72-bits of information is transmitted and received to/from the NoC with a DataValid (DValid) signal asserted high during the transaction indicating valid data on the channel. For transactions where a stall signal is asserted high or DataValid signal asserted low during data its transmission/reception, the current data its are ignored as invalid data at the NoC and network back-ends. The Slave IP end of the network fabric back-end is customized to store the source routing information in memory arrays for sending the response packets back to the correct destination address. Initially, during the request transactions, the network back-end stores the required routing, tag bits, transaction type, and burst information bits in 2-dimensional register arrays. On reception of a response from a Slave IP, the tag bits are compared with the stored information and the correct destination address routing bits are retrieved. Once the destination routing bits are retrieved for the current response transaction, the allocated memory is cleared for storing next request Network Back-End 1 Network Back-End 2 NoC (Routers) Packetization De - Packetization Dvalid_BE1 Dvalid_BE1 Dvalid_BE2 Dvalid_BE2 Stall_Resp Stall_Resp Stall_Req Stall_Req OutFlit OutFlit InFlit InFlit Request Data Response Data Request Data Response Data Packetization De - Packetization Figure 4.3. On-Chip Network Back-ends Interfacing NoC 50 transaction routing bits. In case of burst request transactions, the allocated memory is cleared only after reception of the required number of burst responses from the Slave IP. In this design, 2-dimensional memory arrays each with 8 words of capacity are used to maintain the tra c. 4.1.1.4 OCP Master and Slave Entity Modules OCP is a point-to-point synchronous interface with complimentary master and slave entities communicating in a peer-to-peer fashion [11]. Each OCP entity can only communicate with its complementary side of OCP. The OCP master and slave entities are modularized for reusability across the design. Depending on the requirements of a system, an IP module can have only OCP Master or OCP Slave or both for communicating with the external world. Figure 4.4 illustrates the transaction phases between OCP entities. During the request phase, each transfer across the complementary entities starts with a request initiated from the OCP master presenting a transfer request command (MCmd), address, tag bits (MTagID), and other control information to the OCP slave. Depending on the Slave IP or bus availability to accept the request, the OCP slave entity acknowledges to the OCP master with a slave accept transfer signal (SCmdAccept). If the slave is ready to accept the request, it asserts the SCmdAccept signal high and the OCP master sends out the current transaction data in the Datahandshake phase on the next cycle. The OCP slave, after receiving the current transaction data, acknowledges the OCP master by asserting SDataAccept OCP Master MCmd. MAddr, MBurstLength, MTagID SCmdAccept MData, MByteEn MDataValid SResp, STagID, SData SDataAccept OCP Slave MRespAccept Request Phase Response Phase DataHandshake Phase Figure 4.4. Transaction Phases between OCP Master and OCP Slave Entities 51 signal high and maps the request data into the DI module compatible signals. If the slave is busy, then the OCP slave de-asserts the SCmdAccept signal low which makes the OCP master assert and maintain the stall signal high back to DI module high until the OCP slave is ready to accept the next request. During the response phase, the OCP slave presents the response data from the Slave IP/NoC for the requested transaction with a response signal (SResp), response tag bits (STagID), and read data (SData) signal. On reception of a response, the OCP master acknowledges the OCP slave by asserting MRespAccept signal high. This process continues. The datahandshake and response phases are optional between OCP entities. 4.1.2 Multifrequency Clocked Domain Implementation Figure 4.5 illustrates the design of a multifrequency clocked domain (MFCD). The back-ends and front-ends from the single-frequency clocked domain are reused in a multifrequency clocked domain. In a multifrequency clocked domain, the IP cores and NoC can be operated at di erent clock frequencies, resulting in mutually asynchronous clock domains. In order to avoid metastability and uncertainty in timing when interfacing two mutually asynchronous clock domains, synchronization and bu ering is implemented in the DI module. The DI module employs asynchronous dual clocked pointer FIFOs in the request and response paths to provide synchroniza- tion, bu ering, and data ow control. In this domain, the same data ow transaction steps as the single-frequency clocked domain are implemented, with the exception of added synchronization and bu ering in the DI module. FIFO placement across the design is one of the critical steps in this project implementation since performance can be degraded or improved depending on the placement across the design path (IP core-NoC-IP core). Synchronization can be employed in any of the DI modules while crossing from one clock domain to another clock domain. Note that it is preferable to have the same clock domain at the IP core back-ends and front-ends (OCP entities), since this reduces the overhead of synchronizers and bu ering. In Figure 4.5, the Master IP, Slave IP, and NoC are in mutually asynchronous clock domains. In this scenario ,it is always advisable to employ synchronization and bu ering in the DI modules closer to on-chip network 52 Core B.E F.E F.E Core B.E DI -1 DI - 4 DI module with FIFOs NW BE1 IP Core NW BE2 F.E F.E N O C OCP Master OCP Slave System Initiator System Target Front-ends (OCP entities) operating on IP Cores clock frequency NoC and network B.E are operating on NoC clock D.I module with FIFOs has dual clocks (for writing into and reading out transactions simultaneously in the request and response path). IP Core DI module with FIFOs (DI-3) OCP Slave OCP Master DI -2 DI -3 Figure 4.5. Multifrequency Clocked Domain (MFCD) Implementation Using Asyn- chronous Clocked FIFOs fabric back-ends. Employing asynchronous FIFOs only in the DI modules closer to network fabric back-ends not only segregates the NoC from IP cores but also provides an e cient way to reduce the overhead of synchronization and bu ering on IP core back-ends and front-ends. 4.1.2.1 DI Module with FIFOs Figure 4.6 shows a DI module with integrated asynchronous FIFOs. In Figure 4.5, DI modules DI-2 and DI-3 have integrated FIFOs. Module DI-2 has two mutually asynchronous clock signals as inputs, the write clock (WCLK) and read clock (RCLK). In the request path, the request transaction data from the OCP slave entity are written into DI module (FIFO memory) usingWCLK and the network fabric back-end reads out the data from DI module (FIFO memory) using the read clock (RCLK). Similarly, in the design, module DI-3 uses the write clock to write request transaction data from the network fabric back-end and OCP master reads out request data from DI module (FIFO memory) using read clock (RCLK). This process continues. In the response path, FIFOs are employed using the same mechanism as the request path. I I ~ ________________ , _______ 1 _________________________ J. ______ , • _________________ • , " I I " '\ " \,' I I '/ \ I I I J I I I I I I I I \ 1\ I I I \ I ', _________________ " , _______ L _________________________ J _______ ,I ', __________________ " 53 Back End Domain Interface Front End Request Path Response Path Dual Clocked FIFOs Dual Clocked FIFOs CLK1 CLK1 CLK2 CLK2 Request Path Response Path rempty rdata wfull wdata wfull rdata wdata rempty WCLK RCLK RCLK WCLK Figure 4.6. DI Module with Asynchronous Clocked FIFOs Whenever the two pointers (write and read) catch up with each other, or on reset, an additional two clock cycles are required to synchronize signals between the clock domains. In the case where the write pointer catches up with the read pointer, an extra two write clock cycles are required to synchronize the read pointer value into write clock domain to guarantee a vacant slot for writing into the FIFO memory. Similarly, when the read pointer catches up with the write pointer, an extra two read cycles are required to synchronize the write pointer into read clock domain to guarantee valid data are read out from the FIFO memory. A synchronized write pointer is used to generate rptr empty (read empty) signal and a synchronized read pointer is used to generate wptr full (write full) signal. Writing request transaction data into a FIFO is stalled or halted on the assertion of the wptr full signal high (occurs when write pointer catches up with read pointer). When the DI module stall signal goes high on assertion of wptr full signal high, the OCP slave is not allowed to accept any more request transactions from its OCP Master entity. A rptr empty signal is asserted high (occurs when read pointer catches up with write pointer), when there are no more data to be read out. This design implementation uses 8-words of memory for data storage in the request and response path of DI module which are employing asynchronous FIFOs. 54 4.2 Design Implementation Flow The standard ASIC design ow is used to build clocked interfaces in single- and multifrequency clocked domains. Figure 4.7 illustrates the development ow and required implementation les. In the implementation ow, each design phase is functionally veri ed with integrated IP cores and NoC. During logic synthesis and automatic place & route phases, the synthesized designs are tested with sdf back anno- tation for meeting setup and hold time requirements. In this project implementation ow, the Modelsim simulator is used for testing the design functionality. Verilog Modules (RTL Code) Design Compiler (Logic Synthesis) SoC Encounter (APR) Tech.db Constraint.tcl Structural netlist, sdc file, sdf file, ddc file Tech.lef (Structural netlist, sdc file) Tech.lib design files, APR structural netlist, def file, sdf file, spef file and power report RTL Testing Outputs Functionality Testing (sdf back-annotation) Functionality Testing (sdf back-annotation) Outputs Figure 4.7. Design Implementation Flow with Technology Library Files 55 4.2.1 RTL Source Codes Using Verilog HDL In this project implementation, RTL codes are developed using Verilog HDL. The project design is customized to construct modular components across the design. The project implementation includes macros to de ne the widths of address, data, and other control signals. The widths are con gurable by modifying the de ned macros widths and corresponding changes are required in back-ends. Figure 4.8 illustrates the RTL structure of this project design. Following are the Verilog modules coded in this project implementation a. Test Bench: timescale.v : Timescale definition. Included in all files defines.v : Macro module defines configurable signal widths. Included in all files. test.v : Test Bench top.v : Design Top Layer b. Single-Frequency Clocked Domain: master_be.v : Customized back-end to Master IP core. di.v : Domain Interface module (only with combinational logic). di_ocpm.v : OCP Master entity interfacing with DI module. ocps_di.v : OCP Slave entity interfacing with DI module. nw_be1.v : Customized network back-end (from Master IP end) to NoC. nw_be2.v : Customized network back-end (from Slave IP end) to NoC. slave_be.v : Customized back-end to Slave IP core. c. Multifrequency Clocked Domain: The developed modules in SFCD are used in MFCD. In this implementation, the DI module is constructed with asynchronous FIFOs to handle synchronization and bu ering. The FIFOs are developed by customizing the dual clocked pointer FIFOs proposed in Simulation and Synthesis Techniques for Asynchronous FIFO Design papers [23] [25]. The developed FIFOs are used in both the request and response paths. Following are the Verilog modules. 56 TestBench test.v timescale.v defines.v top.v top1.v master_be.v di.v di_ocpm.v ocps_di.v top2.v slave_be.v di.v di_ocpm.v ocps_di.v noc.v nw_be1.v nw_be2.v router3_72b_HL_pself.v sw_H_72b_pselfv.v merge_L_72b_pself.v ehb_L_72b_pself.v memory.v rom.v di.v (multi-frequency domain) fifo.v fifomem.v sync_r2w.v sync_w2r.v wptr_full.v rptr_empty.v Figure 4.8. Design RTL Structure master_be.v : Customized Back-end to Master IP core. di.v : Domain Interface module (with asynchronous clocked FIFOs). di_ocpm.v : OCP Master entity interfacing with DI module. ocps_di.v : OCP Slave entity interfacing with DI module. nw_be1.v : Customized Network Back-end (from Master IP end) to NoC. nw_be2.v : Customized Network Back-end (from Slave IP end) to NoC. slave_be.v : Customized back-end to Slave IP core. fifo.v : Top level FIFOs module. - r- - E r- - ~ r- - --- - - - -r-- f-- f-- f-- f-- '--- -+- - 57 fifomem.v : FIFO memory module (instantiated with 8 words of data) sync_r2w.v : Module to synchronize read pointer into write clock domain. sync_w2r.v : Module to synchronize write pointer into read clock domain. wptr_full.v : Module to generate write pointer full signal. rptr_empty.v : Module to generate read pointer empty signal. d. Slave IP: memory.v : ROM (8KB) and RAM (8KB) for testing verilog codes. rom.v : Description created by a converter from S-format. e. On-Chip Network Fabric or NoC: router3_72b_HL_pself.v : 3-Port Router top-level module. sw_H_72b_pselfv.v : Router Switch module. merge_L_72b_pself.v : Router Merge module. ehb_L_72b_pself.v : Elastic Half Buffer module. 4.2.2 Design Synthesis Using Synopsys Design Compiler The design is segregated into three blocks corresponding to the IP cores and NoC to explore their performance characteristics individually, as shown in Figure 4.9. The Top1 and Top2 modules include customized back-ends to IP cores, OCP entities, and DI module. The NoC module includes the routers and on-chip network fabric back- ends. The project design is synthesized using Synopsys Design Compiler with Arm's Artisan static cell library using IBM's 65nm process technology. The University of Utah tcl scripts for logic synthesis and auto place & route are modi ed to accomplish design synthesis. The generated structural Verilog netlists are integrated into one unit by manually wiring and functionally tested with sdf back-annotation. 4.2.3 Design APR Using Cadence SoC Encounter Cadence SoC encounter is used for automatic place and route of this project design. The individual top level synthesized structural Verilog netlists (Top1, Top2, and NoC) and with other corresponding required les, as shown in Figure 4.7, are 58 Core B.E F.E F.E Core B.E DI DI DI NW BE1 IP Core NW BE2 F.E F.E N O C OCP Master OCP Master OCP Slave OCP Slave System Initiator System Target IP Core DI Top1 NoC Top2 Top Figure 4.9. Design Synthesis Structure imported into SoC encounter for auto place and route. The physically routed modules are integrated and functionally tested with sdf back-annotation. After achieving the expected functionality on the nal routed design with setup and hold time require- ments, the performance metrics are captured. CHAPTER 5 VALIDATION OF CLOCKED OCP COMPLIANT INTERFACES Increasing design complexity, cost, and probability of errors on an SoC has raised the importance of veri cation to reduce design time and risk, ensuring rapid time- to-market. Ideally, in a standard ASIC design ow, a certain degree of veri cation is required at every stage to improve product quality. In industry, veri cation is more often performed only at critical design implementation stages (RTL, logic synthesis, and physical implementation). In this research, the Sonics OCP Library for Veri cation (SOLV) package is used to validate the clocked interface designs for OCP compliance [29]. The SOLV package supports compliance for all the released OCP-IP Open Core Protocol speci cations. The SOLV package comprises three components: an OCP checker, a dissembler, and a performance analyzer. Figure 5.1 shows the SOLV components and tool ow. Basi- cally, the SOLV component provides a system Verilog assertion (SVA) based checker which can be integrated into a Verilog testbench to validate the protocol compliance. The checker captures OCP interface signals during simulation on each OCP clock cycle and compares them to the OCP protocol requirements. An assertion-based property veri cation mechanism is employed to check signals and report protocol violations at the same clock cycle at which the assertion is failed. The Sonics OCP checker dynamically validates OCP interfaces during simulation, and generates OCP trace les for use by the postprocessing tools ocpdis (OCP disassembler) and ocpperf (OCP performance analyzer). During simulation, the OCP connection activity is logged into an OCP trace les, consisting of hexadecimal valued tables. An OCP dissembler uses the OCP trace les to display OCP connection 60 OCP Master OCP Slave Sonics OCP Checker OCP Dissembler OCP Performance Analyzer Verilog HDL Testbench <ocpName>.ocp <ocpName>.ocpdis <ocpName>.ocptrn <ocpName>.ocpperf SOLV Components Figure 5.1. SOLV Components and Tool Flow activity in a convenient report format. An OCP performance analyzer uses the OCP trace les to measure the performance of OCP basic transfers and burst transactions. 5.1 OCP Checker Setup and Instantiation Sonics provides system Verilog les for a limited set of simulators for the OCP checker. Software is also provided for command line tools: the OCP dissembler and performance analyzer. SOLV supports NC Verilog, VCS, and MTI simulators. The Sonics OCP Checker can be instantiated using a Verilog module containing two maps: one for instance and protocol parameters and another for ports. Each OCP connection has a unique set of instance and protocol con guration parameters which are enabled depending on the supporting OCP subset. Each connection also has a unique set of wires that connects to the SVA checker, and depending on the signals connected, the checker instance port map is de ned. When an OCP signal is not speci ed, the checker reserves a one-bit wide signal for it and uses the default values. Figure 5.2 shows a code snippet for an OCP checker instance and protocol parameter map, and a port map used in this design validation. 61 ocp2_sva_checker #( //Instance Parameters .version ("ocp2.2-1.9"), .checkername ("coretb.ocp"), .name ("checker_port"), .trace_name ("master_ocp.ocp"), .ocpcheck_enable (1), .trace_enable (1), .max_idle_enable (1), //Protocol Parameters. .cmd(1), .addr (1), .addr_wdth (32), .read_enable (1), .write_enable (1), .datahandshake (1), .burstlength (1), .burstlength_wdth (3), .burstprecise (1), .burstseq (1), .broadcast_enable (0), .burstseq_dflt1_enable (1), .burstseq_incr_enable (1), .burstseq_wrap_enable (1), .byteen (1), .cmdaccept (1), .data_wdth (32), .mdata (1), .resp (1), .respaccept (1), .sdata (1), .dataaccept (1), .tags (3), .taginorder (1), .mreset (1), .sreset (1), .threads (0), .. ..) \\ Port Map checker_port ( .Clk_i (Clk), .MCmd_i (MCmd), .MAddr_i (MAddr), .MData_i (MData), .MDataValid_i (MDataValid), .MRespAccept_i (MRespAccept), .SCmdAccept_i (SCmdAccept), .SData_i (SData), .SDataAccept_i (SDataAccept), .SResp_i (SResp), .MByteEn_i (MByteEn), .MBurstLength_i (MBurstLength), .MBurstPrecise_i (MBurstPrecise), .MBurstSeq_i (1'b0), .MTagID_i (MTagID), .MTagInOrder_i (MTagInOrder), .STagID_i (STagID), .STagInOrder_i (STagInOrder), .MReset_ni (RST), .SReset_ni (RST), .. ..); Figure 5.2. OCP Checker Code Snippet CHAPTER 6 DESIGN TESTING AND RESULTS 6.1 Testing Environment Figure 6.1 shows the test setup used for this design. The design uses an emulated RISC pipelined CPU as Master IP core, a synchronous memory (RAM and ROM) as Slave IP core, and customized 3-port synchronous routers as NoC [18] [19]. A Verilog testbench generates the CPU emulation tra c to communicate with the Slave IP core through the network fabric. The network fabric contains synchronous routers connected back-to-back. Data packets are traversed across the routers and transferred to the destination address by using the source routing information provided by the processor emulator. The Memory module (Slave IP) contains 8KB each of RAM and ROM customized to interact with the Master IP using OCP. Core B.E F.E F.E Core B.E DI DI DI DI NW BE1 IP Core IP Core NW BE2 F.E F.E N O C OCP Master OCP Master OCP Slave OCP Slave Top1 NoC Top2 Top Testbench Emulated CPU Memory Testbench (Integrates IP Cores with Top Layer module) Top Module (Integrates Back-ends, Front-ends and DI modules) Built Components Figure 6.1. Design Test Setup 63 A Verilog HDL test bench (test.v) is coded to manually integrate the synthesized submodules (Top1, Top2, and NoC) containing back-ends, front ends, and DI modules with the IP cores. The test bench creates the clock and generates other req
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6kk9sh9