The NSR processor prototype

Richardson, William F.

The NSR processor prototype

Download File | | Reference URL

Update Item Information

Publication Type	journal article
School or College	College of Engineering
Department	Kahlert School of Computing
Creator	Richardson, William F.
Other Author	Brunvand, Erik L.
Title	The NSR processor prototype
Date	1992
Description	The NSR (Non-Synchronous RISC) processor is a general purpose processor structured as a collection of self-timed units that operate concurrently and communicate over bundled data channels in the style of micropipelines. These units correspond to standard synchronous pipeline stages such as Instruction Fetch, Instruction Decode, Execute, Memory Interface, and Register File, but each operates concurrently as a separate self-timed process. In addition to being internally self-timed, the units are decoupled through self-timedFIFO queues between each of the units which allows a high degree of overlap in instruction execution. Branches, jumps, and memory accesses are also decoupled through the use of additional FIFO queues which can hide the execution latency of these instructions. The prototype implementation of the NSR has been constructed using Actel FPGAs (Field Programmable Gate Arrays).
Type	Text
Publisher	University of Utah
First Page	1
Last Page	23
Subject	self-timed systems; asynchronous systems; micropipelines; FPGAs; RISC processor; NSR
Language	eng
Bibliographic Citation	Richardson, W. F., & Brunvand, E. (1992). The NSR processor prototype. 1-23. UUCS-93-029.
Series	University of Utah Computer Science Technical Report
Relation is Part of	ARPANET
Rights Management	©University of Utah
Format Medium	application/pdf
Format Extent	5,183,189 bytes
Identifier	ir-main,16253
ARK	ark:/87278/s6qv44q7
Setname	ir_uspace
ID	703901
OCR Text	Show T h e N S R P rocessor Prototype William F. Richardson and Erik Brunvand U U C S -9 2 -0 2 9 Department of Computer Science University of Utah Salt Lake City, UT 84112 USA August 14, 1992 A b s t r a c t The NSR (Non-Synchronous RISC) processor is a general purpose processor structured as a collection of self-timed units that operate concurrently and communicate over bundled data channels in the style of micropipelines. These units correspond to standard synchronous pipeline stages such as Instruction Fetch, Instruction Decode, Execute, Memory Interface, and Register File, but each operates concurrently as a separate self-timed process. In addition to being internally self-timed, the units are decoupled through self-timed FIFO queues between each of the units which allows a high degree of overlap in instruction execution. Branches, jumps, and memory accesses are also decoupled through the use of additional FIFO queues which can hide the execution latency of these instructions. The prototype implementation of the NSR has been constructed using Actel FPGAs (Field Programmable Gate Arrays). This research was sponsored in part by NSF award MIP-9111793 THE NSR PROCESSOR PROTOTYPE 2 1 In trodu ction A s computer system s continue to grow in size and complexity, the challenges inherent simply in assembling the system pieces in a way that allows them to. work together also grow. A ma jor cause o f the problems lies in the traditional synchronous design style in which all the system components are synchronized to a global clock signal. For exam ple, simply distributing the clock signal throughout a large synchronous system can be a m ajor source of complication. Clock skew is a serious concern in a large system , and is becoming significant even within a single chip. A t the chip level, more and more of the power budget is being used to distribute the clock signal, while designing the clock distribution network can take a significant portion of the design tim e. One solution is to use non-clocked asynchronous techniques or restricted versions of asynchrony known as self-tim ed [8]. 1.1 Self-Timed Systems Self-timed circuits are a subset of a broad class o f asynchronous circuits. General asynchronous circuits do not use a global clock for synchronization, but instead rely on the behavior and arrangement of the circuits to keep the signals proceeding in the correct sequence. In general these circuits are very difficult to design and debug without some additional structure to help the designer deal with the complexity. Traditional clocked synchronous systems are an example of one particular structure applied to circuit design to facilitate design and debugging. Im portant signals are latched into various registers on a particular edge o f a special clock signal. Between clock signals information flows between the latches and must be stable at the input to the latches before the next clock signal. This structure allows the designer to rely on data values being asserted at a particular time in relation to this global clock signal. Self-timed circuits apply a different type of structure to circuit design. Rather than let signals flow through the circuit whenever they are able as with an unstructured asynchronous circuit, or require that the entire system be synchronized to a single global timing signal as with clocked system s, self-timed circuits avoid clock-related timing problems by enforcing a simple communication protocol between circuit elements. This is quite different from traditional synchronous signaling conventions where signal events occur at specific times and may remain asserted for specific time intervals. In self-timed system s it is im portant only that the correct sequence of signals be maintained. The timing of these signals is an issue of performance that can be handled separately. 1.2 Communication Protocol Self-timed protocols are often defined in terms of a pair o f signals that request an action, and acknowledge that the requested action has been completed. One m odule, the sender, sends a request event to another module, the receiver. Once the receiver has completed the requested action, it sends an acknowledge event back to the sender to complete the transaction. THE NSR PROCESSOR PROTOTYPE 3 Req Sender Ack Receiver Data Figure 1: A Bundled D ata Interface Req • \ : - - i - Ack ■ \ ! \ i Data ■ K V i '• One Second j f Transaction Transaction Figure 2: Tw o-Phase Bundled Transition Signaling This procedure defines the operation of the modules which follows the com mon idea o f passing a token o f some sort back and forth between two participants. Imagine that a single token is owned by the sending module. To issue a request event it passes that token to the receiver. W h en the receiver is finished with its processing it produces an acknowledge event by passing that token back to the sender. The sequence o f events in this communication transaction is an alternating sequence of request and acknowledge events. T he sequence o f events in a communication transaction is called the protocol. In this case the protocol is simply for request and acknowledge to alternate, although in general a protocol m ay be much more complicated and involve m any interface signals. Although self-timed circuits can be designed in a variety o f ways, the circuits used to build the N SR processor use two-phase transition signalling for control and a bundled protocol for data paths. Tw o-phase transition signalling [8, 4] uses transitions on signal wires to communicate the request and acknowledge events described previously. Only the transitions are meaningful; a transition from low to high is the same as a transition from high to low and the particular state, high or low, of each wire is not im portant. A bundled data path uses a single set o f control wires to indicate the validity of a bundle o f data wires. This requires that the data bundle and the control wires be constructed such that the value on the data bundle is stable at the receiver before a signal appears on the control wire. This condition is similar to, but weaker than, the equipotential constraint [8]. Tw o modules connected with a bundled data path are shown in Figure 1 and a timing diagram showing the sequence of the signal transitions using two-phase transition signalling is shown in Figure 2. THE NSR PROCESSOR PROTOTYPE R e q - In. D a t a - In. A c k - I n R e g - Ou. t D a t a - O u t Ack; O u t Figure 3: A Micropipeline F IF O Buffer A self-timed F IFO buffer has a particularly simple implementation using the two-phase bundled protocol. The circuit in Figure 3 is an example o f a F IFO buffer of this type with processing between two of the stages. If the processing is not internally self-timed and able to generate a completion signal, a delay must be added that models the delay of the data through that logic as shown in the figure. If no processing is present between the stages, as seen in the right two stages in the figure, the pipeline is a simple F IFO buffer. This type o f FIF O is also known as a micropipeline [9], 2 N S R A rch itectu re The N SR (Non-Synchronous R IS C 1) processor prototype is structured as a collection of self-timed units which operate concurrently and cooperate by communicating with other units using self-timed communication protocols. First-in first-out (F IF O ) buffers play an extremely im portant role in the implementation of the N SR . In fact, one way to look at the architecture o f the N S R processor is as a large F IFO buffer that also modifies the data passing through it according to some rules. The overall architecture o f the N S R is inspired by the synchronous W M [10] and P IP E [7] processors which also use F IF O queues extensively. T he units that make up the N SR processor correspond to standard synchronous pipeline stages, and consist o f Instruction Fetch (IF ), Instruction D ecode (ID ), Execute (E X ), Register File (R F ), and M em ory Interface (M E M ) as shown in Figure 4. Each unit operates concurrently as a separate self-timed process. In addition to being internally self-timed, the units are decoupled 'Because the current implementation has no explicit HALT instruction and no interrupt mechanism, NSR originally stood lor "Nantucket Sleigh Ride." THE NSR PROCESSOR PROTOTYPE 6 E n c o d i n g M n e m o n i c A c t i o n 1111 -R d - -R a - -R.b- 1110 -R d - -R a - -R b - 1101 -R d - -R a - -R b - 1100 -R d - -R a - -R b - S T A R d ,R a ,R b L D A R d ,R a ,R ,b S JM P R d ,R a ,R b A D D R d ,R a ,R b R d ,A Q (S to re ) <- R a -I- R b R d ,A Q (L o a d ) <- R a -I- R b R d ,J m p -Q u e u e <- R a -I- R b R d <- R a -\|- R b 1011 -R d - -R a - -R b - 1010 -R d - -R a - -R b - 1001 -R d - -R a - -R b - 1000 -R d - -R a - -R b - X N O R R d ,R a ,R b X O R R d ,R a ,R b O R R d ,R a ,R b A N D R d ,R a ,R b R d < - R a X N O R R b R d < - R a X O R R b R d <- R a O R R b _ R d < - R a A N D R b ' 0111 -R d - -o ffs e t - M V P C Rd,offset, R d - P C + offset 0110 -R d - 0100 -R b - 0110 -R d - 0010 -R b - 0110 -R d - 0001 -R b - S IIR A R d ,R b S H R L R d ,R b SH L L R d .R b R d ♦- shift right arith m etic R b R d ♦- shift right logical R b R d - shift left log ical R b 0101 l l x x -R a - -R b - 0101 lO xx -R a - -R b - 0101 O lxx -R a - -R b - 0101 OOxx -R a - -R b - SN E R a, R b S G E R a, R b S G T R a, R b SE Q R a, R b C C -Q u e u e < - (R a ^ R b ) C C -Q u e u e < - (R a > R b ) C C -Q u eu e ♦- (R a > R b ) C C -Q u eu e ♦- (R a = R b ) 0100 -R d - -R a - -R b - SU B R,d,Ra,Rl> R d ♦- R a - R b 0011 -R d - -v a lu e - 0010 -R d - - v a lu e - M V IL R d ,valu e M V IH R d ,valu e R d .h <- 00, R d .l <- value R d .h <- value, R d .l ♦- 00 0001 - offset------ 0000 x x x x x x x x x x x x B C N D offset JM P if C C -Q u eu e then P C <- P C -1- offset P C - Jm p -Q u eu e Figure 5: N SR Instruction Set 2.2 Control Flow All control flow decisions are made by the Instruction Fetch unit based on conditions set up in advance by the Execution unit. Conditional branch (BCND) instructions and jum p (JMP) instructions are handled and consumed entirely by the IF unit and do not proceed any further though the N SR pipeline. The semantic convention used is that branches implement an offset relative to the program counter (P C ) while jumps are made to a specific address. BCND instructions are recognized by the Instruction Fetch unit and cause the program counter to either be incremented by one (branch not taken), or to be updated by adding a signed constant present in the opcode (branch taken). The decision to take the branch or not is made based on a condition code (C C ) bit. This C C bit is computed in advance by the Execute unit and stored in a F IFO queue between the Execute unit and Instruction Fetch unit. Note that the arithmetic instructions do not set the condition bit. These C C bits are set only by the explicit condition code setting instructions. These instructions compare the values contained in a pair of registers and set the condition code based on the result of that comparison. The prototype N SR processor implements E Q , N E Q , G T , and G E comparisons. Each BCND instruction consumes one C C bit from the C C queue in order to make the branch decision. Thus, the C C bits generated by the Execute unit and used in the Instruction Fetch unit must obey a one-to-one producer-consumer relationship. Jump instructions are also handled in the Instruction Fetch unit. In this case, the target address is computed by the Execute unit in advance by adding the contents o f two registers with the SJMP instruction and sending the result to a F IFO queue. The Instruction Fetch stage, upon seeing a JMP instruction, dequeues an address from the .Jmp-Quene and uses it to update the value of the P C . T he jum p addresses and JMP instructions must also obey the producer-consumer relationship. One easy way to halt the N SR processor in a deadlock is to issue a JMP instruction before any SJMP instruction, in which case the Instruction Fetch unit will wait forever for the jum p address to show up in the queue. The effect o f the decoupling o f the branch and jum p instructions is similar to the common idea of delay slots. However, rather than using a fixed number of delay slots, the programmer is free to put any number o f instructions between, for exam ple, the SNE instruction and the BCND that uses the generated condition code. If many instructions are issued between these two then the condition code will be waiting when the BCND is executed and there will be no stalling of the pipeline and 110 delay. If, on the other hand, the SNE is followed directly by the BCND, then the Instruction Fetch stage will simply wait for the condition code to be produced before proceeding with the branch. Note that since all the stages are self-tim ed, no explicit control of the pipeline is required to implement this possible stall and no N O -O P instructions are required to fill the delay slots. 2.3 M em ory Access The m emory address space consists of G5536 16-bit words, addressed sequentially from 0x0000 to O xFFFF. For this prototype version o f the N SR , the smallest (and indeed only) addressable m emory element of the N SR is a 16-bit word. M em ory access on the N SR is decoupled through FIFO queues. There are, in fact, no standard load and store instructions in the N SR instruction set. Instead, m em ory addresses are computed and sent to the M em ory Interface which processes the requests and queues up the results. An LDA instruction is exactly like an ADD instruction with the result also sent to the M em ory Interface as an address to load from. The result o f an STA instruction is likewise considered an address in which to store data. T he programmer transfers data between the N SR and m emory by accessing register R l , a special register which is actually connected to queues to and from the memory. W h en the program reads from register R l (R l is the source register for some operation) the result is data from m emory out of the Load D ata Queue (L D Q ), and when the program stores into register R l ( R l is the destination register o f some operation), that data gets queued up to be stored into m emory through the Store D ata Queue (S D Q ). Neither operation takes place until the corresponding address has been placed into the Address Queue (A Q ) T he memory access queues are shown in Figure 6. T he M em ory Interface uses the information in these queues to perform m em ory cycles. W hen a load address is at the head o f the A Q , a read cycle is initi- THE NSR PROCESSOR PROTOTYPE 7 THE NSR PROCESSOR PROTOTYPE 8 M E M u n i t Figure 6: N SR M em ory Queues atecl and the resulting data are placed in the L D Q . W h en a store address is at the head of the A Q , and there are data at the head o f the S D Q , a store cycle is initiated and those data are stored to memory. Because the m emory operations are decoupled, several requests may be queued before they are needed. For exam ple, by placing an LDA instruction several instructions in advance of the instruction that requires the m emory contents, the m em ory access latency is hidden. Again, this is similar to delayed loads with the advantage that any number (including zero) instructions m ay be executed between the initiation of the load and the use of the loaded data. As with control flow operations, no explicit control of the pipeline is needed to generate possible stall cycles. Note that each time an instruction uses register R1 as a source, it dequeues one word from the L D Q . This means that a different value m ay be received each time R1 is accessed. For example, if two LDA instructions have been issued previously, then the instruction ADD r2,rl,rl will add the two values loaded from m em ory and store the result in R 2. In fact, if an address has also been queued with an STA instruction, the instruction ADD rl,rl,rl will add two values from m emory and store the result back to another m em ory location. Interleaved STA and LDA instructions may be used without concern. Although the L D Q and SD Q are independent, there is only one Address Queue. In addition to enqueuing the address, a bit is enqueued which indicates whether the address is for a write or read operation. By sharing the A Q , read-after-write hazards are avoided. However, the unwary programmer can easily deadlock the NSE, processor by issuing an instruction that uses R1 as a source before queuing up an address using an LDA instruction. The processor will stop and wait for the result from m emory that will never arrive. Note that it is a simple m atter for compilers to avoid this problem. THE NSR PROCESSOR PROTOTYPE S y s t e m U n it C h ip s U s e d L o g ic M o d u le s U tiliz a tio n Instruction Fetch 1 Actel 1020A 547 100% Instruction Decode 1 Actel 1010A 287 97% Execute 1 Actel 1020A 518 95% Register File 2 A ctel 1020A 538 each 98% M em ory Interface 2 A ctel 1010A 277 each 94% Figure 7: N SR F P G A Implementation OATAAl" K DATA / ADDFl I M e i n ADDRACK ADDRRSfi D M e r n IF '""'2 I D 2™" I ... IBu, 2 \| ~ 1 EX 2 ------- - 14 MEM Figure 8: FIFO Queue Lengths 3 P ro to ty p e Im plem entation The separate functional units o f the prototype N SR processor are each im plemented using Actel F P G A s. The two-phase transition control modules and bundled data modules have been assembled from a library of macros designed to be used with the Actel parts [3, 2]. The individual units of the N S R are designed to behave as pipeline stages that also process the information that flows through them [5, 4]. These parts were designed and implemented by students in a graduate seminar on V L SI architecture using the W orkview suite of schematic capture and simulation tools from ViewLogic. The resulting F P G A s have been assembled as a wire-wrapped prototype for testing and evaluation. T he number of Actel F P G A chips used to implement each o f the parts o f the N SR and the utilization of those chips are shown in Figure 7. T he N SR processor is connected to a standard PC clone to allow programs to be loaded into the N S R 's memory and data to be retrieved to the PC for analysis. THE NSR PROCESSOR PROTOTYPE 10 T he individual units of the NSR. are connected with self-timed F IF O pipelines to provide a higher degree o f overlap in instruction execution. T he length of each F IFO is not a factor in insuring correct operation, but may become significant in improving the throughput of the processor. T he actual length of. each F IF O was determined by the amount of space left over on each F P G A after the essential functions were implemented. The block diagram o f the N SR functional units along with the length and location o f the connecting F IF O s is shown in Figure 8. ' 3.1 Instruction Fetch The Instruction Fetch unit is responsible for maintaining the program counter (P C ), fetching instructions from memory, and passing executable instructions on to the next stage of the N SR to be decoded and executed. A s described earlier, the IF unit detects and handles all control flow instructions directly. T he IF unit reads instructions from the NSR memory. There are then several actions which may be taken by the IF unit. If the opcode corresponds to a ju m p instruction, an address is taken from the Jm p-Queue and the program counter is loaded with that address. If the opcode is a conditional branch, a bit is taken from the C C -Q u eue, and if the bit indicates that a branch should occur a 12-bit signed offset obtained from the opcode is added to the program counter. Otherwise, the program counter is incremented by one to fetch the next sequential instruction. The jum p and branch instructions go no further through the NSR.. A ll other opcodes are passed on unchanged to the Instruction Decode unit for further action. This action is similar to the concept of "squashing" instructions, often found in synchronous processors. However, the N SR does not convert the branch and jum p instructions into N O -O P s, but instead removes them completely from the main processor pipeline. There is one additional opcode which is recognized by the IF unit for special handling. Since the program counter is only stored in the IF unit, and all other units operate independently, there would normally be no way to obtain a "current" instruction address for use as a return address in a subroutine. To deal with this case, the MVPC instruction causes the IF unit to send the Program Counter value to the next stage. Before passing the P C value on, an 8-bit signed offset is extracted from the MVPC opcode and added to the current PC value. The MVPC instruction is passed on to the ID unit, and is followed immediately afterwards by the modified P C value. T he current PC value held in the IF unit is then incremented normally. The MVPC instruction does not alter the program counter value used to fetch the next instruction, but allows the programmer to obtain the modified address to be placed in a register for later use. It is the responsibility of the ID unit to recognize the MVPC instruction and handle the subsequent address accordingly. T he prototype board on which the NSR. is built has only a single m emory address space, but the N SR has separate logical paths for instructions and data. In order to share the access to the memory, a simple round-robin arbiter is used. The IF unit must take turns with the M E M unit when accessing the memory. In addition, due to pin restrictions, the interface to the instruction THE NSR PROCESSOR PROTOTYPE 11 Bits: Usage: 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Destination ource V V Source B V Figure 9: Register Usage Encoding 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 STA LDA SJMP ADD XNOR XOR o•x> AND MVPC Shi ft SetCC SUB MVIL MVIH i2-3e 'oXi Program Counter Value Immediate Operand Shift Code Cond Figure 10: Execution Unit Operation Encoding m emory uses a 16-bit multiplexed bus, which carries both the address and the opcodes. The Instruction Fetch unit must provide the proper signals to fetch instructions from memory, and must obey the protocols established to ensure that the data m emory unit can also access the system memory. Details on the m em ory arbitration will be found in section 3.6. 3.2 Instruction Decode The ID unit has two responsibilities. It takes executable instructions from the IF unit and informs both the Execute unit and the Register File of what actions they must take. The Register File may provide two, one, or no register contents to the Execution unit for consumption. It must also be told whether to expect a result from the E X unit and to what register it should route that result. There are multistage F IFO queues between the ID , R F , and E X units. It is not necessary that all instructions be synchronized, but each unit must know how many operands and results are needed, and what to do with them when they arrive. The source and destination information is encoded and sent to the Register File using the format shown in Figure 9. Since register R0 cannot be overwritten, it is often used as a destination when the result of an instruction does not need to be saved. The destination field is therefore all zeros when the result is not to be placed in a register. The source registers cannot be encoded in this way, since register R0 is a valid source, so they are encoded with a bit indicating their validity. T he E X unit knows how many operands each instruction requires, but it must be told what instruction to perform and where to send the results. The results m ay be sent to the Register File, to the M em ory unit as an address or THE NSR PROCESSOR PROTOTYPE 12 as data, placed in the Jm p-Queue, or simply discarded. It is also possible to route the results to more than one of these destinations. In the case of a MVPC instruction, the subsequent PC address is passed unchanged through the ID and E X units and routed to the Register File. There are sixteen possible opcodes in the N SR instruction set, so we can indicate the instruction to the E X unit by dedicating one bit of the 16-bit ID -E X communication path to each opcode class. There are two bits left over corresponding to the JMP and BCND instructions, which are never seen by the ID unit. These two bits are used to indicate whether the result is to be sent to the M E M unit ( R l ) , or discarded (RO). This encoding is shown in Figure 10. 3.3 Execute Unit The Execute unit is told what operation to perform by the ID unit, accepts the correct, number of operands from either the Register File or- in the case of a MVPC instruction- from the ID unit, performs the operation, and routes the results to the correct places. The actual operations are standard m athem atical and logical operations (see Figure 5). Zero, one, or two operands are provided by the Register File as instructed by the ID unit, but the E X unit doesn't know or care from which registers the operands come. There is only a single 16-bit path for source operands, so if two operands are required, they are presented in sequence. This decision was made due to the limited pin count o f the F P G A s. The ID unit also tells the E X unit how to route the result. The result may be presented to the R F unit to be written into a register, or it may be sent to the M E M unit as an address or as data. T he result may also be discarded, as is common when the operand has a side-effect. This is often the case with the SJMP instruction, for example, which loads a value into the Jm p-Queue. W ith side-effects, it is possible that the results may be distributed to more than one destination. For exam ple, the instruction STA rl,r2,r3 would add the contents of registers R.2 and R 3, and put the result in both the Store Address Queue (as a side effect, of the STA) and the Store D ata Queue (because the destination register is R.1) and would thereby initiate a m em ory write operation. 3.4 Register File There are sixteen 16-bit. registers available for use, numbered R.0 through R 15. A n y of these registers may be used as source or destination for any instruction. There are three classes of registers, however. Registers RO, R 14, and R 15 are hardwired to the constant values of zero, one, and negative one, respectively. W riting to these registers has no effect on their contents. Registers R2 through R.13 are normal general purpose registers. In the original design, R.14 and R 15 were, also general purpose registers, but due to space restrictions they were hardwired to contant values for the N SR prototype. Register R l is not a register at all, but is actually the access point for the m em ory F IFO queues. The Register File unit is implemented in two F P G A s because of space and THE NSR PROCESSOR PROTOTYPE 13 Figure 11: Register File Unit Processes pin limitations. Both F P G A s are identical, and each contains an eight-bit slice o f each of the registers. There is a mode pin which is used to designate one of the pa,its as the M aster section to insure that both units will operate together. Logically, the Register File consists of two parallel processes. The " source" process provides source operands to the E X unit, while the "result" process writes the E X unit results back to the appropriate registers. The two processes coordinate their actions with the use of a F IF O queue containing the register number of the destination and with a scoreboard bit for each register. T he scoreboard bit ensures that a register which has already been named as a destination is not used as a source until the results are available. The source process begins operation when the usage information is received from the ID unit (see section 3 .2 ). Three operations are performed in sequence to provide two sources and a destination. If the first source is a valid register address then a m ux selects the appropriate register contents and scoreboard bit for examination. If the scoreboard bit is set, the output transition signal is delayed until the scoreboard bit is cleared. W h en the scoreboard bit is cleared, the register contents a,re provided to the E X unit. However, if the source register is R l , then the scoreboard bit is not checked, the source request is sent directly to the M E M unit, and the process waits until the M E M results are provided, then passes them on to the E X unit. W h en the source has been acknowledged by the E X unit, the same process is repeated for the THE NSR PROCESSOR PROTOTYPE 14 second source register. If either source register is not required, the requesting process is skipped. W h en all sources have been provided and acknowledged, the destination is dealt with. If the destination is RO, no action is needed and the next register usage-information is requested from the ID unit. Otherwise, the scoreboard bit for the destination register is set, unless it is already set in which case the process waits until it. is cleared before setting it. The destination register number is enqueued in a F IFO to await, future results from the E X unit, and the next register usage information is requested. If an operand is requested from a register which is already being used for a destination (as indicated by the scoreboard b it), the operand is not provided until the results have been written back. F IFO queues buffering the ID usage inform ation, the destination register, and the E X results help to hide varations in execution times. The result, process begins when a destination register address is placed in the destination F IF O . The process waits for results to be sent from the E X unit. W h en they arrive, the destination register number is popped off the destination F IF O , the results are written to the appropriate register and the corresponding scoreboard bit is cleared. There is no direct correspondence between the number of source operands provided by the R F unit and the number of results written back. T he ID unit coordinates the sourcing and disposition of data between the R F unit and the E X unit. In addition, results from the E X unit which are destined for R1 go directly to the M E M unit, without, passing through the register file. Only when R.l is used as a source does data from m emory pass through the R F unit, and even then the address must, be sent, from the E X unit directly to the M E M unit to initiate the read cycle. Figure 11 shows the structure of the Register File. 3.5 M em ory Interface The data m emory (M E M ) unit handles requests to read and write from the system memory. This unit contains the three queues needed to handle m em ory interfaces (see figure 6). Since there is only one Address Queue, all m em ory accesses are sequential, allowing decoupled m emory access to take place without, the possibility of read-after-write errors. W hen a LDA or STA instruction is executed, an address is composed and then placed in the Address Queue (A Q ), along with a bit indicating whether the address corresponds to a load or store operation. Load addresses which reach the front of the A Q initiate a load operation. The address is placed on the external bus, and the appropriate control signals are generated to fetch the m em ory contents. The M E M unit must convert the two-phase transitions that the N SR uses into the four-phase level-sensitive signals expected by the system S R A M . A simple external delay line is used to provide the request/acknowledge handshaking expected by the N SR . W h en the m emory contents arrive on the data bus following the read signal, they are enqueued in the Load D ata Queue, the F IFO queue which is used by the register file to satisfy read requests from register R l . M em ory writes operate in a similar fashion. T hey are initiated when both a store address and a store data value reach the head of the Address Queue and the Store D ata Queue, respectively. A t that point, both the address and data are placed on the external busses, and the correct signals are generated to produce a write cycle on the system memory. Although the programmer uses register R l as a destination in order to write to memory, the data to be written never actually passes through the register file, but instead comes directly from the E X unit. Recognizing that R l is a destination and producing the correct routing signals is one of the functions of the ID unit. T he M E M unit is implemented in two F P G A s, consisting o f both a master and a slave chip. These chips are very similar, but the master coordinates the actions o f the two, and handles the interface to the m em ory arbiter. Each chip contains eight bits o f the 16-bit data queues, with the master chip also containing the L oad /S tore bit of the Address Queue. 3.6 M em ory Arbitration The IF unit and the M E M unit must take turns accessing the single system memory. To accomplish this, a round-robin arbiter is built into parts o f both units. This arbiter passes a token between the two units, and only the unit which has the token is allowed to access the memory. Usually, the IF stage performs m ost of the m emory accesses. Between each instruction fetch, the token is passed to the M E M unit. If there are no pending data reads or writes, the token is simply returned to the IF unit. If the M E M unit wishes to access memory, it keeps the token until one read or write operation has been completed. The design of the arbiter is shown in Figure 12. T he arbiter starts when a single transition occurs on the INIT line thereby inserting a token into the loop. The token circulates until one o f the processes requests it with a transition on its REQ-MEM-PROC line. This causes the token to be diverted until the m emory process is finished. Notice that because of the way the Q-select module samples its probe input, the token may pass by twice before the request is recognized. T he first pass samples the probe input, and the token is diverted on the second pass. This arbiter design is not very efficient, but it uses only two pins on each chip, and it is fair. 4 P erform ance and Evaluation The Protozonetm prototype board produced by Standford University contains memory, logic, and connections to communicate with a standard PC clone. T he N SR occupies the development space on this board, using wirewrap-ping sockets for the F P G A s. The N SR prototype is shown in Figure 13. Debugging the N S R was remarkably simple. Each chip was thoroughly simulated with unit delays as part of the design process. Once the functionality was correct, the design was placed and routed on the Actel parts, the more realistic delays back-annotated, and the simulations were repeated. T he main THE NSR PROCESSOR PROTOTYPE 15 THE NSR PROCESSOR PROTOTYPE Figure 13: The NSR Prototype THE NSR PROCESSOR PROTOTYPE 18 Another useful debugging aid is a bus m onitor. Since the N SR can be stopped temporarily by holding up the request signals between chips, the data placed on the inter-chip buses can be examined. A driver and encoder for a set o f four seven-segment L E D s was built into an F P G A , and a ribbon cable and plug was used to monitor the bus contents. This is very useful in determining whether the correct data was being transm itted between units. These switches and lights are useful in getting the N SR to communicate am ong its component parts. To insure correct operation, traces o f the interchip buses can be made using a standard logic analyzer. A lthough the N SR uses transitions instead o f levels to indicate when the data is valid, by using two channels o f the logic analyzer on the same bus and triggering one on the rising edge o f the request line and the other on the falling edge, a complete trace o f the bus activity can be obtained. 4.1 Software tools Com m unicating with the N SR is fairly simple. T he protoboard m emory is m apped directly into unused m emory space on the P C , as byte addresses D 0 0 0 :0 0 0 0 to D 0 0 0 :F F F F . This addresses only 64 K bytes, but another 64 Kbytes is available by changing a bit on a specific I /O port. The N SR sees the two bytes with the same P C address as a single 16-bit word. Either the N SR or the P C can access the protoboard memory, but not both. Access is m oderated by a toggle switch. Several simple programs have been written to transfer data between the N SR m em ory and the P C . A load utility takes a simple text file describing the address and contents o f the N SR m em ory and loads the m emory with the appropriate values. An unload utility reverses the process. 4-1-1 Assembler A simple assembler has been written to convert the N SR assembly language instructions into data which can be loaded into the N SR m emory and executed. T he assembler allows for labels, sym bols, data constants and relative offsets, but does not produce object files which can be linked with others. T h e output o f the assembler is a text file containing the address and data o f each affected N SR m em ory location. The N SR always begins execution at location 0x 0000. A nice ability o f the assembler is to add some additional opcodes for instructions not implemented directly by the N SR . The N SR only tests for G T , G E , E Q , and N E conditions. The L E and LT conditions are implemented by the assembler as G E and G T tests, with the operands reversed. For exam ple, the instruction SLE r2,r3 would be implemented as SGE r3,r2 instead. In the same way, the NDT r2,r3 instruction is assembled as XNDR r2,r0,r3. One o f the first programs to be run on the N SR was a simple test to generate Fibbonacci numbers. T he source is shown in Figure 14, which provides a good example o f the assembly language for the N SR . THE NSR PROCESSOR PROTOTYPE > t h i s i s a t e s t t o generat e Fi bbonacci numbers t e x t .Equ 0x0100 data .Equ 0x0200 ; wr ite r e s u l t s here l i m i t .Equ 22 ; only do 22 o f em s t a r t : seq rO.rO ; jump to main bend main .ORG t e x t main: mvih r 2 , h i 8 ( d a t a ) mvil r 3 , l o 8 ( d a t a ) or r 2 , r 2 , r 3 ; r2 p o in ts to output p i mvil r 3 , 0 x l ; we' l l repeat r5 = r4 + mvil r 4 , 0 x l ; and s h i f t ' em down. mvil r l 0 , l o 8 ( l i m i t ) ; j u s t do a few xor r 8 , r 0 , r 0 ; i n i t count r e g i s t e r sta r 0 , r 2 , r 0 ; gonna w ri te one or r l , r 3 , r 0 ; wr ite r3 add r 2 , r 2 , r l 4 ; increment r2 sta r 0 , r 2 , r 0 ; gonna w rit e one or r l , r 4 , r 0 ; w rit e r4 add r 2 , r 2 , r l 4 ; increment r2 l o o p : add r 5 , r 4 , r 3 ; r5 = r4 + r3 sta r 0 , r 2 , r 0 ; gonna w rit e one or r l , r 5 , r 0 ; w rit e r5 add r 2 , r 2 , r l 4 ; increment r2 or r 3 , r 4 , r 0 ; copy r4 t o r3 or r 4 , r 5 , r 0 ; copy r5 t o r4 add r 8 , r 8 , r l 4 ; r8++ s l e r 8 , r l 0 ; branch i f r8 < rlO bend lo o p ; take branch jmp ; die Figure 14: Fibbonacci Program Source 4-1.2 Simulator To aid in the development of test programs for the N SR while the prototype was still being developed, a simulator was written. In addition to speeding program development, the simulator was also useful in suggesting changes to the instruction set. A t first, the MVPC instruction did not add anything to the program counter value, requiring the programmer to dedicate a register to fixing up the P C value before using it. After writing a few programs and running them on the simulator, it was realized that this was awkward, and the modification to the instruction set was made. The simulator is based on an implementation of the C-Threads library routines [6], and has been built on both Sun SP A R C and Hewlett-Packard workstations. T he simulator consists of C-Threads libraries which are machine-dependent, plus the actual N SR simulation code written in C . Separate threads are used for each functional unit of the N SR , with two threads required for the Register File. T he functional units o f the simulator communicate over pipelines implemented with semaphores. Although the simulator is accurate in that it produces the same results as the N SR when running the same programs, it does not attem pt to model the performance of the N SR . Some discussion is underway to determine whether m odifying the simulator to do so would be worthwhile. 4.2 Speed T he speed of the N SR varies depending on the program it is running, but the best performance to date has fallen between 1.10 and 1.34 M IP S. This is relatively slow, but is not surprising since the Actel devices are relatively slow and all design decisions were made to minimize the number of gates an d /or pins needed from the F P G A s. Speed optimizations were not considered for the prototype. Performance was measured by placing a series of exactly 1000 instructions in a large loop and then executing that loop 65536 tim es, after which the N SR was deadlocked. Execution time was measured by placing an oscilloscope probe on the request signal between the IF and ID , and measuring the duration of the active signal with a stopwatch. Typical execution times were on the order of one minute, with repeatability within half a second, for an accuracy of within 1 percent. A variety of test programs have been run with the results listed in Figure 16. 4.3 Power Consumption To measure power consumption, the power supply traces on the prototype printed cicuit board were cut and rerouted to isolate the N SR F P G A s from the rest, of the protoboard. A n ammeter was place in series with the F P G A power supply, and current drain was measured while various programs were running, and also while the NSR, was idle or deadlocked (Figure 15). All measurements were made with the LED bus display and driver removed, although it made THE NSR PROCESSOR PROTOTYPE 20 THE NSR PROCESSOR PROTOTYPE 21 N SR State Current Drain C L R = 0, R E S E T = 0 Deadlocked on JM P, Arb ON Deadlocked on JM P, Arb O F F Running BCND loop Running SJMP loop 4 5.3 m A 5 2.7 m A 45.3 m A 53.1 m A 40.8 m A Figure 15: N SR Current Drain very little difference. A s expected, the current drain was higher when data m emory was accessed, since the M E M unit draws less current when idle. Correspondingly, programs which only branched and did not use any registers used less current. A heavier drain was also noted when the operands of instructions had a larger number of ones in their binary representation. The standby current for a typical Actel A ct-1 series F P G A should be around 3 m A with a m axim um of 10 in A [1], if all outputs are unloaded. The m easured standby current for the seven F P G A s comprising the N SR was 45.3 m A . Surprisingly, the current drain was actually less while running some particular programs than when deadlocked. W e currently have no explanation for this behavior, except to note that the perversity of the universe tends toward a m axim um . 5 C onclusions and thoughts on Fred Pla.ns are being made for the development of a 32-bit self-timed processor which would incorporate several architectural changes and improvements when compared with the NSR.. For obscure reasons2 this processor will be called Fred. W ith Fred, we hope to develop a processor capable of acting as the main component of a standalone computer system . If time and resources perm it, we would like to be able to build a Unix platform with it. Accordingly, there must be several architectural changes. W e plan to provide for 8 -, 16-, and 32-bit memory accesses, I /O ports, hardware and software interrupts, and increased parallelism with additional arithmetic or logical units. It might also be desirable to separate loads from stores, allowing out-of-order m emory access if needed. Much time will need to be devoted to speed issues. From a programming standpoint, there are many improvements to the instruction set which would be desirable in a 3 2 -bit version o f the N S R . These include adding a carry bit for multiple precision arithmetic, providing for im mediate operands in several instructions, allowing for additional classes o f instructions, and providing software exceptions. The addition of a protected mode of operation for system security would be useful also. Of course, we may do something completely different. THE NSR PROCESSOR PROTOTYPE P rogram C on ten ts Secon ds M IP S m illiA m p s A D D O .S add r9,r0,r0 52 1.27 63.0 A D D 1 .S add r 9 ,r l4 ,r l5 52 1.27 93.3 A D D 2 .S add r9,r0,r0 add v 10,v0,r0 add r 11 ,r0,r0 add rl2 ,r0 ,r0 52 1.27 65.3 A D D 3 .S add r0,r0,r0 52 1.27 57.4 A D D 4 .S add r0 ,r0 ,rl5 52 1.27 95.7 A D D 5 .S add r9 ,r0 ,rl5 52 1.27 102.2 ORO.S or rO,rO,rO 51 1.29 52.5 O R 1 .S or v0,r0,i'15 51 1.29 68.3 SEQO.S seq rO,rO ben d + 1 56 1.18 56.9 SE Q 1.S seq r0 ,rl5 bend + 1 56 1.18 45.0 JM PO .S r9 = P C sjm p r9,r9,r8 jm p 49 1.34 62.4 M V P C J M P .S in vp c r9 ,+ 3 sjm p r0,r0,r9 jm p 57 1.16 73.1 LD AO .S Ida rO,rO,rO or rO,rO,rl 59 1.12 66.4 S T A 1.S sta r l,r 0 ,r l5 60 1.10 105.3 S T A 2.S sta r0 ,r0 ,rl5 or r 1,1-0,r l5 55 1.20 106.4 L D A S T A 2 .S Ida rO,r(),rl5 sta rl,rO ,rl 58 1.14 117.0 L D A S T A 3 .S Ida rO,r0,r 15 or r9 ,r0 ,rl sta rl,r0 ,r9 55 1.20 108.7 M E M 1 .S r9 = - 1 , r« = 1 Ida rl0 ,r9 ,r8 Ida i-10,i-9,i8 sta r l 1 ,r9,r8 add r 1,r 1,r 1 54 1.22 98.2 M E M 1 A .S Ida r l0 ,r l 4 ,r l 5 Ida 1-10,1-14,15 sta i-ll,i-14,15 add r l ,r l , r l 55 1.20 100.0 M EM O .S r9 = r8 = 0 Ida rl0 ,r9 ,r8 Ida i-10,r9,r8 sta rl l,r9,i'8 add r l ,r l , r l 54 1.22 75.0 M E M O A .S Ida rl0 ,i'0 ,r0 Ida rl0 ,r0 ,r0 sta r ll,r 0 ,r 0 add r 1,r 1,r 1 54 1.22 71.7 Figure 16: Performance and Current Drain THE NSR PROCESSOR PROTOTYPE 23 R eferences 1. Actel Corporation. A C T Family Field Programmable Gate Array Data-book, March 1991. 2. Erik Brunvand. A cell set for self-timed design using actel F P G A s. Technical Report, U U C S -9 1 -0 1 3 , University o f U tah, 1991. 3. Erik Brunvand. Implementing self-timed systems with F P G A s. In W . R. M oore and W . Luk, editors, F P G A s , chapter 6.2 , pages 3 1 2 -3 2 3 . Abingdon E E & C S Books, 1991. 4. Erik Brunvand. Translating Concurrent Communicating Programs into A synchronous Circuits. PhD thesis, Carnegie Mellon University, 1991. Available as Technical Report C M U -C S -9 1 -1 9 8 . 5. Erik Brunvand and Robert F. Sproull. Translating concurrent programs into delay-insensitive circuits. In I C C A D -8 9 , pages 2 6 2 -2 6 5 . IE E E , November 1989. 6. Eric C . Cooper and Richard P. Draves. C threads. Departm ent o f C om puter Science, Carnegie Mellon University, September 1990. 7. J. R. G oodm an, J. Hsieli, K . Lion, A . R. Pleszkun, P. B. Schechter, and H. C . Young. P IP E : A V L SI decoupled architecture. In 12th Annual International Sym posium on Com puter Architecture, pages 2 0 -2 7 . IE E E Com puter Society, June 1985. 8. C . L. Seitz. System timing. In M ead and Conway, Introduction to V L S I S ystem s, chapter 7. Addison-W esley, 1980. 9. Ivan Sutherland. Micropipelines. C A C M , 3 2 (6 ), 1989. 10. W m . A . W ulf. The W M computer architecture. Com puter Architecture N ew s, 1 6(1), March 1988.
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6qv44q7