| OCR Text |
Show 42 ditlons. with the remaining switch capacity dissipated through operations at the source and destination nodes. The performance is summarized in Figure 7, which shows only the part of cuxve which changes rapidly. Performance improves nearly linearly thereafter. While there is a substantial difference in the speed with which the PNC can perform accesses to local memozy compared with remote memory. this difference is small when compared with that for loosely-coupled architectures such as hypercubes. Benchmarks indicate that a local 16-bit fetch to a 68020 register takes about 1.35 microseconds and a remote fetch takes 6.3 microseconds*. Thus. while a penalty is paid for remote accesses. the ratio of times for remote and local accesses is not very large. Another way of viewing this situation is that local accesses are rather slow, but remote accesses are not vezy much slower. In particular, remote accesses are not slow enough that it is worthwhile to perform a context switch be-tween processes on the source node while a process watts for a remote memozy operation to complete. Instead, all 16-bit fetches are performed synchronously. It is important to note that instruction fetches (which are always local, although this need not be the case, architecturally) take as long as any other local fetches. If contiguous words of memozy are to be accessed, block transfers can be used to move data at the rate of about 19.4 milliseconds for a 64-kilobyte block**. Block transfers become more efficient than single-word transfers if the size of the data ex-ceeds about 100 bytes. Block transfer requests take precedence over single-word transfers and, unlike single-word transfers. can proceed asynchronously with the operation of the CPU. These benchmark results are shown in Figure 8. Only the low end of the cuxve is shown; again, performance improves roughly linearly thereafter. Many Butterfly system are configured with multiple paths through the switch between each pair of nodes. A 16-node machine, for example, may be paired with *These figures are the result of timing 4000 consecutive in-line 16-bit fetch instructions to a machine register. As with all subsequent benchmark figures presented here, these instructions were executed with interrupts disabled, and instruction-fetch time is included in the timing results. **This figure was obtained by executing in-line block-transfer code with interrupts disabled. |