Instruction Sequencing
- memory latency problem
- Wide Memories
- Remember from chapter 3, we could use multiple banks of memory
to decrease memory access time.
- In our simple machine, what if we fetch 32 bits at a time?
- How about 64?
- Especially useful for instruction fetch.
- Not bad for sequential processing of an array.
- We do not improve the time to fetch a single instruction,
- But we do improve the instruction bandwidth.
- But jumps are a problem.
- There are some problems, but this has been implemented in the IBM
RS/6000
- Interleaving
- How about multiple independant banks of memory
- 1/n of the words.
- This can work for both data and instruction
- In fact, how about data in one and instructions in another.
- Instruction Prefetching
- We have already discussed this one.
- Possibly interpret instructions before they get to the cpu
- This will help with unconditional branches.
- Cache
- Small memory for instructions only built into the processor
- Same speed as the processor
- Principle of locality - we will execute instructions close to
the last instruction we executed.
- This works with linear code
- But also with loops
- A cache hit occurs when an instruction we want is in the cache
- A cache miss occurs otherwise.
- Cache miss is expensive
- We must go to main memory to get the instruction
- This will cause wait states, or a processor stall
- Caches can be set up in a number of ways
- Fully Associative - address and data are stored in the cache
- Direct Mapped - each memory address goes to a unique space in
cache
- N-way set associative - each address can go to one of
N slots.
- Each of these has different performance characteristics.
- We also need to worry about when to change entries in the cache
- LRU - least recently used
- FIFO
- Random
- Different performances, overhead, and ease of implementing.
- Data Caching
- Like instruction cache
- But we need to worry about writes
- Write Through - data is written to cache and to memory
- Write Back - data is written to cache only, and only written
to memory when the cache block is replaced.
- Again there are a number of considerations here, but we won't
worry about them too much. (Until next Fall)
-
- Caches can be set up at different levels and speeds.
- The further from the processor, the bigger, and slower.
- Some processor information
- P4 - 1.5 GHz - 2.4 GHz
- 512 KB cache
- 8 - way set associative
- 20 stage pipeline
- UltraSparc III
- 700 - 900 MHz
- 32K I
- 64K D
- 8 MB external
- HPPA 8700
- 750 MHz
- .75 MB d cache
- 1.5 MB I cache
Machine | Processor | SPECINT 2000 | SPECFP 2000 |
Sun Blade 2050 | 900 MHZ | 537 | 610 |
HP 3700 | 750 MHZ | 568 | 604 |
Dell 340 | 2.2 GHz | 790 | 811 |
- New 2003, cache information
- 3.06 GHz XEON
- I don't completely trust the source for this one.
- ICache: Execution Trace Cache (12K, micro-ops)
- ICache: L2 512K, 8-way
- L1 dcache, 8K, 4-way
- Specint 2000 1099
- Specfp 2000 1053
- 1GHZ Alpha
- 64K icache, 64K dcache
- 8MB L2
- Specint 2000 795
- Specfp 2000 1124
- 1.2 GHz Ultrasparc III cu
- 32K I cache
- 64 K Data
- 8MB L2
- Specint 2000 642
- Specfp 2000 953