

## Lecture 10: Cache I

## **CS10014 Computer Organization**

Tsung Tai Yeh Department of Computer Science National Yang Ming Chiao University



# Acknowledgements and Disclaimer

- Slides were developed in the reference with
  - CS 61C at UC Berkeley
    - https://inst.eecs.berkeley.edu/~cs61c/sp23/
  - CS252 at ETHZ
    - https://safari.ethz.ch/digitaltechnik/spring2023
  - CIS510 at Upenn
    - https://www.cis.upenn.edu/~cis5710/spring2019/



## Outline

- Memory Hierarchy
- Memory Caching
- Cache Basics
- Direct-Mapped Cache
- Read Data in Direct-Mapped Cache
- Directed-Mapped Cache Hardware



## Types of Memory

### • Static RAM (SRAM)

- <u>6 or 8 transistors per bit</u>
- Two inverters (4 transistors) + transistors for reading/writing
- Optimized for speed (first) and density (second)
- Fast (sub-nanosecond latencies for small SRAM)
  - Speed roughly proportional to its area (~sqrt(number of bits))
- Mixes well with standard processor logic



## Types of Memory

### • Dynamic RAM (DRAM)

- <u>1 transistor + 1 capacitor per bit</u>
- Optimized for density (in terms of cost per bit)
- Slow (> 30 ns internal access, ~50 ns pin-to-pin)
- Different fabrication steps (does not mix well with logic)
- Nonvolatile storage: Magnetic disk, Flash RAM, Phase-change memory, ...



## Memory & Storage Technologies

- **Cost** what can \$200 buy (2009)?
  - SRAM: 16MB
  - DRAM: 4,000 MB(4GB) 250x cheaper than SRAM
  - Flash: 64,000 MB (64GB) = 16x cheaper than DRAM
  - Disk: 2,000,000MB (2TB) 32x vs Flash (512x vs. DRAM)

### • Latency

- SRAM: < 1 to 2ns (on chip)
- DRAM: ~50ns 100x or more slower than SRAM
- Flash: 75,000ns (75 microseconds) 1500x vs. DRAM
- Disk: 10,000,000ns (10ms) 133x vs Flash (200,000x vs DRAM)



## Memory & Storage Technologies

### • Bandwidth

- SRAM: 300GB/sec (e.g., 12-port 8-byte register file @ 3GHz)
- DRAM: ~25GB/s
- Flash: 0.25GB/s (250MB/s), 100x less than DRAM
- Disk: 0.1GB/s (100MB/s), 250x vs DRAM, sequential access only



- Problems in memories
  - Bigger is slower
    - Bigger -> takes longer to determine the location
  - Faster is more expensive
    - SRAM vs. DRAM vs. SSD vs. Disk vs. Tape
  - Higher bandwidth is more expensive
    - Need more banks, more ports, more channels, higher frequency or faster technology



• Why memory hierarchy?

fast level

- We want both fast and large
- But, we cannot achieve both with a single level of memory
- Idea: Have multiple levels of storage
  - Bigger and slower as the levels are farther from the processor
  - Ensure most of the data the processor needs is kept in the















Kim & Mutlu, "Memory Systems," Computing Handbook, 2014 https://people.inf.ethz.ch/omutlu/pub/memory-systems-introduction\_computing-handbook14.pdf



## The Unit: Cache

### Cache: hardware managed

- Hardware automatically retrieves missing data
- Built from fast SRAM, usually on-chip today
- In contrast to off-chip, DRAM "main memory"

### Cache organization

- Speed vs. Capacity
- Miss classification





## Memory Locality

- Cache contains copies of data in memory being used
- Memory contains copies of data on disk being used
- Caches work on principles of temporal and spatial locality
  - <u>Temporal locality</u>: if we use it now, chances are we'll want to use it again soon
    - Data elements accessed <u>in loops</u> (same data elements are accessed multiple times)
  - <u>Spatial locality</u>: if we use a piece of memory, chances are we'll use the neighboring pieces soon
    - Data elements accessed <u>in array</u> (each time different or just next element is being accessing)



## Exploiting Locality: Memory Hierarchy



- Hierarchy of memory components
  - Upper components
    - Fast  $\leftrightarrow$  Small  $\leftrightarrow$  Expensive
  - Lower components
    - $\bullet \ \mathsf{Slow} \leftrightarrow \mathsf{Big} \leftrightarrow \mathsf{Cheap}$
- Connected by "buses"
  - Which also have latency and bandwidth issues
- Most frequently accessed data in M1
  - M1 + next most frequently accessed in M2, etc.
  - Move data up-down hierarchy
- Optimize average access time
  - latency<sub>avg</sub>= latency<sub>hit</sub> + (%<sub>miss</sub>\* latency<sub>miss</sub>)
  - Attack each component



National Yang Ming Chiao Tung University Computer Architecture & System Lab

## Concrete Memory Hierarchy



- 0th level: Registers
- Managed 1st level: Primary caches
  - Split instruction (I\$) and data (D\$)
  - Typically 8KB to 64KB each
  - 2nd level: 2<sup>nd</sup> and 3<sup>rd</sup> cache (L2, L3)
    - On-chip, typically made of SRAM
    - $2^{nd}$  level typically ~256KB to 512KB
    - "Last level cache" typically 4MB to 16MB
  - 3rd level: main memory
    - Made of DRAM ("Dynamic" RAM)
    - Typically 1GB to 4GB for desktops/laptops
      - Servers can have 100s of GB
  - 4th level: disk (swap and files)
    - Uses magnetic disks or flash drives<sub>22</sub>



National Yang Ming Chiao Tung University Computer Architecture & System Lab

### Evolution of Cache Hierarchies



Intel 486

Intel Core i7 (quad core)

Chips today are 30–70% cache by area



#### Cache



Kim & Mutlu, "Memory Systems," Computing Handbook, 2014 https://people.inf.ethz.ch/omutlu/pub/memory-systems-introduction\_computing-handbook14.pdf



### A key question

- How to <u>map chunks</u> of the main memory address space to blocks in the cache?
- Which location in cache can a given "main memory chunk" be placed in?



19



- Main memory logically divided into fixed-size chunks (blocks)
- Cache can house only a limited number of blocks
- Each block address maps to a potential location in the cache, determined by the index bits in the address
  used to index into the tag and data stores tag index byte in



8-bit address

Cache access:

1) index into the tag and data stores with index bits in address

- 2) check valid bit in tag store
- 3) compare tag bits in address with the stored tag in tag store
- If the stored tag is valid and matches the tag of the block, then the block is in the cache (cache hit)



- Block (line): Unit of storage in the cache
  - Memory is logically divided into blocks that map to potential locations in the cache
- When reading memory, 3 things can happen
  - Cache HIT:
    - Cache block is valid and contains proper address, so read desired word
  - Cache MISS:
    - Nothing in cache in appropriate block, so fetch from memory
  - Cache miss, block replacement
    - Wrong data is in cache at appropriate block, so discard it and fetch desired data from memory







8-bit address

- Each block address maps to a potential location in the cache, determined by the index bits in the address
- Index
  - <u>Specifies the cache index (which "row"/block of the cache we should</u> look in)
- Offset
  - Once we've found correct block, <u>specifies which byte within the block</u> we want
- Tag
  - The remaining bits after offset and index are determined
  - These are used to distinguish between all the memory address that map to the same location







### Cache associativity

• One set can contain multiple cache blocks



Kim & Mutlu, "Memory Systems," Computing Handbook, 2014



## Logical Cache Organization

- Cache is a hardware hashtable
- The setup
  - 32-bit ISA  $\rightarrow$  4B words/addresses, 2<sup>32</sup> B address space
- Logical cache organization
  - 4KB, organized as 1K 4B blocks (aka lines)
  - Each block can hold a 4-byte word
- Physical cache implementation
  - 1K (1024 bit) by 4B **SRAM**
  - Called data array
  - 10-bit address input
  - 32-bit data input/output





## Looking Up A Block

- A byte-addressable main memory
  - 256 bytes, 8-byte blocks -> 32 blocks in memory
  - Assume cache: 64 bytes, 8 blocks





26



## Looking Up A Block

- Q: which 10 of the 32 address bits to use?
- A: bits [11:2]
  - 2 least significant (LS) bits [1:0] are the offset bits
    - Locate byte within word
    - Don't need these to locate word
  - Next 10 LS bits [11:2] are the index bits
    - These locate the word
    - Nothing says index must be these bits

11:2

- But these work best in practice
  - Why? (think about it)





## Is this the block you're looking for?

- Each cache row corresponds to 2<sup>20</sup> blocks
  - How to know which if any is currently there?
  - Tag each cache word with remaining address bits [31:12]

11:2

- Build separate and parallel tag array
  - 1K by 21-bit SRAM
  - 20-bit (next slide) tag + 1 valid bit
- Lookup algorithm

31:12

- Read tag indicated by index bits
- If tag matches & valid bit set: then: Hit → data is good else: Miss → data is garbage, wait...





## Is this the block you're looking for?

- Lookup address x000C14B8
  - Index = addr [11:2] = (addr >> 2) & x3FF = x12E
  - Tag = addr [31:12] = (addr >> 12) = x000C1



29



## Handling a Cache Miss

- What if requested data isn't in the cache?
  - How does it get in there?
- Cache controller: finite state machine
  - Remembers miss address
  - Accesses next level of memory
  - Waits for response
  - Writes data/tag into proper locations
- Bringing a missing block into the cache is a cache fill <sup>30</sup>



### Cache Misses and Pipeline Stalls



- I\$ and D\$ misses stall pipeline just like data hazards
  - Stall logic driven by miss signal
    - Cache "logically" re-evaluates hit/miss every cycle
    - Block is filled  $\rightarrow$  miss signal de-asserts  $\rightarrow$  pipeline restarts



### **Cache Misses**

### Types of Misses

- **Compulsory**: First time data is accessed
- **Capacity**: cache too small to hold all data of interest
- **Conflict**: data of interest maps to same location in cache
- Miss penalty: time it takes to retrieve a block from lower level of hierarchy



## **Cache Performance Equation**



- For a cache
  - Access: read or write to cache
  - Hit: desired data found in cache
  - Miss: desired data not found in cache
    - Must get from another component
    - No notion of "miss" in register file
  - Fill: action of placing data into cache
  - % (miss-rate): #misses / #accesses
  - t<sub>access</sub>: time to check cache. If hit, we're done.
  - t<sub>miss</sub>: time to read data into cache
- Performance metric: average access time

 $\mathbf{t}_{\text{avg}} = \mathbf{t}_{\text{access}} + (\%_{\text{miss}} * \mathbf{t}_{\text{miss}})$ 



## Cache Performance Equation

- Cache hit rate = (# hits) / (# hits + # misses) = (# hits) / (# accesses)
- Average memory access time (AMAT)
  - o = (hit-rate \* hit-latency) + (miss-rate \* miss-latency)





## **CPI** Calculation with Cache Misses

#### • Parameters

- Simple pipeline with base CPI of 1
- Instruction mix: 30% loads/stores
- I\$: %<sub>miss</sub> = 2%, t<sub>miss</sub> = 10 cycles
- D\$:  $\%_{miss} = 10\%$ ,  $t_{miss} = 10$  cycles
- What is new CPI?
  - $CPI_{I\$} = \%_{missI\$} * t_{miss} = 0.02*10 \text{ cycles} = 0.2 \text{ cycle}$
  - $CPI_{D\$} = \%_{load/store} * \%_{missD\$} * t_{missD\$} = 0.3 * 0.1*10 \text{ cycles} = 0.3 \text{ cycle}$
  - $CPI_{new} = CPI + CPI_{I\$} + CPI_{D\$} = 1+0.2+0.3 = 1.5$



## Multi-Word Cache Blocks

- In most modern implementation we store more than one address (>1 byte) in each cache block.
- The number of bytes or words stored in each cache block is referred to as the **block size**.
- The entries in each block come from a contiguous set of addresses to exploit locality of reference, and to simplify indexing



## Cache Examples

- 4-bit addresses  $\rightarrow$  16B memory
  - Simpler cache diagrams than 32-bits
- 8B cache, 2B blocks

tag (1 bit)

- Figure out number of sets: 4 (capacity / block-size)
- Figure out how address splits into offset/index/tag bits
  - Offset: least-significant  $log_2(block-size) = log_2(2) = 1 \rightarrow 0000$
  - Index: next  $\log_2(\text{number-of-sets}) = \log_2(4) = 2 \rightarrow 0000$
  - Tag: rest =  $4 1 2 = 1 \rightarrow 0000$

1 bit

index (2 bits)



**111**1

Q

#### 4-bit Address, 8B Cache, 2B Blocks

| 0000                | Α | Main memory |    |   | tag | (1 bit) |   | index (2 bits) | 1 bit |
|---------------------|---|-------------|----|---|-----|---------|---|----------------|-------|
| 0001                | В |             |    |   |     |         |   |                |       |
| <b>001</b> 0        | С |             |    |   |     |         |   |                |       |
| 0 <mark>01</mark> 1 | D |             |    |   |     |         |   |                |       |
| <b>010</b> 0        | Е |             |    |   |     | 0       |   |                |       |
| <b>010</b> 1        | F |             |    |   |     | 0       | 1 |                |       |
| <b>011</b> 0        | G |             | 00 | 0 |     | Α       | В |                |       |
| <b>011</b> 1        | Н |             | 01 | 0 |     | С       | D |                |       |
| 1 <mark>00</mark> 0 | Ι |             | 10 | 0 |     | E       | F |                |       |
| <b>100</b> 1        | J |             | 11 | 0 |     | G       | H |                |       |
| <b>101</b> 0        | К |             |    |   |     |         |   |                |       |
| <b>101</b> 1        | L |             |    |   |     |         |   |                |       |
| <b>110</b> 0        | М |             |    |   |     |         |   |                |       |
| <b>110</b> 1        | N |             |    |   |     |         |   |                |       |
| 1110                | Р |             |    |   |     |         |   |                |       |



**111**1

Q

National Yang Ming Chiao Tung University Computer Architecture & System Lab

#### 4-bit Address, 8B Cache, 2B Blocks





National Yang Ming Chiao Tung University Computer Architecture & System Lab

#### 4-bit Address, 8B Cache, 2B Blocks





#### **Capacity and Performance**

- Simplest way to reduce %<sub>miss</sub>: increase capacity
  - + Miss rate decreases monotonically
    - "Working set": insns/data program is actively using
    - Diminishing returns
  - However  $t_{\scriptscriptstyle access}$  increases
    - Latency proportional to sqrt(capacity)
  - t<sub>avg</sub>?



Cache Capacity

Given capacity, manipulate %<sub>miss</sub> by changing organization



#### **Block Size**

- Given capacity, manipulate  $\ensuremath{\%_{\text{miss}}}$  by changing organization
- One option: increase block size
  - Exploit spatial locality
  - Notice index/offset bits change
  - Tag remain the same
- Ramifications
  - + Reduce  $\%_{miss}$  (up to a point)
  - + Reduce tag overhead (why?)
  - Potentially useless data transfer
  - Premature replacement of useful data

512\*512bit SRAM





## Block Size and Tag Overhead

- 4KB cache with 1024 4B blocks?
  - 4B blocks  $\rightarrow$  2-bit offset, 1024 frames  $\rightarrow$  10-bit index
  - 32-bit address 2-bit offset 10-bit index = 20-bit tag
  - 20-bit tag / 32-bit block = 63% overhead
- 4KB cache with 512 8B blocks
  - 8B blocks  $\rightarrow$  3-bit offset, 512 frames  $\rightarrow$  9-bit index
  - 32-bit address 3-bit offset 9-bit index = 20-bit tag
  - 20-bit tag / 64-bit block = 32% overhead
  - Notice: tag size is same, but data size is twice as big
- A realistic example: 64KB cache with 64B blocks
  - 16-bit tag / 512-bit block = ~ 2% overhead

#### Note: Tags are not optional



#### 4-bit Address, 8B Cache, 4B Blocks





National Yang Ming Chiao Tung University Computer Architecture & System Lab

#### 4-bit Address, 8B Cache, 4B Blocks



45



National Yang Ming Chiao Tung University Computer Architecture & System Lab

## Effect of Block Size on Miss Rate

- Two effects on miss rate
  - + Spatial prefetching (good)
    - For blocks with adjacent addresses
    - Turns miss/miss into miss/hit pairs
  - Interference (bad)
    - For blocks with non-adjacent addresses (but in adjacent frames)
    - Turns hits into misses by disallowing simultaneous residence
    - Consider entire cache as one big block
- Both effects always present
  - Spatial prefetching dominates initially
    - Depends on size of the cache
  - Good block size is 32–256B
    - Program dependent





## Block Size and Miss Penalty

- Does increasing block size increase t<sub>miss</sub>?
  - Don't larger blocks take longer to read, transfer, and fill?
  - They do, but...
- t<sub>miss</sub> of an isolated miss is not affected
  - Critical Word First / Early Restart (CRF/ER)
  - Requested word fetched first, pipeline restarts immediately
  - Remaining words in block transferred/filled in the background
- t<sub>miss</sub>'es of a cluster of misses will suffer
  - Reads/transfers/fills of two misses can't happen at the same time
  - Latencies can start to pile up
  - This is a bandwidth problem



#### **Cache Conflicts**





#### • Directed-mapped cache

- A given main memory block can be placed in only one possible location in the cache
- Toy example: 256-byte memory, 64-byte cache, 8-byte blocks







FIGURE 5.5 A direct-mapped cache with eight entries showing the addresses of memory words between 0 and 31 that map to the same cache locations. Because there are eight words in the cache, an address X maps to the direct-mapped cache word X modulo 8. That is, the low-order  $log_2(8) = 3$  bits are used as the cache index. Thus, addresses  $00001_{two}$ ,  $01001_{two}$ ,  $10001_{two}$ , and  $11001_{two}$  all map to entry  $001_{two}$  of the cache, while addresses  $00101_{two}$ ,  $01101_{two}$ ,  $10101_{two}$ , and  $11101_{two}$  all map to entry  $101_{two}$  of the cache.



#### • In a directed-mapped cache

- Multiple memory addresses map to the same cache index, how do we tell which one is in there?
- What if we have a block size > 1 byte?
- Ans: divide memory address into three fields

| ttttttttttttt                               | iiiiiiiii | 0000                              |
|---------------------------------------------|-----------|-----------------------------------|
| tag<br>to check<br>if have<br>correct block |           | byte<br>offset<br>within<br>block |



- A byte-addressable main memory
  - 256 bytes, 8-byte blocks -> 32 blocks in memory
  - Assume cache: 64 bytes, 8 blocks
  - Directed-mapped: A block can go to only one





52



#### • Direct-mapped cache

- Two blocks in memory that map to the same index in the cache cannot be present in the cache at the same time
- One index -> one entry
- Can lead to 0% hit rate if more than one block accessed in an interleaved manner map to the same index
  - Assume addresses A and B have the same index bits but different tag bits
  - A, B, A, B, A, B, A, B ... -> conflict in the cache index
  - All accesses are conflict misses



#### Direct-Mapped Cache Example

- Suppose we have a 8B of data in a direct-mapped cache with 2 byte blocks
- Determine the size of the tag, index, and offset fields if we are using a 32-bit architecture
  - Offset
    - Need to specify correct byte within a block
    - Block contains 2 bytes = 2<sup>1</sup> bytes
    - Need 1 bit to specify correct byte



## Direct-Mapped Cache Example

- Suppose we have a 8B of data in a direct-mapped cache with 2 byte blocks
  - Index (index into an "array of blocks")
    - Need to specify correct block in cache
    - # blocks/cache = <u>bytes/cache</u>

bytes/block

- = <u>2<sup>3</sup> bytes/cache</u>
  - 2<sup>1</sup> bytes/block
- = 2<sup>2</sup> blocks/cache
- Need 2 bits to specify this many blocks



#### Direct-Mapped Cache Example

- Suppose we have a 8B of data in a direct-mapped cache with 2 byte blocks
  - Tag: use remaining bits as tag
  - Tag length = address length offset index

$$= 32 - 1 - 2$$
 bits

The tag is leftmost 29 bits of memory address



- Ex. 16 KB of data, direct-mapped, 4 word block
- Read 4 addresses
  - o 0x0000014
  - 0x000001C
  - 0x0000034
  - o 0x00008014

#### Memory ... ... 0000010 а b 00000014 00000018 С 0000001C d ... ... 0000030 е 00000034 00000038 a 000003C h ... 00008010 00008014 00008018 0000801C

...



• 4 addresses divided into



- 16 KB direct-mapped cache, 16B blocks
  - <u>Valid bit</u>: determines whether anything is stored in that row (when computer initially turned on, all entries invalid)





No valid data

...





...



• Load that data into cache, setting tag, valid



•••

•••





• Read from cache at offset, return word b



•••

•••





- Read 0x0000034
  - 000000000000000 00000011 0100 Index field Offset
    - Valid Tag field





...

0 0

...



a



...

#### Read Data in Direct-Mapped Cache



10220 10230

...



- No valid data
  - 000000000000000 <u>000000011</u> 0100



|    | Tag | 0xc-f | 0x8-b | 0x4 - 7 | 0x0 - 3 |
|----|-----|-------|-------|---------|---------|
| Ø  | 0   |       |       |         |         |
| ۱L | 1 0 | d     | С     | b       | а       |
| 2  | 0   |       |       |         |         |
| 3( | 0   |       |       |         |         |
| 4  | 0   |       |       |         |         |
| 5  | 0   | _     |       |         |         |
| 6  | 0   |       |       |         |         |
| 1  | 0   |       |       |         |         |

•••

...

10220 1023



...

#### Read Data in Direct-Mapped Cache

• Load that cache block, return word f



...





- Read 0x00008014
  - $\begin{array}{c} \bullet \quad 0 \\ \bullet \quad 0 \\ \textbf{Valid} \quad \textbf{Tag field} \quad \begin{array}{c} 0 \\ \textbf{0} \\$

| Index | <u>x</u> | Tag | 0xc-f | 0x8-b | 0x4 - 7 | 0x0-3 |
|-------|----------|-----|-------|-------|---------|-------|
| 0     | 0        | •   |       |       |         |       |
| 1     | 1        | 0   | d     | С     | b       | а     |
| 2     | 0        |     |       |       |         |       |
| 3     | 1        | 0   | h     | g     | f       | е     |
| 4     | 0        |     |       |       |         |       |
| 5     | 0        |     |       |       |         |       |
| 6     | 0        |     |       |       |         |       |
| 7     | 0        |     |       |       |         |       |

•••

•••

10220 10230



• Read cache block 1, data is valid



•••

•••

**1022**0 **1023**0



• Cache block 1 tag does not match (0 != 2)



•••



• Miss, so replace block 1 with new data & tag



•••

•••





• Return word J

...



...



## **Takeaway Questions**

- What is the cache status when reading?
  - Read address 0x00000030?
  - Read address 0x000001C?







# **Takeaway Questions**

- 0x0000030 a <u>hit</u>
  - Index = 3, Tag matches, offset = 0, value = <u>e</u>
- 0x000001C a miss
  - Index = 1, tag mismatch, so replace from memory, offset = 0xc, value = <u>d</u>
- Read values must = memory values whether or not cached
  - 0x0000030 = e
  - 0x000001C = d

















| # RISC-V | asse | mbly  | code  | L.    |
|----------|------|-------|-------|-------|
|          | addi | \$t0, | \$0,  | 5     |
| loop:    | beq  | \$t0, | \$0,  | done  |
|          | lw   | \$t1, | 0x4(  | (\$0) |
|          | lw   | \$t2, | 0xC(  | (\$0) |
|          | lw   | \$t3, | 0x8(  | (\$0) |
|          | addi | \$t0, | \$t0, | -1    |
|          | j    | loop  |       |       |
| done:    |      |       |       |       |

*Miss Rate = 3/15 20%* 

Temporal Locality Compulsory Misses







- Increase block size
  - Block size , b = 4 words
  - C = 8 words, direct mapped (1 block per set)
  - Number of blocks, B = C/b = 8/4 = 2





National Yang Ming Chiao Tung University Computer Architecture & System Lab

#### **Directed-Mapped Cache Hardware**





79



National Yang Ming Chiao Tung University Computer Architecture & System Lab

#### **Directed-Mapped Cache Hardware**

|       | addi | \$t0, | \$0,5     |
|-------|------|-------|-----------|
| loop: | beq  | \$t0, | \$0, done |
|       | lw   | \$t1, | 0x4(\$0)  |
|       | lw   | \$t2, | 0xC(\$0)  |
|       | lw   | \$t3, | 0x8(\$0)  |
|       | addi | \$t0, | \$t0, -1  |
|       | j    | loop  |           |
| done: | •    | Ċ.    |           |

*Miss Rate = 1/15* 

= 6.67%

Larger blocks reduce compulsory misses through spatial locality





#### Conclusion

- We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible
- So we create a memory hierarchy:
  - each successively lower level contains "most used" data from next higher level
  - exploits temporal & spatial locality
  - do the common case fast, worry less about the exceptions
- Locality of reference is a Big Idea