

# Lecture 13: Super-Scalar + Out-of-Order

## **CS10014 Computer Organization**

Tsung Tai Yeh Department of Computer Science National Yang Ming Chiao University



# Acknowledgements and Disclaimer

- Slides were developed in the reference with
  - CS 61C at UC Berkeley
    - https://inst.eecs.berkeley.edu/~cs61c/sp23/
  - CS252 at ETHZ
    - https://safari.ethz.ch/digitaltechnik/spring2023
  - CIS510 at Upenn
    - https://www.cis.upenn.edu/~cis5710/spring2024/



# Outline

- Super-scalar Processor
- Out-of-order Processor
  - Dynamic Scheduling in Hardware



#### Parallelism

- Previously: pipeline-level parallelism
  - Work on execute of one instruction in parallel with decode of next
- Next: instruction-level parallelism (ILP)
  - Execute multiple independent instructions fully in parallel
- Then:
  - Static & dynamic scheduling
    - Extract much more ILP
  - Data-level parallelism (DLP)
    - Single-instruction, multiple data (one insn., four 64-bit adds)
  - Thread-level parallelism (TLP)
    - Multiple software threads running on multiple cores



### In-Order Super-scalar Pipelines



- Idea of instruction-level parallelism
- Superscalar hardware issues
  - Bypassing and register file
  - Stall logic
  - Fetch
- "Superscalar" vs VLIW/EPIC



# Flynn Bottleneck

- "Flynn bottleneck"
  - single issue performance limit is CPI = IPC = 1
  - hazards + overhead  $\Rightarrow$  CPI >= 1 (IPC <= 1)
  - diminishing returns from superpipelining [Hrishikesh paper!]
- solution: issue multiple instructions per cycle

|       | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|-------|---|---|---|---|---|---|---|
| inst0 | F | D | Х | М | W |   |   |
| inst1 | F | D | Х | Μ | W |   |   |
| inst2 |   | F | D | Х | Μ | W |   |
| inst3 |   | F | D | Х | Μ | W |   |

• 1st superscalar: IBM America  $\rightarrow$  RS/6000  $\rightarrow$  POWER1



# Instruction-level Parallelism (ILP)

- But consider:
  - ADD r1, r2 -> r3
  - ADD r4, r5 -> r6
  - Why not execute them *at the same time*? (We can!)
- What about:

SUB r1, r2 -> r3 SUB r4, r3 -> r6

- In this case, *dependences* prevent parallel execution
- What about three (or more!) instructions at a time?



### Multiple-Issue (Super-scalar) Pipeline



- Overcome this limit using **multiple issue** 
  - Also called superscalar
  - Two instructions per stage at once, or three, or four, or eight...
  - "Instruction-Level Parallelism (ILP)" [Fisher, IEEE TC'81]
- Today, typically 4-6 wide (AMD, ARM, Intel)
  - AMD Zen 3 is 4-wide
  - Intel Golden Cove is 6-wide



#### 5-Stage Dual-Issue Pipeline



- what is involved in
  - fetching two instructions per cycle?
  - decoding two instructions per cycle?
  - executing two ALU operations per cycle?
  - accessing the data cache twice per cycle?
  - writing back two results per cycle?
- what about 4 or 8 instructions per cycle?



# How Much ILP is There?

- The compiler tries to "schedule" code to avoid stalls
  - Even for scalar machines (to fill load-use delay slot)
  - Even harder to schedule multiple-issue (superscalar)
- How much ILP is common?
  - Greatly depends on the application
    - Consider memory copy
    - Unroll loop, lots of independent operations
  - Other programs, less so
- Even given unbounded ILP, superscalar has implementation limits
  - IPC (or CPI) vs clock frequency trade-off
  - Given these challenges, what is reasonable today?
    - ~4 instruction per cycle maximum



#### Super-scalar Pipeline Diagrams - Ideal

| scalar                  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|-------------------------|---|---|---|---|---|---|---|---|---|----|----|----|
| lw 0(r1)⇒r2             | F | D | Х | Μ | W |   |   |   |   |    |    |    |
| lw 4(r1)⇒r3             |   | F | D | Х | М | W |   |   |   |    |    |    |
| lw 8(r1)⇒r4             |   |   | F | D | Х | Μ | W |   |   |    |    |    |
| add r14,r15⇒r6          |   |   |   | F | D | Х | Μ | W |   |    |    |    |
| add r12,r13 <b>→</b> r7 |   |   |   |   | F | D | Х | Μ | W |    |    |    |
| add r17,r16⇒r8          |   |   |   |   |   | F | D | Х | Μ | W  |    |    |
| lw 0(r18)⇒r9            |   |   |   |   |   |   | F | D | Х | Μ  | W  |    |
|                         |   |   |   |   |   |   |   |   |   |    |    |    |
|                         |   |   |   |   |   |   |   |   |   |    |    |    |
|                         |   | - | - |   | _ |   | _ | - |   |    |    |    |

| 2-way superscalar       | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|-------------------------|---|---|---|---|---|---|---|---|---|----|----|----|
| lw 0(r1) →r2            | F | D | Х | Μ | W |   |   |   |   |    |    |    |
| lw 4(r1)⇒r3             | F | D | Х | М | W |   |   |   |   |    |    |    |
| lw 8(r1)⇒r4             |   | F | D | Х | Μ | W |   |   |   |    |    |    |
| add r14,r15 <b>→</b> r6 |   | F | D | Х | М | W |   |   |   |    |    |    |
| add r12,r13 <b>→</b> r7 |   |   | F | D | Х | М | W |   |   |    |    |    |
| add r17,r16 <b>→</b> r8 |   |   | F | D | Х | М | W |   |   |    |    |    |
| lw 0(r18)⇒r9            |   |   |   | F | D | Х | Μ | W |   |    |    |    |



### Super-scalar Stalls

- invariant: stalls propagate upstream to younger instructions
- what if older instruction in issue "pair" (inst0) stalls?
  - younger instruction (inst1) stalls too, cannot pass it
- what if younger instruction (inst1) stalls?
  - can older instruction from next group (inst2) move up?





#### Super-scalar Pipeline Diagrams - Realistic

|   | scalar            | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |            |
|---|-------------------|---|---|---|---|---|---|---|---|---|----|----|----|------------|
|   | lw 0(r1)⇒r2       | F | D | Х | М | W |   |   |   |   |    |    |    |            |
|   | lw 4(r1)⇒r3       |   | F | D | Х | М | W |   |   |   |    |    |    |            |
|   | lw 8(r1)⇒r4       |   |   | F | D | Х | Μ | W |   |   |    |    |    |            |
| Г | add r4,r5⇒r6      |   |   |   | F | D | * | Х | М | W |    |    |    |            |
| L | add r2,r3⇒r7      |   |   |   |   | F | * | D | Х | Μ | W  |    |    |            |
|   | add r7,r6⇒r8      |   |   |   |   |   |   | F | D | Х | М  | W  |    |            |
|   | lw 4(r8)⇒r9       |   |   |   |   |   |   |   | F | D | Х  | М  | W  |            |
|   | 2-way superscalar | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |            |
|   | lw 0(r1)→r2       | F | D | Х | М | W |   |   |   |   |    |    |    |            |
|   | lw 4(r1)⇒r3       | F | D | Х | М | W |   |   |   |   |    |    |    | Rigid pipe |
|   | lw 8(r1)⇒r4       |   | F | D | Х | Μ | W |   |   |   |    |    |    |            |
|   | add r4,r5⇒r6      |   | F | D | * | * | Х | Μ | W |   |    |    |    |            |
|   | add r2,r3⇒r7      |   |   | F | * | * | D | Х | М | W |    |    |    |            |
|   | add r7,r6⇒r8      |   |   | F | * | * | * | D | Х | Μ | W  |    |    |            |
|   | lw 4(r8)⇒r9       |   |   |   | F | * | * | * | D | Х | Μ  | Ν  | /  |            |

13



# Super-scalar Challenges – Front End

#### • Superscalar instruction fetch

- Modest: fetch multiple instructions per cycle
- Aggressive: buffer instructions and/or predict multiple branches

#### • Superscalar instruction decode

• Replicate decoders

#### • Superscalar instruction issue

- Determine when instructions can proceed in parallel
- More complex stall logic order N<sup>2</sup> for *N*-wide machine
- Not all combinations of types of instructions possible

#### • Superscalar register read

- Port for each register read (4-wide superscalar → 8 read "ports")
- · Each port needs its own set of address and data wires
  - Latency & area  $\propto \# ports^2$



# Challenges of Super-scalar Fetch

- What is involved in fetching multiple instructions per cycle?
- In same cache block? no problem
  - 64-byte cache block is 16 instructions (~4 bytes per instruction)
  - Favors larger block size (independent of hit rate)
- What if next instruction is last instruction in a block?
  - Fetch only one instruction that cycle
  - Or, some processors may allow fetching from 2 consecutive blocks
- What about taken branches?
  - How many instructions can be fetched on average?
  - Average number of instructions per taken branch?
    - Assume: 20% branches, 50% taken  $\rightarrow$  ~10 instructions
- Consider a 5-instruction loop with an 4-issue processor
  - Without smarter fetch, ILP is limited to 2.5 (not 4, which is bad)



#### Wide Fetch



what is involved in fetching multiple instructions per cycle?

- · if instructions are sequential...
  - and on same cache line  $\Rightarrow$  nothing really
  - and on different cache lines  $\Rightarrow$  banked I\$ + combining network
- if instructions are not sequential...
  - more difficult
  - two serial I\$ accesses (access1⇒predict target⇒access2)? no
- note: embedded branches OK as long as predicted NT
  - serial access + prediction in parallel
  - if prediction is T, discard serial part after branch



#### **Trace Cache**

#### problem: low fetch utilization on taken branches

• only fetch up to taken branch, remaining fetch slots lost

trace cache: combine branch predictor with I\$

- [Weiser+Peleg'95, Rotenberg+Bennett+Smith'96]
- stores dynamic instruction sequences
  - tag: initial PC + directions of embedded branches
  - fetch from trace, but make sure that branch directions were ok
  - typically backed by I\$ (in case of trace cache miss)
- used in Pentium4
  - actually a decoded (μop) trace cache



#### Trace Cache Example

instruction cache with 2 instrs per cache block

|     | I\$         |                       | 1  | 2 | 3 | 4 | 5 | 6 | 7 |
|-----|-------------|-----------------------|----|---|---|---|---|---|---|
| tag | data        | inst0 (beq r1, inst4) | F  | D | Х | М | W |   |   |
| PC0 | inst0,inst1 | inst4                 | f* | F | D | Х | Μ | W |   |
| PC2 | inst2,inst3 | inst5                 |    | F | D | Х | Μ | W |   |
| PC4 | inst4,inst5 | inst6                 |    |   | F | D | Х | Μ | W |

• trace-cache with 2 instrs per cache block

|       | Т\$         |
|-------|-------------|
| tag   | data        |
| PC0:T | inst0,inst4 |
| PC2:- | inst2,inst3 |
| PC5:- | inst5,inst6 |

|                       | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|-----------------------|---|---|---|---|---|---|---|
| inst0 (beq r1, inst4) | F | D | Х | М | W |   |   |
| inst4                 | F | D | Х | Μ | W |   |   |
| inst5                 |   | F | D | Х | Μ | W |   |
| inst6                 |   | F | D | Х | М | W |   |



#### Wide Decode



what is involved in decoding N instructions per cycle?

actually decoding instructions?

+ easy if fixed length instructions (multiple decoders)

- harder (but possible) if variable length
- reading input register values?
  - 2N register read ports (register file read latency ~2N)
  - actually less than 2N, since most values come from bypasses
- what about the stall logic to enforce RAW dependences?



### N<sup>2</sup> Dependence Cross-Check

• remember stall logic for single issue pipeline

- rs1(D) == rd(D/X) || rs1(D) == rd(X/M) || rs1(D) == rd(M/W)
- same for rs2(D)
- + full-bypassing reduces to rs1(D) == rd(D/X) && op(D/X) == LOAD
- doubling issue width (N) quadruples stall logic!
  - not only 2 instructions in D, but two instructions in every stage
  - (rs1(D<sub>1</sub>) == rd(D/X<sub>1</sub>) && op(D/X<sub>1</sub>) == LOAD)
  - (rs1(D<sub>1</sub>) == rd(D/X<sub>2</sub>) && op(D/X<sub>2</sub>) == LOAD)
  - repeat for rs1(D<sub>2</sub>), rs2(D<sub>1</sub>), rs2(D<sub>2</sub>)
  - also check dependence of 2nd instruction on 1st: rs1(D<sub>2</sub>) == rd(D<sub>1</sub>)

#### "N<sup>2</sup> dependence cross-check"

– for N-wide pipeline, stall (and bypass) circuits grow as N<sup>2</sup>



# What Checking Is Required?



• Plus checking for load-to-use stalls from prior *n* loads



#### Wide Execute



what is involved in execut]ing N instructions per cycle?

- multiple execution units...N of every kind?
  - N ALUs? OK, ALUs are small
  - N FP dividers? no, FP dividers are huge (and fdiv is uncommon)
- typically have some mix (proportional to instruction mix)
  - RS/6000: 1 ALU/memory/branch + 1 FP
  - Pentium: 1 any + 1 ALU (Pentium)
  - Pentium II: 1 ALU/FP + 1 ALU + 1 load + 1 store + 1 branch
  - Alpha 21164: 1 ALU/FP/branch + 2 ALU + 1 load/store



#### N<sup>2</sup> Bypass



#### N<sup>2</sup> bypass network

- N+1 input muxes at each ALU input
- N<sup>2</sup> point-to-point connections
- Routing lengthens wires
- Heavy capacitive load
- And this is just one bypass stage (MX)!
  - There is also WX bypassing
  - · Even more for deeper pipelines
- One of the big problems of superscalar
  - Why? On the critical path of single-cycle "bypass & execute" loop



# Not All N<sup>2</sup> Created Equal

- N<sup>2</sup> bypass vs. N<sup>2</sup> stall logic & dependence cross-check
  - Which is the bigger problem?
- N<sup>2</sup> bypass ... by far
  - 64-bit quantities (vs. 5-bit)
  - Multiple levels (MX, WX) of bypass (vs. 1 level of stall logic)
  - Must fit in one clock period with ALU (vs. not)
- Dependence cross-check not even 2nd biggest N<sup>2</sup> problem
  - Regfile is also an N<sup>2</sup> problem (think latency where N is #ports)
  - And also more serious than cross-check



# Mitigating N<sup>2</sup> Bypass & Register File



**Clustering**: mitigates N<sup>2</sup> bypass

- Group ALUs into K clusters
- Full bypassing within a cluster
- Limited bypassing between clusters
  - With 1 or 2 cycle delay
  - Can hurt IPC, but faster clock
- (N/K) + 1 inputs at each mux
- (N/K)<sup>2</sup> bypass paths in each cluster
- Steering: key to performance
  - Steer dependent insns to same cluster

#### Cluster register file, too

- Replicate a register file per cluster
- All register writes update all replicas
- Fewer read ports; only for cluster



# Mitigating N<sup>2</sup> RegFile with Clustering



- Clustering: split N-wide execution pipeline into K clusters
  - With centralized register file, 2N read ports and N write ports
- Clustered register file: extend clustering to register file
  - Replicate the register file (one replica per cluster)
  - Register file supplies register operands to just its cluster
  - All register writes go to all register files (keep them in sync)
  - Advantage: fewer read ports per register!
    - K register files, each with 2N/K read ports and N write ports



### Wide Memory Access



what is involved in accessing memory for multiple instructions per cycle?

- multi-banked D\$
  - requires bank assignment and conflict-detection logic
- (rough) instruction mix: 20% loads, 15% stores
  - for width N, we need about 0.2\*N load ports, 0.15\*N store ports



### Wide Memory Access

- How to provide additional D\$ (D-cache) bandwidth?
  - Have already seen split I\$/D\$, but that gives you just one D\$ port
  - How to provide a second (maybe even a third) D\$ port ?
- Option#1: multi-porting
  - + Most general solution, any two accesses per cycle
  - Lots of wires; expansive in latency, area (cost), and power
- Option#2: replication
  - Read from either replica, but writes update both replicas
    - Writing both insures they have the same values
  - Multiplies read bandwidth only (writes must go to all replicas)
  - + General solution for loads, little latency penalty
  - Not a solution for stores, area, power penalty



### Wide Memory Access

#### • Option#3: banking (or interleaving)

- Divide D\$ into banks (by address), one access per bank per cycle
- Bank conflict: two access to same bank -> one stall
- + No latency, area, power overheads
- + One access per bank per cycle, assuming no conflicts
- Complex stall logic -> address not known until execute stage
- To support N accesses, need 2N+ banks to avoid frequent conflicts
- Which address bit(s) determine bank ?
  - Offset bits? Individual cache lines spread among different banks
    - + Fewer conflicts
    - Must replicate tags across banks, complex missing handling
  - Index bits? Banks contain complete cache lines
    - More conflicts
    - + Tags not replicated, simpler missing handling



#### Wide Writeback



what is involved in writing back multiple instructions per cycle?

- nothing too special, just another port on the register file
  - · everything else is taken care of earlier in pipeline
- · adding ports isn't free, though
  - increases area
  - increases access latency



National Yang Ming Chiao Tung University Computer Architecture & System Lab

# Very Long Instruction Word (VLIW)

- Hardware-centric multiple issue problems
  - Wide fetch+branch prediction, N<sup>2</sup> bypass, N<sup>2</sup> dependence checks
  - Hardware solutions have been proposed: clustering, trace cache
- Compiler-centric: very long insn word (VLIW)
  - Effectively, a 1-wide pipeline, but unit is an N-insn group
  - Compiler guarantees insns within a VLIW group are independent
    - If no independent insns, slots filled with nops
  - · Group travels down pipeline as a unit
    - + Simplifies pipeline control (no rigid vs. fluid business)
    - + Cross-checks within a group un-necessary
    - Downstream cross-checks (maybe) still necessary
  - Typically "slotted": 1st insn must be ALU, 2nd mem, etc.
    - + Further simplification



# Very Long Instruction Word (VLIW)

VLIW: instructions that "encode" multiple operations.

The hardware executes the entire instruction "at once" on parallel function units in execute stage.

The "long instructions" encode the fact that the operations are <u>independent</u>. So, the hardware does not need to dynamically figure this out. This can save area and power and is now popular in embedded processors (e.g., Texas Instruments C6x series of DSPs).





# Very Long Instruction Word (VLIW)

- + Simpler instruction fetch
  - Fetch a bundle instructions per cycle
- + Simpler dependence check logic
  - Compiler guarantees all instructions in bundle independent
- + Simpler branch prediction
  - Restrict to one branch per bundle
- By default, doesn't help bypasses or register file problems
- Compiler-visible clustering possible in VLIW
  - Each "lane" of VLIW has "local" registers (read/written by this lane)
  - A few "global" registers (R/W by any lane) are used to communicate between lanes



# Very Long Instruction Word (VLIW)

- Code density
  - Lots of "no-ops" in bundles
- Not compatible across machines of different widths
  - "Not compatible" could also mean programs would execute incorrectly
  - Or, "not compatible" can mean programs would execute slowly
- VLIW doesn't solve all problems
  - VLIW mainly targets dependence checking
    - Which isn't the worst N<sup>2</sup> problem in multiple-issue
  - Doesn't magically create ILP



National Yang Ming Chiao Tung University Computer Architecture & System Lab

# Multiple-Issue Implementations

- Statically-scheduled (in-order) superscalar
  - What we've talked about thus far
  - + Executes unmodified sequential programs
  - Hardware must figure out what can be done in parallel
  - E.g., Pentium (2-wide), UltraSPARC (4-wide), Alpha 21164 (4-wide)
- Very Long Instruction Word (VLIW)
  - Compiler identifies independent instructions, new ISA
  - + Hardware can be simple and perhaps lower power
  - E.g., TransMeta Crusoe (4-wide), most DSPs
  - Variant: Explicitly Parallel Instruction Computing (EPIC)
    - A bit more flexible encoding & some hardware to help compiler
    - E.g., Intel Itanium (6-wide)
- Dynamically-scheduled superscalar (next topic)
  - Hardware extracts more ILP by on-the-fly reordering
  - Intel Atom/Core/Xeon, AMD Opteron/Ryzen, some ARM A-series



#### Trends in Single-Processor Multiple Issue

|       | 486  | Pentium | PentiumII | Pentium4 | Itanium | ItaniumII | Core2 |
|-------|------|---------|-----------|----------|---------|-----------|-------|
| Year  | 1989 | 1993    | 1998      | 2001     | 2002    | 2004      | 2006  |
| Width | 1    | 2       | 3         | 3        | 3       | 6         | 4     |

- Issue width has saturated at 4-6 for high-performance cores
  - Canceled Alpha 21464 was 8-way issue
  - Not enough ILP to justify going to wider issue
  - Hardware or compiler *scheduling* needed to exploit 4-6 effectively
    - More on this in the next unit
- For high-performance *per watt* cores (say, smart phones)
  - Typically 2-wide superscalar (but increasing each generation)



# Multiple Issue Summary

- Multiple issue
  - Exploits insn level parallelism (ILP) beyond pipelining
  - Improves IPC, but perhaps at some clock & energy penalty
  - 4-6 way issue is about the peak issue width currently justifiable
    - Low-power implementations today typically 2-wide superscalar
- Problem spots
  - N<sup>2</sup> bypass & register file  $\rightarrow$  clustering
  - Fetch + branch prediction  $\rightarrow$  buffering, loop streaming, trace cache
  - N<sup>2</sup> dependency check  $\rightarrow$  VLIW/EPIC (but unclear how key this is)
- Implementations
  - Superscalar vs. VLIW/EPIC



# **Out-of-Order Processor**

- Dynamically-scheduled processors
  - Also called out-of-order processors
  - · Hardware re-schedules insns...
  - ...within a sliding window of VonNeumann insns
  - As with pipelining and superscalar, ISA unchanged
    - · Same hardware/software interface, illusion of in-order
- Increases scheduling scope
  - Does loop unrolling transparently!
  - Uses branch prediction to "unroll" branches
- Examples:
  - first appeared in Pentium Pro (1995)
  - part of every smartphone/tablet/laptop/desktop/server chip



National Yang Ming Chiao Tung University Computer Architecture & System Lab

# In-Order Limitations

|                                        | 0 | 1 | 2  | 3          | 4              | 5          | 6     | 7     | 8              | 9 | 10 | 11 | 12 |
|----------------------------------------|---|---|----|------------|----------------|------------|-------|-------|----------------|---|----|----|----|
| Ld $[r1] \rightarrow r2$               | F | D | X  | $M_1$      | M <sub>2</sub> | W          |       |       |                |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4           | F | D | d* | d*         | d*             | X⁴         | $M_1$ | $M_2$ | W              |   |    |    |    |
| xor r4 $^{\wedge}$ r5 $\rightarrow$ r6 |   | F | D  | d*         | d*             | d*         | X↓    | $M_1$ | M <sub>2</sub> | W |    |    |    |
| ld [r7] → r4                           |   | F | D  | <b>p</b> * | <b>p</b> *     | <b>p</b> * | Х     | $M_1$ | M <sub>2</sub> | W |    |    |    |

- In-order pipeline, three-cycle load-use penalty
  - 2-wide
- Why not the following?

|                         | 0 | 1 | 2  | 3     | 4              | 5              | 6     | 7     | 8              | 9 | 10 | 11 | 12 |
|-------------------------|---|---|----|-------|----------------|----------------|-------|-------|----------------|---|----|----|----|
| Ld [r1] → r2            | F | D | Х  | $M_1$ | M <sub>2</sub> | W              |       |       |                |   |    |    |    |
| add r2 + r3 → <b>r4</b> | F | D | d* | d*    | d*             | X↓             | $M_1$ | $M_2$ | W              |   |    |    |    |
| xor <b>r4</b> ^ r5 → r6 |   | F | D  | d*    | d*             | d*             | X♥    | $M_1$ | M <sub>2</sub> | W |    |    |    |
| ld [r7] <b>→r4</b>      |   | F | D  | X     | M1             | M <sub>2</sub> | W     |       |                |   |    |    |    |



# In-Order Limitations

|                              | 0 | 1 | 2  | 3          | 4              | 5          | 6     | 7     | 8              | 9 | 10 | 11 | 12 |
|------------------------------|---|---|----|------------|----------------|------------|-------|-------|----------------|---|----|----|----|
| Ld [p1] → p2                 | F | D | Х  | $M_1$      | M <sub>2</sub> | W          |       |       |                |   |    |    |    |
| add p2 + p3 $\rightarrow$ p4 | F | D | d* | d*         | d*             | X⁴         | $M_1$ | $M_2$ | W              |   |    |    |    |
| xor p4 ^ p5 → p6             |   | F | D  | d*         | d*             | d*         | X↓    | $M_1$ | $M_2$          | W |    |    |    |
| ld [p7] → p8                 |   | F | D  | <b>p</b> * | <b>p</b> *     | <b>p</b> * | Х     | $M_1$ | M <sub>2</sub> | W |    |    |    |

- In-order pipeline, three-cycle load-use penalty
  - 2-wide
- Why not the following:

|                         | 0 | 1 | 2  | 3     | 4              | 5              | 6              | 7     | 8     | 9 | 10 | 11 | 12 |
|-------------------------|---|---|----|-------|----------------|----------------|----------------|-------|-------|---|----|----|----|
| Ld [p1] → p2            | F | D | X  | $M_1$ | M <sub>2</sub> | W              |                |       |       |   |    |    |    |
| add p2 + p3 → <b>p4</b> | F | D | d* | d*    | d*             | X              | M <sub>1</sub> | $M_2$ | W     |   |    |    |    |
| xor <b>p4</b> ^ p5 → p6 |   | F | D  | d*    | d*             | d*             | X↓             | $M_1$ | $M_2$ | W |    |    |    |
| ld [p7] <b>→ p8</b>     |   | F | D  | X     | M1             | M <sub>2</sub> | W              |       |       |   |    |    |    |



## Out-of-Order Pipeline





# Out-of-Order Execution

- · Also called "Dynamic scheduling"
  - · Done by the hardware on-the-fly during execution
- Looks at a "window" of instructions waiting to execute
  - · Each cycle, picks the next ready instruction(s)
- Two steps to enable out-of-order execution: Step #1: Register renaming – to avoid "false" dependencies
   Step #2: Dynamically schedule – to enforce "true" dependencies
- Key to understanding out-of-order execution:
  - Data dependencies



# **Dependence** Types

RAW (Read After Write) = "true dependence" (true)
 mul r0 \* r1 - r2
 ...

add **r2**+ r3 → r4

WAW (Write After Write) = "output depend." (false)
 mul r0 \* r1→r2
 ...

add r1 + r3 - (r2)

 WAR (Write After Read) = "anti-dependence" (false) mul r0 (r1) → r2

... add r3 + r4 - r1

WAW & WAR are "false", eliminate via register renaming



# **Register Renaming**

- To eliminate register conflicts/hazards
- "Architected" vs "Physical" registers level of indirection
  - Names: r1, r2, r3
  - Locations: p1, p2, p3, p4, p5, p6, p7
  - Original mapping:  $r1 \rightarrow p1$ ,  $r2 \rightarrow p2$ ,  $r3 \rightarrow p3$ , p4-p7 are free



| p4,p5,p6,p7 |
|-------------|
| p5,p6,p7    |
| p6,p7       |
| p7          |

FreeList



Original insns Renamed insns

add p2,p3→p4 sub p2,p4→p5 mul p2,p5→p6 div p4,4→p7

- Renaming conceptually write each register once
  - + Removes false dependences
  - + Leaves true dependences intact!



### Register Renaming Example

| • | egietei i               |                                                              |               |
|---|-------------------------|--------------------------------------------------------------|---------------|
|   |                         | original insns                                               | renamed insns |
|   | ue/false<br>ependencies | xor x3,x1,x2<br>add x4,x3,x4<br>sub x3,x5,x2<br>addi x1,x3,1 |               |
|   | arch reg                | phys reg                                                     | free list     |
|   | x1                      | <del>-p1</del> p9                                            | <del></del>   |
|   | x2                      | p2                                                           | <u></u>       |
|   | x3                      | <del>-р3 -р6 -</del> р8                                      | <del>8</del>  |
|   | x4                      | <del>-p4</del> p7                                            | <del></del>   |
|   | x5                      | p5                                                           | p10           |



# Register Renaming Algorithm

- Two key data structures:
  - maptable[architectural\_reg] → physical\_reg
  - Free list: allocate & free registers
- · Algorithm: at "decode" stage for each instruction:

```
insn.phys_input1 = maptable[insn.arch_input1]
insn.phys_input2 = maptable[insn.arch_input2]
insn.old_phys_output = maptable[insn.arch_output]
new_reg = new_phys_reg()
maptable[insn.arch_output] = new_reg
insn.phys output = new reg
```

- At "commit"
  - Once all older instructions have committed, free register free\_phys\_reg(insn.old\_phys\_output)



# Dynamic Scheduling Overview



- Insns fetch/decode/rename into Instruction Buffer
  - · Also called "instruction window" or "instruction scheduler"
- Insns (conceptually) check ready bits each cycle
  - · Execute oldest "ready" instruction, set output as "ready"



# Dynamic Scheduling Algorithm

- Data structures:
  - Ready table[phys\_reg] → yes/no (part of "issue queue")
- Algorithm at "issue" stage (prior to read registers): foreach instruction:

if table[insn.phys\_input1] == ready &&
 table[insn.phys\_input2] == ready then
 insn is "ready"
select the oldest "ready" instruction
 table[insn.phys\_output] = ready

- Multiple-cycle instructions? (such as loads)
  - For an insn with latency of N, set "ready" bit N-1 cycles in future



# Dispatch

- Put renamed instructions into OoO structures
- Re-order buffer (ROB)
  - Holds instructions from Fetch through Commit
- Issue Queue
  - Central piece of scheduling logic
  - Holds instructions from Dispatch through Issue
  - Tracks ready inputs
    - Physical register names + ready bit
    - "AND" the bits to tell if ready





# **Dispatch Steps**

- Allocate Issue Queue (IQ) slot
  - · Full? Stall
- Read ready bits of inputs
  - · 1-bit per physical reg
- · Clear ready bit of output in table
  - Instruction has not produced value yet
- Write data into Issue Queue (IQ) slot



xor p1 ^ p2 → p6 add p6 + p4 → p7 sub p5 - p2 → p8 addi p8 + 1 → p9

#### Issue Queue

| Insn | Inp1 | R | Inp2 | R | Dst | Bday |
|------|------|---|------|---|-----|------|
|      |      |   |      |   |     |      |
|      |      |   |      |   |     |      |
|      |      |   |      |   |     |      |
|      |      |   |      |   |     |      |

#### **Ready bits**





xor p1  $^{p2} \rightarrow p6$ add p6 + p4  $\rightarrow$  p7 sub p5 - p2  $\rightarrow$  p8 addi p8 + 1  $\rightarrow$  p9

**Issue Queue** 

| Insn | Inp1 | R | Inp2 | R | Dst | Bday |
|------|------|---|------|---|-----|------|
| xor  | p1   | У | p2   | у | p6  | 0    |
|      |      |   |      |   |     |      |
|      |      |   |      |   |     |      |
|      |      |   |      |   |     |      |

**Ready bits** 



52



xor p1  $^{p2} \rightarrow p6$ add p6 + p4  $\rightarrow$  p7 sub p5 - p2  $\rightarrow$  p8 addi p8 + 1  $\rightarrow$  p9

#### Ready bits

| p1 | у |
|----|---|
| p2 | у |
| р3 | у |
| p4 | у |
| p5 | у |
| р6 | n |
| р7 | n |
| p8 | у |
| p9 | у |

#### Issue Queue

| Insn | Inp1 | R | Inp2 | R | Dst | Bday |
|------|------|---|------|---|-----|------|
| xor  | р1   | У | p2   | у | p6  | 0    |
| add  | p6   | n | p4   | у | р7  | 1    |
|      |      |   |      |   |     |      |
|      |      |   |      |   |     |      |



xor p1  $^p2 \rightarrow p6$ add p6 + p4  $\rightarrow p7$ sub p5 - p2  $\rightarrow p8$ addi p8 + 1  $\rightarrow p9$ 

Issue Queue

| Insn | Inp1 | R | Inp2 | R | Dst | Bday |
|------|------|---|------|---|-----|------|
| xor  | p1   | у | p2   | у | p6  | 0    |
| add  | p6   | n | р4   | у | р7  | 1    |
| sub  | p5   | у | p2   | у | p8  | 2    |
|      |      |   |      |   |     |      |

#### **Ready bits**

| p1 | У |
|----|---|
| p2 | У |
| р3 | у |
| p4 | у |
| p5 | у |
| p6 | n |
| р7 | n |
| p8 | n |
| p9 | у |



xor p1  $^{p2} \rightarrow p6$ add p6 + p4  $\rightarrow$  p7 sub p5 - p2  $\rightarrow$  p8 addi p8 + 1  $\rightarrow$  p9

#### Issue Queue

| Insn | Inp1 | R | Inp2 | R | Dst | Bday |
|------|------|---|------|---|-----|------|
| xor  | p1   | у | p2   | у | p6  | 0    |
| add  | p6   | n | p4   | у | р7  | 1    |
| sub  | р5   | у | p2   | у | p8  | 2    |
| addi | p8   | n |      | у | p9  | 3    |

#### **Ready bits**

| p3 y<br>p4 y<br>p5 y<br>p6 n | p7 n<br>p8 n | _ |
|------------------------------|--------------|---|
| р3 у<br>р4 у                 |              |   |
| рЗ у                         | р5 у         |   |
|                              | р4 у         |   |
| р2 у                         | р3 у         |   |
| -                            | р2 у         |   |
| р1 у                         | р1 у         |   |



# Out-of-order Pipeline

- Execution (out-of-order) stages
- Select ready instructions
  - Send for execution
- Wakeup dependents





## Issue = Select + Wakeup

- Select oldest of "ready" instructions
  - "xor" is the oldest ready instruction below
  - "xor" and "sub" are the two oldest ready instructions below
  - · May have resource constraints, e.g., can't do 2 loads

| Insn | Inp1 | R | Inp2 | R | Dst | Bday |        |
|------|------|---|------|---|-----|------|--------|
| xor  | p1   | У | p2   | У | р6  | 0    | Ready! |
| add  | p6   | n | р4   | у | р7  | 1    |        |
| sub  | р5   | У | p2   | У | p8  | 2    | Ready! |
| addi | p8   | n |      | у | p9  | 3    |        |



# Issue = Select + Wakeup

- Wakeup dependent instructions
  - Search for destination (Dst) in inputs & set "ready" bit
    - Implemented with a special memory array circuit called a Content Addressable Memory (CAM)
  - · Also update ready-bit table for future instructions

| Insn | Inp1      | R | Inp2 | R | Dst       | Bday |
|------|-----------|---|------|---|-----------|------|
| xor  | p1        | У | p2   | У | <b>p6</b> | 0    |
| add  | <b>p6</b> | У | p4   | У | р7        | 1    |
| sub  | р5        | У | p2   | У | <b>p8</b> | 2    |
| addi | p8        | У |      | у | p9        | 3    |

- For multi-cycle operations (loads, floating point)
  - Wakeup deferred a few cycles
  - · Include checks to avoid structural hazards

| Rea       | iuy bit | i. |
|-----------|---------|----|
| p1        | У       |    |
| p2        | у       |    |
| р3        | У       |    |
| p4        | У       |    |
| р5        | У       |    |
| <b>p6</b> | У       |    |
| р7        | n       |    |
| <b>p8</b> | у       |    |
| p9        | n       |    |

Ready hit



### Issue

- · Select/Wakeup one cycle
- Dependent instructions execute on back-to-back cycles
  - Next cycle: add/addi are ready:

| Insn | Inp1 | R | Inp2 | R | Dst | Bday |
|------|------|---|------|---|-----|------|
|      |      |   |      |   |     |      |
| add  | p6   | У | p4   | у | р7  | 1    |
|      |      |   |      |   |     |      |
| addi | p8   | У |      | у | р9  | 3    |

- Issued instructions are removed from issue queue
  - Free up space for subsequent instructions



# Re-order Buffer (ROB)

- ROB entry holds all info for recovery/commit
  - · All instructions & in order
  - · Architectural register names, physical register names, insn type
  - · Not removed until very last thing ("commit")
- Operation
  - · Fetch: insert at tail (if full, stall)
  - · Commit: remove from head (if not yet done, stall)
- · Purpose: tracking for in-order commit
  - Maintain appearance of in-order execution
  - Needed to support:
    - Misprediction recovery
    - Freeing of physical registers



# Register Renaming Revisited

- Track (or "log") the "overwritten register" in ROB
  - Free this register at commit
  - · Also used to restore the map table on "recovery"
    - Used for branch misprediction recovery



# Recovery

- Completely remove wrong path instructions
  - Flush from IQ
  - Remove from ROB
  - Restore map table to before misprediction
  - Free destination registers
- · How to restore map table?
  - Option #1: log-based reverse renaming (on following slides)
    - · Tracks the old mapping to allow it to be reversed
    - Done sequentially for each instruction (slow)
  - Option #2: checkpoint-based recovery
    - · Checkpoint state of maptable and free list each cycle
    - Faster recovery, but requires more state
  - Option #3: hybrid (checkpoint branches, unwind for others)



National Yang Ming Chiao Tung University Computer Architecture & System Lab

## Reg Renaming Recovery Example beq is midpredicted

<del>-p8 - p6 -</del>p3

<del>-p7</del> p4

p5

other insns h already beer renamed. Ho we restore m table & free

х3

x4

x5

|                       | original insns           | renamed insns           | overwritten      |
|-----------------------|--------------------------|-------------------------|------------------|
| licted, but           | beq                      | beq                     |                  |
| sns have<br>been      | <del>-xor x3,x1,x2</del> | xor p6,p1,p2            |                  |
| d. How do             | -add x4,x3,x4            | add p7,p6,p4            |                  |
| ore map<br>free list? | <del>sub x3,x5,x2</del>  | <del>sub p8,p5,p2</del> | <del>x3:p6</del> |
|                       | -addi x1,x3,1            | <del>addi p9,p8,1</del> |                  |
| arch reg              | phys rec                 |                         | free list        |
| x1                    | <u>р</u> 9_р1            |                         | <u>p6</u>        |
| x2                    | p2                       |                         | <del></del>      |
|                       |                          |                         |                  |

<del>p</del>0

p10



# Commit

- At commit, an insn updates architected state
  - · Commit is done in-order
  - Only when instructions are finished and there is no possibility of rollback
  - Ok to free overwritten register at this point



## Free over-written register

xor r1 ^ r2 r3 add r3 r4  $\rightarrow$  r4 sub r5 - r2  $\rightarrow$  r3 addi r3 + 1  $\rightarrow$  r1



[p3] [p4] [p6] [p1]

- p3 was r3 before xor
- p6 is r3 after xor
  - Anything older than xor should read p3
  - Anything younger than xor should read p6 (until another insn writes r3)
  - · At commit of xor, no older instructions exist



National Yang Ming Chiao Tung University Computer Architecture & System Lab

# Register Renaming Commit Example

|                           |                | <b></b>              |             |
|---------------------------|----------------|----------------------|-------------|
| ., .                      | original insns | renamed insns        | overwritten |
| mit insns<br>der, freeing | xor x3,x1,x2   | xor p6,p1,p2         | x3:p3       |
| s registers               | add x4,x3,x4   | add p7,p6,p4         | x4:p4       |
|                           | sub x3,x5,x2   | sub p8,p5,p2         | x3:p6       |
|                           | addi x1,x3,1   | addi p9,p8, <u>1</u> | x1:p1       |
| arch reg                  | phys reg       | free li              | st          |
| x1                        | p9             | p10                  |             |
| x2                        | p2             |                      |             |
| x3                        | p8             |                      |             |
| x4                        | p7             |                      |             |
| x5                        | р5             |                      | 66          |



# Dynamic Scheduling Example

- The following slides are a detailed but concrete example
- · Yet, it contains enough detail to be overwhelming
  - Try not to worry about the details
- Focus on the big picture:

Hardware can reorder instructions to extract instruction-level parallelism



# Dynamic Scheduling Example

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6              | 7     | 8  | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|----------------|-------|----|---|----|----|----|
| ld [p1] → p2                 | F | Di | Ι  | RR | Х  | $M_1$ | M <sub>2</sub> | W,    | С  |   |    |    |    |
| add p2 + p3 → p4             | F | Di |    |    |    | Ι     | RR             | X     | W, | С |    |    |    |
| xor p4 ^ p5 $\rightarrow$ p6 |   | F  | Di |    |    |       | Ι              | RR    | X  | W | С  |    |    |
| ld [p7] → p8                 |   | F  | Di | Ι  | RR | Х     | M <sub>1</sub> | $M_2$ | W  |   | С  |    |    |

- How would this execution occur cycle-by-cycle?
- · Execution latencies assumed in this example:
  - Loads have two-cycle load-to-use penalty
    - Three cycle total execution latency
  - All other instructions have single-cycle execution latency
- · Issue queue holds all un-executed instructions
  - Holds ready/not-ready status
  - Faster than looking up in ready table each cycle



## Out-of-Order Pipeline – Cycle 0

|                                       | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---------------------------------------|---|---|---|---|---|---|---|---|---|---|----|----|----|
| ld [r1] → r2                          | F |   |   |   |   |   |   |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4          | F |   |   |   |   |   |   |   |   |   |    |    |    |
| xor r4 $^{\land}$ r5 $\rightarrow$ r6 |   |   |   |   |   |   |   |   |   |   |    |    |    |
| ld [r7] → r4                          |   |   |   |   |   |   |   |   |   |   |    |    |    |

| Ν  | 1ap  | Re  | ady |  |       | Re    | order      | Insn   | To I       | ree | Done?  |
|----|------|-----|-----|--|-------|-------|------------|--------|------------|-----|--------|
|    | able | Ta  | ble |  |       | E     | Buffer     | ld     |            |     | no     |
| r1 | p8   | p1  | yes |  |       |       |            | add    |            |     | no     |
|    | · ·  | p2  | yes |  |       |       |            |        |            |     |        |
| r2 | р7   | р3  | yes |  | Issue | Queue |            |        |            |     |        |
| r3 | p6   | p4  | yes |  |       |       | <b>D</b> 2 | Crea D | <b>D</b> 2 | Dee | b Data |
|    |      | p5  | yes |  | Insn  | Src1  | <b>R</b> ? | Src2   | <b>R</b> ? | Des | t Bdy  |
| r4 | р5   | p6  | yes |  |       |       |            |        |            |     |        |
| r5 | p4   | p7  | yes |  |       |       |            |        |            |     |        |
| r6 | р3   | p8  | yes |  |       |       |            |        |            |     |        |
|    | · ·  | p9  |     |  |       |       |            |        |            |     |        |
| r7 | p2   | p10 |     |  |       |       |            |        |            |     |        |
| r8 | p1   | p11 |     |  |       |       |            |        |            |     |        |
|    |      | p12 |     |  |       |       |            |        |            |     |        |



r1 r2

r3 r4 r5 r6 r7

r8

## Out-of-Order Pipeline – Cycle 1a

|                              | 0 | 1  | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|---|---|---|---|---|---|---|---|----|----|----|
| Id $[r1] \rightarrow r2$     | F | Di |   |   |   |   |   |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F |    |   |   |   |   |   |   |   |   |    |    |    |
| xor r4 <sup>∧</sup> r5 → r6  |   |    |   |   |   |   |   |   |   |   |    |    |    |
| ld [r7] → r4                 |   |    |   |   |   |   |   |   |   |   |    |    |    |

| Ν | 1ap  | Re  | eady |             |         | order  |      | To F   | ree | Done?  |     |  |  |  |
|---|------|-----|------|-------------|---------|--------|------|--------|-----|--------|-----|--|--|--|
|   | able | Ta  | able |             | E       | Buffer | · Id | р      | 7   |        | no  |  |  |  |
| 1 | p8   | p1  | yes  |             |         |        | add  |        |     |        | no  |  |  |  |
| L |      | p2  | yes  |             |         |        |      |        |     |        |     |  |  |  |
| 2 | p9   | р3  | yes  | Issue Queue |         |        |      |        |     |        |     |  |  |  |
| 3 | p6   | p4  | yes  | Insn        | Src1 R? |        | Src2 | R? Des |     | st Bdy |     |  |  |  |
|   |      | p5  | yes  | 111511      | SICI    | K:     | SICZ | K!     | Des | ינ     | Buy |  |  |  |
| 1 | р5   | p6  | yes  | ld          | p8      | yes    |      | yes    | p9  |        | 0   |  |  |  |
| 5 | p4   | p7  | yes  |             |         | -      |      |        |     |        |     |  |  |  |
| 5 | р3   | p8  | yes  |             |         |        |      |        |     |        |     |  |  |  |
|   |      | p9  | no   |             |         |        |      |        |     |        |     |  |  |  |
| 7 | p2   | p10 |      |             |         |        |      |        |     |        |     |  |  |  |
| 3 | p1   | p11 |      |             |         |        |      |        |     |        |     |  |  |  |
|   |      | p12 |      |             |         |        |      |        |     |        |     |  |  |  |



Map Table

r1 r2

r3

r4 r5

r6

r7

r8

p8

р9

р6 р10

p4

р3

p2

p1

## Out-of-Order Pipeline – Cycle 1b

|                              | 0 | 1  | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|---|---|---|---|---|---|---|---|----|----|----|
| ld [r1] → r2                 | F | Di |   |   |   |   |   |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |   |   |   |   |   |   |   |   |    |    |    |
| xor r4 ^ r5 → r6             |   |    |   |   |   |   |   |   |   |   |    |    |    |
| ld [r7] → r4                 |   |    |   |   |   |   |   |   |   |   |    |    |    |

|   | Re    | eady |  |         |       |            | Insn | To F       | ree  | Done? |  |
|---|-------|------|--|---------|-------|------------|------|------------|------|-------|--|
|   | Table |      |  |         | E     | Buffer     | ld   | p7         |      | no    |  |
|   | p1    | yes  |  |         |       |            | add  | p5         |      | no    |  |
| _ | p2    | yes  |  |         |       |            |      |            |      |       |  |
|   | р3    | yes  |  | Issue   | Queue |            |      |            |      |       |  |
|   | p4    | yes  |  |         | -     | <b>D</b> 2 | C    | <b>D</b> 2 | Deel | Dates |  |
|   | p5    | yes  |  | Insn    | Src1  | <b>R</b> ? | Src2 | <b>R</b> ? | Dest | t Bdy |  |
| ) | p6    | yes  |  | ld      | p8    | yes        |      | yes        | p9   | 0     |  |
|   | p7    | yes  |  | م دا دا |       |            |      |            |      | 4     |  |
|   | p8    | yes  |  | add     | р9    | no         | p6   | yes        | p10  | 1     |  |
| - | p9    | no   |  |         |       |            |      |            |      |       |  |
|   | p10   | no   |  |         |       |            |      |            |      |       |  |
|   | p11   |      |  |         |       |            |      |            |      |       |  |
|   | p12   |      |  |         |       |            |      |            |      |       |  |

71



## Out-of-Order Pipeline – Cycle 1c

|                              | 0 | 1  | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|---|---|---|---|---|---|---|---|----|----|----|
| ld [r1] → r2                 | F | Di |   |   |   |   |   |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |   |   |   |   |   |   |   |   |    |    |    |
| xor r4 ^ r5 → r6             |   | F  |   |   |   |   |   |   |   |   |    |    |    |
| ld [r7] → r4                 |   | F  |   |   |   |   |   |   |   |   |    |    |    |

| Мар   |     | Re    | Ready |   |       | Re       | Insn       | To F   | ree        | Done? |    |        |
|-------|-----|-------|-------|---|-------|----------|------------|--------|------------|-------|----|--------|
| Table |     | Table |       |   |       | E        | ld         | p7     |            |       | no |        |
| r1    | p8  | p1    | yes   |   |       |          | add        | p      | 5          |       | no |        |
| r1    | μo  | p2    | yes   |   |       |          |            | xor    |            |       |    | no     |
| r2    | p9  | р3    | yes   |   | Issue | Queue    | ld         |        |            | no    |    |        |
| r3    | p6  | p4    | yes   |   |       | <u> </u> | <b>D</b> 2 | Crea D | <b>D</b> 2 | Dee   |    | D du c |
|       |     | p5    | yes   |   | Insn  | Src1     | <b>R?</b>  | Src2   | <b>R</b> ? | Des   | τ  | Bdy    |
| r4    | p10 | рб    | yes   | 1 | ld    | p8       | yes        |        | yes        | p9    |    | 0      |
| r5    | p4  | р7    | yes   |   |       |          |            | -      | ,          |       |    |        |
| r6    | p3  | p8    | yes   |   | add   | р9       | no         | p6     | yes        | p10   |    | 1      |
|       |     | p9    | no    |   |       |          |            |        |            |       |    |        |
| r7    | p2  | p10   | no    |   |       |          |            |        |            |       | +  |        |
| r8    | p1  | p11   |       |   |       |          |            |        |            |       |    | : .    |
|       |     | p12   |       |   |       |          |            |        |            |       |    |        |



r1

r2

r3

r4

r5

r6

r7

r8

## Out-of-Order Pipeline – Cycle 2a

|                              | 0 | 1  | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|---|---|---|---|---|---|---|---|----|----|----|
| ld [r1] → r2                 | F | Di | Ι |   |   |   |   |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |   |   |   |   |   |   |   |   |    |    |    |
| xor r4 ^ r5 → r6             |   | F  |   |   |   |   |   |   |   |   |    |    |    |
| ld [r7] → r4                 |   | F  |   |   |   |   |   |   |   |   |    |    |    |

| Ν | 1ap  | Re  | eady |  |          | Re       | order      | Insn | To F       | ree | D   | one? |
|---|------|-----|------|--|----------|----------|------------|------|------------|-----|-----|------|
|   | able | Ta  | able |  |          | I        | Buffer     | ld   | р          | 7   |     | no   |
| 1 |      | p1  | yes  |  |          |          |            | add  | p          | 5   |     | no   |
|   | p8   | p2  | yes  |  |          |          |            | xor  |            |     |     | no   |
| 2 | p9   | р3  | yes  |  | Issue    | Queue    |            | ld   |            |     |     | no   |
| 3 | p6   | p4  | yes  |  | <u> </u> | <u> </u> | <b>D</b> 2 | Src2 | 60         | Dee |     | Daha |
|   |      | p5  | yes  |  | Insn     | Src1     | <b>R</b> ? | SICZ | <b>R</b> ? | Des | SL. | Bdy  |
| 4 | p10  | p6  | yes  |  | ld       | p8       | yes        |      | yes        | p9  |     | 0    |
| 5 | p4   | р7  | yes  |  |          |          |            |      | -          | -   | _   |      |
| 6 | p3   | p8  | yes  |  | add      | р9       | no         | p6   | yes        | p10 | ו   | 1    |
| - |      | p9  | no   |  |          |          |            |      |            |     |     |      |
| 7 | p2   | p10 | no   |  |          |          |            |      |            |     | _   |      |
| 8 | p1   | p11 |      |  |          |          |            |      |            |     |     |      |
|   |      | p12 |      |  |          |          |            |      |            |     |     |      |



r1 r2 r3 r4

r5 r6 r7

r8

# Out-of-Order Pipeline – Cycle 2b

|                              | 0 | 1  | 2  | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|---|---|---|---|---|---|---|----|----|----|
| ld [r1] → r2                 | F | Di | Ι  |   |   |   |   |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |   |   |   |   |   |   |   |    |    |    |
| xor r4 <sup>∧</sup> r5 → r6  |   | F  | Di |   |   |   |   |   |   |   |    |    |    |
| ld [r7] <b>→</b> r4          |   | F  |    |   |   |   |   |   |   |   |    |    |    |

| Ν | 1ap  | Re  | eady |   |       | Re       | order      | Insn | To F       | ree | D | one? |
|---|------|-----|------|---|-------|----------|------------|------|------------|-----|---|------|
|   | able | Ta  | able | _ |       | E        | Buffer     | ld   | p          | 7   |   | no   |
| 1 |      | p1  | yes  |   |       |          |            | add  | р          | 5   |   | no   |
| 1 | p8   | p2  | yes  |   |       |          |            | xor  | р          | 3   |   | no   |
| 2 | p9   | р3  | yes  |   | Issue | Queue    |            | ld   |            |     |   | no   |
| 3 | p6   | p4  | yes  |   |       | <u> </u> | <b>D</b> 2 | Sre2 | <b>D</b> 2 | Dee |   | Daha |
|   |      | p5  | yes  |   | Insn  | Src1     | <b>R</b> ? | Src2 | <b>R</b> ? | Des | τ | Bdy  |
| 4 | p10  | p6  | yes  | - | ld    |          | yes        |      | yes        | 9   |   | 0    |
| 5 | p4   | p7  | yes  | 1 |       | -        | ,          |      |            |     | _ |      |
| 6 | p11  | p8  | yes  | 1 | add   | p9       | no         | p6   | yes        | p10 | ) | 1    |
|   | -    | р9  | no   |   | xor   | p10      | no         | p4   | yes        | p11 |   | 2    |
| 7 | p2   | p10 | no   |   |       | P10      |            | ۲١   | ,          | P11 |   | -    |
| 8 | p1   | p11 | no   | 1 |       |          |            |      |            |     |   | ,    |
| _ | •    | p12 |      |   |       |          |            |      |            |     |   |      |



# Out-of-Order Pipeline – Cycle 2c

|                                    | 0 | 1  | 2  | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------------------------|---|----|----|---|---|---|---|---|---|---|----|----|----|
| ld [r1] → r2                       | F | Di | Ι  |   |   |   |   |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4       | F | Di |    |   |   |   |   |   |   |   |    |    |    |
| xor r4 $^{12}$ r5 $\rightarrow$ r6 |   | F  | Di |   |   |   |   |   |   |   |    |    |    |
| ld [r7] <b>→</b> r4                |   | F  | Di |   |   |   |   |   |   |   |    |    |    |

|    | 1ap<br>able | Tą  | ady<br>able |   |
|----|-------------|-----|-------------|---|
| r1 | p8          | p1  | yes         |   |
|    |             | p2  | yes         |   |
| r2 | p9          | р3  | yes         | 1 |
| r3 | p6          | p4  | yes         | Б |
|    |             | p5  | yes         | Ľ |
| r4 | p12         | p6  | yes         | 4 |
| r5 | p4          | p7  | yes         |   |
| r6 | p11         | p8  | yes         | ĺ |
|    |             | p9  | no          |   |
| r7 | p2          | p10 | no          | Ľ |
| r8 | p1          | p11 | no          |   |
|    |             | p12 | no          |   |

|       | Jonaci                           | 111511                          | 101                                                                              | ice                                                                                                      |                                                                                                                                                                                                                                                                                                                                              | one:                                                                                                                                                                                                                                                                                                                                                                   |
|-------|----------------------------------|---------------------------------|----------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| E     | Buffer                           | ld                              | p                                                                                | 7                                                                                                        |                                                                                                                                                                                                                                                                                                                                              | no                                                                                                                                                                                                                                                                                                                                                                     |
|       |                                  | add                             | p                                                                                | 5                                                                                                        |                                                                                                                                                                                                                                                                                                                                              | no                                                                                                                                                                                                                                                                                                                                                                     |
|       |                                  | xor                             | p                                                                                | 3                                                                                                        |                                                                                                                                                                                                                                                                                                                                              | no                                                                                                                                                                                                                                                                                                                                                                     |
| Queue |                                  | ld                              | p:                                                                               | 10                                                                                                       |                                                                                                                                                                                                                                                                                                                                              | no                                                                                                                                                                                                                                                                                                                                                                     |
| Src1  | <b>R</b> ?                       | Src2                            | <b>R</b> ?                                                                       | Des                                                                                                      | t                                                                                                                                                                                                                                                                                                                                            | Bdy                                                                                                                                                                                                                                                                                                                                                                    |
| n8    | ves                              |                                 | Ves                                                                              | _ <u>n9</u>                                                                                              |                                                                                                                                                                                                                                                                                                                                              | 0                                                                                                                                                                                                                                                                                                                                                                      |
| P*    | ,                                |                                 | ,                                                                                | P5                                                                                                       |                                                                                                                                                                                                                                                                                                                                              | •                                                                                                                                                                                                                                                                                                                                                                      |
| p9    | no                               | p6                              | yes                                                                              | p10                                                                                                      | )                                                                                                                                                                                                                                                                                                                                            | 1                                                                                                                                                                                                                                                                                                                                                                      |
| p10   | no                               | p4                              | yes                                                                              | p11                                                                                                      | L                                                                                                                                                                                                                                                                                                                                            | 2                                                                                                                                                                                                                                                                                                                                                                      |
| p2    | yes                              |                                 | yes                                                                              | p12                                                                                                      | 2                                                                                                                                                                                                                                                                                                                                            | 3                                                                                                                                                                                                                                                                                                                                                                      |
|       | Queue<br>Src1<br>p8<br>p9<br>p10 | BufferQueueSrc1R?p8yesp9nop10no | Buffer Id<br>add<br>xor<br>Id<br>Src1 R? Src2<br>p8 yes<br>p9 no p6<br>p10 no p4 | Buffer Id p<br>add p<br>xor p<br>Id p:<br>Src1 R? Src2 R?<br>p8 yes yes<br>p9 no p6 yes<br>p10 no p4 yes | Buffer         Id         p7           add         p5           xor         p3           Id         p10           Src1         R?         Src2         R?         Des           p8         yes         yes         p9           p9         no         p6         yes         p10           p10         no         p4         yes         p11 | Buffer         Id         p7           add         p5           add         p5           xor         p3           Id         p10           Src1         R?         Src2         R?         Dest           p8         yes         yes         p9           p9         no         p6         yes         p10           p10         no         p4         yes         p11 |

Reorder Insn To Free Done?

75



# Out-of-Order Pipeline – Cycle 3

|                              | 0 | 1  | 2  | 3  | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|---|---|---|---|---|---|----|----|----|
| ld [r1] → r2                 | F | Di | Ι  | RR |   |   |   |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |   |   |   |   |   |   |    |    |    |
| xor r4 ^ r5 → r6             |   | F  | Di |    |   |   |   |   |   |   |    |    |    |
| ld [r7] → r4                 |   | F  | Di | Ι  |   |   |   |   |   |   |    |    |    |

|    | 1ap<br>able | Ta  | ady<br>able |  |
|----|-------------|-----|-------------|--|
| r1 | p8          | p1  | yes         |  |
|    |             | p2  | yes         |  |
| r2 | p9          | р3  | yes         |  |
| r3 | p6          | p4  | yes         |  |
|    |             | p5  | yes         |  |
| r4 | p12         | p6  | yes         |  |
| r5 | p4          | p7  | yes         |  |
| r6 | p11         | p8  | yes         |  |
|    |             | p9  | no          |  |
| r7 | p2          | p10 | no          |  |
| r8 | p1          | p11 | no          |  |
|    |             | p12 | no          |  |

|       | Re    | order      | Insn | To I       | ree | D  | one? |  |   |
|-------|-------|------------|------|------------|-----|----|------|--|---|
|       | E     | Buffer     | ld   | p          | 7   |    | no   |  |   |
|       |       |            | add  | p          | 5   |    | no   |  |   |
|       |       |            | xor  | p          | 3   |    | no   |  |   |
| Issue | Queue |            | ld   | p:         | 10  |    | no   |  |   |
| Insn  | Src1  | <b>R</b> ? | Src2 | <b>R</b> ? | Des | st | Bdy  |  |   |
| Id    | p8    | yes        |      | ves        | p9  |    | 0    |  |   |
| Ľ.    | P0    | ,          |      | ,00        | P2  |    | Ŭ    |  |   |
| add   | р9    | no         | р6   | yes        | p10 | )  | 1    |  |   |
| xor   | p10   | no         | p4   | yes        | p11 |    | p11  |  | 2 |
| ld    | p2    | yes        |      | yes        | p12 |    | 3    |  |   |



r1

r2

r3

r4

r5

r6

r7

r8

# Out-of-Order Pipeline – Cycle 4

|                  | 0 | 1  | 2  | 3  | 4  | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------|---|----|----|----|----|---|---|---|---|---|----|----|----|
| ld [r1] → r2     | F | Di | Ι  | RR | Х  |   |   |   |   |   |    |    |    |
| add r2 + r3 → r4 | F | Di |    |    |    |   |   |   |   |   |    |    |    |
| xor r4 ^ r5 → r6 |   | F  | Di |    |    |   |   |   |   |   |    |    |    |
| ld [r7] → r4     |   | F  | Di | Ι  | RR |   |   |   |   |   |    |    |    |

| Ν | 1ap  | Re  | eady |   |          |       | order      |      | To F       | ree              | D          | one? |
|---|------|-----|------|---|----------|-------|------------|------|------------|------------------|------------|------|
|   | able | Tą  | able | _ |          | E     | Buffer     | ld   | р          | 7                |            | no   |
| 1 |      | p1  | yes  |   |          |       |            | add  | p          | 5                |            | no   |
|   | p8   | p2  | yes  |   |          |       |            | xor  | p          | 3                |            | no   |
| 2 | p9   | р3  | yes  |   | Issue    | Queue |            | ld   | p1         | 10               |            | no   |
| 3 | p6   | p4  | yes  |   |          | -     | <b>D</b> 2 | Src2 | <b>D</b> 2 | Dee              |            | Daha |
|   |      | p5  | yes  |   | Insn     | Src1  | R?         | SICZ | <b>R</b> ? | Des              | ε <b>ι</b> | Bdy  |
| 4 | p12  | p6  | yes  | . | ld       |       | yes        |      | yes        | _ <del>p</del> 9 | _          | 0    |
| 5 | p4   | p7  | yes  | 1 | <u> </u> |       |            |      | ,          |                  | _          |      |
| 6 | p11  | p8  | yes  | 1 | add      | p9    | yes        | p6   | yes        | p10              | )          | 1    |
|   |      | p9  | yes  |   | xor      | p10   | no         | p4   | yes        | p11              |            | 2    |
| 7 | p2   | p10 | no   |   |          |       |            | ۲'   | ,          | •                | _          |      |
| 8 | p1   | p11 | no   | • | ld       | - p2  | yes        |      | yes        | -p12             | -          | 3    |
|   | •    | p12 | no   |   |          |       |            |      |            |                  |            |      |



r1 r2 r3 r4 r5 r6 r7

r8

# Out-of-Order Pipeline – Cycle 5a

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|---|---|---|---|----|----|----|
| Id $[r1] \rightarrow r2$     | F | Di | Ι  | RR | Х  | $M_1$ |   |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |    | Ι     |   |   |   |   |    |    |    |
| xor r4 ^ r5 → r6             |   | F  | Di |    |    |       |   |   |   |   |    |    |    |
| ld [r7] → r4                 |   | F  | Di | Ι  | RR | Х     |   |   |   |   |    |    |    |

| Мар   | Re  | eady |   |         | Re    | order      | Insn | To F       | ree | D          | one? |
|-------|-----|------|---|---------|-------|------------|------|------------|-----|------------|------|
| Table | Ta  | able | _ |         | E     | Buffer     | ld   | р          | 7   |            | no   |
|       | p1  | yes  |   |         |       |            | add  | р          | 5   |            | no   |
| p8    | p2  | yes  |   |         |       |            | xor  | р          | 3   |            | no   |
| p9    | р3  | yes  |   | Issue   | Queue |            | ld   | p1         | LO  |            | no   |
| p6    | p4  | yes  |   |         | -     | <b>D</b> 2 | Src2 | 62         | Dee |            | D du |
| · ·   | p5  | yes  |   | Insn    | Src1  | R?         | Src2 | <b>R</b> ? | Des | ε <b>ι</b> | Bdy  |
| p12   | p6  | yes  | - | ld      |       | yes        |      | yes        | 9   | _          | 0    |
| p4    | р7  | yes  |   | م ما ما |       |            |      |            |     | _          | 4    |
| p11   | p8  | yes  |   | add     | р9    | yes        | p6   | yes        | p10 | )          | 1    |
|       | p9  | yes  |   | xor     | p10   | yes        | p4   | yes        | p11 |            | 2    |
| p2    | p10 | yes  |   |         | •     | /          | P -  | 7.00       |     |            |      |
| p1    | p11 | no   | - | ld      |       | yes        |      | yes        | p12 | -          | -3   |
| _ ·   | p12 | no   |   |         |       |            |      |            |     |            |      |



# Out-of-Order Pipeline – Cycle 5b

|                  | 0 | 1  | 2  | 3  | 4  | 5     | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------|---|----|----|----|----|-------|---|---|---|---|----|----|----|
| ld [r1] → r2     | F | Di | Ι  | RR | Х  | $M_1$ |   |   |   |   |    |    |    |
| add r2 + r3 → r4 | F | Di |    |    |    | Ι     |   |   |   |   |    |    |    |
| xor r4 ^ r5 → r6 |   | F  | Di |    |    |       |   |   |   |   |    |    |    |
| ld [r7] → r4     |   | F  | Di | Ι  | RR | Х     |   |   |   |   |    |    |    |

Done?

no

no

no

no

Bdy

0

÷

2

3





r1 r2

r3 r4

r5 r6 r7

r8

#### Out-of-Order Pipeline – Cycle 6

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6              | 7 | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|----------------|---|---|---|----|----|----|
| ld [r1] → r2                 | F | Di | Ι  | RR | Х  | $M_1$ | M <sub>2</sub> |   |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |    | Ι     | RR             |   |   |   |    |    |    |
| xor r4 <sup>∧</sup> r5 → r6  |   | F  | Di |    |    |       | Ι              |   |   |   |    |    |    |
| ld [r7] → r4                 |   | F  | Di | Ι  | RR | Х     | $M_1$          |   |   |   |    |    |    |

| Мар      | Re          | eady |     |       |                   | eorder     |               | To F       | ree              | D   | one? |
|----------|-------------|------|-----|-------|-------------------|------------|---------------|------------|------------------|-----|------|
| Table    | Ta          | able | _   |       | I                 | Buffer     | ld            | p          | 7                |     | no   |
|          | <b>1</b> p1 | yes  |     |       |                   |            | add           | p          | 5                |     | no   |
| p8       | p2          | yes  |     |       |                   |            | xor           | p          | 3                |     | no   |
| p9       | р3          | yes  | ]   | Issue | Queue             |            | ld            | p1         | 10               |     | no   |
| p6       | p4          | yes  |     |       | <u> </u>          | <b>D</b> 2 | Src2          | <b>D</b> 2 | Dee              |     | Ddy  |
| · ·      | p5          | yes  |     | Insn  | Src1              | <b>R</b> ? | Src2          | <b>R</b> ? | Des              | SL. | Bdy  |
| p12      | p6          | yes  | -   | ld    | - p8              | yes        |               | yes        | 9                | _   | 0    |
| p4       | p7          | yes  | 1   |       |                   |            | 6             | ,          | -                | _   |      |
| p11      | p8          | yes  | ] - | add   | <del>р9</del>     | yes        | <del>p6</del> | yes        | <del>- p10</del> | )   | -1   |
| -        | p9          | yes  |     | xor   | p10               | yes        | p4            | yes        | p11              |     | 2    |
| p2       | p10         | yes  |     |       |                   | 7          | μ.            | 7.55       | •                | _   |      |
| p1       | p11         | yes  | -   | ld    | - <del>p2</del> - | yes        |               | yes        | -p12             | -   | -3   |
| <u> </u> | p12         | yes  |     |       |                   |            |               |            |                  |     |      |



# Out-of-Order Pipeline – Cycle 7

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6              | 7     | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|----------------|-------|---|---|----|----|----|
| Id $[r1] \rightarrow r2$     | F | Di | Ι  | RR | Х  | $M_1$ | M <sub>2</sub> | W     |   |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |    | Ι     | RR             | X     |   |   |    |    |    |
| xor r4 ^ r5 → r6             |   | F  | Di |    |    |       | Ι              | RR    |   |   |    |    |    |
| ld [r7] <b>→</b> r4          |   | F  | Di | Ι  | RR | Х     | $M_1$          | $M_2$ |   |   |    |    |    |

| Ν   | 1ap  | Re      | ady  |     |          | Re                | order  | Insn              | To F       | ree  | D             | one?       |
|-----|------|---------|------|-----|----------|-------------------|--------|-------------------|------------|------|---------------|------------|
|     | able |         | able | _   |          | I                 | Buffer | ld                | р          | 7    |               | yes        |
| . 1 |      | p1      | yes  |     |          |                   |        | add               | p          | 5    |               | no         |
| r1  | p8   | p2      | yes  |     |          |                   |        | xor               | p          | 3    |               | no         |
| r2  | p9   | р3      | yes  |     | Issue    | Queue             |        | ld                | p:         | 10   |               | no         |
| r3  | p6   | p4      | yes  |     |          | <u> </u>          |        | -                 |            |      |               |            |
|     |      | p5      | yes  | 1   | Insn     | Src1              | R?     | Src2              | <b>R</b> ? | Des  | t             | Bdy        |
| r4  | p12  | ,<br>рб | yes  | 1.  | ld       | p8                | yes    |                   | yes        | - p9 |               | 0          |
| r5  | p4   | p7      | yes  | 1   | <u> </u> | P0                | ,      |                   | ,00        | P5   |               | •          |
|     |      | p8      | yes  | 1 - | add      | - <del>p9</del>   | yes    | <del>- p6</del> - | yes        | p10  | $\rightarrow$ | 1          |
| r6  | p11  |         | ,    | 1   |          |                   | ,      | -                 | <i>'</i>   |      | -             |            |
| r7  | p2   | p9      | yes  | -   | xor      | <del>p10</del>    | yes    |                   | yes        | p11  | -             | _2         |
| 17  | μz   | p10     | yes  | 1   | <u></u>  |                   |        |                   |            |      |               | -          |
| r8  | p1   | p11     | yes  | - 1 | ld       | - <del>p2</del> - | yes    |                   | yes        | p12  | -             | <u>_</u> ? |
|     | -    | p12     | yes  |     |          |                   |        |                   |            |      | _             |            |



## Out-of-Order Pipeline – Cycle 8a

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6              | 7              | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|----------------|----------------|---|---|----|----|----|
| Id $[r1] \rightarrow r2$     | F | Di | Ι  | RR | Х  | $M_1$ | M <sub>2</sub> | W              | С |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |    | Ι     | RR             | X              |   |   |    |    |    |
| xor r4 ^ r5 → r6             |   | F  | Di |    |    |       | Ι              | RR             |   |   |    |    |    |
| ld [r7] → r4                 |   | F  | Di | Ι  | RR | Х     | $M_1$          | M <sub>2</sub> |   |   |    |    |    |

| N  | <b>1</b> ар | Re  | eady |     |        | Re                | eorder | Insn          | To F      | ree           | D   | one? |
|----|-------------|-----|------|-----|--------|-------------------|--------|---------------|-----------|---------------|-----|------|
|    | •           | Ta  | able |     |        | 1                 | Buffer | ld ld         | p         | 7             |     | yes  |
| 10 | able        |     |      | 1   |        | •                 | Junci  |               |           |               |     |      |
| -1 | 50          | p1  | yes  |     |        |                   |        | add           | p         | 5             |     | no   |
| r1 | p8          | p2  | yes  | 1   |        |                   |        | xor           | p         | 3             |     | no   |
| r2 | p9          | p3  | yes  | 1   | Issue  | Queue             |        | ld            | p:        | 10            |     | no   |
| -  | -           | p4  | yes  | 1   | 100000 | Queue             |        |               |           |               |     |      |
| r3 | p6          |     | yes  | 4   | Insn   | Src1              | R?     | Src2          | <b>R?</b> | Des           | t l | Bdy  |
|    |             | p5  | yes  | I . | Insu   | SICI              | К:     | SICZ          | К:        | Des           |     | Duy  |
| r4 | p12         | p6  | yes  | 1 _ | ld     |                   | yes    |               | yes       | - p9          |     | 0    |
| _  |             |     | yes  | 1 7 |        | PO                | ,05    |               | yes       | - 4-          |     | 0    |
| r5 | p4          | p7  |      |     |        |                   |        |               |           |               |     | -    |
| r6 | p11         | p8  | yes  | -   | add    | <del>- p9</del> - | yes    | <del>p6</del> | yes       | <del>1(</del> | ,   | -1   |
| 10 | pm          | p9  | yes  | 1   | Vor    | <b>n10</b>        | 1/00   | <b>n</b> 4    | 1/00      | n11           |     | 2    |
| r7 | p2          | · · |      | 1 - | xor    | <del>p10</del>    | yes    | _p1           | yes       | p11           |     | -2-  |
| 17 | P2          | p10 | yes  |     |        |                   |        |               |           |               | -   |      |
| r8 | <b>n</b> 1  | p11 | yes  | 1 - | ld     | p2                | yes    |               | yes       | p12           |     | 3    |
| 10 | p1          |     | ,    | 1   |        |                   | ,      |               | , 30      | P             | -   |      |
|    |             | ב12 | ves  | I   |        |                   |        |               |           |               |     |      |



National Yang Ming Chiao Tung University Computer Architecture & System Lab

## Out-of-Order Pipeline – Cycle 8b

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6              | 7              | 8  | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|----------------|----------------|----|---|----|----|----|
| ld [r1] → r2                 | F | Di | Ι  | RR | Х  | $M_1$ | M <sub>2</sub> | W              | С  |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |    | Ι     | RR             | X              | W, |   |    |    |    |
| xor r4 ^ r5 → r6             |   | F  | Di |    |    |       | Ι              | RR             | X  |   |    |    |    |
| ld [r7] → r4                 |   | F  | Di | Ι  | RR | Х     | $M_1$          | M <sub>2</sub> | W  |   |    |    |    |

| Ν  | 1ap  | Re  | ady |   |       | Re              | order      | Insn            | To F       | ree              | Done?  | ?        |
|----|------|-----|-----|---|-------|-----------------|------------|-----------------|------------|------------------|--------|----------|
|    | able | Ta  | ble |   |       | I               | Buffer     | ld              | p          | 7                | yes    |          |
| r1 | p8   | p1  | yes |   |       |                 |            | add             | р          | 5                | yes    |          |
| r1 | μo   | p2  | yes |   |       |                 |            | xor             | p          | 3                | no     |          |
| r2 | p9   | р3  | yes |   | Issue | Queue           |            | ld              | p1         | 10               | yes    |          |
| r3 | p6   | p4  | yes |   |       | -               | <b>D</b> 2 | Cre 2           | 60         | Dee              | L D.d. |          |
|    |      | p5  | yes |   | Insn  | Src1            | <b>R?</b>  | Src2            | <b>R</b> ? | Des              | t Bdy  | <u>/</u> |
| r4 | p12  | p6  | yes | - | ld    |                 | yes        |                 | yes        | - p9             | 0      | コ        |
| r5 | p4   | р7  |     |   |       | -               |            | -               |            | -                | _      | 4        |
| r6 | p11  | р8  | yes | - | add   | <del>- p9</del> | yes        | <del>- p6</del> | yes        | <del>- p10</del> | + 1    | ╡        |
| 10 | рп   | p9  | yes |   | vor   | p10             | VOC        | n/              | VOC        | p11              | 2      | Τ        |
| r7 | p2   | p10 | yes |   | xor   | P10             | yes        | -p1             | 705        | PII              | 2      |          |
| r8 | p1   | p11 | yes | - | ld    |                 | yes        |                 | yes        | p12              | 3      | ╡        |
| 10 | PT   | p12 | yes |   |       | •               | ,          |                 | ,          |                  |        | ł        |



r1

r2

r3

r4 r5 r6 r7

r8

## Out-of-Order Pipeline – Cycle 9a

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6              | 7     | 8  | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|----------------|-------|----|---|----|----|----|
| Id $[r1] \rightarrow r2$     | F | Di | Ι  | RR | Х  | $M_1$ | M <sub>2</sub> | W     | С  |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |    | Ι     | RR             | X     | W, | С |    |    |    |
| xor r4 ^ r5 → r6             |   | F  | Di |    |    |       | Ι              | RR    | X  |   |    |    |    |
| ld [r7] → r4                 |   | F  | Di | Ι  | RR | Х     | $M_1$          | $M_2$ | W  |   |    |    |    |

|   |       | _        | _    |   | _     | _                | -      |               |      |            |     |                  |  |
|---|-------|----------|------|---|-------|------------------|--------|---------------|------|------------|-----|------------------|--|
| N | 1ap   | Re       | eady |   |       | Re               | eorder | Insn          | To F | ree        | D   | one?             |  |
|   | able  |          | able | _ |       | I                | Buffer | ld            | p    | 7          |     | yes              |  |
| 1 |       | p1       | yes  |   |       |                  |        | add           | p    | 5          |     | <del>yes –</del> |  |
| 1 | p8    | p2       | yes  |   |       |                  |        | xor           | p3   |            | no  |                  |  |
| 2 | p9    | р3       | yes  |   | Issue | Queue            |        | ld            | p1   | 10         | yes |                  |  |
| 3 | 26    | p4       | yes  |   |       | -                |        |               |      |            | -   |                  |  |
| 5 | p6    | p5       |      |   | Insn  | Src1             | R?     | Src2          | R?   | Des        | t   | Bdy              |  |
| 4 | p12   | р5<br>р6 | yes  |   | Id    | p8               | yes    |               | yes  | - p9       |     | 0                |  |
| - |       |          | ,    |   |       | Po               | ,05    |               | yes  | 4          |     | 0                |  |
| 5 | p4    | р7       |      |   | add   | - <del>p</del> 9 | MOG    | 26            | VOC  | -p10       |     | 1                |  |
| 6 | p11   | p8       | yes  |   | auu   | 64               | yes    | <del>p6</del> | yes  | <b>P10</b> | 1   | -                |  |
|   |       | p9       | yes  |   | xor   | p10              | yes    | p1            | Vos  | p11        |     | 2                |  |
| 7 | p2    | p10      | yes  |   |       | P10              | ,05    | ۲ <b>۲</b>    | 703  |            | ·   | 2                |  |
| 0 | - m 1 |          |      | - | ld    | p2               | yes    |               | yes  | p12        |     | 3                |  |
| 8 | p1    | p11      | yes  |   | Ľ     | P2               | ,,     |               | ,05  | P12        | •   | 5                |  |
|   |       | p12      | yes  |   |       |                  |        |               |      |            |     |                  |  |



r1 r2 r3 r4 r5 r6 r7

r8

# Out-of-Order Pipeline – Cycle 9b

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6              | 7              | 8  | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|----------------|----------------|----|---|----|----|----|
| ld [r1] → r2                 | F | Di | Ι  | RR | Х  | $M_1$ | M <sub>2</sub> | W,             | С  |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |    | Ι     | RR             | X              | W, | С |    |    |    |
| xor r4 ^ r5 → r6             |   | F  | Di |    |    |       | Ι              | RR             | X  | W |    |    |    |
| ld [r7] → r4                 |   | F  | Di | Ι  | RR | Х     | $M_1$          | M <sub>2</sub> | W  |   |    |    |    |

| Ν | 1ap  | Re       | eady |     |               | Re       | eorder          | Insn              | To I       | Free | D            | one? |
|---|------|----------|------|-----|---------------|----------|-----------------|-------------------|------------|------|--------------|------|
|   | able | Ta       | able |     |               | I        | - <del>Id</del> | p7                |            | yes  |              |      |
| 1 |      | p1       | yes  |     |               |          |                 | add               | p          | 5    |              | yes  |
| 1 | p8   | p2       | yes  | 1   |               |          |                 | xor               | p          | 3    | yes          |      |
| 2 | p9   | p3       | yes  | 1   | Issue         | Queue    |                 | ld                | p10        |      | yes          |      |
| 3 | p6   | p4       | yes  | 1   |               | <u> </u> |                 |                   |            |      | _            |      |
| 5 |      | p5       |      | 1   | Insn          | Src1     | R?              | Src2              | <b>R</b> ? | Des  | ;t           | Bdy  |
| 4 | p12  | p6       | yes  | Ι.  | ld            | p8       | yes             |                   | yes        | 9    |              | 0    |
| 5 | n4   | ро<br>р7 | ,    | 1   | <sup>IU</sup> | μο       | yc5             |                   | yC3        |      |              | 0    |
| 5 | p4   |          |      | Ι.  | add           | p9       | VOC             | <del>- p6</del> - | Voc        | p10  | $\mathbf{r}$ | 1    |
| 6 | p11  | p8       | yes  |     | auu           | P2       | yes             | ρu                | yes        | pro  | <u> </u>     | -    |
| - |      | p9       | yes  | Ι.  | xor           | p10      | yes             | p1                | Vos        | p11  |              | 2    |
| 7 | p2   | p10      | yes  | 1   |               | P10      | ,05             | P                 | 705        | P11  | <u> </u>     | ~    |
| 8 | n1   | p11      | yes  | 1 - | ld            | p2       | yes             |                   | yes        | p12  | 2            | 3    |
| 0 | p1   | · · ·    | ,    | ł   |               | P        | /               |                   | 7          | P    | _            | -    |
|   |      | p12      | yes  | 1   |               |          |                 |                   |            |      |              |      |



## Out-of-Order Pipeline – Cycle 10

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6              | 7     | 8  | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|----------------|-------|----|---|----|----|----|
| ld [r1] → r2                 | F | Di | Ι  | RR | Х  | $M_1$ | M <sub>2</sub> | W     | С  |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |    | Ι     | RR             | X     | W, | С |    |    |    |
| xor r4 ^ r5 → r6             |   | F  | Di |    |    |       | Ι              | RR    | X  | W | С  |    |    |
| ld [r7] → r4                 |   | F  | Di | Ι  | RR | Х     | $M_1$          | $M_2$ | W  |   | С  |    |    |

Done? yes yes

yes

yes

Bdy

n

| Map Ready |      |          |      |     |       | Re                | Insn | To F          | ree            | Do   |               |
|-----------|------|----------|------|-----|-------|-------------------|------|---------------|----------------|------|---------------|
|           | able | Ta       | able | _   |       | E                 | ld   | p7            |                |      |               |
| r1        | p8   | p1       | yes  |     |       |                   |      | add           |                | 5    |               |
| 11        | ρο   | p2       | yes  |     |       |                   |      | xor           | <del>  р</del> | 3    |               |
| r2        | p9   | р3       |      |     | Issue | Queue             |      | Id            | p10            |      |               |
| r3        | p6   | p4       | yes  |     |       | -                 |      |               |                | _    | _             |
| 15        | ρo   | p5       |      | 1   | Insn  | Src1              | R?   | Src2          | <b>R</b> ?     | Des  | t             |
| r4        | p12  | p6       | yes  | Ι.  | ld    | p8                | yes  |               | yes            | - p9 | $\neg$        |
| r5        | p4   | р7       |      |     |       | po                | ,005 |               | ye5            | P2   | $\rightarrow$ |
|           |      | p8       | yes  | - 1 | add   | <del>- p9</del> - | yes  | <del>p6</del> | yes            | _p10 | +             |
| r6        | p11  | р0<br>р9 | ,    | 1   |       |                   |      |               | -              |      | +             |
| r7        | p2   | · · ·    | yes  | - 1 | xor   | <del>- p10</del>  | yes  | -p1           | yes            | _p11 | -+            |
|           | P2   | p10      |      |     |       |                   |      |               |                |      | $\uparrow$    |
| r8        | p1   | p11      | yes  | - 1 | ld    | -p2               | yes  |               | yes            | -p12 | - +           |
|           | _    | p12      | yes  |     |       |                   |      |               |                |      |               |



#### Out-of-Order Pipeline – Done!

|                              | 0 | 1  | 2  | 3  | 4  | 5     | 6     | 7     | 8 | 9 | 10 | 11 | 12 |
|------------------------------|---|----|----|----|----|-------|-------|-------|---|---|----|----|----|
| Id $[r1] \rightarrow r2$     | F | Di | Ι  | RR | Х  | $M_1$ | M     | W     | С |   |    |    |    |
| add r2 + r3 $\rightarrow$ r4 | F | Di |    |    |    | Ι     | RR    | Х     | W | С |    |    |    |
| xor r4 ^ r5 → r6             |   | F  | Di |    |    |       | Ι     | RŔ    | Х | W | С  |    |    |
| ld [r7] → r4                 |   | F  | Di | Ι  | RR | Х     | $M_1$ | $M_2$ | W |   | С  |    |    |

| Ν   | 1ap  | Re  | eady |     |        | Re                | order | Insn              | To F          | ree            | Done?   |
|-----|------|-----|------|-----|--------|-------------------|-------|-------------------|---------------|----------------|---------|
|     | able | Ta  | ble  |     | Buffer |                   |       | -                 | p             | 7              | yes     |
| r1  | p8   | p1  | yes  |     |        |                   |       | add               | p             |                | yes     |
| ' 1 | - PO | p2  | yes  |     |        |                   |       | xor               | <del>p3</del> |                | yes     |
| r2  | p9   | р3  |      |     | Issue  | Queue             |       | ld -              | p1            | 10             | yes     |
| r3  | p6   | p4  | yes  |     |        | -                 |       |                   |               | _              |         |
|     | •    | p5  |      | 1   | Insn   | Src1              | R?    | Src2              | <b>R</b> ?    | Dest           | t   Bdy |
| r4  | p12  | p6  | yes  | 1.  | ld     | p8                | yes   |                   | yes           | - p9           | 0       |
| r5  | p4   | р7  |      | 1   |        |                   | ,     |                   | ,             | -              |         |
|     | •    | p8  | yes  | 1 - | add    | <del>- p9</del> - | yes   | <del>- p6</del> - | yes           | <del>p10</del> | + 1     |
| r6  | p11  | р9  | yes  | ł   |        |                   |       |                   |               |                | -       |
| r7  | p2   |     | ,    | -   | xor    | <del>- p10</del>  | yes   | p1                | yes           | p11            | 2       |
|     | P2   | p10 |      | Į į | 1.4    |                   |       |                   |               | n17            |         |
| r8  | p1   | p11 | yes  | - 1 | ld     |                   | yes   |                   | yes           | <u>p12</u>     | 3       |
|     | -    | p12 | yes  |     |        |                   |       |                   |               |                |         |



# Conclusion

- OoO Everywhere
- Apple Cyclone core
  - part of A7 chip, launched in iPhone 5S in 2013
  - issue 6 insns per cycle
  - 192-entry ROB
  - 4 integer ALUs
  - 2 Load/Store units
  - 14-19 cycle branch misprediction penalty
  - <u>https://www.anandtech.com/show/7910/apples-cyclone-</u> <u>microarchitecture-detailed</u>