辅导案例-CMPEN 431 OOO

CMPEN 431 OOO Superscalar.1 Sampson Fall 2019 PSU CMPEN 431 Computer Architecture Fall 2019 Jack Sampson( www.cse.psu.edu/~sampson ) [Slides adapted from work by Mary Jane Irwin, in turn adapted from Computer Organization and Design, Revised 4th Edition, Patterson & Hennessy, © 2011, Morgan Kaufmann & 5th Edition, Patterson & Hennessy, © 2014, MK With additional thanks/credits to Amir Roth, Milo Martin, CIS/UPenn] Dynamically Scheduled SuperScalar (OOO) Processors, Ch.4F CMPEN 431 OOO Superscalar.2 Sampson Fall 2019 PSU Review: Multiple Instruction Issue Possibilities ❑ Fetch and issue more than one instruction in a cycle 1. Statically-scheduled (in-order) Very Long Instruction Word (VLIW) e.g., TransMeta (4-wide) – Compiler figures out what can be done in parallel, so the hardware can be dumb and low power – Compiler must group parallel instr’s, requires new binaries SuperScalar e.g., Pentium (2-wide), ARM CortexA8 (2-wide) – Hardware figures out what can be done in parallel – Executes unmodified sequential programs Explicitly Parallel Instruction Computing (EPIC) e.g., Intel Itanium (6-wide) – A compromise: compiler does some, hardware does the rest 2. Dynamically-scheduled (out-of-order) SuperScalar Hardware dynamically determines what can be done in parallel (can extract much more ILP with OOO processing) E.g., Intel Pentium Pro/II/III (3-wide), Core i7 (4 cores, 4-wide, SMT2), IBM Power5 (5-wide), Power8 (12 cores, 8-wide, SMT8) CMPEN 431 OOO Superscalar.3 Sampson Fall 2019 PSU The Impact of Data Dependence on Scheduling ❑ RAW When more than one applies, RAW dominates: add $t0,$t1,$t2 addi $t0,$t0,1 Must be respected: no way to avoid sequential execution ❑ WAR/WAW on registers Two different things can happen when using the same name depending on instruction ordering Can be eliminated by register renaming (saw this with VLIW/SSA) ❑ WAR/WAW on memory Can’t (practically) rename memory and don’t know if there is an actual dependency until the effective address is known (in Exec) Need to use something other than register renaming CMPEN 431 OOO Superscalar.4 Sampson Fall 2019 PSU Control Dependence and Instruction Scheduling ❑ Using branch prediction we may end up executing instructions that should not have been executed (i.e., the prediction is incorrect), thereby violating the control dependencies But, as long as we don’t change the visible machine state, it is still okay (we just used some energy doing work that has to be thrown away) ❑ The key is having a way to execute beyond (several) predicted branches without changing the visible machine state until you know for sure that the branch prediction was correct CMPEN 431 OOO Superscalar.5 Sampson Fall 2019 PSU Exception Dependence ❑ We also have to provide for precise interrupts, i.e., those synchronous to program (instruction) execution, to support virtual memory (TLB and/or page faults) and deal with undefined instructions, arithmetic overflow, etc. ❑ We also have to preserve exception (interrupt) behavior  any changes in instruction execution order must not change the order in which exceptions are raised, or cause new exceptions to be raised Example: beq $t0,$t1,L1 lw $t2,0($s1) L1: Can there be a problem with executing lw before beq? CMPEN 431 OOO Superscalar.6 Sampson Fall 2019 PSU Dynamic Scheduling in HW ❑ Key question: Can implementable hardware detect and orchestrate, at runtime, the necessary dependences in order to produce an instruction execution schedule that obeys all three of {data, control, exception} logical ordering constraints expressed in the original program binary and encoded in ISA semantics… but that better reflects the naturally occurring instruction level parallelism of the original program dependence graph? ❑ Short answer: YES! (But it requires some new HW mechanisms) ❑ Long answer: The next several dozen slides ☺ CMPEN 431 OOO Superscalar.7 Sampson Fall 2019 PSU Dynamic OOO Datapaths: Historical Precursors ❑ Scoreboarding – CDC 6600 (Thornton) first pub. in 1964 Used centralized hazard detection logic (scoreboard) to support OOO execution. Instr’s were stalled when their FU was busy, for RAW dependencies, and for WAW and WAR dependencies ❑ Tomasulo – IBM 360/91 (Tomasulo) first pub. in 1967 Used distributed hazard detection logic (reservation stations feeding each FU) to support OOO execution with register renaming that eliminated WAW and WAR dependencies; distributed results from FUs to reservation stations on a Common Data Bus (potential bottleneck) Writes results to register file and memory when instr’s completes – possibly out-of-order – so could not support precise interrupts or speculative execution (e.g., branch speculation) http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo1/tomasulo.htm CMPEN 431 OOO Superscalar.8 Sampson Fall 2019 PSU Dynamic OOO Datapaths in Microprocessors ❑ HPS – (Hwu, Patt, Shebanow) first publication in 1985 Used a register alias table and distributed node alias tables that fed each FUs (essentially reservation stations) to support OOO execution with register renaming; distributed results from FUs to reservation stations on multiple distribution buses (one per FU) Supported precise interrupts and speculative execution with a checkpoint repair mechanism ❑ RUU – (Sohi) first publication in 1987 Uses a centralized Register Update Unit (RUU) that 1) receives new instr’s from decode, 2) renames registers, 3) monitors the (single) result bus to resolve dependencies, 4) determines when instr’s are ready to issue (send for execution), and 5) holds completed instr’s until they can commit Supports precise interrupts and speculative execution with in- order commit out of the RUU Basis of SimpleScalar’s datapath architecture CMPEN 431 OOO Superscalar.9 Sampson Fall 2019 PSU Basic OOO Instruction Flow Overview 1. Fetch (in program order): Fetch multiple sequential instructions in parallel from the IM (I$) 2. Decode, Rename, & Dispatch (in program order): In parallel, decode all of the instr’s just fetched, rename the architected registers (ArchitectedRegFile (ARF)) with rename registers (PhysicalRegFile (PRF)), and schedule renamed instr’s for execution by dispatching them to the IQ (Instruction Queue) and the ROB (ReOrder Buffer) (combined in the RUU in SimpleScalar) Loads and stores are dispatched as two (micro)instr’s – one to the IQ to compute the addr and one to LSQ (LoadStoreQueue) for the memory operation CMPEN 431 OOO Superscalar.10 Sampson Fall 2019 PSU Basic OOO Instruction Flow Overview, Con’t 3. Issue (Out Of Order – OOO): When an instr in the IQ has all of its source data and the FU (Functional Unit) it needs is free, it is issued for execution 4. Writeback (Out Of Order – OOO): When the dst value has been computed it is written back to the PRF, the IQ, ROB and LSQ are updated – the instr completes execution Stores DO NOT write to cache at this stage, and the ARF is NOT updated CMPEN 431 OOO Superscalar.11 Sampson Fall 2019 PSU Basic OOO Instruction Flow Overview, Con’t 5. Commit (in program order): Only commit the instr’s result data to the state locations (i.e., update DM (D$), ARF) when it is the oldest completed instr in the ROB Exceptions delayed until this point Branch misprediction cleanup potentially delayed to this point – can be performed more aggressively at WB No matter how wide commit is, cannot skip over uncompleted instructions CMPEN 431 OOO Superscalar.12 Sampson Fall 2019 PSU Out-of-Order Pipeline F e tc h D e c o d e R e n a m e D is p a tc h C o m m it Buffers of renamed instructions (IQ,ROB,LSQ) Is s u e R e n a m e -R e g -R e a d E x e c u te W ri te b a c k In-order front end Out-of-order execution In-order commit Entirely new In-order pipeline stages Similar to, but not the same as “Writeback” Dataflow based execution engine operating on PHYSICAL register state, mapping to/from ARCHITECTURAL state CMPEN 431 OOO Superscalar.13 Sampson Fall 2019 PSU Basic OoO Instruction Flow Overview 1. Fetch 2. Decode 3. Re
name 4. Dispatch 5. Issue 6. Writeback 7. Commit In Original Program Order In Original Program Order In Dataflow Order Isolation via buffering Isolation via buffering N o n -s p e c u la ti v e C o n tr o l Speculative Control Semi-Speculative Control CMPEN 431 OOO Superscalar.15 Sampson Fall 2019 PSU Our Code Example lp(0): lw $t0,0($s1) #cache miss, assume 3 cycle stall addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 #provides WAW hazard addi $s1,$s1,-4 bne $s1,$0,lp #predict taken (and is) lp(1): lw $t0,0($s1) #cache hit (from here on) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp lp(3): … RAW WAR WAW CMPEN 431 OOO Superscalar.16 Sampson Fall 2019 PSU Code Dependency Observations ❑ Lots of both true and false dependencies ❑ sub instr independent of other instr’s (has no true dependencies) So can execute in parallel with another instr Are there others? ❑ Registers re-used Just as in static SS, the register names get in the way How can the hardware get around this? lp(0): lw $t0,0($s1) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp lp(1): lw $t0,0($s1) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp lp(3): … CMPEN 431 OOO Superscalar.17 Sampson Fall 2019 PSU Register Renaming ❑ Can use register renaming to eliminate (WAW, WAR) (register) data dependencies – conceptually write each register once + Removes false dependences (WAW and WAR) + Leaves true dependences (RAW) intact ❑ “Architected” vs “Physical” registers Architected (ISA) register names: $t0,$s1,$s1,$s2, etc Physical register names: p1,p2,p3,p4,p5,p6,p7 ❑ Need two hardware structures to enable renaming A Map Table showing the architected register that the physical register is currently “impersonating” A Free List of physical registers not currently in use CMPEN 431 OOO Superscalar.18 Sampson Fall 2019 PSU Renaming Example: Initial State $s1 p1 $s2 p2 $t0 p3 Map Table Free List p4 p5 p6 p7 p8 lw $t0,0($s1) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp CMPEN 431 OOO Superscalar.19 Sampson Fall 2019 PSU Renaming Example: lw Renaming $s1 p1 $s2 p2 $t0 p3 Map Table Free List p4 p5 p6 p7 p8 lw $t0,0($s1) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp Over-written Reg lw p4,0(p1) [p3] p4 CMPEN 431 OOO Superscalar.21 Sampson Fall 2019 PSU Which Register to Free at Commit ? ❑ The over-written (physical) register can be freed at Commit (i.e., added back to the Free List) It can be safely freed at commit because any instruction using that register must have already committed because both rename and commit are in-order and any NEWER instructions would use this instruction’s destination register (or an even newer name) instead Destination register CANNOT be freed at commit because subsequent dependent instructions may still exist in the pipeline ❑ We also need to keep track of the over-written (physical) register so that it can be restored in the Map Table on a recovery from mis-predicted branches and recovery from exceptions CMPEN 431 OOO Superscalar.22 Sampson Fall 2019 PSU Renaming Example: addu Renaming $s1 p1 $s2 p2 $t0 Map Table Free List p4 p5 p6 p7 p8 lw $t0,0($s1) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp Over-written Reg lw p4,0(p1) [p3] addu p5,p4,p2 [p4] 5 CMPEN 431 OOO Superscalar.24 Sampson Fall 2019 PSU Renaming Example: sw Renaming $s1 p1 $s2 p2 $t0 Map Table Free List p5 p6 p7 p8 lw $t0,0($s1) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp Over-written Reg lw p4,0(p1) [p3] addu p5,p4,p2 [p4] sw p5,0(p1) CMPEN 431 OOO Superscalar.26 Sampson Fall 2019 PSU Renaming Example: sub Renaming $s1 p1 $s2 p2 $t0 Map Table Free List p5 p6 p7 p8 lw $t0,0($s1) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp Over-written Reg lw p4,0(p1) [p3] addu p5,p4,p2 [p4] sw p5,0(p1) sub p6,p1,p2 [p5] 6 CMPEN 431 OOO Superscalar.28 Sampson Fall 2019 PSU Renaming Example: addi Renaming $s1 p1 $s2 p2 $t0 Map Table Free List p6 p7 p8 lw $t0,0($s1) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp Over-written Reg lw p4,0(p1) [p3] addu p5,p4,p2 [p4] sw p5,0(p1) sub p6,p1,p2 [p5] addi p7,p1,-4 [p1] 7 CMPEN 431 OOO Superscalar.29 Sampson Fall 2019 PSU Renaming Example: bne Renaming $s1 p1 $s2 p2 $t0 Map Table Free List p6 p7 p8 lw $t0,0($s1) addu $t0,$t0,$s2 sw $t0,0($s1) sub $t0,$s1,$s2 addi $s1,$s1,-4 bne $s1,$0,lp Over-written Reg lw p4,0(p1) [p3] addu p5,p4,p2 [p4] sw p5,0(p1) sub p6,p1,p2 [p5] addi p7,p1,-4 [p1] p7 bne p7,p0,lp CMPEN 431 OOO Superscalar.31 Sampson Fall 2019 PSU Our Code Example After Renaming lp(0): lw p4,0(p1) #[p3];cache miss, 3 cycle stall addu p5,p4,p2 #[p4] sw p5,0(p1) sub p6,p1,p2 #[p5] addi p7,p1,-4 #[p1] bne p7,p0,lp #predict taken (and is) lp(1): lw p8,0(p7) #[p6];cache hit addu p9,p8,p2 #[p8] sw p9,0(p7) sub p10,p7,p2 #[p9] addi p11,p7,-4 #[p7] bne p11,p0,lp lp(3): … RAW WAR – none WAW – none ❑ As promised, renaming eliminated false data dependencies (WAW, WAR) and left true data dependencies (RAW) intact CMPEN 431 OOO Superscalar.32 Sampson Fall 2019 PSU Out-of-Order Pipeline Progress F e tc h D e c o d e R e n a m e D is p a tc h C o m m it Buffers of renamed instructions (IQ,ROB,LSQ) Is s u e R e n a m e -R e g -R e a d E x e c u te W ri te b a c k In-order front end Out-of-order execution In-order commit ❑ Have completed Fetch, Decode, Rename (in program order) and are ready to Dispatch Instr’s now have unique register names, so can now put into OOO execution structures CMPEN 431 OOO Superscalar.33 Sampson Fall 2019 PSU Dispatch ❑ Renamed instructions are placed into OoO hardware data structures 1. Issue Queue (IQ) (SimpleScalar’s RUU) Central piece of scheduling logic holding un-executed instr’s Accessible as both a RAM and a CAM (Content Addr Memory) 2. Re-order buffer (ROB) (SimpleScalar’s RUU) Holds all instructions (in order) until Commit time Keeps track of the over-written register so they can be returned to the free list and to support recovery from mispredicted branches and exceptions 3. Load-Store Queue (LSQ) Loads and stores dispatched in two parts – one going to the IQ for effective address calculation and the other to the LSQ for loads and stores going to the DM Stores not sent to DM until Commit time, what about loads ? CMPEN 431 OOO Superscalar.35 Sampson Fall 2019 PSU Issue Queue (IQ) ❑ Holds un-executed instructions Instruction op and instruction “age” ❑ Tracks status of source inputs (ready, not ready) Physical (renamed) source register names + a ready bit for each source operand – AND the ready bits to tell if the instruction is ready to issue (send for execution) ❑ Physical (renamed) destination register Instr Src1 R Src2 R Dst Ready? Age CMPEN 431 OOO Superscalar.36 Sampson Fall 2019 PSU Aside: Content Addressable Memories (CAMs) ❑ Storage hardware that is addressed by its content. Typical applications include ROB source tag field comparison logic, cache tags, and (highly-associative) TLBs Match Field Data Field Hit Match Data Search Data Hardware that compares the Search Data to the Match Field entries for each word in the CAM in parallel ! On a match the Hit bit is set and the Data Field for that entry is output to Match Data on read or the Match Data is written into the Data Field on write If no match occurs, the Hit bit is reset CAMs can be designed to accommodate multiple hits ❑ A storage structure can have ports of both types (RAM & CAM) CMPEN 431 OOO Superscalar.37 Sampson Fall 2019 PSU Dispatch Steps ❑ Allocate IQ (and ROB) slot Full? Stall Not full? Find an empty slot in the IQ ❑ Read ready bits of inputs (source registers) from a Ready Table Ready Table: 1-bit per physical register indicating whether or not that physical register value has been produced ❑ Clear ready bit of output (destination register) in Ready Table Instruction has not produced value yet ❑ Write instruction data in the allocated IQ slot ❑ Recall that lw and sw go into both the IQ
(for computing the effective address) and the LSQ (which interfaces with the DM) CMPEN 431 OOO Superscalar.38 Sampson Fall 2019 PSU Dispatch Example, lw Dispatch Instr Src1 R Src2 R Dst Age Issue Queue p0 y p1 y p2 y p3 y p4 y p5 y p6 y p7 y p8 y Ready Table lp(0): lw p4,0(p1) #[p3] addu p5,p4,p2 #[p4] sw p5,0(p1) sub p6,p1,p2 #[p5] addi p7,p1,-4 #[p1] bne p7,p0,lp # p9 y lw 0+p1 p4 0y y n CMPEN 431 OOO Superscalar.40 Sampson Fall 2019 PSU Dispatch Example, addu Dispatch Instr Src1 R Src2 R Dst Age Issue Queue p0 y p1 y p2 y p3 y p4 n p5 y p6 y p7 y p8 y Ready Table lp(0): lw p4,0(p1) #[p3] addu p5,p4,p2 #[p4] sw p5,0(p1) sub p6,p1,p2 #[p5] addi p7,p1,-4 #[p1] bne p7,p0,lp # p9 y lw 0+p1 p4 0y y naddu p4 n p2 y p5 1 CMPEN 431 OOO Superscalar.42 Sampson Fall 2019 PSU Dispatch Example, sw Dispatch Instr Src1 R Src2 R Dst Age Issue Queue p0 y p1 y p2 y p3 y p4 n p5 n p6 y p7 y p8 y Ready Table lp(0): lw p4,0(p1) #[p3] addu p5,p4,p2 #[p4] sw p5,0(p1) sub p6,p1,p2 #[p5] addi p7,p1,-4 #[p1] bne p7,p0,lp # p9 y lw 0+p1 p4 0y y addu p4 n p2 y p5 1 sw n 0+p1 y 2p5 CMPEN 431 OOO Superscalar.44 Sampson Fall 2019 PSU Dispatch Example, sub Dispatch Instr Src1 R Src2 R Dst Age Issue Queue p0 y p1 y p2 y p3 y p4 n p5 n p6 y p7 y p8 y Ready Table lp(0): lw p4,0(p1) #[p3] addu p5,p4,p2 #[p4] sw p5,0(p1) sub p6,p1,p2 #[p5] addi p7,p1,-4 #[p1] bne p7,p0,lp # p9 y lw 0+p1 p4 0y y n addu p4 n p2 y p5 1 sw n 0+p1 y 2p5 sub y p2 y p6 3p1 CMPEN 431 OOO Superscalar.46 Sampson Fall 2019 PSU Dispatch Example, addi Dispatch Instr Src1 R Src2 R Dst Age Issue Queue p0 y p1 y p2 y p3 y p4 n p5 n p6 n p7 y p8 y Ready Table lp(0): lw p4,0(p1) #[p3] addu p5,p4,p2 #[p4] sw p5,0(p1) sub p6,p1,p2 #[p5] addi p7,p1,-4 #[p1] bne p7,p0,lp # p9 y lw 0+p1 p4 0y y n addu p4 n p2 y p5 1 sw n 0+p1 y 2p5 sub y p2 y p6 3p1 y -4 y p7 4p1addi CMPEN 431 OOO Superscalar.47 Sampson Fall 2019 PSU Dispatch Example, bne Dispatch Instr Src1 R Src2 R Dst Age Issue Queue p0 y p1 y p2 y p3 y p4 n p5 n p6 n p7 n p8 y Ready Table lp(0): lw p4,0(p1) #[p3] addu p5,p4,p2 #[p4] sw p5,0(p1) sub p6,p1,p2 #[p5] addi p7,p1,-4 #[p1] bne p7,p0,lp # p9 y lw 0+p1 p4 0y y addu p4 n p2 y p5 1 sw n 0+p1 y 2p5 sub y p2 y p6 3p1 y -4 y p7 4p1addi n p0 y 5p7bne CMPEN 431 OOO Superscalar.48 Sampson Fall 2019 PSU Out-of-Order Pipeline Progress F e tc h D e c o d e R e n a m e D is p a tc h C o m m it Buffers of renamed instructions (IQ,ROB,LSQ) Is s u e R e n a m e -R e g -R e a d E x e c u te W ri te b a c k Instr’s now have unique (physical) register names In-order front end Out-of-order execution In-order commit Dispatched instr’s now in OoO IQ CMPEN 431 OOO Superscalar.49 Sampson Fall 2019 PSU Out-of-Order Execution Pipeline Stages ❑ Execution (out-of-order) stages Issue Rename-Reg-Read Execute Rename-Writeback ❑ Issue 1. Select ready instructions Send (Issue) them for execution 2. Wakeup dependent instructions in the IQ ❑ OoO execution pipeline has necessary forwarding hardware and multiple FU’s of different types (some of them with multiple pipeline stages) ❑ Remember, read and writeback are from/to the physical (rename) RF CMPEN 431 OOO Superscalar.51 Sampson Fall 2019 PSU Issue = Select + Wakeup ❑ Select N oldest, ready instr’s to send for execution (checking for structural hazards (e.g., FUs)) Assume lw has already been issued to memory and it’s 3 cycle cache miss is still pending sub and addi are the two oldest ready instr’s Ready! Ready! Instr Src1 R Src2 R Dst Age Issue Queue (IQ) lw 0+p1 p4 0y y addu p4 n p2 y p5 1 sw n 0+p1 y 2p5 sub y p2 y p6 3p1 y -4 y p7 4p1addi n p0 y 5p7bne p0 y p1 y p2 y p3 y p4 n p5 n p6 n p7 n p8 y Ready Table p9 y Issued CMPEN 431 OOO Superscalar.52 Sampson Fall 2019 PSU Issue = Select + Wakeup ❑ Wakeup dependent instr’s CAM search for dst addr in Src1 and Src2 and set ready bit (R) on match Update Ready Table for Dispatch of future instr’s Instr Src1 R Src2 R Dst Age IQ lw 0+p1 p4 0y y addu p4 n p2 y p5 1 sw n 0+p1 y 2p5 sub y p2 y p6 3p1 y -4 y p7 4p1addi n p0 y 5p7bne p0 y p1 y p2 y p3 y p4 n p5 n p6 n p7 n p8 y Ready Table p9 y Assoc Search for p6 and p7 Assoc Search for p6 and p7 y y y Issued Ready! Ready! CMPEN 431 OOO Superscalar.54 Sampson Fall 2019 PSU Next Issue = Select + Wakeup ❑ Select and Wakeup done in one cycle, sub and addi have been issued for execution (and removed from IQ) ❑ lw has just completed and p4 is now ready ❑ So, which instr’s will be issued next ? Ready! Instr Src1 R Src2 R Dst Age addu p4 n p2 y p5 1 sw n 0+p1 y 2p5 y p0 y 5p7bne p0 y p1 y p2 y p3 y p4 n p5 n p6 y p7 y p8 y Ready Table p9 y y yReady! Assoc Search for p4 Assoc Search for p45 5 IQ y y CMPEN 431 OOO Superscalar.55 Sampson Fall 2019 PSU Aside: (Rename) Register Read ❑ When do instructions read the physical register file? Obviously cannot be done at Decode (not renamed yet) 1. Option #1: after Issue (Select), right before Execute Read physical (renamed) register Or get value via forwarding (based on physical register name) Pentium 4, MIPS R10k ❑ Physical register file may be large Could be a multi-cycle read 2. Option #2: as part of Dispatch, keep the data values (if known) in the IQ (along with the Pregaddr for the Issue (Wakeup) associative search) Means bigger IQ entries (+32b or 64b per source value) Pentium Pro, Core 2, Core i7i7 (implemented as FU Reservation Stations (rather than as a centralized IQ)) CMPEN 431 OOO Superscalar.56 Sampson Fall 2019 PSU Out-of-order Pipeline – The Detailed View F e tc h D e c o d e R e n a m e /D is p a tc h C o m m it Issue Queue (IQ) Is s u e /P R e g R e a d W ri te B a c k In-order front end Out-of-order execution In-order commit BTB Ready Table ReOrder Buffer (ROB) E x e c u te E x e c u te Head Ready Instr’s PRegFile Tail ARegFile ITLB I$ BHT Map Table Free List D$ DTLB Load Store Queue (LSQ) Empty Slot CMPEN 431 OOO Superscalar.57 Sampson Fall 2019 PSU Re-Order Buffer (ROB) ❑ All instructions Commit in order At commit write the physical register value to the ISA register and free the overwritten physical register (add it back to the Free List), for store instr’s write the data in the LSQ to the D$ (more on this soon), and free the LSQ and ROB entries for reuse ❑ Two other purposes To support recovery from branch misprediction and to support precise (synchronous) interrupts – Flush the ROB, IQ, and LSQ, restore Map Table and Ready Table to before misprediction/interrupt, and free the physical registers (update Free List) (wasted time, wasted power – why accurate branch prediction is sooo important for OOO datapaths (not as bad for interrupts since they are relatively infrequent)) and … – On mispredicted branch at ROB head, update BHT, BTB, restart the pipeline at the branch (with the correct prediction this time) – On interrupt of instruction at ROB head, service the interrupt, restart the pipeline at the interrupting instr CMPEN 431 OOO Superscalar.58 Sampson Fall 2019 PSU Re-Order Buffer (ROB) Data ❑ ROB entry has to keep track of all of the info needed for Commit & Recover Physical (renamed) dst register addr and its architectural (ISA) equivalent (so can update ARegFile on completion) Overwritten physical register name (for release and recovery) Instruction address (PC) and type (in particular store, branch) A way of determining when the instr completes execution Exception (interrupt) and branch outcome information ❑ On Dispatch: insert at tail Full? Stall ❑ Commit: remove from head Instr at head not completed? No instr to commit this cycle Multiple instr’s at head completed … commit multiple instr’s this cycle (if have hardware to support it) http://www.ecs.umass.edu/ece/koren/architecture/ROB/rob_simulator.htm CMPEN 431 OOO Superscalar.59 Sampson Fall 2019 PSU Speculation in OoO Machines ❑ Speculation allows execution of future instr’s that (may) depend
on the speculated instruction Speculate on the outcome of a conditional branch (branch prediction) just don’t commit until the branch outcome is known Speculate that a store (for which address is unknown) that precedes a load does not refer to the same address, allowing the younger load to be executed before the older store (load speculation) not committing the load until the speculation has cleared ❑ Must have hardware mechanisms for Checking to see if the guess was correct Recovering from incorrect speculation – only commit out of the ROB when are sure speculation is correct ❑ Ignore and/or buffer exceptions created by speculatively executed instructions until it is clear that they should really occur (i.e., not allowed to change the machine state until commit time) CMPEN 431 OOO Superscalar.60 Sampson Fall 2019 PSU Commit Instr PReg AReg C?PC ROB lwXXX0 addu p5 sw sub p6 p7addi bne ReReg p3 p4 p5 p1 XXX1 XXX3 XXX2 XXX4 XXX5 p4 $t0 $t0 $t0 $s1 S/B B y y y Head Tail ❑ Commit: instr takes on its architected state In-order, so only when the instr is finished (C?) and at the Head of the ROB Copy the data from the physical dst register (PReg) to the ISA (architected) dst register (AReg) Free the overwritten physical register (ReReg) S CMPEN 431 OOO Superscalar.61 Sampson Fall 2019 PSU Freeing the Overwritten ReReg lp(0): lw p4,0(p1) #[p3] addu p5,p4,p2 #[p4] sw p5,0(p1) sub p6,p1,p2 #[p5] addi p7,p1,-4 #[p1] bne p7,p0,lp # lp(1): lw p8,0(p7) #[p6] addu p9,p8,p2 #[p8] sw p9,0(p7) sub p10,p7,p2 #[p9] addi p11,p7,-4 #[p7] bne p11,p0,lp ❑ When lw commits put p3 back on the free list When does p4 go back on the free list? ❑ What if first bne is found to be mis-predicted when it gets to the head of the ROB? Need to restore the map table to the after the first addi state and restart at bne $s1 p7 → p11 $s2 p2 $t0 p6 → p8 → p9 → p10 Map Table (from 1st to 2nd bne) CMPEN 431 OOO Superscalar.62 Sampson Fall 2019 PSU What if first bne is mispredicted ? ❑State at misprediction ROB contains 1st bne and all of the instr’s after it that have been dispatched Map Table Free List ❑Cleaned up state Flush ROB, IQ, LSQ Restore Map Table Update Free List Restore Ready Table Fix BHT, BTB Restart dispatch at bne(0) $s1 p1 $s2 p2 $t0 p10 p11 p12, p13, p14, p15 … $s1 p1 $s2 p2 $t0 p6 p7 bne(1) addi(1) sub(1) sw(1) addu(1) lw(1) bne(0) p8, p9, p10, p11 … HeadTail CMPEN 431 OOO Superscalar.63 Sampson Fall 2019 PSU Load Store Queue (LSQ) ❑ Loads and stores are dispatched to the IQ, to the LSQ (the interface to the DM), and to the ROB When ready, loads and stores are issued (for effective address calculation) and their IQ entries are released When the effective address or store source has been calculated, it is compared to find the matching EAddr / Src entries in the LSQ Instr Src R EAddr R Dst Age LSQ lw (1) 0+p1 p4 0y y sw (1) 0+p1p5 2n y lw (2) 0+p7 p8 6y n sw (2) 0+p7p9 8n n Assoc Search for 0+p7 Ready! Issued y y Assoc Search for p5 yReady! CMPEN 431 OOO Superscalar.64 Sampson Fall 2019 PSU Memory Location Data Dependencies ❑ RAW, WAR and WAW memory data dependencies Memory storage conflicts are less frequent since memory locations are not used (and reused) in the same way that registers are ❑ Stores are committed to the DM from the LSQ in program order at commit time (when they are at the head of the ROB); since stores commit in order there are no WAW hazards. There are also no WAR hazards since there are also no older loads (they have already been committed). sw $t0,0($s1) lw $t1,0($s1) sw $t0,0($s1) sw $t1,0($s1) lw $t0,0($s1) sw $t1,0($s1) RAW, true dependence (cannot reorder, what to do?) WAR, anti-dependence (write commit in order fixes, lw will have already been committed) WAW, memory output dependence (write commit in order fixes) CMPEN 431 OOO Superscalar.65 Sampson Fall 2019 PSU Loads from Memory ❑ When an issued load instr completes execution, the load data is written to the PRegFile, the Ready Table is updated, and the load source register addr is compared (associatively) to see if it matches the Src addr’s of instr’s in the IQ and the Src addr’s in the LSQ; the load’s LSQ entry is released ❑ Note that the oldest load is “issued” for execution out of the LSQ to the DM If there is a EAddr match with another (younger) load, that younger load may not need to be executed since the current load may load in the data the younger load needs However, it there is an intervening store between the issuing load and the younger load all with the same effective address then the store has the data the younger load needs (store->load forwarding) CMPEN 431 OOO Superscalar.66 Sampson Fall 2019 PSU Load Bypassing ❑ For better performance younger loads can bypass (be issued before) older loads and stores in the LSQ under certain conditions Loads bypassing stores – Ready loads can bypass previous (older) stores as long as their effective addresses are known and different (so there is no RAW hazard) sw $t0,0($s1) lw $t1,0(???) sw $t0,0($s1) lw $t1,0($s2) lw $t0,0(???) lw $t1,0($s2) lw $t1,0($s2) lw $t0,0(???) lw $t1,0($s2) sw $t0,0($s1) lw $t1,0(???) sw $t0,0($s1) Loads bypassing loads – Ready (EAddr has been calculated) loads in the LSQ can bypass previous (older) unready loads – What if they are to the same EAddr? Who cares, no harm done. CMPEN 431 OOO Superscalar.67 Sampson Fall 2019 PSU Load Bypassing with Load Forwarding ❑ Load forwarding – when a load’s data is supplied directly from an older store in the LSQ The most recent older matching LSQ store data value is supplied to the load (beware! there could be more than one matching store) 0 1 2 3 total order load byp load fwd S p e e d u p Low HM High ❑ Load bypassing gives 19% speedup improvement (for a 4-way OOO datapath) ❑ Load forwarding gives an additional 4% speedup improvement From Johnson, 1992 CMPEN 431 OOO Superscalar.68 Sampson Fall 2019 PSU Stores to Memory ❑ Stores are held in the LSQ until the store is ready to commit (in program order – when the store is at the head of the ROB); on Commit the LSQ and ROB entries are released ❑ In addition to the associative search for matching EAddr’s, when a PReg becomes ready that address is compared (associatively) with the LSQ’s Src field (stores’ data value PReg addresses) If there is also an effective addr match (EAddr) in the LSQ with a load and the stores data is ready, then the store can provide the load’s dst data if the store is the most recent store older than the load (again, store->load forwarding) CMPEN 431 OOO Superscalar.69 Sampson Fall 2019 PSU OoO Scheduling Scope (Exposing More ILP) ❑ Scheduling scope = OOO window size Larger = better 1. Constrained by the number of physical registers (PRegFile) – ROB roughly limited by the number of physical registers – Big register file = expensive (area) and slow 2. Constrained by size of Issue Queue – Limits number of un-executed instructions – CAMs = can’t make too big (power + area) 3. Constrained by size of Load+Store Queue – Limits number of loads/stores – CAMs = can’t make too big (power + area) ❑ Usefulness of large window: limited by branch prediction 95% branch mis-prediction rate: 1 in 20 branches, or 1 in 100 instr’s CMPEN 431 OOO Superscalar.70 Sampson Fall 2019 PSU ILP in a “Perfect” Dynamic SS Datapath ❑ The perfect dynamic SS datapath has An infinite number of rename registers that eliminates all WAR, WAW data hazards Infinite IQ, LSQ, and ROB (so never full) No (fetch, decode, dispatch, issue, FU, buses, ports) limit on the number of instr’s that can begin execution simultaneously (as long as RAW (true) data hazards are not present) Perfect branch prediction Perfect caches Loads can be moved before stores as long as there are no RAW data hazards All FU’s have a 1 cycle latency 55 63 18 75 119 150 0 40 80 120 160 gc c es pr es so l i fp pp p do du c to m ca tv IP C From H&P, 2003 CMPEN 431 OOO Superscalar.71 S
ampson Fall 2019 PSU Effect of IQ size on ILP 0 40 80 120 160 In fin ite 2K 51 2 12 8 32 8 4 IP C gcc espresso li fpppp doduc tomcatv ❑ Instruction window (IQ) – the set of instructions that are examined simultaneously for execution From H&P, 2003 CMPEN 431 OOO Superscalar.72 Sampson Fall 2019 PSU Effect of Finite Rename Registers 0 20 40 60 In fin ite 25 6 12 8 64 32 N on e IP C gcc espresso li fpppp doduc tomcatv ❑ On a processor with an IQ of 2K, a maximum 64-way issue capability, and a tournament branch predictor with 8K entries From H&P, 2003 CMPEN 431 OOO Superscalar.73 Sampson Fall 2019 PSU Effect of Realistic Branch Prediction on ILP ❑ On a processor with an IQ of 2K and maximum 64-way issue capability 0 20 40 60 Pe rf ec t To ur na m en t St an da rd 2 -b it St at ic N on e IP C gcc espresso li fpppp doduc tomcatv From H&P, 2003 CMPEN 431 OOO Superscalar.74 Sampson Fall 2019 PSU Summary: Dynamic (OoO) Scheduling ❑ Dynamic scheduling Totally in the hardware; compiler can help (e.g., loop unrolling) ❑ Fetch many instr’s into instruction window Use branch prediction to speculate past (multiple) branches Flush pipeline queues on branch misprediction ❑ Rename to avoid false dependencies ❑ Execute instructions as soon as possible Register dependencies are known Handling memory dependencies more tricky ❑ Commit instr’s in order Anything strange happens before commit, just flush pipeline queues ❑ Current machines: 100+ instruction scheduling window CMPEN 431 OOO Superscalar.75 Sampson Fall 2019 PSU Out Of Order: Top 5 Things to Know 1. Register renaming How to perform it and how to recover it 2. Issue/Select Wakeup: CAM Choose N oldest ready instructions 3. Stores Write at commit Forward to loads via LSQ 4. Loads Possibility for load bypassing and load forwarding 5. Commit Precise state maintained in the ROB How/when physical registers are freed CMPEN 431 OOO Superscalar.76 Sampson Fall 2019 PSU Power Costs of OoO Execution ❑ Complexity of dynamic scheduling and recovering from mis-speculation requires more power ❑ Multiple simpler cores may be better (power-wise) Power*Delay product may be a better measure Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i486 1989 25MHz 5 1 No 1 5W Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001 2000MHz 22 3 Yes 1 75W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Core 2006 2930MHz 14 4 Yes 2 75W Nehalem 2010 3300MHz 14 4 Yes 4 87W Ivy Bridge 2012 3400MHz 14 4 Yes 8 77W CMPEN 431 OOO Superscalar.77 Sampson Fall 2019 PSU An Example: Intel’s OoO Processors ❑ Intel’s Tick-Tock technology/processor model A Tick processor is the “current” design fabbed at a new technology node (feature size) A Tock processor is a new microprocessor architecture design fabbed at the current technology node 45nm tech node 32nm tech node 22nm tech node 14nm tech node Nehalem West mere Sandy Bridge Ivy Bridge Has well Broad well Sky lake Tock Tick Tock Tick Tock Tick Tock 4Q 2008 1Q 2010 1Q 2011 3Q 2011 2Q 2013 4Q 2014 3Q 2015 ❑ Skylake is the fifth Tock since Intel instituted its Tick- Tock model https://en.wikipedia.org/wiki/Intel_Tick-Tock CMPEN 431 OOO Superscalar.78 Sampson Fall 2019 PSU Some Typical “Scope” Queue Sizes Nehalem Sandy Bridge Haswell Instr Decode Queue 28 per thread / 2 threads 28 per thread / 2 threads 56 total for 2 threads ROB 128 uops 168 uops 192 uops Res Station (IQ) 36 uops 56 uops 60 uops Integer Rename RF 160 registers 168 registers FP Rename RF 144 registers 168 registers Load Buffers 48 entries 64 entries 72 entries Store Buffers 32 entries 36 entries 42 entries ❑ All x86 architectures so x86 CISC instructions are decoded into (several) RISC microinstructions (uops) ❑ All three machines are SMT (2 threads) – stay tuned CMPEN 431 OOO Superscalar.79 Sampson Fall 2019 PSU ❑ 4-wide fetch/decode ❑ 2-way SMT ❑ Decoders convert x86 to 4 uops / clock ❑ Instr Decode Queue holds uops from 2 threads – dynamically partitioned ❑ (Red is what is changed over Sandy Bridge) CMPEN 431 OOO Superscalar.80 Sampson Fall 2019 PSU ❑ ROB holds uops from 2 threads – dynamically partitioned ❑ IQ work done by the unified Reservation Station ❑ 8 execution ports (only 6 on Sandy Bridge) CMPEN 431 OOO Superscalar.81 Sampson Fall 2019 PSU Haswell’s Cache Architecture ❑ All caches have 64B blocks; L1s and L2 private, L3 shared Metric Nehalem Sandy Bridge Haswell L1 I$ 32KiB, 4-way 32KiB, 8-way 32KiB, 8-way L1 D$ 32KiB, 8-way 32KiB, 8-way 32KiB, 8-way Ld-to-use 4 cycles 4 cycles 4 cycles Ld bdwdth 16B/cycle 32B/cycle (banked) 64B/cycle St bdwdth 16B/cycle 16B/cycle 32B/cycle UL2 256KiB, 8-way 256KiB, 8-way 256KiB, 8-way Ld-to-use 10 cycles 11 cycles 11 cycles Bdwdth L1 32B/cycle 32B/cycle 64B/cycle L1 iTLB 128, 4-way 128, 4-way 128, 4-way L1 dTLB 64, 4-way 64, 4-way 64, 4-way L2 uTLB 512, 4-way 512, 4-way 1024, 8-way CMPEN 431 OOO Superscalar.82 Sampson Fall 2019 PSU Cortex A8 versus Intel i7 Processor ARM A8 Intel Core i7 920 Market Personal Mobile Device Server, cloud Thermal design power 2 Watts 130 Watts Clock rate 1 GHz 2.66 GHz Cores/Chip 1 4 Floating point? No Yes Multiple issue? Yes Yes Peak instructions/clock cycle 2 4 Pipeline stages 14 14 Pipeline schedule Static in-order Dynamic out-of-order with speculation Branch prediction 2-level 2-level 1st level caches/core 32 KiB I, 32 KiB D 32 KiB I, 32 KiB D 2nd level caches/core 128-1024 KiB 256 KiB 3rd level caches (shared) – 2- 8 MiB CMPEN 431 OOO Superscalar.83 Sampson Fall 2019 PSU Core i7 Pipeline ❑ 4-wide fetch/decode ❑ 2-way SMT ❑ Register alias table – Map Table ❑ Retirement register file – ARF ❑ IQ work done by the unified Reservation Station ❑ Branch misprediction costs 17 cycles CMPEN 431 OOO Superscalar.84 Sampson Fall 2019 PSU Core i7 Performance CMPEN 431 OOO Superscalar.85 Sampson Fall 2019 PSU ARM Cortex A8 Performance (from 4.E) ❑ Ideal CPI is 0.5. For the median case (gcc), 80% of the stalls are due to pipeline hazards, 20% to memory stalls CMPEN 431 OOO Superscalar.86 Sampson Fall 2019 PSU Core i7 Branch Speculation Performance CMPEN 431 OOO Superscalar.87 Sampson Fall 2019 PSU Review: Multithreaded Implementations ❑ MT trades (single-thread) latency for throughput Sharing the datapath degrades the latency of individual threads, but improves the aggregate latency of both threads And it improves utilization of the datapath hardware ❑ Main questions: thread scheduling policy and pipeline partitioning When to switch from one thread to another? How exactly do threads share the pipelined datapath itself? ❑ Choices depends on what kind of latencies you want to tolerate and how much single thread performance you are willing to sacrifice Coarse-grain multithreading (CGMT) Fine-grain multithreading (FGMT) Simultaneous multithreading (SMT) CMPEN 431 OOO Superscalar.88 Sampson Fall 2019 PSU Vertical and Horizontal Under-Utilization ❑ FGMT reduces vertical under-utilization Loss of all slots in an issue cycle ❑ Does not help with horizontal under-utilization Loss of some slots in an issue cycle (in a static SS) SMTStatic SS ti m e FGMT CMPEN 431 OOO Superscalar.89 Sampson Fall 2019 PSU Simultaneous MultiThreading (SMT) ❑ What can issue instr’s from multiple threads in one cycle? Same thing that issues instr’s from multiple parts of same program… …out-of-order execution !! ❑ Simultaneous multithreading (SMT): OoO + FGMT Aka (by Intel) “hyper-threading” Once instr’s are renamed, issuer doesn’t care which thread they come from (well, for non-loads at least) Some examples – IBM Power5: 4-way, 2 threads; IBM Power7: 4-way, 4 threads – Intel Pentium4: 3-way, 2 threads; Intel Core i7: 4-way, 2 threads – AMD Bulldozer: 4-way, 2 threads – Alpha 21464: 8-way issue, 4 threads (canceled) – Notice a pattern? #threads (T) * 2 = #issue width (N) CMPEN 431 OOO Superscalar.90 Sampson Fall 2019 PSU SMT Resource Partitioning ❑ Each thr
ead must have its own persistent hard state structures Per-thread PC (thread scheduler) Map Table ARegFile ❑ No-state (combinational) structures (e.g., ALU) can be dynamically shared ❑ As with FGMT, TLBs, caches, bpred tables (BHT,BTB) are already dynamically partitioned (persistent soft state) so can be shared Some structures, e.g., TLBs, will need thread ids Some ordered “soft” state structures (e.g., RAS) will have to be replicated CMPEN 431 OOO Superscalar.91 Sampson Fall 2019 PSU SMT Out-of-order Pipeline F e tc h D e c o d e R e n a m e /D is p a tc h C o m m it IQ Is s u e /P R e g R e a d W ri te B a c k Out-of-order execution In-order commit BTB Ready Table ROB E x e c u te E x e c u te Head Ready Instr’s PRegFile Tail ARegFile ITLB I$ BHT Map Table D$ Free List DTLB LSQ thread scheduler ARegFile Map Table In-order CMPEN 431 OOO Superscalar.92 Sampson Fall 2019 PSU SMT Resource Partitioning, con’t Transient state structures will need to be partitioned ❑ Execution pipeline latches shared as with FGMT ❑ Free List, PRegFile, Ready Table, and IQ entries can be partitioned (shared) at the fine grain (entry) level Physically unordered and so fine-grain sharing is possible Probably want a bigger PRegFile and IQ – # physical registers = (#threads * #arch-regs) + #in-flight instr’s – # Map Table entries = (#threads * #arch-regs) ❑ How are physically ordered structures (ROB, LSQ) shared? – Fine-grain sharing (as with IQ) would entangle commit (and squash on branch misprediction, interrupts) Allowing threads to commit independently is important, so … CMPEN 431 OOO Superscalar.93 Sampson Fall 2019 PSU Static vs Dynamic ROB & LSQ Partitioning ❑ Static partitioning (basically one per thread) T equal-sized contiguous partitions in the ROB and LSQ – T is the number of threads – Essentially equivalent to having a ROB and LSQ for each thread Could have sub-optimal utilization (fragmentation) as some ROBs could fill up while others are almost empty But no starvation (as in dynamic partitioning) ❑ Dynamic partitioning #partitions > #T, available partitions assigned on need basis Better utilization Possible starvation (one thread grabs most/all the partitions, so other threads are “starved”) Couple with a fetch policy that gives a preference to threads with fewest in-flight instr’s ❑ Both need a larger ROB and LSQ CMPEN 431 OOO Superscalar.94 Sampson Fall 2019 PSU Multithreading Speed-Ups on the Core i7 ❑ Speed-up on PARSEC 1.31 avg ❑ Energy efficiency improvements 1.07 avg CMPEN 431 OOO Superscalar.95 Sampson Fall 2019 PSU Multithreading vs Multicore ❑ If you wanted to run multiple threads would you build a A multicore: multiple separate pipelines? A multithreaded processor: a single larger pipeline? ❑ Both will get you throughput on multiple threads A multicore core will be simpler, possibly faster clock – Multicore is mainly a TLP (thread-level parallelism) engine SMT will get you better performance (IPC) on a single thread – SMT is basically an ILP engine that converts TLP to ILP ❑ Do both Intel’s Sandy (Ivy) Bridge and Haswell, IBM’s Power7 & 8 4 to 8 OOO 4-way cores each of which supports 2 to 4 threads (SMT) Private L1 and (normally) L2 caches, shared L3 cache 3+ GHz clock rate CMPEN 431 OOO Superscalar.96 Sampson Fall 2019 PSU Evolution of Pipelined, SS Processors Year Clock Rate # Pipe Stages Issue Width OOO? Cores /Chip Power Intel 486 1989 25 MHz 5 1 No 1 5 W Intel Pentium 1993 66 MHz 5 2 No 1 10 W Intel Pentium Pro 1997 200 MHz 10 3 Yes 1 29 W Intel Pentium 4 Willamette 2001 2000 MHz 22 3 Yes 1 75 W Intel Pentium 4 Prescott 2004 3600 MHz 31 3 Yes 1 (2) 103 W Intel Core 2006 2930 MHz 14 4 Yes 2 75 W Intel Core i7 2008 2930 MHz 14 4 Yes 4 (2) 95 W Sun USPARC III 2003 1950 MHz 14 4 No 1 90 W Sun T1 (Niagara) 2005 1200 MHz 6 1 No 8 70 W CMPEN 431 OOO Superscalar.103 Sampson Fall 2019 PSU SimpleScalar Structure ❑ sim-outorder: supports out-of-order execution (with in-order commit) with a Register Update Unit (RUU) Uses a RUU for register renaming and to hold the results of pending instructions (our IQ). The RUU also retires (i.e., commits) completed instructions (so our ROB) in program order to the RegFile Uses a LSQ for store instructions not ready to commit and load instructions waiting for access to the D$ Loads are satisfied by either the memory or by an earlier store value residing in the LSQ if their addresses match – Loads are issued to the memory system only when addresses of all previous (older) loads and stores are known CMPEN 431 OOO Superscalar.104 Sampson Fall 2019 PSU SimpleScalar Pipeline Stage Functions F e tc h m u lt ip le i n s tr ’s D e c o d e /R e n a m e a n d D is p a tc h i n s tr ’s W a it f o r s o u rc e o p e ra n d s to b e R e a d y a n d F U f re e , s c h e d u le R e s u lt B u s a n d e x e c u te i n s tr ’s C o p y R e s u lt B u s d a ta t o m a tc h in g w a it in g s o u rc e s W ri te d s t c o n te n ts t o R e g F ile o r D a ta M e m o ry FETCH DECODE, RENAME & DISPATCH ISSUE & EXECUTE WRITE BACK RESULT COMMIT In Order In OrderOut of OrderIn Order ruu_fetch() ruu_dispatch() ruu_issue() lsq_refresh() ruu_writeback() ruu_commit() CMPEN 431 OOO Superscalar.105 Sampson Fall 2019 PSU SimpleScalar Pipeline ❑ ruu_fetch(): fetches instr’s from one I$ line, puts them in the fetch queue, probes the cache line predictor to determine the next I$ line to access in the next cycle – fetch:ifqsize: fetch width (default is 4) – fetch:speed: ratio of the front end speed to the execution core ( times as many instructions fetched as decoded per cycle) – fetch:mplat: branch misprediction latency (default is 3) ❑ ruu_dispatch(): decodes instr’s in the fetch queue, puts them in the dispatch (scheduler) queue, enters and links instr’s into the RUU and the LSQ, splits memory access instructions into two separate instr’s (one to compute the effective addr and one to access the memory), notes branch mispredictions – decode:width: decode width (default is 4) CMPEN 431 OOO Superscalar.106 Sampson Fall 2019 PSU SimpleScalar Pipeline, con’t ❑ ruu_issue()and lsq_refresh(): locates and marks the instr’s ready to be executed by tracking register and memory dependencies, ready loads are issued to D$ unless there are earlier stores in LSQ with unresolved addr’s, forwards store values with matching addr to ready loads – issue:width: maximum issue width (default is 4) – ruu:size: RUU capacity in instr’s (default is 16, min is 2) – lsq:size: LSQ capacity in instr’s (default is 8, min is 2) and handles instr’s execution – collects all the ready instr’s from the scheduler queue (up to the issue width), check on FU availability, checks on access port availability, schedules writeback events based on FU latency (hardcoded in fu_config[]) – res:ialu | imult | memport | fpalu | fpmult: number of FU’s (default is 4 | 1 | 2 | 4 | 1) CMPEN 431 OOO Superscalar.107 Sampson Fall 2019 PSU SimpleScalar Pipeline, con’t ❑ ruu_writeback(): determines completed instr’s, does data forwarding to dependent waiting instr’s, detects branch misprediction and on misprediction rolls the machine state back to the checkpoint and discards erroneously issued instructions ❑ ruu_commit(): in-order commits results for instr’s (values copied from RUU to RegFile or LSQ to D$), RUU/LSQ entries for committed instr’s freed; keeps retiring instructions at the head of RUU that are ready to commit until the head instr is one that is not ready

辅导案例-CMPEN 431 OOO

Related

Previous Post辅导案例-COMP207-Assignment 2

Next Post辅导案例-EEE3032

Author admin