CS641 Class 25 Pipelining

hw8 is available

Tiny Computer in Verilog: see handout

Note the programmer’s view of the CPU instruction set just involves AC and memory locations:

LOAD addr   load from memory addr to AC

STORE add store from AC to memory addr

ADD addr   add contents of memory at addr to AC  <----non MIPS-like instruction

JUMP addr

 

RISC CPUs only access memory in load and store instructions

But here we have only one programmer-visible register, so hard to add two numbers in registers, the RISC way.

You’ll fix this in hw8 by replacing ADD with ADDI imm, enough to do the little program.

 

hw8 has some little extensions of the tiny computer, and asks you to think about pipelining it.

MIPS Pipeline Stages

1) IFetch: Fetch Instruction, Increment PC

2) Decode Instruction, Read Registers

3) Execute:
  Mem-ref:          Calculate Address
  Arith-log: Perform Operation

4) Memory:
  Load:   Read Data from Memory
  Store:  Write Data to Memory

5) Write Back: Write Data to Register

 

°         What makes it easy

          all instructions are the same length

          just a few instruction formats

          memory operands appear only in loads and stores

°         What makes it hard?

          structural hazards:   suppose we had only one memory

          control hazards:  need to worry about branch instructions

          data hazards:  an instruction depends on a previous instruction

°         We’ll build a simple pipeline and look at these issues

°          what really makes it hard:

          exception handling

          trying to improve performance with

          out-of-order execution, etc.

Pipeline Control

·         Pass control signals along just like the data

·         Recall idea of saving register value across stages

·         Control based on instruction (also need funct at ALU)—now sorted out by when (what stage) they are needed later

 

 

 

 

 

 

 

 


from pg. 360

 

 

Picture of bits save across stages to control things when relevant (pg. 361)

Longest save: Write-back control lines

Pipeline Stalls: if something stops one stage, all stages stop.

°         Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

          Structural hazards: HW cannot support this combination of instructions

          Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard

          Data hazards: Instruction depends on result of prior instruction still in the pipeline

Problem with starting next instruction before first is finished

          dependencies that “go backward in time” are data hazards

          Look at example (pg. 364)

         

          Here, 4 data dependencies, of which 2 are data hazards

          Clear that two left-hand ones are going back in time, so are data hazards

          And right-hand one is not

          Not so clear on the almost vertical one, but note that it tilts slightly forward in time, because the decode stage is split into two parts, the first part for write-back and the second part for read-register, so this does work.

          This means the register-write is not using the main clock edge, but a separate rising edge timed in the middle of the phase, a sort-of crypto-cycle.

Software Solution for Data Hazards

          Have compiler guarantee no hazards

          Where do we insert the “nops” ?

                sub         $2, $1, $3
                and        $12, $2, $5
                or            $13, $6, $2
                add        $14, $2, $2
                sw          $15, 100($2)

          Answer: 2 nops after the sub

          Problem:  this really slows us down!

Forwarding, a hardware solution

°         Use temporary results, don’t wait for them to be written

o   ALU forwarding (use ALU result before it’s written, from its spot in the pipeline register

 

(pg. 367)

°         Load word can still cause a pipeline stall

Specifically when an instruction tries to read a register following a load instruction that writes to the same register.

(pg. 372)

°         Thus, we need a hazard detection unit to “stall” the load instruction

°         We can stall the pipeline by keeping an instruction in the same stage

°        

°         (pg. 374)

°          

Branch Hazards: important to understand for software performance

Here, assume the branch is taken, but the CPU doesn’t expect that, goes on to start following instructions.

With our old control circuitry, the CPU finds out about the branch decision in the EX step (recall the Zero output of the ALU for reporting a zero result, here of a subtraction). Two following instructions are already in the pipeline at this point.

pg. 376

          need to add hardware for flushing instructions if we are wrong about guessing which way the branch will go.

          But first use 2 optimizations to attack the problem--

          Optimization #1:

          move asynchronous comparator up to Stage 2

          as soon as instruction is decoded (Opcode identifies is as a branch), immediately make a decision and set the value of the PC (if necessary)

          Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed

          Side Note: This means that branches are idle in Stages 3, 4 and 5.

          Optimization #2: Redefine branches

          Old definition: if we take the branch, none of the instructions after the branch get executed by accident

          New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

So, with both of these in place, we don’t need to flush instructions: the branch delay slot works like a nop. That’s the MIPS way, and this trick is used in lots of other RISC CPUs as well.

          Notes on Branch-Delay Slot

          Worst-Case Scenario: can always put a no-op in the branch-delay slot

          Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program

          re-ordering instructions is a common method of speeding up programs

          compiler must be very smart in order to find instructions to do this

          Jumps also have a delay slot…

          Downside: greatly confuse programmers who are doing low-level debugging,