CS641 Class 25 Pipelining

hw8 is available

Tiny Computer in Verilog: see handout

Note the programmer’s view of the CPU instruction set just involves AC and memory locations:

LOAD addr load from memory addr to AC

STORE add store from AC to memory addr

ADD addr add contents of memory at addr to AC <----non MIPS-like instruction

JUMP addr

RISC CPUs only access memory in load and store instructions

But here we have only one programmer-visible register, so hard to add two numbers in registers, the RISC way.

You’ll fix this in hw8 by replacing ADD with ADDI imm, enough to do the little program.

hw8 has some little extensions of the tiny computer, and asks you to think about pipelining it.

MIPS Pipeline Stages

1) IFetch: Fetch Instruction, Increment PC

2) Decode Instruction, Read Registers

3) Execute:
Mem-ref: Calculate Address
Arith-log: Perform Operation

4) Memory:
Load: Read Data from Memory
Store: Write Data to Memory

5) Write Back: Write Data to Register

° What makes it easy

• all instructions are the same length

• just a few instruction formats

• memory operands appear only in loads and stores

° What makes it hard?

• structural hazards: suppose we had only one memory

• control hazards: need to worry about branch instructions

• data hazards: an instruction depends on a previous instruction

° We’ll build a simple pipeline and look at these issues

° what really makes it hard:

• exception handling

• trying to improve performance with

• out-of-order execution, etc.

Pipeline Control

· Pass control signals along just like the data

· Recall idea of saving register value across stages

· Control based on instruction (also need funct at ALU)—now sorted out by when (what stage) they are needed later

from pg. 360

Picture of bits save across stages to control things when relevant (pg. 361)

Longest save: Write-back control lines

Pipeline Stalls: if something stops one stage, all stages stop.

° Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

• Structural hazards: HW cannot support this combination of instructions

• Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard

• Data hazards: Instruction depends on result of prior instruction still in the pipeline

Problem with starting next instruction before first is finished

• dependencies that “go backward in time” are data hazards

• Look at example (pg. 364)

•

• Here, 4 data dependencies, of which 2 are data hazards

• Clear that two left-hand ones are going back in time, so are data hazards

• And right-hand one is not

• Not so clear on the almost vertical one, but note that it tilts slightly forward in time, because the decode stage is split into two parts, the first part for write-back and the second part for read-register, so this does work.

• This means the register-write is not using the main clock edge, but a separate rising edge timed in the middle of the phase, a sort-of crypto-cycle.

Software Solution for Data Hazards

• Have compiler guarantee no hazards

•          Where do we insert the “nops” ?

                sub         $2, $1, $3
                and        $12, $2, $5
                or            $13, $6, $2
                add        $14, $2, $2
                sw          $15, 100($2)

• Answer: 2 nops after the sub

• Problem: this really slows us down!

Forwarding, a hardware solution

° Use temporary results, don’t wait for them to be written

o ALU forwarding (use ALU result before it’s written, from its spot in the pipeline register

(pg. 367)

° Load word can still cause a pipeline stall

Specifically when an instruction tries to read a register following a load instruction that writes to the same register.

(pg. 372)

° Thus, we need a hazard detection unit to “stall” the load instruction

° We can stall the pipeline by keeping an instruction in the same stage

° (pg. 374)

Branch Hazards: important to understand for software performance

Here, assume the branch is taken, but the CPU doesn’t expect that, goes on to start following instructions.

With our old control circuitry, the CPU finds out about the branch decision in the EX step (recall the Zero output of the ALU for reporting a zero result, here of a subtraction). Two following instructions are already in the pipeline at this point.

pg. 376

• need to add hardware for flushing instructions if we are wrong about guessing which way the branch will go.

• But first use 2 optimizations to attack the problem--

• Optimization #1:

• move asynchronous comparator up to Stage 2

• as soon as instruction is decoded (Opcode identifies is as a branch), immediately make a decision and set the value of the PC (if necessary)

• Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed

• Side Note: This means that branches are idle in Stages 3, 4 and 5.

• Optimization #2: Redefine branches

• Old definition: if we take the branch, none of the instructions after the branch get executed by accident

• New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

So, with both of these in place, we don’t need to flush instructions: the branch delay slot works like a nop. That’s the MIPS way, and this trick is used in lots of other RISC CPUs as well.

• Notes on Branch-Delay Slot

• Worst-Case Scenario: can always put a no-op in the branch-delay slot

• Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program

• re-ordering instructions is a common method of speeding up programs

• compiler must be very smart in order to find instructions to do this

• Jumps also have a delay slot…

• Downside: greatly confuse programmers who are doing low-level debugging,