Situations preventing next instruction from executing in designated clock cycle. Causes the pipeline to stall.
3 types.
Structural hazard
Section titled “Structural hazard”When 2 different instructions require the same hardware resource(s) simultaneously.
Occurs primarily in special purpose functional units that are less frequently used (such as floating point divide or other complex long running instructions). Not a major performance factor, assuming programmers are aware of the lower throughput of these instructions.
One occurence is when both instruction fetch and data access need the same memory. Can be avoided in Harward architecture.
Data hazard
Section titled “Data hazard”When an instruction depends on the result of the another instruction which are overlapped.
3 types.
Suppose when unpipelined, instruction runs before and both use the same register.
- Read after write (RAW)
must read only after writes. Stall is required to resolve. - Write after read (WAR)
writes only after reads. Impossible in 5-stage pipeline. Occurs when instructions are reordered. Can be solved by renaming registers. - Write after write (WAW)
writes only after writes. Impossible in 5-stage pipeline. Occurs when instructions are reordered.
Register Half-Cycle Timing
Section titled “Register Half-Cycle Timing”A solution to data hazards by reading from registers only in the first half of a clock cycle, and writing to registers only in the second half of a clock cycle.
Forwarding
Section titled “Forwarding”Aka. bypassing or short-circuiting. An alternate solution to data hazards. Sends a value directly from the pipeline stage that produces it (EX/MEM) to the stage that needs it, instead of waiting for the register write. This avoids many data-hazard stalls and keeps the pipeline moving.
Solves most RAW hazards, but not all. Can’t handle data hazards, because reading from memory is much slower.
Control hazard
Section titled “Control hazard”Aka. branch hazard. The next instruction depends on a branch/jump, and it’s not known yet. Cannot fetch the next instruction branch resolves.
Causes more performance loss compared to data hazards. 4 solutions are commonly used.
Taken branch
Section titled “Taken branch”When a branch changes the PC to its target address. Otherwise untaken branch.
Freeze
Section titled “Freeze”Simplest scheme to handle branch hazards. The process of holding execution in the pipeline until the branch destination is known. Waits for 2 cycles per branch.
Implementation is simple in terms of hardware and software. Branch penalty is constant, cannot be lessened with software optimization.
Used when no predictions.
Speculatively fetches next instruction as it normally would. Discards the fetched instructions if they are not needed. Kind of similar to freeze.
Branch penalty is constant, cannot be lessened with software optimization.
Used when a misprediction is made (with any branch prediction policy).
Assume untaken
Section titled “Assume untaken”Treat all branches are untaken. Fetches next instruction placed sequentially. No branch penalty for untaken branches. If the branch is taken, 2 cycle branch penalty.
Processor state must remain unchanged until the actual branching outcome is known. If the prediction is wrong, the pipeline is flushed and restarted. Results in a 1 cycle penalty when the branch is taken.
More performant. Slightly more complex.
Assume taken
Section titled “Assume taken”Treat all branches are taken. Requires an early adder and immediate decoder in ID. No branch penalty for taken branches. If the branch is not taken, 1 or 2 cycles branch penalty.
Delayed branch
Section titled “Delayed branch”Branch outcome is only known after the EX step. Delayed branch technique is used to run instruction(s) before the EX step, whether the branch is taken or not. Mostly a single instruction delay is used. If longer branch penalty is there, other techniques are used. Compiler is responsible for filling the delay slot with a useful instruction, or a nop.
Useful for short and simple pipelines. Implementation becomes too complex when dynamic branch prediction is there. Heavily used in early RISC-V processors; not anymore.