# EECS 583 – Class 12 Superblock Scheduling Software Pipelining Intro

University of Michigan

October 17, 2018

# Announcements + Reading Material

- Project discussion meetings
  - » No class next week (Oct 22 & 24)
  - » Each group meets 15 mins with Ze and I
  - » Signup today in class, signup sheet on my door (4633 BBB) if you miss class or can't decide on a timeslot
  - » Be prompt, show up a few minutes early as back-to-back meetings
- Project proposals
  - » Due Wednesday, Oct 31, 11:59pm
  - » 1 paragraph summary of what you plan to work on
    - Topic, approach, objective
    - 1-2 references
  - » Email to me and Ze, cc your group members
- Today's class reading
  - "Iterative Modulo Scheduling: An Algorithm for Software Pipelining Loops", B. Rau, MICRO-27, 1994, pp. 63-74.
- Next next Monday's reading
  - » "Code Generation Schema for Modulo Scheduled Loops", B. Rau, M. Schlansker, and P. Tirumalai, MICRO-25, Dec. 1992.

| Homework Problem Fi                                           | ron  | n La   | st Tim | he - A | nswer  |
|---------------------------------------------------------------|------|--------|--------|--------|--------|
| Machina: 2 iagua 1 mamoru port 1 ALU                          |      | Op     | priori | ty     |        |
| Machine: 2 issue, 1 memory port, 1 ALU                        |      | 2      | 6<br>7 |        |        |
| Memory port = $2$ cycles, pipelined                           |      | 3      | 4      |        |        |
| ALU = 1 cycle                                                 |      | 4      | 5      |        |        |
|                                                               |      | 5      | 2      |        |        |
|                                                               |      | 6<br>7 | 3      |        |        |
| 0,1 (1m) (2m) $0,0$                                           |      | 8      | 2      |        |        |
|                                                               |      | 9      | 3      |        |        |
| $22 + \frac{12}{2}$                                           |      | 10     | 1      |        |        |
| 2,3 (3) (4m) 2,2                                              |      | RU_m   | ap     | Sch    | nedule |
| 1 $1$ $2$ $3,4$ $2$                                           | time | ALU    | MEM    | Time   | Placed |
| 3,5(5) (6) <sup>3,4</sup> (7) 4,4                             | 0    |        | Х      | 0      | 2      |
| 1                                                             | 1    |        | Χ      | 1      | 1      |
| 1 5,5 $(8)$ (9m) 0,4                                          | 2    |        | Χ      | 2      | 4      |
|                                                               | 3    | Χ      | Χ      | 3      | 3,9    |
| 2                                                             | 4    | Χ      |        | 4      | 6      |
| 6,6 (10)                                                      | 5    | Χ      |        | 5      | 7      |
|                                                               | 6    | Χ      |        | 6      | 5      |
| 1. Calculate height-based priorities                          | 7    | Χ      |        | 7      | 8      |
| <ol> <li>Schedule using <u>Operation</u> scheduler</li> </ol> | 8    | Х      |        | 8      | 10     |

#### 

### Generalize Beyond a Basic Block

- Superblock
  - » Single entry
  - » Multiple exits (side exits)
  - » No side entries
- Schedule just like a BB
  - » Priority calculations needs change
  - » Dealing with control deps



#### Lstart in a Superblock

#### Not a single Lstart any more

- » 1 per exit branch (Lstart is a vector!)
- » Exit branches have probabilities

Lstart0

Estart

op

1

2

3

4

5

6

Lstart1



# **Operation Priority in a Superblock**

#### Priority – Dependence height and speculative yield

- » Height from op to exit \* probability of exit
- » Sum up across all exits in the superblock

Priority(op) = **SUM**(Probi \* (MAX\_Lstart – Lstarti(op) + 1))

valid late times for op



### Dependences in a Superblock



\* Data dependences shown, all are reg flow except  $1 \rightarrow 6$  is reg anti

\* Dependences define precedence ordering of operations to ensure correct execution semantics

\* What about control dependences?

\* Control dependences define precedence of ops with respect to branches

#### Conservative Approach to Control Dependences



\* Make branches barriers, nothing moves above or below branches

\* Schedule each BB in SB separately

\* Sequential schedules

\* Whole purpose of a superblock is lost

#### Upward Code Motion Across Branches

- Restriction 1a (register op)
  - The destination of op is not in liveout(br)
  - » Wrongly kill a live value
- Restriction 1b (memory op)
  - » Op does not modify the memory
  - Actually live memory is what matters, but that is often too hard to determine
- Restriction 2
  - » Op must not cause an exception that may terminate the program execution when br is taken
  - Op is executed more often than it is supposed to (<u>speculated</u>)
  - » Page fault or cache miss are ok
- Insert control dep when either restriction is violated



#### Downward Code Motion Across Branches

- Restriction 1 (liveness)
  - » If no compensation code
    - Same restriction as before, destination of op is not liveout
  - » Else, no restrictions
    - Duplicate operation along both directions of branch if destination is liveout
- Restriction 2 (speculation)
  - » Not applicable, downward motion is not speculation
- Again, insert control dep when the restrictions are violated
- Part of the philosphy of superblocks is no compensation code inseration hence R1 is enforced!



#### Add Control Dependences to a Superblock



#### **Class Problem**



Draw the dependence graph

### **Relaxing Code Motion Restrictions**

- Upward code motion is generally more effective
  - Speculate that an op is useful (just like an out-of-order processor with branch pred)
  - » Start ops early, hide latency, overlap execution, more parallelism
- Removing restriction 1
  - » For register ops use register renaming
  - » Could rename memory too, but generally not worth it
- Removing restriction 2
  - » Need hardware support (aka <u>speculation models</u>)
    - Some ops don't cause exceptions
    - Ignore exceptions
    - Delay exceptions



R1: y is not in liveout(1)R2: op 2 will never cause an exception when op1 is taken

# **Restricted Speculation Model**



### General Speculation Model

- 2 types of exceptions
  - Program terminating (traps)
    - Div by 0, illegal address
  - » Fixable (normal and handled at run time)
    - Page fault, TLB miss
- General speculation
  - Processor provides nontrapping versions of all operations (div, load, etc)
  - Return some bogus value (0)
     when error occurs
  - » R2 is completely ignored, only R1 limits speculation
  - » Speculative ops converted into non-trapping version
  - Fixable exceptions handled as usual for non-trapping ops



# Programming Implications of General Spec

- Correct program
  - » No problem at all
  - Exceptions will only result when branch is taken
  - Results of excepting speculative operation(s) will not be used for anything useful (R1 guarantees this!)
- Program debugging
  - Non-trapping ops make this almost impossible
  - Disable general speculation during program debug phase



#### **Class Problem**



 Starting with the graph assuming restricted speculation, what edges can be removed if general speculation support is provided?
 With more renaming, what dependences could be removed?

# Sentinel Speculation Model

- Ignoring all speculative exceptions is painful
  - » Debugging issue (is a program ever fully correct?)
- Also, handling of all fixable exceptions for speculative ops can be slow
  - » Extra page faults
- Sentinel speculation
  - » Mark speculative ops (opcode bit)
  - Exceptions for speculative ops are noted, but not handed immediately (return garbage value)
  - Check for exception conditions in the "home block" of speculative potentially excepting ops



# Delaying Speculative Exceptions

- ✤ 3 things needed
  - » Record exceptions
  - » Check for exceptions
  - » Regenerate exception
    - Re-execute ops including dependent ops
    - Terminate execution or process exception
- Recording them
  - Extend every register with an extra bit
    - Exception tag (or NAT bit)
    - Reg data is garbage when set
    - Bit is set when either
      - Speculative op causes exception
      - Speculative op has a NAT'd source operand (exception propagation)



# Delaying Speculative Exceptions (2)

- Check for exceptions
  - Test NAT bit of appropriate register (last register in dependence chain) in home block
  - » Explicit checks
    - Insert new operation to check NAT
  - » Implicit checks
    - Non-speculative use of register automatically serves as NAT check
- Regenerate exception
  - » Figure out the exact cause
  - » Handle if possible
  - » Check with NAT condition branches to "recovery code"
  - Compiler generates the recovery code specific to each check



# Delaying Speculative Exceptions (3)

In recovery code, the exception condition Recovery code consists of chain will be regenerated as the excepting op of operations starting with a is re-executed with the same inputs potentially excepting speculative op up to its corresponding check If the exception can be handled, it is, all dependent ops are re-executed, and execution 2': y = \*x3': z = y + 41: branch x == 0is returned to point after the check If the exception is a program error, execution is terminated in the recovery code branch NAT(z) fixup done: 4: \*w = z Recovery code fixup: 2": y = \*x3": z = y + 4jump done

# Implicit vs Explicit Checks

- Explicit
  - » Essentially just a conditional branch
  - » Nothing special needs to be added to the processor
  - » Problems
    - Code size
    - Checks take valuable resources
- Implicit
  - » Use existing instructions as checks
  - » Removes problems of explicit checks
  - » However, how do you specify the address of the recovery block?, how is control transferred there?
  - » Hardware table
    - Indexed by PC
    - Indicates where to go when NAT is set
- IA-64 uses explicit checks

### Homework Problem



- 1. Move ops 5, 6, 8 as far up in the SB as possible assuming sentinel speculation support and register renaming
- 2. Insert the necessary checks and recovery code (assume ld, st, and div can cause exceptions)

# Change Focus to Scheduling Loops



#### Basic Approach – List Schedule the Loop Body



Total time = 6 \* n

#### Unroll Then Schedule Larger Body



Total time = 7 \* n/2

# Problems With Unrolling

- Code bloat
  - » Typical unroll is 4-16x
  - » Use profile statistics to only unroll "important" loops
  - » But still, code grows fast
- Barrier after across unrolled bodies
  - » I.e., for unroll 2, can only overlap iterations 1 and 2, 3 and 4, ...
- Does this mean unrolling is bad?
  - » No, in some settings its very useful
    - Low trip count
    - Lots of branches in the loop body
  - » But, in other settings, there is room for improvement

# **Overlap Iterations Using Pipelining**



# A Software Pipeline



# Creating Software Pipelines

- Lots of software pipelining techniques out there
- Modulo scheduling
  - » Most widely adopted
  - » Practical to implement, yields good results
- Conceptual strategy
  - » Unroll the loop completely
  - » Then, schedule the code completely with 2 constraints
    - All iteration bodies have identical schedules
    - Each iteration is scheduled to start some fixed number of cycles later than the previous iteration
  - » <u>Initiation Interval</u> (II) = fixed delay between the start of successive iterations
  - » Given the 2 constraints, the unrolled schedule is repetitive (kernel) except the portion at the beginning (prologue) and end (epilogue)
    - Kernel can be re-rolled to yield a new loop

# Creating Software Pipelines (2)

- Create a schedule for 1 iteration of the loop such that when the same schedule is repeated at intervals of II cycles
  - » No intra-iteration dependence is violated
  - » No inter-iteration dependence is violated
  - » No resource conflict arises between operation in same or distinct iterations
- We will start out assuming Itanium-style hardware support, then remove it later
  - » Rotating registers
  - » Predicates
  - » Software pipeline loop branch

# Terminology



<u>Initiation Interval</u> (II) = fixed delay between the start of successive iterations

Each iteration can be divided into <u>stages</u> consisting of II cycles each

Number of stages in 1 iteration is termed the <u>stage count (SC)</u>

Takes SC-1 cycles to fill/drain the pipe

To Be Continued ...