# gpucc: An Open-Source GPGPU Compiler

Authors: Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Wang, Robert Hundt

Presenters: Bryce Messmann, Rohit Kandula

Date: 11 December 2019

## Background

- The two dominant software platforms for GPUs are CUDA (by NVIDIA) and OpenCL.
- CUDA is widely-used, but it is proprietary!
- Open-source compilers exist for CUDA, but they are not as powerful as NVIDIA's proprietary compiler (which is itself limited).

There is almost no research on CUDA-based compiler optimizations. This is a large bottleneck in GPU compiler research.

#### gpucc

- gpucc is "a fully functional, open-source, high performance, CUDA-compatible toolchain, based on LLVM and Clang"
- Targets CUDA, and includes many general and CUDA-specific optimizations
- Compared to NVIDIA's proprietary nvcc compiler:
  - Significantly faster compile time
  - On-par runtime performance



## **IR-Level Optimizations**

- gpucc includes many optimizations for runtime performance
  - Standard optimizations (e.g. -O3 in LLVM) are intended for CPUs
  - The optimizations here target GPUs, and the CUDA platform specifically
- All optimizations are done within LLVM

## Optimization 1/6: Loop Unrolling / Function Inlining

- Jumps are expensive on GPUs
  - Multiple "small" simultaneous threads (SIMD execution)
  - Little or no out-of-order execution
  - Pass-in-memory calling adds delay
- gpucc performs more aggressive unrolling and inlining
  - Reduces number of jumps, improving performance
  - Also promotes stack variables to registers (due to constant propagation / SROA)
- Manual options: #pragma unroll and \_\_\_\_\_forceinline\_\_\_\_

## **Optimization 2/6: Inferring Memory Spaces**

- gpucc can "propagate" memory spaces from initial definition to users
- This knowledge allows for performance optimizations (e.g. ld.shared instruction instead of generic ld instruction)





## Optimization 3/6: Memory-Space Alias Analysis

- Two pointers to different memory spaces won't access the same memory
- gpucc can detect this! (using memory-space inference)
- This improves dead-store elimination

## Optimization 4/6: Bypassing 64-Bit Division

- NVIDIA GPUs don't have a "divide" unit
- They perform division using a long sequence of simpler instructions
  - ~20 instructions for 32-bit division
  - ~70 instructions for 64-bit division
- Most division can be done in 32 bits (divisor and dividend are small)
- 64-bit division is automatically converted to 32-bit division when possible

- Eliminate Partial Redundancy:
  - 0 (b+1)\*n -> b\*n + n
- Useful for Array accesses with unrolled loops
  - Matrix Multiplication
  - Dot product
  - Back Propagation
- Adopted by other backends in LLVM:
  - AMDGPU
  - PowerPC
  - **ARM64**

```
p0 = \&a[c + b * n];
                                                                    p1 = \&a[c+1+b]
                                                                                        *n];
                                                                    p2 = \&a[c+2+b]
                                                                                        *n];
#pragma unroll for (long x = 0; x < 3; ++x) {
  #pragma unroll for (long y = 0; y < 3; ++y) {
    float p = \&a[(c + y) + (b + x) * n];
                                                                    p3 = \&a[c + (b+1)*n];
    ... // load from p
                                                                    p4 = \&a[c+1+(b+1)*n];
 }
                                                                    p5 = \&a[c+2+(b+1)*n];
}
      Loads a 3x3 submatrix at indices
      (b,c) in the array a
                                                                    p6 = \&a[c + (b+2)*n];
```

p7 = &a[c+1+(b+2)\*n];p8 = &a[c+2+(b+2)\*n]:

after unrolling

- Inefficiencies:
  - Partial Redundancy:
    - (b)\*n,(b+1)\*n,(b+2)\*n
    - =>(b) \*n, (b) \*n+n, (b) \*n+n+n
  - Doesn't use var+immOff addressing:
    - var -> stored in register
    - immOff -> 32-bit immediate
    - c+2+(b+2)\*n => (c+(b+2)\*n)+2
- 3 optimizations under this class
  - Pointer Arithmetic Reassociation: to map loads to var+immOff
  - Straight-Line Strength Reduction & Global Reassociation: Eliminate partial redudancy



## Optimization 5.1/6: Pointer Arithmetic Reassociation

- PAR tries to extract extract additive integer constant from pointer address expression.
  - variable part + constant offset
- NVPTX codegen folds to reg+immOff using simple pattern matching
- PAR promotes better CSE
  - &a[c+1+b\*n] -> &a[c+b\*n] + 1 -> &p0[1] -> ld.f32 [%rd1+4]



#### Optimization 5.1/6: Pointer Arithmetic Reassociation

$$p0 = &a[c + b *n];$$
  

$$p1 = &a[c+1+b *n];$$
  

$$p2 = &a[c+2+b *n];$$
  

$$p3 = &a[c + (b+1)*n];$$
  

$$p4 = &a[c+1+(b+1)*n];$$
  

$$p5 = &a[c+2+(b+1)*n];$$
  

$$p6 = &a[c + (b+2)*n];$$
  

$$p7 = &a[c+1+(b+2)*n];$$
  

$$p8 = &a[c+2+(b+2)*n];$$
  

$$after unrolling$$

p0 = &a[c+b\*n]; p1 = &p0[1]; p2 = &p0[2];

after PAR+CSE

#### Optimization 5.2/6: Straight-Line Strength Reduction

Strength reduction forms and replacements.

- Works on dominator paths instead of loops.
- b, s integer variables. C0, C1 are integer constants.
- SLSR identifies candidates in the same form and replaces them

#### Optimization 5.2/6: Straight-Line Strength Reduction

Strength reduction forms and replacements.

- C1-C0 is often -1 or 1 or a fixed stride
- Increases dependencies and could hurt ILP
  - NVIDIA K40 doesn't use out-of-order execution and has 2 integer units
  - Not so much of a problem as scheduling window is small

#### Optimization 5.1/6: Straight-Line Strength Reduction

| p1      | = | &a[c+b*n];<br>&p0[1];<br>&p0[2];                             | p0 =<br>p1 = | b*n;<br>&a[c+x0];<br>&p0[1];<br>&p0[2];                            |
|---------|---|--------------------------------------------------------------|--------------|--------------------------------------------------------------------|
| -<br>p4 | = | &a[c+(b+1)*n];<br>&p3[1];<br>&p3[2];                         | p3 =<br>p4 = | x0+n;<br>&a[c+x1];<br>&p3[1];<br>&p3[2];                           |
| p7      | = | &a[c+(b+2)*n];<br>&p6[1];<br>&p6[2];<br><i>after PAR+CSE</i> | p6 =<br>p7 = | <pre>x1+n;<br/>&amp;a[c+x2];<br/>&amp;p6[1];<br/>&amp;p6[2];</pre> |
|         |   |                                                              | Ċ            | after SLSR                                                         |

## **Optimization 5.3/6: Global Reassociation**

- Reorders commutative operations for better redundancy elimination
- Similar to
  - Enhanced Scalar Replacement but linear time complexity
  - Reassociative pass but "global"



#### **Optimization 5.3/6: Global Reassociation**



## **Optimization 5.3/6: Global Reassociation**

Pruning guarantees linear time

Safe due to pre-order traversal

•



21

## **Optimization 6/6: Speculative Execution**

- Straight-Line Scalar Optimizations do not work on non dominating instructions
- Solution: Hoist side effect free instructions from conditional basic blocks
- Increases dominance and likelihood of SLSO
- Also promotes predication
  - Reduces conditional basic blocks small which in turn triggers predicated execution

|               | p = &a[i];<br><b>if</b> (b)          | p = &a[i];<br><b>if</b> (b)          |
|---------------|--------------------------------------|--------------------------------------|
| <b>if</b> (b) | u = *p;                              | u = *p;                              |
| u = a[i];     | q = &a[i+j];                         | q = &p[j];                           |
| <b>if</b> (c) | <b>if</b> (c)                        | <b>if</b> (c)                        |
| v = a[i+j];   | $\mathbf{v} = \mathbf{*}\mathbf{q};$ | v = *q;                              |
| (a) original  | ( <b>b</b> ) speculative execution   | (c) straight-line opti-<br>mizations |

## Evaluation

#### Benchmarks:

- Rodinia: based on various "real-world" applications (data mining, medical imaging, etc.)
- **SHOC**: scientific computing benchmarks
- **Tensor**: benchmarks using the Tensor module in Eigen 3.0 (C++ linear algebra library)

#### Metrics:

- Runtime performance comparison
- Compilation time
- Effects of optimizations

### Performance on End-to-End Benchmarks



### Performance on Open-Source Benchmarks



### **Effects of Optimizations**



- IU: inline and unroll
- MSI: memory space
   inference
- **SL**: straight-line scalar optimizations
- **AA**: memory-space alias analysis
- **bypass**: bypassing 64bit divides

| Opt    | Benchmark | Speedup |  |
|--------|-----------|---------|--|
| IU     | ic1       | ~10x    |  |
| MSI    | ic1       | ~3x     |  |
| bypass | ic2       | 50%     |  |
| SL     | ic1       | 28.1%   |  |

~

## Conclusion

- gpucc
  - Open Source
  - High performance
  - CUDA compiler
- Faster (or on-par) Code w.r.t nvcc
- Faster or comparable compilation times
- Now part of clang/LLVM
- Research Opportunities:
  - Frontend: more language extensions? Other languages?
  - Backend: more architectures (e.g., CUDA on AMDGPU)? Target SASS?
  - Debugging: better profiling? Static analysis?
  - Optimizer: more optimizations?

# Questions?