

# ML Compiler on Heterogenous Computer Architecture

Tsung Tai Yeh

Department of Computer Science

National Yang-Ming Chiao Tung University

## Acknowledgements and Disclaimer

- Slides was developed in the reference with CS 15-779, Advanced Topics in Machine Learning Systems (LLM Edition), CMU, 2025
- AI System: https://github.com/Infrasys-AI/AISystem

### **Outline**

- Heterogeneous Computer Architecture
  - CPU+GPU
  - CPU+ASIC
- ML Compiler
  - MLIR
  - IREE
- MegaKernel + Mirage on GPU
  - Domain-Specific Language

# What is heterogeneous SoC?

- Heterogenous computer architecture
  - A chip contains CPU and multiple specialized functional units



| Chip                         | Tesla - FSD Chip                                            | Qualcomm - Snapdragon 865 (Galaxy<br>S20, March 6 2020)     |
|------------------------------|-------------------------------------------------------------|-------------------------------------------------------------|
| Technology Node              | Samsung 14 nm process                                       | TSMC's advanced 7nm (N7P)                                   |
| СРИ                          | 3x (4-core) Cortex-A72                                      | 4x Cortex-A77, 4x Cortex-A55 (4 high power, 4 low power)    |
| GPU                          | Custom GPU, 0.6 TFLOPS @ 1 Ghz                              | Adreno 650, 1.25 TFLOPS @ 700 MHz -ish                      |
| NPU (AI<br>accelerator)      | 2x Tesla NPU, each 37 TOPS (total 74<br>TOPS)               | Hexagon 698 @ 15 TOPS                                       |
| Memory (Cache)               | 2x 32MB SRAM for NPUs                                       | 1 MB L2, 4 MB L3, and 3 MB system wide cache                |
| Memory (RAM)                 | 8GB LPDDR4X, 2x 64-bit, Bandwidth<br>111 GB/s               | 16GB LPDDR5, 4x 16-bit , Bandwidth 71.30 GB/s               |
| ISP (Image signal processor) | 24-bit? 1 billion pixels per second                         | Spectra 480, dual 14-bit CV-ISP 2<br>Gpixel/s, H.265 (HEVC) |
| Secure Processing<br>Unit    | "Security system", verify code has<br>been signed by Tesla. | Qualcomm SPU230, EAL4+ certified                            |

## Why heterogeneous computer architecture?



General purpose processor is not getting faster and power-efficient because of Slowdown of Moore's Law and Dennard Scaling

# Why heterogeneous computer architecture?



## **Evolution of Computer Architecture**



## Hetero Computer Architecture (CPU + GPU)

- Step 1: CPU sends data from its host memory to device memory
- Step 2: CPU asks GPU to begin the execution
- Step 3: GPU sends results back to the CPU
- What are advantages when using this hetero. architecture?



## Hetero Computer Architecture (CPU + ASIC)

- Two types of heterogeneous computer architecture
  - Discreated CPU+ASIC (separated DRAM)

Integrated CPU+ASIC (shared DRAM)



## Hetero Computer Architecture (CPU+ASIC)

- Post-Moore era and dark silicon
  - A suite of accelerators on chip are rising
  - Applications will only use a subset of processors/accelerators at a time
  - Such a heterogeneous architecture is compatible with dark silicon



2010 Apple A4
65 nm TSMC 53 mm<sup>2</sup>
4 accelerators



2014 Apple A8
20 nm TSMC 89 mm<sup>2</sup>
28 accelerators



2019 Apple A127 nm TSMC 83 mm²42 accelerators

## Hetero Computer Architecture (GPU)

GPU includes FP, SFU (Special Functional Unit), Ray Tracing

(RT) Core, and Tensor Core





- Program Compilation
  - o Programming model?
  - Data/Kernel mapping/partition?
  - Concurrent execution?



Program Compilation



- Hardware
  - Packaging on Chiplet
  - Network-on-Chip (NoC)
    - Photonic Integrated Circuit





Trade-off the performance and flexibility



## **Takeaway Questions**

- How to improve the performance of processor?
  - (A) Increase the size of cache
  - (B) Add specialized engines in the processor
  - (C) Utilize high bandwidth memory (HBM)
- What are benefits of heterogeneous computer architecture?
  - (A) Improve energy efficiency of the processor
  - (B) Facilitate parallel computing
  - (C) Reduce memory access latency

## Computer Language Stacks



## **Compiler Basics**



## LLVM Compiler Architecture



## LLVM Compiler Architecture



## What is Al Compiler?

Translate the operators of ML models to hardware



## What is Al Compiler?



# Al Compiler: Stage I

## ML Model graph

- Static model graphPython->Onnx
- Graph rewrite/Optimizer

#### Performance

- Op kernel libraries (cuDNN, CMSIS-NN ...)
- More performance improve using Op scheduling, tiling, fusion



## Al Compiler: Stage II

## ML Model graph

- Transforms PyTorch expression into IR
- Optimizes Tensor IR

#### Performance

- Operator lowering
- Inter-op optimization
- Static/dynamic graphs
- Not only rely on the customized Op Lib



## Al Compiler Frontend

#### Front-end compilation

- Goal
  - Parse model graphs from different AI system frameworks
  - Transforms model graphs into IR

#### Tasks

- Input format of ML models ((TensorFlow, PyTorch, ONNX ...)
- Transformation: transform model into united expression
  - TVM Relay, PyTorch Aten (TorchScript)
- High-level IR/Graph IR
  - Hardware independent
  - Operator/Tensor expression

## Al Compiler Frontend

- Front-end compilation
  - Tasks
    - Computational Graph Optimizations
      - Algebraic simplification
      - Operator Fusion
      - Operator Sinking
      - Static memory planning
      - Tensor Layout transformation

## Al Compiler High-Level IR

- Layer-level IR
  - Express ML model structure as a calculation graph
  - High-level abstraction
  - Optimization
    - DSE, operator fusion..
  - Cross-platform



## Graph IR

- Express ML model as a computation graph
- Tensor
- Operator
- Dependency



2 { 1.0 2.0 3.0 4.0 5.0 6.0



- Tensor
  - Shape [2, 3, 4, 5]
  - 。 [N, C, H, W] [N, H, W, C]
  - Type [int, float, string, ...]
- Operator
  - Algebra operator
  - Pre-defined operators

| Add   | Log       | While     |
|-------|-----------|-----------|
| Sub   | MatMul    | Merge     |
| Mul   | Conv      | BroadCast |
| Div   | BatchNorm | Reduce    |
| Relu  | Loss      | Мар       |
| Floor | Sigmoid   |           |

- Directed Acyclic Graph (DAG)
  - Operator, Tensor, control flow (For/While), dependency



30

- Static Computational Graph
  - Al system framework (e.g. TensorFlow) parses API used to describe ML model
  - Fixed before execution
  - Use static data structure to describe model graph topology

```
class Network(nn.Cell):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.dense_relu_sequential = nn.SequentialCell(
nn.Dense(28*28, 512),
nn.ReLU())

def construct(self, x):
x = self.flatten(x)
logits = self.dense_relu_sequential(x)
return logits
```



- Dynamic Computational Graph
  - Built on-the-fly as operations are performed
  - Define-by-run offers greater flexibility
    - Good for handling complex and variable-structured data
      - Time-series data: audio
      - Graph data: social networks
      - Multi-modal data: combinations of different variablestructured data types

## Operator fusion



A is called twice



A is called once Buffer A's output

Operator fusion



Three kernel calls (A, B, C)



Reuse the intermediate data buffer

- Operator fusion
  - Reduce memory RW of intermediate tensors



- How to fuse operators ?
  - TVM dominator tree (In a DAG)
  - Dominator
    - Node X dominates node Y iff all paths from the entry to Y go through X.
    - Node A dominates node C (A dom C)

#### CFG (Control Flow Graph)



- How to fuse operators ?
  - TVM dominator tree (In a DAG)

#### **Dominator Tree:**





- The purpose of dominator tree
  - Check the path of each node to dominator node
  - Fuses the node that does not affect the rest of nodes
  - How to create a dominate tree?
    - Create DFS tree based on DAG
    - Create DOM (dominator) tree
    - Examine a group of nodes to check if multiple nodes can be fused



- Rule of operator fusion
  - Injective (one-to-one map): Add, pointwise
  - Reduction: sum/max/min
  - Complex-out-fusable
    - : conv2D
  - Opaque (cannot be fused): sort



- Data layout alignment
  - Unaligned tensor data will increase the memory transactions



- Data layout (N, C, H, W)
  - N: batch; N: Height; W: width; C: Channels
  - NCHW: arrange data in the same channel in the a consecutive memory space
  - Good for the computations of GPU (data parallel)



 1
 2
 3
 4
 5
 6
 6
 7
 8
 9
 19
 11
 12
 13
 14
 15
 16
 ...

- Data layout (N, H, W, C)
  - NHWC: arrange the data having the same location in different channels in a consecutive memory space e.g. Conv1x1





- Data layout (N, C, H, W)
  - PyTorch on NPU/GPU uses NCHW data layout
  - TensorFlow use NHWC data layout



- Memory optimization
  - Attention memory usage for a deep Transformer (64 layer and 4 heads), recomputed during the backward pass
  - BERT (768 hidden layers) and needs 73GB memory when the batch size is 64

| Data type                                   | Stored | Recomputed |
|---------------------------------------------|--------|------------|
| 1024 text tokens (several paragraphs)       | 1.0 GB | 16 MB      |
| 32×32×3 pixels (CIFAR-10 image)             | 9.6 GB | 151 MB     |
| 64×64×3 pixels (Imagenet 64 image)          | 154 GB | 2.4 GB     |
| 24,000 samples (~2 seconds of 12 kHz audio) | 590 GB | 9.2GB      |

- Memory optimization
  - Static memory allocation
    - Parameters, constant, output
    - Allocate memory in the model initialization stage
  - Dynamic memory allocation
    - Output tensor, workspace tensor (intermediate tensor)
    - Allocate memory (dynamic: varying batch size, static: fixed batch size)

- Memory optimization
  - Inplace operation: overwrite when the next operator is element-wise operator
  - Memory sharing: the size of both operators is the same and no data dependency in these two operators



#### AI Compiler Low-Level IR

#### Low-level IR

- Describes the computation of a ML model in a more <u>fine-grained representation</u> than that in high-level IR
- Enable the target-dependent optimization
- Halide-based IR
  - Separation of comp.
     and schedule
  - Choose the best schedule to specific target platform



- Back-end compilation
  - Goal
    - Transform ML graph to specific hardware
    - Code generation: LLVM/CUDA/OpenCL ...
  - Tasks
    - Hardware Specific Optimization
      - Memory allocation
      - Parallelization
    - Scheduling
      - Auto Scheduling: polyhedral, Halide

Hardware-specific optimizations



- Hardware intrinsic mapping
  - Transform a certain set of low-level IR to kernels
  - TVM extensible tensorization
    - Declare the behavior of hardware intrinsic and lowering the rule for intrinsic mapping
    - Enable compiler
       backend <u>apply optimized</u>
       <u>micro-kernels to a</u>
       <u>specific pattern of</u>
       operations



- Memory allocation and fetching
  - E.g. GPU memory <u>hierarchy requires efficient memory</u> <u>allocation and fetching techniques for improving data locality</u>
  - TVM memory scope
    - Tag a compute stage as shared or thread-local
      - Shared: generates code with shared memory allocation
      - Properly insert memory barrier



- Memory latency hiding
  - Reordering the execution pipeline
  - In TPU-Accel with decoupled access-execute (DAE)
    - Backend needs to perform scheduling and fine-grained sync to produce the correct and efficient code
  - TVM virtual threading schedule primitive
    - Virtually parallelized threads
    - Barriers + operations = a single instruction stream



- Loop oriented optimization
  - Loop fusion
    - fuse loops with the same boundaries for better data

reuse

- Sliding window
  - Compute values when needed
  - Store them for data reuse until they no longer required



- Parallelization
  - Halide uses a <u>schedule primitive called parallel</u>
    - Specify the parallelized dimension of the loops
  - Nested polyhedral model detect hierarchy parallelization among levels of tiling and striding



- Back-end compilation
  - Tasks
    - Auto-tuning
      - Parameterization cost model
    - Using kernel libraries
      - NVIDIA cuDNN/TensorRT, AMD MIOpen
    - Low-level IR/ Operator IR
      - Halide IR
    - Compilation scheme
      - Just-In-Time (JIT), Ahead-Of-Time (AOT)

## **Takeaway Questions**

- What are jobs of AI compiler?
  - (A) Handle tensor memory allocation
  - (B) Reorder the execution of the DL operators
  - (C) Generate assembly codes
- How does Al compiler improve the data reuse on the local memory?
  - (A) Use the NCHW data layout
  - (B) Operator fusion
  - (C) Operator lowering

- Most high-level languages have their own AST
- ML graphs compilation process is fragmented
- MLIR allows developers to <u>use a unified codebase/framework</u> to do their optimizations and <u>develop some optimizations for multiple inputs</u>



- MLIR's input
  - applications, compilers, C program, etc.
- Within MLIR
  - Implement multiple Dialects for distinct inputs
  - Use Dialect to deal with tensors



- Once we have an optimal IR
  - MLIR can lower it onto the backends such as LLVM for CPU ...
  - If the targeting hardware is FPGA, TPU, need vendor-tools for final compilation



- MLIR Compiler Infrastructure
  - A set of optimization/code conversion/code generation pipeline



#### MLIR Dialect

- One way to express IR from other specific IRs
- Every IR can be transformed in the corresponding MLIR dialect
- Each programming language's dialect (Tensor dialect, HLO dialect, LLM IR dialect) is inherent from mlir::Dialect
- AST (Abstract Syntax Tree)



#### MLIR Dialect

- DRR(Dynamic Reconstructed Radiography) transform different dialect
- ODS(Operation Definition System) define operation



- · 'acc' Dialect
- · 'affine' Dialect
- 'amdgpu' Dialect
- · 'amx' Dialect
- · 'arith' Dialect
- · 'arm neon' Dialect
- 'arm\_sve' Dialect
- 'ArmSME' Dialect

- MLIR Operation
  - Output: %tensor
  - Operation: toy.transpose
  - Input: %tensor
  - Transform tensor <2x3xf64> to tensor <3x2xf64>
  - The location of transpose is in "example/file/path", line 12, 1st word

```
    Operations
```

- gpu.all reduce (gpu::AllReduceOp)
- gpu.alloc\_(gpu::AllocOp)
- o gpu.barrier (gpu::BarrierOp)
- o gpu.binary (gpu::BinaryOp)
- gpu.block dim\_(gpu::BlockDimOp)
- o gpu.block id (gpu::BlockIdOp)
- gpu.cluster\_block\_id\_(gpu::ClusterBlockIdOp)

%t\_tensor = "toy.transpose"(%tensor) {inplace = true} : (tensor<2x3xf64>) -> tensor<3x2xf64> loc("example/file/path":12:1)

Simple Matmul Kernel

```
M = 2 # Rows in arg0
          # Columns in arg0, Rows in arg1
K = 2816
N = 1280 # Columns in arg1
# Matrix multiplication with f16 -> f32 promotion
for i in range (M):
   for j in range(N):
      acc = 0.0 # float32 accumulator
      for k in range(K):
          a = float(arg0[i][k]) # f16 -> f32
          b = float(arg1[k][j]) # f16 -> f32
          acc += a * b
      result[i][j] = acc # store result as float32
```

#### matmul.mlir

```
\#map = affine map < (d0, d1, d2) \rightarrow (d0, d2) >
\#map1 = affine map < (d0, d1, d2) \rightarrow (d2, d1) >
\#map2 = affine map < (d0, d1, d2) -> (d0, d1) >
func.func @matmul(%arg0: tensor<2x2816xf16>, %arg1: tensor<2816x1280xf16>) -> tensor<2x1280xf32> {
%cst = arith.constant 0.000000e+00 : f32
%0 = tensor.empty() : tensor<2x1280xf32>
 %1 = linalq.fill ins(%cst : f32) outs(%0 : tensor<2x1280xf32>) -> tensor<2x1280xf32>
 %2 = linalq.generic {indexing maps = [#map, #map1, #map2], iterator types = ["parallel", "parallel",
"reduction"]} ins(%arg0, %arg1 : tensor<2x2816xf16>, tensor<2816x1280xf16>) outs(%1 : tensor<2x1280xf32>) {
 ^bb0(%in: f16, %in 0: f16, %out: f32):
   %3 = arith.extf %in : f16 to f32
   %4 = arith.extf %in 0 : f16 to f32
   %5 = arith.mulf %3, %4 : f32
   %6 = arith.addf %out, %5 : f32
  linalq.yield %6 : f32
 } -> tensor<2x1280xf32>
 return %2 : tensor<2x1280xf32>
```

#### IREE – Intermediate Representation Execution Environment

#### IREE

- A MLIR-based compiler for ML programs
- Takes ML workloads from various frontends (PyTorch ..) and execute on different backends (x86, Arm, NVIDIA GPUs, AMD GPUs ..)



66

#### IREE – Intermediate Representation Execution Environment

#### IREE Compiler Design



## Mega-Kernel on the GPU

 GPU includes multiple specialized engines (CUDA core, Tensor core ..)





# Existing Kernel-Per-Operator Approach



#### Limitations

**No Inter-Layer Pipelining** 

Kernel barriers prevent interlayer pipelining



#### Limitations

#### No Overlapping

Coarse-grained dependency prevents comp. & comm. overlap



# Kernel-Per-Operator v.s. Mega-Kernel



# Key Challenges of Mega-Kernel

1. How to manage dependency?

Task Graph

No kernel barriers in mega-kernel

2. How to handle dynamism?

Continuous batching, prefill/decode, paged/radix attention, speculative decoding

In-Kernel Parallel Runtime

3. How to optimize performance? Mirage Superoptimizer\*

Existing compilers target individual kernels

## Mirage: A SuperOptimizer for ML

Can we represent FlashAttention as a graph optimization?

Is it possible to implement FlashAttention as a combination of matmul, exp, add,



### Mirage: A SuperOptimizer for ML

 Key idea: automatically generate highly-optimized GPU kernels for DNNs



- Less engineering effort: thousands lines of CUDA code → a few lines of Python code in Mirage
- Better performance: outperform existing systems by 1.1-3.5x
- Faster adaptation: day-0 support for new models; no manual effort

#### Hierarchical Graph Representation



#### Example: RMSNorm & MatMul in LLMs

 Existing systems launch two kernels since Y does not fit in shared memory



#### µGraph for RMSNorm & MatMul

 Existing systems launch two kernels since Y does not fit in shared memory



## High-Performance µGraph

- Key Challenges to discover High-performance µGraph
  - How to generate potential µGraph?
  - How to verify their correctness?
  - Mirage system



### Hardware-Customized µGraphs

 Find µGraphs similar to expert-written implementations for attention on NVIDIA A100 GPU



### Neural Processing Unit (NPU)

- Intel NPU
  - Hardware Acceleration Blocks
    - Handle GEMM,CONV ...
  - Streaming Hybrid Architecture Vector Engines (SHAVE)
    - Perform parallel computing for general needs
  - DMA Engines
    - Moving data between DRAM and software-managed cache



### Heterogeneity on ASIC

- NPU
  - MAC engine + Specialized engines
- CPU on AI PC
  - AMD Ryzen AI Pro 300 (CPU+NPU+GPU)





### **Takeaway Questions**

- What are jobs of MLIR?
  - (A) Operator definition
  - (B) Operator lowering
  - (C) Instruction selection
- What are benefits of MegaKernel?
  - (A) Overlapping GPU specialized engine execution
  - (B) GPU register reuse
  - (C) Decrease the kernel launch overhead

### Future of AI Compiler

- Future of AI compiler
  - In model inference
    - Ahead-of-Time (AoT) compilation
  - In model training
    - Just-in-Time (JIT) compilation
- The form of IR
  - Need one IR that can support diverse programming language and ML frameworks
  - Good for cross-platform

### Future of AI Compiler

- Auto-parallelization
  - Automatic execute ML models through different parallelization approaches
  - Distributed computing (Model training)
  - Parallel computing in one chip
- Auto Code/kernel generation
  - Not only Domain-Specific Language (DSL)
  - Match diverse hardware platforms