You and Your Research How To Become A Hacker 编程的智慧 An open letter to those who want to start programming Teach Yourself Programming in Ten Years

紀錄一般知識，盡量不針對特定編譯器或硬體。可以舉例。
課程:
- CS143: Compilers
- 15-745 Optimizing Compilers for Modern Architectures

如何能深刻了解編譯器?

Language Implementation Patterns
- 前端，直譯器相關。
Compilers: Principles, Techniques, and Tools
- 優化相關。前端不推薦。
Engineering a Compiler, Second Edition - 入門
Advanced Compiler Design and Implementation - 進階
寫一個編譯器或是參與現有編譯器的開發。

有用連結:

What LLVM can do for you

$ wget http://llvm.org/devmtg/2012-04-12/Slides/Workshops/memo.cc
$ wget http://llvm.org/devmtg/2012-04-12/Slides/Workshops/examples.tbz2

編譯流程

Compiler
- 源代碼 (Source Code) → 語彙分析 (Lexical Analysis) → 語法分析 (Syntax Analysis) → 語意分析 (Semantic Analysis) → 生成中間表示式 (IR Generation) → 優化中間表示式 (IR Optimization) → 生成目標代碼 (Code Generation) → 優化目標代碼 (Optimization) → 目標代碼 (Machine Code)

前端

Language Implementation Patterns
Lex & Yacc
- 《自制编程语言》相关资料
- win_sjis
  - 重點著重在如何建立 AST。前端會遍歷 AST 多次，每次遍歷求得的資訊會記在 AST 節點上, 供下一次遍歷使用。
怎样写一个解释器 - 王垠
Parsing
- The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out at the semantic analysis (contextual analysis) step.
What is the difference between “keyword” and “reserved word”?
- Reserved word
  - In a computer language, a reserved word (also known as a reserved identifier) is a word that cannot be used as an identifier, such as a name of a variable or function or a label – it is "reserved from use". This is a syntactic definition, and a reserved word may have no meaning. This is a syntactic definition, and a reserved word may have no meaning. A closely related and often conflated notion is a keyword which is a word with special meaning; this is a semantic definition.
Why is the difference between an expression and a statement
- Expression
- Statement
- 表達式 (expression) 和述句 (statement) 主要差別在於前者產生値，而後者產生動作。後者可藉由前者產生的值，進行賦値或是流程控制。

JTC1/SC22/WG14 - C
- 閱讀標準，理解用詞。
- SDCC
JTC1/SC22/WG21 - C++

Data type
- Primitive data type
  - 基本型別。下述衍生型別基本上都是以基本型別建構而成。
- Composite data type
  - 複合型別。
- Reference type
  - 引用類型。Java 裡面除基本型別 (Primitive data type) 之外，皆為引用類型。
Dangling else
- else 如何與 if 配對。

產生中間語言

請見 Three-Address Code IR。

對於中間語言常用的格式，Static single assignment form (SSA)，只需要知道底下幾點:

每個變量只被賦值一次 (類似函數式編程中 immutable 的概念)。
一般程序轉換成 SSA form 時，會針對控制流插入 Phi Node。
- Phi Node 在 LLVM 中被實現為一條 LLVM IR (What exactly PHI instruction does and how to use it in LLVM)。
SSA form 簡化後續優化所需實現。

優化中間語言

請見 IR Optimization、Global Optimization 和 Global Optimization, Part II。

生成目標代碼

Code Generation 包含以下步驟:

可以參考 LLVM 的流程 The high-level design of the code generator。

Introduction to Compiler Design: optimization and backend issues

指令選擇

Instruction selection 將 IR 轉換成 machine code，或是低階的 IR (接近目標機器碼)，但仍使用 pseudo register，留待之後的 register allocation 將目標機器的暫存器分配給 pseudo register。Instruction selection 一般是將 machine code 模板貼至 IR tree 上，稱為 tiling。一般來說，會將 IR tree 轉成 DAG (directed acyclic graph)。Tree 和 DAG 不同之處在於，節點在 tree 中只有一個父節點，而在 DAG 中可以有多個父節點。DAG 這個特性用來將 common subexpression 暴露出來。若一個節點為 common subexpression，在 tree 中會出現多次，而在 DAG 只會出現一次。DAG 有助於後續的 instruction selection 和 register allocation。

見 Compilers: Principles, Techniques, and Tools 第 8.9 節: Instruction Selection by Tree Rewriting。

Instruction Selection
Code-generator generators: generating the instruction selector
- Optimal code selection in DAGs
More Instruction Selection
Compilers: Principles, Techniques, and Tools - 8.9

LLVM 使用的是 SelectionDAG based instruction selector。利用 tblgen 生成 instruction selector。

Scheduling expression DAGs for minimal register need
Optimal Integrated Code Generation for Clustered VLIW Architectures
Sethi–Ullman algorithm
- Sethi-Ullman Scheduling
- lib/CodeGen/SelectionDAG/ScheduleDAGRRList.cpp (SelectionDAG Scheduling and Formation Phase)

暫存器配置

兩個述句 (statement) 之間的點被稱為 program point。一個變數被 define，代表該變數被寫入值; 一個變數被 use，代表該變數的值被讀出。針對變數 a 而言，

---------- program point
  a = 1; // define
----------
  b = a; // use
----------

一個變數被稱為 live，代表該變數其值在將來會被使用到，亦即在某個時間點被讀出。一個變數的 live range 是一個 program point 的集合，該變數在這些 program ponit 為 live。一個變數的 live interval 是一個最小區間，該區間包含該變數所有的 live range。live interval 相較於 live range 較不精確，但是其 register allocation 較為簡單。Register Allocation 第 64 頁演示的 linear scan 算法就是針對各個變數的 live interval 配置暫存器。編譯器在做 register allocation 時，會視情況針對某變數做 live range splitting，其目的是希望在 live range splitting 之後，能減少各變數之間 live range 重疊的情況。如果兩個變數的 live range 有所重疊，則代表這兩個變數無法分配到同一個暫存器，這會增加暫存器使用量。

Register allocation 包含兩個步驟: 首先，register allocation 釐清哪些變數將可存入暫存器; 接著，register assignment 將物理暫存器分配給那些變數。如果途中發現暫存器不敷使用，register allocation 演算法會選擇將某些暫存器存入內存，這個動作稱之為 spill。register allocation 演算法目標之一就是要降低 spill 的次數。一般來說，register allocation 演算法會選擇把擁有較長 live interval 的變數 spill 到內存。這是因為如果將暫存器分配給該變數，則該暫存器將有較長時間不能再被使用。所以，一般會將暫存器優先分配給 live interval 較短的變數。有時候，如果某暫存器其中的值可以在之後簡單的計算出來，算法可以選擇 spill 該暫存器，且之後不須從內存將其值讀出，只需要重新計算即可，這稱之為 rematerialization。

相較於 linear scan 是利用 live interval 此一較不精確的描述來分配暫存器，Register Allocation 第 109 頁演示的 graph coloring 算法利用 live range 分配暫存器。每個變數代表一個節點。如果變數 A 和變數 B 的 live range 有所重疊，則節點 A 和 B 之間有連線。針對各個變數的 live range 所建立的圖稱為 register interference graph (RIG)。此算法目的是要替 RIG 上的節點著色 (代表暫存器)，有連線的節點必須著以不同的顏色。

根據 register allocation 的範圍，可分為 local register allocation 和 global register allocation。前者針對一個 basic block 分配暫存器給各個變數，後者則針對數個 basic block。

LLVM 3.0 預計以 greedy register allocation 取代現行的 linear scan 算法。我想你現在具有足夠知識開始閱讀文章 Greedy Register Allocation in LLVM 3.0 了。 LLVM 3.0 有兩個 register allocator，分別是 basic allocator 和 greedy allocator。兩者都是根據變數的 live range 分配暫存器。它們都有兩個重要的資料結構: priority queue 和 live interval unions。Priority queue 以變數的 live range 的 spill weight 由高到低排序，將暫存器優先配給 spill 代價較高的變數。Live interval unions 存放的是已經分配到暫存器的 live range，目的是用來檢查某暫存器分配給那些 live range。底下略為描述 basic allocator 和 greedy allocator 的運作方式。

Basic allocator: 由 priority queue 中挑出 spill 代價最高的 live range，在此稱之為 live range A，試圖配置暫存器給它。如果成功，將配置結果更新到 live interval unions; 否則，則 spill live range A。這裡不需要檢查 live interval unions 選擇 spill 其它 live range，因為我們優先配置暫存器給 spill 代價較高的 live range。一般是透過插入 load/store 指令 spill 某個 live range，這會造出新的 live range (在 load 跟 store 之間)。這一個因 spill 造出的 live range 會塞入 priority queue 等待下一次分配，其 spill weight 設為無限大。等到下一次為該 spill weight 為無限大的 live range A 分配暫存器時，basic allocator 會選擇一暫存器，並檢視 live interval unions 看該暫存器分配給那些 live range，亦即與那些 live range 有 interfering。Basic allocator spill 那些倒楣的 live range，再將該暫存器配給該 live range A。
Greedy allocator: 與 basic allocator 的差別在，當沒有適當的暫存器可供分配給 live range A，greedy allocator 會由 live interval unions 中挑出 spill weight 較小的 live range B 並 spill 它 (稱為 evict)。Live range B 會被塞回 priority queue 等待下一次分配。注意! 在 basic allocator 中，不會挑 live interval unions 中 spill weight 較小的 live range 做 spill。此外，geedy allocator 會在 spill live range 之前先試著做 live range splitting，將 live range 分成較小塊的 live range，再將它們塞回 priority queue。

指令排程

請見 Code Optimization。

Instruction scheduling 和 register allocation 的順序如何調整及其優缺點請見 The phase order of Instruction Scheduling。

instruction scheduling 目的在於提高 ILP。
instruction scheduling 在 register allocation 之前的優點在於 ILP 能最大化。因為在 register allocation 之前，使用的是虛擬暫存器 (數量無限)，多條指令可以並行; 但這也是缺點，ILP 最大化之後，會對 register allocation 造成壓力，多條指令執行可能會使用超過物理暫存器數量的暫存器，register allocation 必須做 spill。
instruction scheduling 在 register allocation 之後的優缺點和上面剛好相反。
LLVM 在 register allocation 前後都會做 instruction scheduling。

Optimization

程序分析-原理部分(全)
- Dominator/Post-dominator
- Control Dependence

Intermediate Optimizations
[LLVMdev] Switch instruction lowering
- Multiway branch
  - Improving Switch Lowering for The LLVM Compiler System
  - A Superoptimizer Analysis of Multiway Branch Code Generation

Link Time Optimization

Link-time optimization

Exceptio Handling

目前所見的例外處理其實作上基本遵循 Itanium C++ ABI: Exception Handling。它實現所謂的 zero-cost exception handling，也就是例外處理不會造成程式正常執行流程效能上的影響。¹⁾

personality function/routine 是語言執行時期函示庫中的函式，用來作為 system unwind library 和 language-specific exception handling semantics 的介面。這是因為不同的程式語言對於例外處理有不同的行為，personality function 接受例外的內文，包含其型別和值。

LLVM

LLVM 3.0 針對例外處理主要提供以下 LLVM IR/Intrinsics 支援。前端在處理程式語言中的例外處理時，可以將其對映到 LLVM 中用來支援例外處理的指令，再交由之後的優化器和後端處理。底下將會以 C++ 中例外處理的 try/catch/throw 當作對映。

invoke: 針對在 try block 中的函式呼叫，改用 invoke 而非 call。如果在一個會丟出例外的函式中呼叫另一個函式，同樣也是用 invoke。
landingpad: 例外發生之後，會跳轉到一個特殊的基本塊。在這個基本塊中，landingpad 用來準備之後例外處理需要的資料。
resume
llvm.eh.typeid.for

以 The new LLVM exception handling scheme 說明 LLVM 如何實現例外處理。

try {
   ...
   MayThrowSomething();
   AnotherFunctionCall();
   ...
} catch (int i) {
  DoSomethingWithInt(i);
} catch (class A a) {
  DoSomethingWithA(a);
}

其 try block 對映的 LLVM IR 為，

invoke void @_Z17MayThrowSomethingv() to label %cont unwind label %lpad

我們可以看到在 try block 中的函式呼叫改用 invoke，其分支目標有二: label %cont 和 unwind label %lpad，分別為正常執行流程和例外處理流程。

lpad:
  %info = landingpad { i8*, i32 } personality @__gxx_personality_v0 catch @_ZTIi catch @_ZTl1A
  %except = extractvalue { i8*, i32 } %info, 0
  %selector = extractvalue { i8*, i32 } %info, 1
  %typeid = call i32 @llvm.eh.typeid.for(@_ZTIi)
  %match = icmp eq i32 %selector, %typeid
  br i1 %match, label %run_catch, label %try_next

landingpad 是用來準備例外資訊給後面的例外處理程序使用。landing pad (lpad) 中文意思是停機坪，這代表當例外發生時，執行流程會跳轉到這裡，再轉發到其它地方。%info 是該例外的資訊，{ i8*, i32 } 是其型別。其中 i8* 是 exception pointer，i32 是 selector。personality @__gxx_personality_v0 代表目前處理的是 C\+\+ 的例外，它知道如何將該例外與 catch clause 作 C\+\+ 型別比較。catch @_ZTIi 和 catch @_ZTl1A 則是列出所有的 catch clause，@_ZTIi 和 @_ZTl1A 是語言特定的全域變數，這裡用來代表 catch clause 的 C\+\+ 型別資訊。

術語

Clobbering²⁾
- clobbered register 代表該條指令執行時，其內容會被修改的暫存器。
Register scavenging ³⁾
- 在 register allocation 之後的 pass，如果需要新的 register，會做 register scavenging 看附近有沒有暫時沒被用到的 register。如果都沒有，就只能做 spill。
Stack slot
- how does an optimizing c++ compiler reuse stack slots of a function?
Rematerialization
- 將所需的值重新計算，避免對內存的存取。可以看做是 spill 的反向動作。

Sibling call
Partial evaluation
- Futamura projections⁴⁾

Fixed-point arithmetic
- Introduction: Real numbers in Real world
- A Fixed-Point Introduction by Example
- (二進制) 小數點前後的位數固定。一般用 Qm.f 表示，其中 m 表示整數位，f 表示小數位，另外隱藏一位有號位 (signed bit)。
- 固定點運算可以先將小數點右移至底，將其當作一般整數運算，再將小數點左移。例如: 兩個 Q7.24 相乘 (Q7.24 加上有號位，總共 32 位)。
```
- widen the i32s to i56s
- multiply
- shift right 24
- truncate to i32
```
- Saturation arithmetic
  - 飽和運算。運算結果有最大最小值，高於最大值以最大值計，低於最小值以最小值計，前述行為被形容為飽和。
- 有理數/分數 (rational number/fractional number)
- 實數 (real number)

How to understand volatile and non-volatile registers?
- Caller/Callee Saved Registers
- caller-saved register 是調用方在調用函式之前需要保存的暫存器，因為這些暫存器在調用函式之後，其內容會被揮發 (破壞) 掉，故又稱 volatile register。
- callee-saved register 是被調用方在使用之前和之後需要保存和回復的暫存器。因為這些暫存器在調用函式之後，其內容會不變，故又稱 non-volatile register。

硬件平台相關

VLIW
- 非常依賴編譯器的軟流水優化。
Hardware Loop
- Architecture-oriented C optimization, part 1: DSP features
  - 一般 loop 在尾部透過 compare and branch 決定是否要繼續執行 loop。branch 指令會打斷 pipeline。
  - 增加硬件紀錄 loop counter 和 loop start/end。
- Porting LLVM to a Next Generation DSP
  - 一般 DSP 非常依賴 ILP，透過 SMT，VLIW 和 SIMD 達成。
Delay Slot
- Branch delay slot
  - Having Fun with Branch Delay Slots
  - 在 branch 指令之後的 pipeline stage 填入不論 branch taken 或是 not taken 都會執行的指令。屬於編譯器軟流水優化。
Literal pool⁵⁾
- ARM 相關。
- Literal pools
- The trouble with literal pools

代碼分析相關

Control flow graph
- 以下的程序分析都發生在 control flow graph。
Dominator
- 分析在 control flow graph 中，節點之間的關係。如果所有從 entry node 出發到 node n 的路徑，都需要經過 node d，則稱 d dominate n。每個 node 都 dominate 自己。
  - 1 dominate 3。所有從 1 出發到 3 的路徑，都需要經過 1。
  - 1/2 strictly dominate 3。1/2 dominate 3 且 1/2 不等於 3。
  - immediate dominator of 3 is 2。2 strictly dominate 3，且 2 沒有 strictly dominate 其它 strictly dominate 3 的 node; 相反的，1 strictly dominate 3，但是 1 strictly dominate 2，而 2 又 strictly dominate 3。
    - 單純從字面上理解 immediate dominator 應該也可以。
  - dominance frontier of n
    - frontier 是前沿。dominance frontier of n，是一個集合，集合中的節點，其 immediate predecessor 被 n 所 dominate，但節點本身不被 n 所 dominate。
    - 用來決定在 CFG 哪些地方設置 SSA 所需的 Phi node。
      - 一個節點 n，假設引用到變量 v。如果 n 屬於兩個 dominance frontier 的交集，且這兩個 dominance frontier 各自都定義變量 v，則在節點 n 之前插入 Phi node。
  - dominator tree
    - tree 中每一個節點 n 的子節點 m，immediate dominator of m is n。
Data-flow analysis
- 分析程序在各個不同的點，屬性的變化。屬性的變化，可以從 CFG 由前往後，或是後往前推導。
  - dead code elimination 所需的 live variable analysis 屬於 backward analysis。
    - Live variable analysis

代碼優化相關

附上 LLVM 相關代碼。

Type-Based Alias Analysis (TBAA)
- Type-Based Alias Analysis
- 假設不同型別的指針所指向的內存空間不重疊，並據此做優化。
Tail call
- 可以用迭代實現的遞迴。請見 Procedures and the Processes They Generate。
Dead code elimination (DCE)
- 消除未被使用的表達式。
- DCE.cpp
Dead store elimination (DSE)
- 如果一個變量被賦值之後，其值未被使用，則為 dead store。
- DeadStoreElimination.cpp
Global value numbering (GVN)
- 為變量和表達式編號。兩個變量或表達式具有相同編號，代表其中之一可以被消除。和 CSE 是互補的優化。
- Introduction to load elimination in the GVN pass
- GVN.cpp
Common subexpression elimination (CSE)
- 將表達式中重複的部分提出，避免重複計算。
Loop-invariant code motion (LICM)
- 將迴圈運算中，不會改變的變數提升至迴圈外，避免重複計算。
- LICM.cpp

Scalar Replacement of Aggregates (SRoA)

How LLVM Optimizes a Function

如果 aggregate ⁶⁾ (如: struct) 中的 field 沒有 alias，則將該 field 當成 scalar 處理，以利後續優化。底下是經過 LLVM opt 優化後的 IR。

// sroa.c
typedef struct {
  int x;
  int y;
} A;
 
int func() {
  A a = { .x = 1, .y = 2 };
  // int a.x.sroa = a.x;
  // int a.y.sroa = a.y;
  return a.x + a.y;
}

; clang -O0 -emit-llvm -S sroa.c -o sroa_O0.ll
 
define i32 @func() #0 {
entry:
  %a = alloca %struct.A, align 4
  %0 = bitcast %struct.A* %a to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* bitcast (%struct.A* @func.a to i8*), i64 8, i32 4, i1 false)
  %x = getelementptr inbounds %struct.A, %struct.A* %a, i32 0, i32 0
  %1 = load i32, i32* %x, align 4
  %y = getelementptr inbounds %struct.A, %struct.A* %a, i32 0, i32 1 
  %2 = load i32, i32* %y, align 4
  %add = add nsw i32 %1, %2
  ret i32 %add
}

; opt -sroa sroa_O0.ll -S -o sroa.ll
 
define i32 @func() #0 {
entry:
  %a.sroa.0.0.copyload = load i32, i32* getelementptr inbounds (%struct.A, %struct.A* @func.a, i64 0, i32 0), align 4
  %a.sroa.2.0.copyload = load i32, i32* getelementptr inbounds (%struct.A, %struct.A* @func.a, i64 0, i32 1), align 4
  %add = add nsw i32 %a.sroa.0.0.copyload, %a.sroa.2.0.copyload
  ret i32 %add
}

The new intraprocedural Scalar Replacement of Aggregates
SROA.cpp

ScalarEvolution and Loop Optimization
- SCEV (Scalar Evolution)
  - 15.5 Scalar evolutions
  - 將隨著每個迭代有序變化的變量 (induction variable) 以 SCEV 表示。
- Loop Invariant Computation and Code Motion
  - 15.1 Loop representation
  - back edge (d → a)，a dominate d。又叫 latch。
  - natural loop of a back edge。包含 back edge 頭尾兩個節點，構成迴圈的最小集合。
  - LLVM 會先做 LoopSimplify.cpp 將 natural loop 正規化，方便後續處理。
    - 插入 pre-header。確保外部進入 loop 只有一個入口點，即 pre-header。
    - 插入 exit-block。確保離開 loop 只有一個出口點，即 exit-block。
    - 確保只有一條 back edge。
  - 以 LICM 為例，會把 loop invariant 從 loop body 提升至 pre-header。
Vectorization
- Auto-Vectorization in LLVM
restrict type qualifier
- Restrict-qualified pointers in LLVM