目录

ARM7 和 ARMv7 是不一樣的東西。前者是微架構 (micro architecture) 或稱 family,後者指的是指令集架構 (instruction set architecture) 或稱 architecture。Cortex 屬微架構,實作 ARMv7 指令集。Cortex A15 加入虛擬化支援。

ARM 中所稱的 byte,halfword 和 word 分別為 8,16 和 32 位。

針對自修改代碼,ARM 需要做額外的工作1)

處理器

暫存器

以 application level view 來看,ARM 的 register 有 16 個,分別是 R0 - R12,SP (Stack Pointer),LR (Link Register) 和 PC (Program Counter)。這 16 個暫存器,根據 Security Extensions 是否有被實作,從 31 或 33 個暫存器中選出來。選擇方式是根據當前處理器所處的模式來決定。對於某些暫存器,ARM 會提供額外的拷貝。這些擁有額外拷貝的暫存器被稱為 banked (shadow/private) register。Registers and Processor Modes 中的 "Register 8 to register 12 are general purpose registers, but they have shadow registers which come into use when you switch to FIQ mode." 代表當處理器為 FIQ (Fast interrupt request) 模式下,R8 - R12 會從 shadow register (它們的拷貝) 中選用,以此來避免破壞 user mode 中 R8 -R12 的值。

系統控制協處理器

CP15 又稱系統控制協處理器 (system control coprocessor),其中有 c0 到 c15 暫存器,暫存器中的位可用來做不同的設置。透過指令 mrc 或是 mcr 讀寫 CP15 裡面的暫存器,mrc 是將 CP15 (c) 的暫存器讀至通用暫存器 (r); mcr 則反之。CP15 上某些暫存器實際上有多個物理暫存器,必須透過 opcode_2 指定要存取哪一個。例如,CP15:c0 有 MIDR (Main ID Register)、CTR (Cache Type Register)、TCMTR (TCM Type Register) 等等。

The system control coprocessor appears as a set of 32-bit registers that you can write to and read from.

mcr 將 CP15 (c) 的暫存器讀至通用暫存器 (r),其指令格式為:

MCR<c> <coproc>,<opc1>,<Rt>,<CRn>,<CRm>{,<opc2>}

系統模式

浮點和向量指令

#include <stdio.h>
#include "arm_neon.h"
 
void print_uint8 (uint8x16_t data, char* name) {
  int i;
  static uint8_t p[16];
 
  vst1q_u8 (p, data);
  printf ("%s = ", name);
  for (i = 0; i < 16; i++) {
    printf ("%02d ", p[i]);
  }
  printf ("\n");
}
 
int main() {
  const uint8_t uint8_data1[] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}; // 8 x 16 = 128 bit
  const uint8_t uint8_data2[] = {16,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1};
 
  uint8x16_t data_a; // uint8x16_t 是 NEON 自訂的型別,代表 8 x 16 的 vector register (每個元素 8 bit,共 16 個元素)。
  data_a = vld1q_u8(uint8_data1); // vld1q_u8 將準備好的資料讀進 vector register。vld1q_u8 中的 q 代表 vector register 為 128 bit。
 
  uint8x16_t data_b;
  data_b = vld1q_u8(uint8_data2);
 
  uint8x16_t data_c;
  print_uint8(data_a, "data_a");
  print_uint8(data_b, "data_b");
 
  data_c = vaddq_u8(data_a, data_b); // 將兩個 8 x 16 的 vector register 相加。
 
  print_uint8(data_c, "data_c");
 
  return 0;
}
# http://www.mentor.com/embedded-software/sourcery-tools/sourcery-codebench/lite-edition
$ arm-none-linux-gnueabi-gcc \
  -mfloat-abi=softfp -mfpu=neon \
  -mcpu=cortex-a8 -ftree-vectorize \
  -ffast-math -static \
  neon.c -o neon 
  1. Programmer Model
    1. register file of 16 Q (for Quadword) vectors, sixteen 128-bit vector registers, named q0 to q1.
    2. register file of 16 Q (for Quadword) vectors can also be seen as thirty-two 64-bit D (Doubleword) vectors, named d0 to d31
  2. Syntax
    1. All NEON instructions, even loads and stores, begin by "V".
    2. Instructions can have one (or more) letter just after the V which acts as a modifier.
      1. Q means the instruction saturates
      2. R that it rounds
      3. H that it halves
    3. all instructions need to take a suffix telling the individual size and type of the elements being operated on
      1. from .u8 (unsigned byte) to .f32 (single-precision floating-point)

big.LITTLE

The other way to support big.LITTLE systems is to have all CPUs, both big and LITTLE, visible in a multiprocessor configuration. This approach offers greater flexibility, but also poses special challenges for the Linux kernel. For example, the scheduler assumes that CPUs are interchangeable, which is anything but the case on big.LITTLE systems.

原子指令

其它

軟體相關

範例

系統呼叫

ARM 將系統呼叫視為例外。swi 是舊式系統呼叫,系統呼叫號為 swi 的參數; svc 是新式系統呼叫,系統呼叫號存放在 r7 暫存器3)

IN:
0x40010820:  e1a0c007      mov  ip, r7
0x40010824:  e3a0707a      mov  r7, #122        ; 0x7a
0x40010828:  ef000000      svc  0x00000000      ; 系統號為 0x7a。

OP:
 ---- 0x40010820
 mov_i32 tmp5,r7
 mov_i32 r12,tmp5

 ---- 0x40010824
 movi_i32 tmp5,$0x7a
 mov_i32 r7,tmp5

 ---- 0x40010828
 movi_i32 pc,$0x4001082c
 movi_i32 tmp5,$0x2
 movi_i64 tmp6,$exception
 call tmp6,$0x0,$0,tmp5

----------------
IN:
0xffff0008:  e59ff410      ldr  pc, [pc, #1040] ; 0xffffffffffff0420。 swi/svc vecotr 位在 0x08,除非 high vector 被設置,否則 0x08 會被置換成 0xffff0008。查表跳轉至該系統號的處理函式。

----------------
IN:
0xc0022e60:  e24dd048      sub  sp, sp, #72     ; 0x48
0xc0022e64:  e88d1fff      stm  sp, {r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, sl, fp, ip}
0xc0022e68:  e28d803c      add  r8, sp, #60     ; 0x3c
0xc0022e6c:  e9486000      stmdb        r8, {sp, lr}^    ; 將暫存器 sp 和 lr 其值寫至 r8 所指內存位址。
0xc0022e70:  e14f8000      mrs  r8, SPSR                 ; 將 spsr 賦值給暫存器。方向: r (register) <- s (spsr)。
0xc0022e74:  e58de03c      str  lr, [sp, #60]
0xc0022e78:  e58d8040      str  r8, [sp, #64]
0xc0022e7c:  e58d0044      str  r0, [sp, #68]
0xc0022e80:  e3a0b000      mov  fp, #0  ; 0x0
0xc0022e84:  e3180020      tst  r8, #32 ; 0x20
0xc0022e88:  13a0a000      movne        sl, #0  ; 0x0
0xc0022e8c:  051ea004      ldreq        sl, [lr, #-4]
0xc0022e90:  e59fc0a8      ldr  ip, [pc, #168]  ; 0xffffffffc0022f40
0xc0022e94:  e59cc000      ldr  ip, [ip]
0xc0022e98:  ee01cf10      mcr  15, 0, ip, cr1, cr0, {0} ; 將暫存器值寫入協處理器。方向: c (coprocessor) <- r (register)。

----------------
IN:
0xc0022e9c:  e321f013      msr  CPSR_c, #19     ; 0x13。設置 cpsr 的 control bit 為 19。方向: s (status register) <- r (register)。

內存

ARM 將內存分為底下幾種 (7.2. Memory types):

上述類似又可分為共享 (Shared) 和非共享 (Non-shared),依照該內存空間是否可被多個處理器存取。

位址轉換

ARM 的位址轉換涉及底下三種位址:

禁用 MMU 的情況下,VA 等同於 PA,不經過轉換直接以該位址存取內存。啟用 MMU 的情況下,視欲存取的 VA 是否超過 32M 而有所不同,公式如下:

    if (address < 0x02000000)
        address += env->cp15.c13_fcse;

位址轉換過程中的 MVA,是為了加快上下文切換的速度。請見 ARM 学习笔记(四) 快速上下文切换(FCSE)技术 一文。一般來說,上下文切換必須將頁表指針指向新進程的頁表,這代表虛擬位址到物理位址的映射有所改變,需要將 TLB 和快取的內容清空,之後再將新進程的映射和資料載入 TLB 和快取。快速上下文切换 (Fast Context Switch Extension,FCSE) 就是用來避免前述開銷。FCSE 的原理如下:

通常情况下,如果两个进程占用的虚拟地址空间有重叠,系统在这两个进程之间进行切换时,必须进行虚拟地址到物理地址的重映射。而虚拟地址到物理地址的重映射涉及到重建 MMU 中的页表,而且快取及 TLB 中的内容都必须使无效(通过设置协处理器寄存器的相关位)。这些操作将带类巨大的系统开销,一方面重建 MMU 和使无效快取及 TLB 的内容需要很大的开销,另一方面重建快取和 TLB 内容也需要很大的开销。

FCSE 的引入避免了这种系统开销。它位于 CPU 和 MMU 之间。如果两个进程使用了同样的虚拟地址空间,则对 CPU 而言,两个进程使用了同样的虚拟地址空间;快速上下文切换机构对各进程的虚拟地址进行变换,这样的系统中除了 CPU 之外的部分看到的是经过快速上下文切换机构变换的虚拟地址 (MVA)。快速上下文切换机构将个进程的虚拟地址空间变换成不同的虚拟地址空间。这样在进行进程间切换时就不需要进行虚拟地址到物理地址的重映射。因為快取和 TLB 看到的是不同的 MVA。

ARM 系统中,4GB 的虚拟空间被分成了 128 个进程空间块,每一个进程空间块大小为 32MB。每个进程空间块中可以包含一个进程,该进程可以使用虚拟地址空间0x0~0x01FFFFFF, 这个地址范围也就是 CPU 看到的进程的虚拟空间。系统 128 个进程空间块的编号 0~127, 标号为 X 的进程空间块中的进程实际使用的虚拟地址空间为(X * 0x02000000)到(X * 0x02000000 + 0x01FFFFFF),这个地址空间是系统中除了 CPU 之外的其他部分看到的该进程所占用的虚拟地址空间,亦即 MVA。

系统中,每个进程都使用虚拟地址空间 0x0~0x01FFFFFF, 当进程访问本进程的指令和数据时,它产生虚拟地址 (VA) 的高7位为 0;快速上下文切换机构用该进程的进程标示符 (CP15:c13) 代替 VA 的高 7 位,从而得到变换后的虚拟地址 MVA,这个 MVA 在该进程对应的进程空间块内。

当 VA 的高 7 位不全是 0 时,MVA = VA。这种 VA 是本进程用于访问别的进程中的数据和指令的虚拟地址,注意这时被访问的进程标识符不能为 0。

權限控制

針對具備有 MMU 的 ARM 平台來說,可啟用虛擬內存。ARM 有兩層頁表。內存存取權限控制主要由域 (domain) 決定,再由 access permission (AP) 決定。透過 CP15:c1 啟用/禁用 MMU,CP15:c2 存放 TTB (Translation Table Base) 也就是頁表的位址,CP15:c2 有 TTBR0 和 TTBR1,分別存放用戶態和內核態的頁表基址 (內存物理位址)。ARM 可將內存劃分至多 16 個域,每個域可以設置權限,透過設置 CP15:c3,每個域各佔 2 位。CP15:c5 存放頁缺失原因。CP15:c6 存放造成頁缺失的虛擬位址。ARMARM 中的 B3.12.1 Organization of the CP15 registers in a VMSA implementation 會以圖示列出 CP15 中與虛擬內存相關的暫存器及其意義。在 ARM 中,除了以頁作為內存分配單位外,還可以較大的段 (section) 做為分配單位。可以針對頁中的子頁 (subpage) 設置存取權限。

The translation properties associated with each translation table entry include:

Memory access permission control

This controls whether a program has access to a memory area. The possible settings are no access, read-only access, or read/write access. In addition, there is control of whether code can be executed from the memory area. If a processor attempts an access that is not permitted, a memory abort is signaled to the processor. The permitted level of access can be affected by:

  • whether the program is running in User mode or a privileged mode
  • the use of domains.

Memory region attributes

These describe the properties of a memory region. The top-level attribute, the Memory type, is one of Strongly-ordered, Device, or Normal. Device and Normal memory regions have additional attributes, see Summary of ARMv7 memory attributes on page A3-25.

Virtual-to-physical address mapping

The VMSA regards the address of an explicit data access or an instruction fetch as a Virtual Address (VA). The MMU maps this address onto the required Physical Address (PA). VA to PA address mapping can be used to manage the allocation of physical memory in many ways. For example:

  • to allocate memory to different processes with potentially conflicting address maps
  • to enable an application with a sparse address map to use a contiguous region of physical memory.

ARM MMU工作原理剖析一文搭配範例,對 MMU 有相當清楚的描述。參與權限檢查的有底下幾個元素:

域 (domain) 主要用來加速上下文切換,另一種方式是 FCSE。The ARM Architecture 一文中提及:

One major concern associated with memory protection is the cost of address space switching. On ARM a context switch requires switching page tables. The complete cost of page table switch includes the cost of flushing page tables, purging TLBs and caches and then refilling them. Two mechanisms were introduced to enable operating system designers eliminate this cost in some cases. The first mechanism is protection domains. Every virtual memory page or section belongs to one of sixteen protection domains. At any point in time, the running process can be either a manager of a domain, which means that it can access all pages belonging to this domain bypassing access permissions, a client of the domain, which means that is can access pages belonging to the domain according to their page table access permission bits, or can have no access to the domain at all. In some situations, it is possible to do context switch by simply changing domain access permissions, which means simply writing a new value to the domain access register of coprocessor 15.

頁表項可以有底下控制位 (6.5.2. Access permissions6.6.1. C and B bit, and type extension field encodings):

注意! QEMU 是一種無快取的 ARM 實現,不需考慮 C 和 B 位 4)

  1. disas_cp15_insn (target-arm/translate.c)。
    /* Disassemble system coprocessor (cp15) instruction.  Return nonzero if
       instruction is not defined.  */
    static int disas_cp15_insn(CPUARMState *env, DisasContext *s, uint32_t insn)
    {
        ... 略 ...
     
        tmp2 = tcg_const_i32(insn);
        if (insn & ARM_CP_RW_BIT) {
            tmp = tcg_temp_new_i32();
            gen_helper_get_cp15(tmp, cpu_env, tmp2);
            /* If the destination register is r15 then sets condition codes.  */
            if (rd != 15)
                store_reg(s, rd, tmp);
            else
                tcg_temp_free_i32(tmp);
        } else {
            tmp = load_reg(s, rd);
            gen_helper_set_cp15(cpu_env, tmp2, tmp);
            tcg_temp_free_i32(tmp);
            /* Normally we would always end the TB here, but Linux
             * arch/arm/mach-pxa/sleep.S expects two instructions following
             * an MMU enable to execute from cache.  Imitate this behaviour.  */
            if (!arm_feature(env, ARM_FEATURE_XSCALE) ||
                    (insn & 0x0fff0fff) != 0x0e010f10)
                gen_lookup_tb(s);
        }
        tcg_temp_free_i32(tmp2);
        return 0;
    }
  2. helper_set_cp15 (target-arm/helper.c) 設置 CP15:c2 指向頁表。
    void HELPER(set_cp15)(CPUARMState *env, uint32_t insn, uint32_t val)
    {
        int op1;
        int op2;
        int crm;
     
        op1 = (insn >> 21) & 7;
        op2 = (insn >> 5) & 7;
        crm = insn & 0xf;
        switch ((insn >> 16) & 0xf) {
     
        ... 略 ...
     
        case 2: /* MMU Page table control / MPU cache control.  */
            if (arm_feature(env, ARM_FEATURE_MPU)) {
                switch (op2) {
                case 0:
                    env->cp15.c2_data = val;
                    break;
                case 1:
                    env->cp15.c2_insn = val;
                    break;
                default:
                    goto bad_reg;
                }
            } else {
                switch (op2) {
                case 0:
                    env->cp15.c2_base0 = val;
                    break;
                case 1:
                    env->cp15.c2_base1 = val;
                    break;
                case 2:
                    val &= 7;
                    env->cp15.c2_control = val;
                    env->cp15.c2_mask = ~(((uint32_t)0xffffffffu) >> val);
                    env->cp15.c2_base_mask = ~((uint32_t)0x3fffu >> val);
                    break;
                default:
                    goto bad_reg;
                }
            }
            break;
     
            ... 略 ...
     
        }
        return;
    bad_reg:
        /* ??? For debugging only.  Should raise illegal instruction exception.  */
        cpu_abort(env, "Unimplemented cp15 register write (c%d, c%d, {%d, %d})\n",
                  (insn >> 16) & 0xf, crm, op1, op2);
    }        
  1. cpu_arm_handle_mmu_fault (target-arm/helper.c)。
    int cpu_arm_handle_mmu_fault (CPUARMState *env, target_ulong address,
                                  int access_type, int mmu_idx)
    {
        uint32_t phys_addr;
        target_ulong page_size;
        int prot;
        int ret, is_user;
     
        is_user = mmu_idx == MMU_USER_IDX;
        ret = get_phys_addr(env, address, access_type, is_user, &phys_addr, &prot,
                            &page_size);
        if (ret == 0) {
            /* Map a single [sub]page.  */
            phys_addr &= ~(uint32_t)0x3ff;
            address &= ~(uint32_t)0x3ff;
            tlb_set_page (env, address, phys_addr, prot, mmu_idx, page_size);
            return 0;
        }
     
        if (access_type == 2) {
            env->cp15.c5_insn = ret;
            env->cp15.c6_insn = address;
            env->exception_index = EXCP_PREFETCH_ABORT;
        } else {
            env->cp15.c5_data = ret;
            if (access_type == 1 && arm_feature(env, ARM_FEATURE_V6))
                env->cp15.c5_data |= (1 << 11);
            env->cp15.c6_data = address;
            env->exception_index = EXCP_DATA_ABORT;
        }
        return 1;
    }
  2. get_phys_addr (target-arm/helper.c)。
    static inline int get_phys_addr(CPUARMState *env, uint32_t address,
                                    int access_type, int is_user,
                                    uint32_t *phys_ptr, int *prot,
                                    target_ulong *page_size)
    {
        /* Fast Context Switch Extension.  */
        if (address < 0x02000000)
            address += env->cp15.c13_fcse;
     
        if ((env->cp15.c1_sys & 1) == 0) {
            /* MMU/MPU disabled.  */
            *phys_ptr = address;
            *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
            *page_size = TARGET_PAGE_SIZE;
            return 0;
        } else if (arm_feature(env, ARM_FEATURE_MPU)) {
            *page_size = TARGET_PAGE_SIZE;
            return get_phys_addr_mpu(env, address, access_type, is_user, phys_ptr,
                                     prot);
        } else if (env->cp15.c1_sys & (1 << 23)) {
            return get_phys_addr_v6(env, address, access_type, is_user, phys_ptr,
                                    prot, page_size);
        } else {
            return get_phys_addr_v5(env, address, access_type, is_user, phys_ptr,
                                    prot, page_size);
        }
    }
  3. get_phys_addr_v6 (target-arm/helper.c)。ARM1176JZ-S 6.11.2 ARMv6 page table translation subpage AP bits disabled 分別描述第一和第二層頁表項的格式。
    static int get_phys_addr_v6(CPUARMState *env, uint32_t address, int access_type,
                                int is_user, uint32_t *phys_ptr, int *prot,
                                target_ulong *page_size)
    {
        int code;
        uint32_t table;
        uint32_t desc;
        uint32_t xn;
        int type;
        int ap;
        int domain;
        int domain_prot;
        uint32_t phys_addr;
     
        /* Pagetable walk.  */
        /* Lookup l1 descriptor.  */
        table = get_level1_table_address(env, address);
        desc = ldl_phys(table);
        type = (desc & 3);
        if (type == 0) {
            /* Section translation fault.  */
            code = 5;
            domain = 0;
            goto do_fault;
        } else if (type == 2 && (desc & (1 << 18))) {
            /* Supersection.  */
            domain = 0;
        } else {
            /* Section or page.  */
            domain = (desc >> 5) & 0x0f; // 取 [8:5] 位。
        }
        domain_prot = (env->cp15.c3 >> (domain * 2)) & 3;
        if (domain_prot == 0 || domain_prot == 2) {
            if (type == 2)
                code = 9; /* Section domain fault.  */
            else
                code = 11; /* Page domain fault.  */
            goto do_fault;
        }
        if (type == 2) {
            if (desc & (1 << 18)) {
                /* Supersection.  */
                phys_addr = (desc & 0xff000000) | (address & 0x00ffffff);
                *page_size = 0x1000000;
            } else {
                /* Section.  */
                phys_addr = (desc & 0xfff00000) | (address & 0x000fffff);
                *page_size = 0x100000;
            }
            ap = ((desc >> 10) & 3) | ((desc >> 13) & 4);
            xn = desc & (1 << 4);
            code = 13;
        } else {
            /* Lookup l2 entry.  */
            table = (desc & 0xfffffc00) | ((address >> 10) & 0x3fc);
            desc = ldl_phys(table);
            ap = ((desc >> 4) & 3) | ((desc >> 7) & 4);
            switch (desc & 3) {
            case 0: /* Page translation fault.  */
                code = 7;
                goto do_fault;
            case 1: /* 64k page.  */
                phys_addr = (desc & 0xffff0000) | (address & 0xffff);
                xn = desc & (1 << 15);
                *page_size = 0x10000;
                break;
            case 2: case 3: /* 4k page.  */
                phys_addr = (desc & 0xfffff000) | (address & 0xfff);
                xn = desc & 1;
                *page_size = 0x1000;
                break;
            default:
                /* Never happens, but compiler isn't smart enough to tell.  */
                abort();
            }
            code = 15;
        }
        if (domain_prot == 3) {
            *prot = PAGE_READ | PAGE_WRITE | PAGE_EXEC;
        } else {
            if (xn && access_type == 2)
                goto do_fault;
     
            /* The simplified model uses AP[0] as an access control bit.  */
            if ((env->cp15.c1_sys & (1 << 29)) && (ap & 1) == 0) {
                /* Access flag fault.  */
                code = (code == 15) ? 6 : 3;
                goto do_fault;
            }
            *prot = check_ap(env, ap, domain_prot, access_type, is_user);
            if (!*prot) {
                /* Access permission fault.  */
                goto do_fault;
            }
            if (!xn) {
                *prot |= PAGE_EXEC;
            }
        }
        *phys_ptr = phys_addr;
        return 0;
    do_fault:
        return code | (domain << 4);
    }
    • Translation fault 代表該虛擬位址並無映射,等同 x86 的 P 位清為零。

外設

Semihosting

CMSIS

其它

以 softfp 編譯的執行檔無法在 hardfp 的平台上運行 5)

-mfloat-abi=name

Specifies which floating-point ABI to use. Permissible values are: `soft', `softfp' and `hard'.

Specifying `soft' causes GCC to generate output containing library calls for floating-point operations. `softfp' allows the generation of code using hardware floating-point instructions, but still uses the soft-float calling conventions. `hard' allows generation of floating-point instructions and uses FPU-specific calling conventions.

The default depends on the specific target configuration. Note that the hard-float and soft-float ABIs are not link-compatible; you must compile your entire program with the same ABI, and link with a compatible set of libraries. http://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

術語

文章

系統軟體

GCC

其它

虛擬化

支援 Coretx-A15 的板子皆有支援虛擬化7)

匯編語言

編程優化

要寫出有效率的 C 代碼,必須要知道編譯器於優化上有何侷限,目標處理器架構上的限制,以及特定編譯器本身的限制。

參考書籍

外部連結