<note>
  * [[http://www.oldlinux.org/download/clk011c-3.0.pdf|Linux内核完全注释]]
    * https://github.com/yuanxinyu/Linux-0.11
  * [[http://www.tldp.org/HOWTO/Unix-and-Internet-Fundamentals-HOWTO/|The Unix and Internet Fundamentals HOWTO]]
</note>

====== ELF 映像的装入 ======
GNU 把对于动态连接 ELF 映像的支持作了分工：把 EL F映像的装入/启动放在 Linux 内核中；而把动态连接的实现放在用户空间，并为此提供一个称为“解释器”的工具软件，而解释器的装入/启动也由内核负责。

  * [[http://www.longene.org/techdoc/0328130001224576708.html|漫谈兼容内核之八: ELF映像的装入(一)]]
  * [[http://www.longene.org/techdoc/0750005001224576724.html|漫谈兼容内核之九: ELF映像的装入(二)]]
====== 系統呼叫 ======
  * [[http://seclab.cs.sunysb.edu/sekar/papers/syscallclassif.htm|Classification and Grouping of Linux System Calls]]
  * Anomaly Detection Based on System Call Classification
<blockquote>
Generally, systems provide a library or API that sits between normal programs and the operating system. On Unix-like systems, that API is usually part of an implementation of the C library (libc), such as glibc, that provides wrapper functions for the system calls, often named the same as the system calls that they call. On Windows NT, that API is part of the Native API, in the ntdll.dll library; this is an undocumented API used by implementations of the regular Windows API and directly used by some system programs on Windows.
</blockquote>
  * [[wp>System call]]

  - ''syscall_table_32.S'' 裡定義系統呼叫函式指針的集合。<code asm>
ENTRY(sys_call_table)
  .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */
  .long sys_exit
  .long ptregs_fork
  .long sys_read
  .long sys_write
  /* 略 */
</code>
  - ''entry_32.S'' 定義系統呼叫的入口匯編 (handler)。與 FreeBSD 不同，FreeBSD 是以棧傳遞參數，Linux 則是用暫存器傳參。<code asm>
  # system call handler stub
ENTRY(system_call)
  RING0_INT_FRAME     # can't unwind into user space anyway
  pushl %eax      # save orig_eax
  CFI_ADJUST_CFA_OFFSET 4
  SAVE_ALL            ; 透過暫存器傳參。
  GET_THREAD_INFO(%ebp)
          # system call tracing in operation / emulation
  testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
  jnz syscall_trace_entry
  cmpl $(nr_syscalls), %eax
  jae syscall_badsys
syscall_call:
  call *sys_call_table(,%eax,4)
  movl %eax,PT_EAX(%esp)    # store the return value
syscall_exit:
  LOCKDEP_SYS_EXIT
  DISABLE_INTERRUPTS(CLBR_ANY)  # make sure we don't miss an interrupt
          # setting need_resched or sigpending
          # between sampling and the iret
  TRACE_IRQS_OFF
  movl TI_flags(%ebp), %ecx
  testl $_TIF_ALLWORK_MASK, %ecx  # current->work
  jne syscall_exit_work
</code>
  - 透過 ''sys_call_table'' 跳轉至相應的函式。<code c>
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
  struct file *file;
  ssize_t ret = -EBADF;
  int fput_needed;

  file = fget_light(fd, &fput_needed);
  if (file) {
    loff_t pos = file_pos_read(file);
    ret = vfs_read(file, buf, count, &pos);
    file_pos_write(file, pos);
    fput_light(file, fput_needed);
  }

  return ret;
}
</code>
  * [[http://osinside.net/syscall/system_call_table.htm|System call table]]
  * [[http://swaywang.blogspot.com/2011/10/system-call-system-call-process-oslinux.html|System Call]]
  * [[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=1&ved=0CGMQFjAA&url=http%3A%2F%2Frtlab.cs.nthu.edu.tw%2Fcourse%2FTrace_OS_Slides%2FLinux%2FSystem%2520Call%2F2009%2F2009_04_24_System%2520Call.ppt&ei=B1O_T_i8GIzmmAX2nZC_Cg&usg=AFQjCNF-n9ljeTdeDtmtLqXcvE3HaPDvMQ&sig2=lUYlNAdXEmjUTJRfkh-IbA|Linux System Call]]
  * [[http://stackoverflow.com/questions/9340876/where-the-system-call-function-sys-getpid-is-located-in-the-linux-kernel|Where the system call function “sys_getpid” is located in the linux kernel?]]

  * [[http://www.ibm.com/developerworks/linux/library/l-system-calls/|Kernel command using Linux system calls]]

  * [[http://www.bianceng.cn/OS/Linux/201111/31272.htm|Linux中断处理之时钟中断（一）]]
  * [[http://www.bianceng.cn/OS/Linux/201111/31273.htm|Linux中断处理之时钟中断（二）]]
  * [[http://www.linuxforum.net/forum/showthreaded.php?Cat=&Board=linuxK&Number=666125&page=&view=&sb=&o=|setup_IO_APIC末尾处，check_timer是啥意思啊？]]
  * [[http://www.bianceng.cn/OS/Linux/201109/29130.htm|Linux内核源代码的目录结构]]
  * [[http://www.bianceng.cn/OS/Linux/201109/29133.htm|Linux操作系统的内核初始化过程详解]]
  * [[http://kerneltrap.org/node/2450|Feature: High Memory In The Linux Kernel]]
  * [[http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory|Anatomy of a Program in Memory]]
  * [[http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory|How The Kernel Manages Your Memory]]
  * [[http://tldp.org/LDP/khg/HyperNews/get/memory/linuxmm.html|Linux Memory Management Overview]]
  * [[http://www.formosaos.url.tw/linux/kinit0.html|I386 KERNEL HEAD]]
    * 3G (c0000000) 

  * [[http://loda.hala01.com/2011/12/linux-kernel-排程機制介紹/|Linux Kernel 排程機制介紹]] 

當時鐘發出中斷時，會調用 timer_interrupt 處理該中斷。
  * [[wp>Revolution OS]]
    * [[http://linux99sun.pixnet.net/blog/post/23501300|Revolution OS （作業系統革命）]]
  * [[wp>Kernel panic]]
  * [[wp>Linux kernel oops]]
  * [[http://www.av8n.com/computer/htm/kernel-lockup.htm|Debugging Linux Kernel Lockup / Panic / Oops]]
  * [[http://penberg.blogspot.com/2010/07/linux-kernel-oops-debugging.html|Linux kernel OOPS debugging]]

  * [[wp>Linux startup process]]
    * [[https://lkml.org/lkml/2006/4/20/304|Re: Which process is associated with process ID 0 (swapper)]]
  * [[http://tldp.org/LDP/khg/HyperNews/get/devices/addrxlate.html|Translating Addresses in Kernel Space]] 

  * [[http://www.yolinux.com/TUTORIALS/LinuxTutorialInitProcess.html|Linux Init Process / PC Boot Procedure]]
  * [[http://www.google.com/url?sa=t&rct=j&q=&esrc=s&frm=1&source=web&cd=28&ved=0CGYQFjAHOBQ&url=http%3A%2F%2Fwww.cp.su.ac.th%2F~sirak%2F517325%2FLinux%2520Booting%2520Procedure.ppt&ei=rbULT5jpJYPdmAWew92EBg&usg=AFQjCNH8Rrg4UsZ7lGYzxZYDgQnXHQv-8Q&sig2=ihmyppSS6USpweZtHxx7Zg|Linux Booting Procedure]]
  * [[wp>Reset vector]]
  * [[http://stackoverflow.com/questions/5300527/do-normal-x86-or-amd-pcs-run-startup-bios-code-directly-from-rom-or-do-they-cop|Do normal x86 or AMD PCs run startup/BIOS code directly from ROM, or do they copy it first to RAM?]]
  * [[http://superuser.com/questions/200556/difference-between-shutdown-power-off-and-restart-reboot|Difference between shutdown ( power off ) and restart ( reboot )]]
  * [[wp>Reboot (computing)]]

  * [[http://lwn.net/Articles/10465/|Native POSIX Thread Library 0.1 released]]
    * [[wp>Futex]]
    * [[http://blog.csdn.net/Javadino/article/details/2891385|[Pthread] Linux中的线程同步机制(一) -- Futex]]
    * [[http://blog.csdn.net/Javadino/article/details/2891388|[Pthread] Linux中的线程同步机制(二) -- In Glibc]]
    * [[http://blog.csdn.net/Javadino/article/details/2891399|[Pthread] Linux中的线程同步机制(三) -- Practice]]

  * [[http://www.jollen.org/EmbeddedLinux/|Embedded Linux 專欄]]
===== 添加系統呼叫 =====
<note>以下以 3.5 版為例。</note>
  - 編譯內核。<code bash>
$ wget http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.5.3.tar.bz2
$ tar xvf linux-3.5.3.tar.bz2; cd linux-3.5.3
$ wget http://people.cs.nctu.edu.tw/~chenwj/source/config-qemu-x86
$ mv config-qemu-x86 .config
$ make ARCH=i386
$ qemu-system-i386 -kernel arch/x86/boot/bzImage -hda disk-x86.raw -append "root=/dev/sda" -vnc :3
</code>
  - 編輯 ''arch/x86/syscalls/syscall_32.tbl''。<code diff>
diff -ruN linux-3.5.3/arch/x86/syscalls/syscall_32.tbl linux-3.5.3.new/arch/x86/syscalls/syscall_32.tbl
--- linux-3.5.3/arch/x86/syscalls/syscall_32.tbl        2012-08-26 10:32:13.000000000 +0800
+++ linux-3.5.3.new/arch/x86/syscalls/syscall_32.tbl    2012-08-28 14:17:16.098536453 +0800
@@ -356,3 +356,4 @@
 347    i386    process_vm_readv        sys_process_vm_readv            compat_sys_process_vm_readv
 348    i386    process_vm_writev       sys_process_vm_writev           compat_sys_process_vm_writev
 349    i386    kcmp                    sys_kcmp
+350    i386    helloworld              sys_helloworld
</code>
  - 編輯 ''include/linux/syscalls.h''。<code diff>
diff -ruN linux-3.5.3/include/linux/syscalls.h linux-3.5.3.new/include/linux/syscalls.h
--- linux-3.5.3/include/linux/syscalls.h        2012-08-26 10:32:13.000000000 +0800
+++ linux-3.5.3.new/include/linux/syscalls.h    2012-08-28 14:19:47.171854560 +0800
@@ -860,4 +860,6 @@

 asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
                         unsigned long idx1, unsigned long idx2);
+
+asmlinkage long sys_helloworld(void);
 #endif
</code>
  - 新增檔案實作新增系統呼叫。<code diff>
diff -ruN linux-3.5.3/arch/x86/kernel/helloworld.c linux-3.5.3.new/arch/x86/kernel/helloworld.c
--- linux-3.5.3/arch/x86/kernel/helloworld.c    1970-01-01 08:00:00.000000000 +0800
+++ linux-3.5.3.new/arch/x86/kernel/helloworld.c        2012-08-28 14:30:34.139545617 +0800
@@ -0,0 +1,8 @@
+#include <linux/linkage.h>
+#include <linux/kernel.h>
+
+long sys_helloworld(void)
+{
+    printk("hello world from linux kernel!\n");
+    return 0;
+}
</code>
  - 修改 ''arch/x86/kernel/Makefile''。<code diff>
diff -ruN linux-3.5.3/arch/x86/kernel/Makefile linux-3.5.3.new/arch/x86/kernel/Makefile
--- linux-3.5.3/arch/x86/kernel/Makefile        2012-08-26 10:32:13.000000000 +0800
+++ linux-3.5.3.new/arch/x86/kernel/Makefile    2012-08-28 14:34:55.763928000 +0800
@@ -34,6 +34,7 @@
 obj-y                  += tsc.o io_delay.o rtc.o
 obj-y                  += pci-iommu_table.o
 obj-y                  += resource.o
+obj-y                  += helloworld.o

 obj-y                          += process.o
 obj-y                          += i387.o xsave.o
</code>
  - 運行範例。<code c>
#include <unistd.h>
#include <sys/syscall.h>

#define NR_SYSCALL 350

int main()
{
    return syscall(NR_SYSCALL);
}
</code>

  * [[http://stackoverflow.com/questions/9977968/adding-a-new-system-call-in-linux-kernel-3-3|Adding a new system call in Linux kernel 3.3]][(http://ppc52776.blogspot.tw/2012/08/adding-new-system-call-in-linux-kernel.html)]
  * [[http://nycrenee.wordpress.com/2007/04/23/add-a-system-call-on-linux-kernel-2611/|Add A System Call On Linux Kernel 2.6.11]]
  * [[http://www.csee.umbc.edu/courses/undergraduate/CMSC421/fall02/burt/projects/howto_add_systemcall.html|Adding A System Call]]
  * [[http://tldp.org/HOWTO/html_single/Implement-Sys-Call-Linux-2.6-i386/|Implementing a System Call on Linux 2.6 for i386]]
====== 例外 ======
  - ''traps.c'' 裡的 ''trap_init'' 會設置例外的進入點。<code c>
void __init trap_init(void)
{
  set_intr_gate(0, &divide_error);
  set_intr_gate_ist(1, &debug, DEBUG_STACK);
  set_intr_gate_ist(2, &nmi, NMI_STACK);

  /* 略 */
}
</code>
  - ''entry_32.S'' 裡面包含例外處理函式的進入點。<code asm>
ENTRY(divide_error)
  RING0_INT_FRAME
  pushl $0      # no error code
  CFI_ADJUST_CFA_OFFSET 4
  pushl $do_divide_error
  CFI_ADJUST_CFA_OFFSET 4
  jmp error_code
  CFI_ENDPROC
END(divide_error)
</code>
  - 最後會到 ''traps.c'' 中對應的函式執行。和 FreeBSD 稍有不同，FreeBSD 統一在 ''/usr/src/sys/i386/i386/trap.c'' 中的 trap 函式處理。<code asm>
#define DO_ERROR_INFO(trapnr, signr, str, name, sicode, siaddr)   \
dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
{                 \
  siginfo_t info;             \
  info.si_signo = signr;            \
  info.si_errno = 0;            \
  info.si_code = sicode;            \
  info.si_addr = (void __user *)siaddr;       \
  if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, signr)  \
              == NOTIFY_STOP) \
    return;             \
  conditional_sti(regs);            \
  do_trap(trapnr, signr, str, regs, error_code, &info);   \
}

DO_ERROR_INFO(0, SIGFPE, "divide error", divide_error, FPE_INTDIV, regs->ip)
</code>

  * [[http://timetobleed.com/a-few-things-you-didnt-know-about-signals-in-linux-part-1/|A Few Things You Didn’t Know about Signals in Linux Part 1]]
====== 進程 ======
  * [[http://lxr.free-electrons.com/source/include/linux/sched.h#L1228|struct task_struct (include/linux/sched.h)]]
====== 虛擬內存 ======
注意! <color red>內核中存取變數都是以虛擬位址存取</color>，必要時透過 ''\_\_pa'' 巨集減去一個位移量，得到對映的物理位址。關於虛擬內存相關資料結構請見 [[http://www.makelinux.net/books/ulk3/understandlk-CHP-9-SECT-2|9.2. The Memory Descriptor]] 和 [[http://www.makelinux.net/books/ulk3/understandlk-CHP-9-SECT-3|9.3. Memory Regions]]。第 31 頁。Linux 內核源代碼情景分析 2.3 節。Linux 目前將頁表結構抽象成: pgd、pud、pmd 和 pte 四層頁表。''pgd_t''、''pud_t''、''pmd_t'' 和 ''pte_t'' 代表的是其中的頁表項 (entry)。可以分別透過 [[http://lxr.free-electrons.com/source/arch/x86/include/asm/pgtable.h#L586|pgd_offset]]、[[http://lxr.free-electrons.com/source/arch/x86/include/asm/pgtable.h#L556|pud_offset]]、[[http://lxr.free-electrons.com/source/arch/x86/include/asm/pgtable.h#L511|pmd_offset]] 和 [[http://lxr.free-electrons.com/source/arch/x86/include/asm/pgtable_64.h#L142|pte_offset_map]] 取得。當前述頁表項為空時，分別呼叫 [[http://lxr.free-electrons.com/source/include/linux/mm.h#L1179|pud_alloc]]、[[http://lxr.free-electrons.com/source/include/linux/mm.h#L1185|pmd_alloc]] 和 [[http://lxr.free-electrons.com/source/arch/x86/mm/pgtable.c#L23|pte_alloc_one]] 分配 pud、pmd、pte。同時配合權限構成相應的頁表項，如 [[http://lxr.free-electrons.com/source/arch/x86/mm/pgtable.c#L23|mk_pmd]] 或是 [[http://lxr.free-electrons.com/source/arch/x86/include/asm/pgtable.h#L457|mk_pte]]。

  * [[http://lxr.free-electrons.com/source/include/linux/mm_types.h#L299|struct mm_struct (include/linux/mm_types.h)]]
    * 描述進程的整體虛擬地址空間。<code c>
struct task_struct {

        ... 略 ...

        struct mm_struct *mm, *active_mm;

        ... 略 ...
};

struct mm_struct {
        // 指向進程第一個 VMA，之後可透過 VMA 中的 vm_next 遍歷進程所有的 VMA。
         struct vm_area_struct * mmap;           /* list of VMAs */

        ... 略 ...
};
</code> 
  * [[http://lxr.free-electrons.com/source/include/linux/mm_types.h#L211|struct vm_area_struct (include/linux/mm_types.h)]]
    * Linux 將進程的虛擬地址空間分成數個區塊 (area)。基本上 ELF 中的段 (segment) 會對應到一個 VMA，請見[[http://www.waterlike.com.tw/bookdata.asp?NO=TP3C09A004|程序員的自我修養]]第 6 章: 可執行檔的裝載與進程。<code c>
struct vm_area_struct {
        struct mm_struct * vm_mm;       /* The address space we belong to. */
        unsigned long vm_start;         /* Our start address within vm_mm. */
        unsigned long vm_end;           /* The first byte after our end address
                                           within vm_mm. */

        /* linked list of VM areas per task, sorted by address */
        struct vm_area_struct *vm_next, *vm_prev;

        ... 略 ...
};
</code>
    * [[http://www.jollen.org/blog/2007/01/linux_virtual_memory_areas_vma.html|Linux 的 Virtual Memory Areas（VMA）：基本概念介紹]]
    * [[http://www.jollen.org/blog/2007/01/process_vma.html|Linux 的 Virtual Memory Areas（VMA）：Process 與 VMA 整體觀念]]
    * [[http://www.jollen.org/blog/2007/03/mmap_vma.html|小談 mmap() 與 VMA]]
    * [[http://blog.csdn.net/liujun01203/article/details/5862940|vm_area_struct 结构]]
  * [[http://lxr.free-electrons.com/source/include/linux/mmzone.h#L668|struct pglist_data (include/linux/mmzone.h)]]
    * 每個處理器的物理內存稱之為節點 (node)。 
  * [[http://lxr.free-electrons.com/source/include/linux/mmzone.h#L329|struct zone (include/linux/mmzone.h)]]
    * 節點再分為數個區域 (zone)。 
  * [[http://lxr.free-electrons.com/source/include/linux/mm_types.h#L41|struct page (include/linux/mm_types.h)]]
    * 描述物理頁的資料結構。區域內含數個物理頁。 

[[http://www.kernel.org/doc/gorman/html/understand/understand009.html|Chapter 6 Physical Page Allocation]]。Linux 基本上將系統上的物理內存分為數個節點 (node)，以 ''pd_data_t'' 表示，每個節點關連到一個處理器，這主要是用來適應 NUMA。節點又分為數個內存區域 (zone)，分為 ZONE_DMA、ZONE_NORMAL 和 ZONE_HIGHMEM，關於 Linux 如何描述物理內存，請見 [[http://www.kernel.org/doc/gorman/html/understand/understand005.html|Chapter 2 Describing Physical Memory]]。ZONE_DMA 是物理內存前 16 MB 的區域，供周邊使用; ZONE_NORMAL 是 16 - 896 MB 的物理內存，將會被內核映射至虛擬位址高位址處，也就是映射至內核空間; ZONE_HIGHMEM 是物理內存剩下的區域。區域有三種不同的水印 (watermark)，分別為 pages_high、pages_low 和 pages_min，代表該區域物理頁的使用量。kswapd 會在剩餘物理頁為 pages_low 時被喚醒，開始回收頁面直到剩餘物理頁為 pages_high 為止。如果前述回收頁面仍抵銷不了物理頁的消耗，導致剩餘物理頁為 pages_min 時，allocator 在分配物理頁的同時，也會做和 kswapd 一樣的工作，試圖同時回收物理頁。

[[http://www.kernel.org/doc/gorman/html/understand/understand011.html|Chapter 8 Slab Allocator]] 處理申請小塊內存的請求，避免內部破碎。

<blockquote>
When a User Mode process asks for dynamic memory, <color red>it doesn't get additional page frames; instead, it gets the right to use a new range of linear addresses, which become part of its address space.</color> This interval is called a "memory region."
</blockquote>

為了快速定位有哪些頁表項指向該物理頁，Linux 使用 reverse mapping，請見 [[http://www.makelinux.net/books/ulk3/understandlk-CHP-17-SECT-2|17.2. Reverse Mapping]]。直接在 ''struct page'' 維護指向此物理頁的頁表項，並非好的做法。目前的做法，內核會維護物理頁到 VMA 的反向連結，VMA 再透過 ''struct mm_struct'' 中的 ''pgd_t * pgd'' 遍歷該進程的頁表。

  * [[wp>Slab allocation]]
  * [[http://www.ibm.com/developerworks/cn/linux/l-cn-slub/|Linux SLUB 分配器详解]]
  * [[http://www.mystone7.com/2012/08/13/process_4g_process/|进程访问4G空间]]
  * [[http://download.polytechnic.edu.na/pub4/download.sourceforge.net/pub/sourceforge/r/re/readmemmanag/linux_mem_manage_2.6.36.2_summary.pdf]]
  * [[http://www.unixresources.net/linux/clf/linuxK/archive/00/00/44/85/448501.html|有关Linux下3层页表和2层页表的实现问题(PMD的跳过)]]
  * [[http://lwn.net/Articles/106177/|Four-level page tables]]
===== 術語 =====
  * Get Free Page (GFP)
  * SLOB (Simple List Of Blocks)
  * PAT (Page Attribute Table)
    * 跟 [[x86]] 有關。
===== 頁缺失 =====
目標: 了解物理頁如何被內核釋放掉。

  - ''do_page_fault (arch/x86/mm/fault.c)''。<code c>
/*
 * This routine handles page faults.  It determines the address,
 * and the problem, and then passes it off to one of the appropriate
 * routines.
 */
dotraplinkage void __kprobes
do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
        ... 略 ...

        /* Get the faulting address: */
        address = read_cr2();

good_area:
        // 配置物理頁。
         fault = handle_mm_fault(mm, vma, address, flags);

        ... 略 ...
}
</code>
  - ''handle_mm_fault (mm/memory.c)''。<code c>
int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
                unsigned long address, unsigned int flags)
{
        ... 略 ...
 

        pte = pte_offset_map(pmd, address); // 此為最後一層的頁表 (page table)

        // 分配物理頁。
          return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
</code>
  - ''handle_pte_fault (mm/memory.c)''。<code c>
int handle_pte_fault(struct mm_struct *mm,
                     struct vm_area_struct *vma, unsigned long address,
                     pte_t *pte, pmd_t *pmd, unsigned int flags)
{
        ... 略 ...
        
        entry = *pte;
        if (!pte_present(entry)) {
                // 該物理頁未被配置。
                   if (pte_none(entry)) {
                        // 該物理頁映射至檔案。
                        if (vma->vm_ops) {
                                if (likely(vma->vm_ops->fault))
                                        return do_linear_fault(mm, vma, address,
                                                pte, pmd, flags, entry);
                        }
                        return do_anonymous_page(mm, vma, address,
                                                 pte, pmd, flags);
                }
                   if (pte_file(entry))
                        return do_nonlinear_fault(mm, vma, address,
                                        pte, pmd, flags, entry);
                return do_swap_page(mm, vma, address,
                                        pte, pmd, flags, entry);
        }

        ... 略 ...
}
</code>
===== 外部連結 =====
  * [[http://fanqiang.chinaunix.net/a1/b1/20010901/1305001220.html|读核日记(七) --linux的内存管理机制(1)]]
  * [[http://www.makelinux.net/ldd3/chp-15-sect-1|15.1. Memory Management in Linux]]

  * [[http://linux-mm.org/|LinuxMM]]
  * [[http://www.kernel.org/doc/gorman/html/understand/index.html|Understanding the Linux Virtual Memory Manager]]
    * [[http://ptgmedia.pearsoncmg.com/images/0131453483/downloads/gorman_book.pdf|Understanding the Linux® Virtual Memory Manager]]
  * [[http://www.rohitab.com/discuss/topic/31139-tutorial-paging-memory-mapping-with-a-recursive-page-directory/|Paging: Memory Mapping With A Recursive Page Directory]]

<blockquote>
We need a way to read from and write to page tables. <color red>This means accessing them by a virtual address</color>, since we're using paging.
</blockquote>

<blockquote>
During initialization of the virtual memory manager, the last PDE in the page directory is set to the physical address of the page directory itself.
</blockquote>

  * [[http://jpsix.pixnet.net/blog/post/29962727-%5Blinux%5D-linux-kernel-memory-allocation|[Linux] Linux Kernel Memory Allocation]]
  * [[http://www.linuxatemyram.com/|Help! Linux ate my RAM!]]
  * [[http://blog.linux.org.tw/~jserv/archives/001461.html|探索 Linux Memory Model (上)]]
  * [[http://blog.linux.org.tw/~jserv/archives/001463.html|探索 Linux Memory Model (下)]]
  * [[http://140.120.7.20/LinuxRef/mmLinux/Linux_memory_model.html|Explore the Linux memory model]]
  * [[http://www.ibm.com/developerworks/cn/linux/l-kernel-shared-memory/index.html?ca=drs-|Linux Kernel Shared Memory 剖析]]
  * [[http://lwn.net/Articles/423584/|Transparent huge pages in 2.6.38]]

  * [[http://histemiss.blog.163.com/blog/static/30487860201251803955826/]]
===== 進程內存用量 =====
<code bash>
# 由於各式各樣複雜的原因，底下輸出僅供參考。
$ ps u -p `pidof a.out`
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
chenwj   31952  0.0  0.0   3884   472 pts/42   S+   14:51   0:00 ./a.out
# 關注最底下 writeable/private 的數據。
$ pmap -d `pidof a.out`
32280:   ./a.out
Address           Kbytes Mode  Offset           Device    Mapping
0000000000400000       4 r-x-- 0000000000000000 000:0000f a.out
0000000000600000       4 r---- 0000000000000000 000:0000f a.out
0000000000601000       4 rw--- 0000000000001000 000:0000f a.out
00000000020d5000     132 rw--- 0000000000000000 000:00000   [ anon ]
00007fa2458be000    1412 r-x-- 0000000000000000 0fe:00000 libc-2.12.2.so
00007fa245a1f000    2048 ----- 0000000000161000 0fe:00000 libc-2.12.2.so
00007fa245c1f000      16 r---- 0000000000161000 0fe:00000 libc-2.12.2.so
00007fa245c23000       4 rw--- 0000000000165000 0fe:00000 libc-2.12.2.so
00007fa245c24000      20 rw--- 0000000000000000 000:00000   [ anon ]
00007fa245c29000     120 r-x-- 0000000000000000 0fe:00000 ld-2.12.2.so
00007fa245e2f000      12 rw--- 0000000000000000 000:00000   [ anon ]
00007fa245e44000       8 rw--- 0000000000000000 000:00000   [ anon ]
00007fa245e46000       4 r---- 000000000001d000 0fe:00000 ld-2.12.2.so
00007fa245e47000       4 rw--- 000000000001e000 0fe:00000 ld-2.12.2.so
00007fa245e48000       4 rw--- 0000000000000000 000:00000   [ anon ]
00007fff29b9a000      84 rw--- 0000000000000000 000:00000   [ stack ]
00007fff29bff000       4 r-x-- 0000000000000000 000:00000   [ anon ]
ffffffffff600000       4 r-x-- 0000000000000000 000:00000   [ anon ]
mapped: 3888K    writeable/private: 272K    shared: 0K
</code>
  * RSS (Resident Set Size)
    * 該進程物理內存用量。 
  * VSZ (Virtual Size)
    * 該進程虛擬內存用量。 
  * [[http://unix.stackexchange.com/questions/35129/need-explanation-on-resident-set-size-virtual-size|Need explanation on Resident Set Size/Virtual Size]]
  * [[http://unix.stackexchange.com/questions/18841/measuring-ram-usage-of-a-program|Measuring RAM usage of a program]]
  * [[http://stackoverflow.com/questions/131303/linux-how-to-measure-actual-memory-usage-of-an-application-or-process|Linux: How to measure actual memory usage of an application or process?]]
  * [[http://virtualthreads.blogspot.tw/2006/02/understanding-memory-usage-on-linux.html|Understanding memory usage on Linux]]
===== 傾印物理內存 =====
''/dev/mem'' 代表當前機器上<color red>物理內存</color>的內容。一般是透過 ''mmap'' 將物理內存的某個區段映射至目前進程的虛擬地址空間，進程對該虛擬地址空間的讀寫，即代表對相映物理內存的讀寫，進而使得進程可以直接讀寫物理內存。通常情況下，是用作 MMIO。對 ''/dev/mem'' 而言，是以 byte 為單位定址，且其代表的是物理位址 [(http://people.cs.nctu.edu.tw/~chenwj/log/UNIX/nico103-2012-08-29.txt)]。''mmap'' 映射 ''/dev/mem'' 有其限制，請見 [[http://stackoverflow.com/questions/11891979/accessing-mmaped-dev-mem|accessing mmaped /dev/mem?]][(http://lwn.net/Articles/267427/)]。 
<code c>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>

int main()
{
    // 透過 map_base 一次從物理內存讀一個 byte, halfword 或是 word。
    // 端看到時候是將 map_base 轉型成 unsigned char *、unsigned short * 或是 unsigned long *。
    // 注意! map_base 是讀取物理內存起始位址對映的虛擬位址。
    void *map_base; 
    unsigned long virt_addr; // 物理位址對映的虛擬位址。
    unsigned char val;       // 該物理位址開始一個 byte 的內容。
    int i, fd;

    fd = open("/dev/mem", O_RDWR|O_SYNC);
    if (fd == -1)
        abort();

    // 將 /dev/mem 起始位址 (物理位址) 0x20000 之後 0xff 的內容，
    // 映射至當前進程的虛擬位址空間。
    map_base = mmap(NULL, 0xff, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0x20000);

    if (!map_base)
        abort();

    for (i = 0; i < 0xff; ++i)
    {
        // 讀取當前物理內存 0x2000 - 0x200ff 的內容。
        virt_addr = (unsigned long)(map_base + i);
        val = *((unsigned char *)map_base + i);
        printf("virt_addr: 0x%08x val: 0x%x\t\t", virt_addr, val);

        // 寫入新值，再讀取物理內存 0x2000 - 0x200ff 的內容。
        *((unsigned char *)map_base + i) = i;
        val = *((unsigned char *)map_base + i)
        printf("virt_addr: 0x%08lx val: 0x%x\n", virt_addr, val);
    }

    close(fd);

    munmap(map_base, 0xff);

    return 0;
}
</code>
  * [[http://superuser.com/questions/71389/what-is-dev-mem|What is /dev/mem?]]
    * [[http://stackoverflow.com/questions/6134984/access-permissions-of-dev-mem|Access permissions of /dev/mem]] 
    * [[http://blog.csdn.net/wlp600/article/details/6893636|/dev/mem]]
    * [[http://blog.csdn.net/zhanglei4214/article/details/6653568|利用mmap /dev/mem 读写Linux内存]]
    * [[http://blog.linux.org.tw/~jserv/archives/001342.html|X server 的 low-level 觀點]]
  * [[http://www.arm.linux.org.uk/mailinglists/faq.php#f8|How can I access /dev/mem ? - How can I map memory in user space?]]
===== 其它 =====
[[http://stackoverflow.com/questions/12256261/modifying-current-process-pte-through-dev-mem|Modifying current process' pte through /dev/mem?]] 的目的是要讓 4G 以上和以下的虛擬位址映射至同一個物理頁。目前是透過 ''/dev/mem'' 配合當前進程的 CR3，修改 4G 以下虛擬位址的頁表項，使其改指向 4G 以上虛擬位址所映射的物理頁。理論上，牽涉其中的物理頁，其相關資料結構 ''struct page'' 中的 count 和 mapcount 應做適當的更新。其中一種狀況是，被多個頁表項所映射的物理頁，其 mapcount 並未做相應的更新，這會導致進程結束時，內核回收其頁面時發現 mapcount 值有誤。關於頁面回收請見 [[http://hi.baidu.com/_kouu/item/3590d5f2f9d48cb431c199d9|linux 页面回收浅析]] 和   [[http://www.ibm.com/developerworks/cn/linux/l-cn-pagerecycle/index.html|Linux 2.6 中的页面回收与反向映射]]。rss_stat 是用來統計進程所使用的物理內存數量，也需要更新。
==== 方法一 ====
<code>
BUG: Bad page map in process mmap  pte:8000000007eb2067 pmd:07acb067
page:ffffea00001fac80 count:0 mapcount:-1 mapping:          (null) index:0x101b7b
page flags: 0x4000000000000014(referenced|dirty)
addr:0000000101b7b000 vm_flags:00100073 anon_vma:ffff880007ab0708 mapping:          (null) index:101b7b
Pid: 609, comm: mmap Tainted: G    B        3.5.3 #7
Call Trace:
 [<ffffffff8107abcc>] ? print_bad_pte+0x1d2/0x1ea
 [<ffffffff8107bf18>] ? unmap_single_vma+0x3a0/0x56d
 [<ffffffff8107c745>] ? unmap_vmas+0x2c/0x46
 [<ffffffff8108106b>] ? exit_mmap+0x6e/0xdd
 [<ffffffff8101cc4f>] ? do_page_fault+0x30f/0x348
 [<ffffffff81020ce6>] ? mmput+0x20/0xb4
 [<ffffffff810256ae>] ? exit_mm+0x105/0x110
 [<ffffffff8103bb6c>] ? hrtimer_try_to_cancel+0x67/0x70
 [<ffffffff81026b59>] ? do_exit+0x211/0x711
 [<ffffffff810272e0>] ? do_group_exit+0x76/0xa0
 [<ffffffff8102731c>] ? sys_exit_group+0x12/0x19
 [<ffffffff812f3662>] ? system_call_fastpath+0x16/0x1b
BUG: Bad rss-counter state mm:ffff880007a496c0 idx:0 val:-1
BUG: Bad rss-counter state mm:ffff880007a496c0 idx:1 val:1
</code>

  system_call_fastpath -> sys_exit_group -> do_group_exit -> do_exit -> hrtimer_try_to_cancel
    -> exit_mm -> mmput (kernek/fork.c) -> do_page_fault -> exit_mmap (mm/mmap.c)
    -> unmap_vmas (mm/memory.c) -> unmap_single_vma -> print_bad_pte

  - 進程退出，調用 ''mmput (kernek/fork.c)'' 回收進程使用的物理頁，並清空頁表。請見 [[http://liuw.72pines.com/867|Linux内核释放页表的过程]]。<code c>
void mmput(struct mm_struct *mm)
{
        might_sleep();

        if (atomic_dec_and_test(&mm->mm_users)) {
                uprobe_clear_state(mm);
                exit_aio(mm);
                ksm_exit(mm);
                khugepaged_exit(mm); /* must run before exit_mmap */
                exit_mmap(mm); /* error msg 1 */
                set_mm_exe_file(mm, NULL);
                if (!list_empty(&mm->mmlist)) {
                        spin_lock(&mmlist_lock);
                        list_del(&mm->mmlist);
                        spin_unlock(&mmlist_lock);
                }
                if (mm->binfmt)
                        module_put(mm->binfmt->module);
                mmdrop(mm); /* error msg 2 */ 
        }
}
</code>
  - ''exit_mmap (mm/mmap.c)'' 依序釋放進程的 VMA。[[http://www.makelinux.net/books/ulk3/?u=understandlk-CHP-9-SECT-3|9.3. Memory Regions]] 中的 ''9.3.5.3. The unmap_region() function'' 可以供作參考。<code c>
void exit_mmap(struct mm_struct *mm)
{
        // 存放平台特定的資訊以備 tlb_remove_page 回收物理頁使用。
        struct mmu_gather tlb;

        ... 略 ...

        vma = mm->mmap;

        lru_add_drain();
        flush_cache_mm(mm);
        // 初始化 mmu_gather，第三個參數為 1 代表我們欲銷毀整個虛擬空間。
         tlb_gather_mmu(&tlb, mm, 1);

        // 釋放所有 VMA 其所包含的物理頁。
         unmap_vmas(&tlb, vma, 0, -1);
  
        // 釋放頁表。      
         free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0);
        tlb_finish_mmu(&tlb, 0, -1);
       
        while (vma) {
                if (vma->vm_flags & VM_ACCOUNT)
                        nr_accounted += vma_pages(vma);
                vma = remove_vma(vma);
        }

}
</code>
  - ''unmap_vmas (mm/memory.c)'' 回收物理页。<code c>
void unmap_vmas(struct mmu_gather *tlb,
                struct vm_area_struct *vma, unsigned long start_addr,
                unsigned long end_addr)
{
        struct mm_struct *mm = vma->vm_mm;

        mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
        for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
                unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
        mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
}
</code>
    * ''unmap_single_vma (mm/memory.c)''。<code c>
static void unmap_single_vma(struct mmu_gather *tlb,
                struct vm_area_struct *vma, unsigned long start_addr,
                unsigned long end_addr,
                struct zap_details *details)
{
        unsigned long start = max(vma->vm_start, start_addr);
        unsigned long end;

        if (start >= vma->vm_end)
                return;
        end = min(vma->vm_end, end_addr);
        if (end <= vma->vm_start)
                return;

        if (vma->vm_file)
                uprobe_munmap(vma, start, end);

        if (unlikely(is_pfn_mapping(vma)))
                untrack_pfn_vma(vma, 0, 0);

        if (start != end) {
                if (unlikely(is_vm_hugetlb_page(vma))) {
                        /*
                         * It is undesirable to test vma->vm_file as it
                         * should be non-null for valid hugetlb area.
                         * However, vm_file will be NULL in the error
                         * cleanup path of do_mmap_pgoff. When
                         * hugetlbfs ->mmap method fails,
                         * do_mmap_pgoff() nullifies vma->vm_file
                         * before calling this function to clean up.
                         * Since no pte has actually been setup, it is
                         * safe to do nothing in this case.
                         */
                        if (vma->vm_file)
                                unmap_hugepage_range(vma, start, end, NULL);
                } else
                        unmap_page_range(tlb, vma, start, end, details);
        }
}
</code>
   * ''unmap_page_range (mm/memory.c)'' 依次释放 pud，pmd 和 pte。
      * unmap_page_range -> zap_pud_range -> zap_pmd_range -> zap_pte_range <code c>
static void unmap_page_range(struct mmu_gather *tlb,
                             struct vm_area_struct *vma,
                             unsigned long addr, unsigned long end,
                             struct zap_details *details)
{
        pgd_t *pgd;
        unsigned long next;

        if (details && !details->check_mapping && !details->nonlinear_vma)
                details = NULL;

        BUG_ON(addr >= end);
        mem_cgroup_uncharge_start();
        tlb_start_vma(tlb, vma);
        pgd = pgd_offset(vma->vm_mm, addr);
        do {
                next = pgd_addr_end(addr, end);
                if (pgd_none_or_clear_bad(pgd))
                        continue;
                next = zap_pud_range(tlb, vma, pgd, addr, next, details);
        } while (pgd++, addr = next, addr != end);
        tlb_end_vma(tlb, vma);
        mem_cgroup_uncharge_end();
}
</code>
      * ''zap_pte_range (mm/memory.c)''。<code c>
static unsigned long zap_pte_range(struct mmu_gather *tlb,
                                struct vm_area_struct *vma, pmd_t *pmd,
                                unsigned long addr, unsigned long end,
                                struct zap_details *details)
{
    ... 略 ...

        do {

                if (pte_present(ptent)) {
                        struct page *page;

                        // 取回 page 結構。
                            page = vm_normal_page(vma, addr, ptent);

                        // 把 pte 清零。
                            ptent = ptep_get_and_clear_full(mm, addr, pte,
                                                        tlb->fullmm);

                        // 收集欲回收的 page。
                            tlb_remove_tlb_entry(tlb, pte, addr);

                        // 把 page 的引用计数减 1。
                            if (PageAnon(page))
                                rss[MM_ANONPAGES]--;
                        else {
                                if (pte_dirty(ptent))
                                        set_page_dirty(page);
                                if (pte_young(ptent) &&
                                    likely(!VM_SequentialReadHint(vma)))
                                        mark_page_accessed(page);
                                rss[MM_FILEPAGES]--;
                        }

                        // 清除物理頁至頁表項的反向映射。
                            page_remove_rmap(page);
                        // 理論上物理頁被頁表項指到的次數應大於或等於零。
                            // 若底下條件成立，代表該物理頁沒有被頁表項指到。
                            if (unlikely(page_mapcount(page) < 0))
                                print_bad_pte(vma, addr, ptent, page);

        } while (pte++, addr += PAGE_SIZE, addr != end);

        add_mm_rss_vec(mm, rss);
        arch_leave_lazy_mmu_mode();
        pte_unmap_unlock(start_pte, ptl);

        if (force_flush) {
                force_flush = 0;
                tlb_flush_mmu(tlb); // 釋放物理頁。
                   if (addr != end)
                        goto again;
        }

        return addr;
}
</code>
    * [[http://bbs.chinaunix.net/thread-3558002-1-1.html|写时复制细节问题]]
  - ''free_pgtables (mm/memory.c)''。<code c>
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
                unsigned long floor, unsigned long ceiling)
{
        while (vma) {
                struct vm_area_struct *next = vma->vm_next;
                unsigned long addr = vma->vm_start;

                /*
                 * Hide vma from rmap and truncate_pagecache before freeing
                 * pgtables
                 */
                unlink_anon_vmas(vma);
                unlink_file_vma(vma);

                if (is_vm_hugetlb_page(vma)) {
                        hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
                                floor, next? next->vm_start: ceiling);
                } else {
                        /*
                         * Optimization: gather nearby vmas into one call down
                         */
                        while (next && next->vm_start <= vma->vm_end + PMD_SIZE
                               && !is_vm_hugetlb_page(next)) {
                                vma = next;
                                next = vma->vm_next;
                                unlink_anon_vmas(vma);
                                unlink_file_vma(vma);
                        }
                        free_pgd_range(tlb, addr, vma->vm_end,
                                floor, next? next->vm_start: ceiling);
                }
                vma = next;
        }
}
</code>
    * ''free_pgd_range (mm/memory.c)'' <code c>
void free_pgd_range(struct mmu_gather *tlb,
                        unsigned long addr, unsigned long end,
                        unsigned long floor, unsigned long ceiling)
{
        ... 略 ...

        pgd = pgd_offset(tlb->mm, addr);
        do {
                next = pgd_addr_end(addr, end);
                if (pgd_none_or_clear_bad(pgd))
                        continue;
                free_pud_range(tlb, pgd, addr, next, floor, ceiling);
        } while (pgd++, addr = next, addr != end);
}
</code>
    * ''free_pud_range'' <code c>
        pud = pud_offset(pgd, addr);
        do {
                next = pud_addr_end(addr, end);
                if (pud_none_or_clear_bad(pud))
                        continue;
                free_pmd_range(tlb, pud, addr, next, floor, ceiling);
        } while (pud++, addr = next, addr != end);

</code>
    * ''free_pmd_range (mm/memory.c)'' <code c>
static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
                                unsigned long addr, unsigned long end,
                                unsigned long floor, unsigned long ceiling)
{
        ... 略 ...

        pmd = pmd_offset(pud, addr);
        do {
                next = pmd_addr_end(addr, end);
                if (pmd_none_or_clear_bad(pmd))
                        continue;
                free_pte_range(tlb, pmd, addr);
        } while (pmd++, addr = next, addr != end);


        ... 略 ...
}
</code>
    * ''free_pte_range (mm/memory.c)''<code c>
/*
 * Note: this doesn't free the actual pages themselves. That
 * has been handled earlier when unmapping all the memory regions.
 */
static void free_pte_range(struct mmu_gather *tlb, pmd_t *pmd,
                           unsigned long addr)
{
        pgtable_t token = pmd_pgtable(*pmd);
        pmd_clear(pmd);
        pte_free_tlb(tlb, token, addr);
        tlb->mm->nr_ptes--;
}
</code>
  - ''mmput'' 呼叫 ''mmdrop (kernel/fork.c)''。<code c>
void __mmdrop(struct mm_struct *mm)
{
        BUG_ON(mm == &init_mm);
        mm_free_pgd(mm);
        destroy_context(mm);
        mmu_notifier_mm_destroy(mm);
        check_mm(mm);
        free_mm(mm);
}
</code>
  - ''check_mm (kernel/fork.c)''。<code c>
static void check_mm(struct mm_struct *mm)
{
        int i;

        for (i = 0; i < NR_MM_COUNTERS; i++) {
                long x = atomic_long_read(&mm->rss_stat.count[i]);

                // 等同 if (x)，即 x != 0 時，會執行 if 語句。unlikely 是給編譯器提示，說明 if 語句不常執行。
                   if (unlikely(x))
                        printk(KERN_ALERT "BUG: Bad rss-counter state "
                                          "mm:%p idx:%d val:%ld\n", mm, i, x);
        }
}
</code>
    * ''mm_struct (include/linux/mm_types.h)''。<code c>
struct mm_rss_stat {
        atomic_long_t count[NR_MM_COUNTERS];
};

struct mm_struct {
        struct vm_area_struct * mmap;           /* list of VMAs */
        struct rb_root mm_rb;
        struct vm_area_struct * mmap_cache;     /* last find_vma result */

        ... 略 ...

        struct mm_rss_stat rss_stat;

        ... 略 ...
};
</code>
  * [[http://www.kernel.org/doc/gorman/html/understand/understand009.html|Chapter 6 Physical Page Allocation]]
  * [[http://stackoverflow.com/questions/4069245/how-does-physical-pages-are-allocated-and-freed-during-the-malloc-and-free-call|How does physical pages are allocated and freed during the malloc and free call?]]
  * [[http://linux-mm.org/LinuxMMFAQ|What are the design internals behind zap_xxx_range() APIs?]]
  * [[http://histemiss.blog.163.com/blog/static/30487860201251803955826/|mmap.c]]
==== 方法二 ====
<code>
# cat /proc/mtrr
reg00: base=0x0e0000000 ( 3584MB), size=  512MB, count=1: uncachable
# ./mmap
malloc vaddr: 0x00000001009a2010 val: 3
pte: 0x0000000007ee5000
mmap:624 map pfn RAM range req uncached-minus for [mem 0x07ee5000-0x07ee5fff], got write-back
malloc vaddr: 0x00000001009a2010 val: 10
pte: 0x8000000007ee5267
</code>
  - ''reserve_pfn_range (arch/x86/mm/pat.c)''。<code c>
static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
                                int strict_prot)
{
    ... 略 ...

        if (is_ram) {
                if (!pat_enabled)
                        return 0;

                flags = lookup_memtype(paddr);
                if (want_flags != flags) {
                        printk(KERN_WARNING "%s:%d map pfn RAM range req %s for [mem %#010Lx-%#010Lx], got %s\n",
                                current->comm, current->pid,
                                cattr_name(want_flags),
                                (unsigned long long)paddr,
                                (unsigned long long)(paddr + size - 1),
                                cattr_name(flags));
                        *vma_prot = __pgprot((pgprot_val(*vma_prot) &
                                              (~_PAGE_CACHE_MASK)) |
                                             flags);
                }
                return 0;
        }

    ... 略 ...
}
</code>
  * [[http://lwn.net/Articles/278994/|Documentation/x86/pat.txt]]
  * Documentation/x86/mtrr.txt
    * [[http://www.meduna.org/txt_mtrr_en.html|Speeding up the graphics on Pentium Pro / Pentium II computers]]
==== 方法三 ====
直接進入內核修改頁表項。必須在程序結束時，將手動分配的頁釋放掉。

  * [[http://marc.info/?l=linux-mm&m=134736185025322&w=2|What else need to be done if we allocate phys page manually?]]
  * [[http://stackoverflow.com/questions/12419229/how-can-i-make-kernel-reclaim-phys-page-i-allocate-automatically|How can I make kernel reclaim phys page I allocate automatically?]]


====== 鎖 ======
<blockquote>
RCU supports concurrency between a single updater and multiple readers.
</blockquote>

  * [[wp>Read-copy-update|Read-copy-update (RCU)]]
    * [[http://www.ibm.com/developerworks/cn/linux/l-rcu/|Linux 2.6内核中新的锁机制--RCU]]
    * [[http://nano-chicken.blogspot.tw/2011/02/linux-modules14-read-copy-update.html|Linux Modules（14）- Read Copy Update]]
    * [[http://www2.rdrop.com/~paulmck/RCU/whatisRCU.html|What is RCU, Really?]]
    * [[http://lwn.net/Articles/262464/|What is RCU, Fundamentally?]]
====== 追蹤點 ======
  * ''arch/x86/kvm/trace.h'' 裡面列出欲生成的追蹤點。<code c>
TRACE_EVENT(kvm_entry,
  TP_PROTO(unsigned int vcpu_id),
  TP_ARGS(vcpu_id),

  TP_STRUCT__entry(
    __field(  unsigned int, vcpu_id   )
  ),

  TP_fast_assign(
    __entry->vcpu_id  = vcpu_id;
  ),

  TP_printk("vcpu %u", __entry->vcpu_id)
);
</code>
    * [[http://lwn.net/Articles/379903/|Using the TRACE_EVENT() macro (Part 1)]]
  * 在 ''vcpu_enter_guest'' (''arch/x86/kvm/x86.c'') 插入追蹤點。<code c>
static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
  ... 略 ...

  trace_kvm_entry(vcpu->vcpu_id);
  kvm_x86_ops->run(vcpu);

  ... 略 ...
}
</code>

  * [[http://lwn.net/Articles/410200/|trace-cmd: A front-end for Ftrace]]
  * [[http://renren.it/a/caozuoxitong/Linux/20110604/87950.html|ftrace和它的前端工具trace-cmd(深入了解Linux系統的利器)]]
  * [[http://blog.csdn.net/arethe/article/details/6293505|[内核文档]使用Linux内核的追踪点]]
===== 驅動與模組 ======
  * [[http://oss.org.cn/kernel-book/ldd3/index.html|Linux 设备驱动 Edition 3]]
  * [[http://tldp.org/LDP/lkmpg/2.4/html/book1.htm|The Linux Kernel Module Programming Guide]]
====== 其它 ======
  * [[http://www.ibm.com/developerworks/linux/library/l-kernel-memory-access/|User space memory access from the Linux kernel]]
  * [[http://www.jollen.org/blog/2006/12/linux_device_driver_io_3.html|Linux 驅動程式的 I/O, #3: kernel-space 與 user-space 的「I/O」]]
  * [[http://www.lslnet.com/linux/f/docs1/i06/big5136257.htm|copy_from_user 和 copy_to_user]]
  * [[http://nano-chicken.blogspot.tw/2009/11/linux-modulesi.html|Linux Kernel（1）- Linux Module簡介]]
  * [[http://stackoverflow.com/questions/2264384/how-do-i-use-ioctl-to-manipulate-my-kernel-module|How do I use ioctl() to manipulate my kernel module?]]

  * [[http://stackoverflow.com/questions/9389688/in-what-context-kernel-thread-runs-in-linux|In what context Kernel Thread runs in Linux?]]
  * [[http://stackoverflow.com/questions/2430943/linux-kernel-threads-scheduler|Linux Kernel Threads - scheduler]]
  * [[http://stackoverflow.com/questions/6235897/no-address-space-for-linux-kernel-threads|No address space for Linux Kernel threads]]
===== Bottom Half =====
在處理中斷時，內核會禁用中斷以免影響當前正在運行的中斷處理函式。一般來說，會希望禁用中斷的時間盡可能的縮短，以加快系統的反應速度 (禁用中斷代表系統對外部中斷無法反應)。因此會將中斷處理函式分成 top half 和 bottom half，前者在禁用中斷的情況下執行，其執行時間短; 後者則是在開啟中斷的情況下運行。

<blockquote>
Interrupts can come anytime, when the kernel may want to finish something else it was trying to do. The kernel's goal is therefore to get the interrupt out of the way as soon as possible and defer as much processing as it can. For instance, suppose a block of data has arrived on a network line. When the hardware interrupts the kernel, it could simply mark the presence of data, give the processor back to whatever was running before, and do the rest of the processing later (such as moving the data into a buffer where its recipient process can find it, and then restarting the process). <color red>The activities that the kernel needs to perform in response to an interrupt are thus divided into a critical urgent part that the kernel executes right away and a deferrable part that is left for later.</color>

<cite>http://www.makelinux.net/books/ulk3/understandlk-CHP-4-SECT-1</cite>
</blockquote>

  * [[http://tldp.org/LDP/tlk/kernel/kernel.html|11.1  Bottom Half Handling]]
  * [[http://www.csie.nctu.edu.tw/~tcwu/doc/Linux/Kernel/chapter11/chapter11.htm|11.1 Bottom Half Handling (任務的延遲處理)]]
  * [[http://www.lslnet.com/linux/f/docs1/i03/big5120943.htm|Bottom Half]]
===== 模塊 =====
  * [[http://blog.wu-boy.com/2010/06/linux-kernel-driver-%E6%92%B0%E5%AF%AB%E7%B0%A1%E5%96%AE-hello-world-module-part-1/|[Linux Kernel] 撰寫簡單 Hello, World module (part 1)]]
  * [[http://tldp.org/HOWTO/Module-HOWTO/|Linux Loadable Kernel Module HOWTO]]
===== MMIO =====
物理內存會有一個區段映射至裝置的暫存器和內存，這一物理內存同樣在頁表中有虛擬位址到物理位址的映射。如 [[http://duartes.org/gustavo/blog/post/motherboard-chipsets-memory-map|Motherboard Chipsets and the Memory Map]] 一文所述，CPU 送出的物理位址是透過北橋來決定是存取內存還是裝置。

  * [[http://stackoverflow.com/questions/9654504/memory-mapped-io-how-is-it-done|Memory mapped IO - how is it done?]]
  * [[http://stackoverflow.com/questions/9115129/in-x86-platform-does-the-dma-operation-mean-to-move-data-between-mmio-addr-spac|In X86 Platform, does the DMA operation mean to move data between MMIO addr space and system memory addr space?]]
  * [[http://stackoverflow.com/questions/4355117/mmio-pio-info-for-linux|MMIO/PIO Info for Linux]]
  * [[wp>Memory-mapped I/O]]
  * [[http://www.makelinux.net/ldd3/chp-15|Chapter 15. Memory Mapping and DMA]]
  * [[http://tw.myblog.yahoo.com/max-e/article?mid=690&next=689&l=f&fid=5|電腦小常識-電腦插4G記憶體只抓到3.8G而已..這是怎一回事]]
  * [[http://blog.csdn.net/better0332/article/details/4748669|认识4G地址空间的局限----MMIO内存映射的问题]]
====== 外部連結 ======
  * [[http://www.makelinux.net/books/ulk3/main|Understanding the Linux Kernel]]
  * [[http://kernelnewbies.org/|Linux Kernel Newbies]]
  * [[http://lxr.linux.no/|Linux Cross Referencer]]
  * [[http://vger.kernel.org/|VGER.KERNEL.ORG]]
  * [[http://www.mulix.org/lectures/kernel_oopsing/kernel_oopsing.pdf|Linux Kernel Debugging]]
  * [[http://blog.csdn.net/fudan_abc|fudan_abc的Linux内核专栏]]
  * [[http://www.longene.org/|Linux兼容内核]]
  * [[http://www.kerneltravel.net/|Linux内核之旅]]
  * [[http://www.jamesmolloy.co.uk/index.html|home of James Molloy]]
  * [[http://www.books.com.tw/exep/prod/booksfile.php?item=0010516379|深入探索 Linux 核心架構]]
  * [[http://www.welan.com.tw/5356|Linux 內核源代碼情景分析(上下冊)]]
    * [[http://www.longene.org/forum/viewtopic.php?f=5&t=29|Linux 内核源代码情景分析（非扫描版）下载]]
    * [[http://www.cis.nctu.edu.tw/~is92065/book/Linux内核情景分析.pdf|Linux 内核情景分析]]
    * [[http://blog.csdn.net/superkiss2|superkiss2的专栏]]