====== 使用 ======
- 檢查 CPU 是否支援虛擬化。
$ cat /proc/cpuinfo | grep vmx
$ lsmod | grep kvm
kvm_intel 38370 0
kvm 176424 1 kvm_intel
- 確認機器是否支援 EPT。N 代表不支援。
$ cat /sys/module/kvm_intel/parameters/ept
N
* [[http://www.mail-archive.com/kvm@vger.kernel.org/msg39010.html|Disable EPT]]
- 運行虛擬機。注意! 可以觀察到虛擬機啟動時,只有看到 QEMU。QEMU 在開啟 KVM 的情況下,直接將客戶機代碼運行在宿主機的 CPU 上。其餘部分諸如客戶機物理內存分配,或是 IO 模擬均跟不開啟 KVM 的情況一樣。
$ qemu-system-i386 -enable-kvm linux-0.2.img -vnc 0.0.0.0:1
- 安裝 ''trace-cmd''。需要內核支援 [[http://lwn.net/Articles/115405/|Debugfs]] 和安裝 [[http://www.swig.org/|SWIG]]。
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/trace-cmd.git trace-cmd
$ cd trace-cmd
$ make prefix=$INSTALL; make prefix=$INSTALL install
$ export PATH=$INSTALL/bin:$PATH
# 可能因為 NFS 的關係,必須在 /tmp 運行 trace-cmd。
$ trace-cmd record -b 20000 -e kvm
$ qemu-system-i386 -enable-kvm linux-0.2.img -vnc 0.0.0.0:1
# trace-cmd 會在當前目錄底下建立 trace.dat。
$ trace-cmd report
* 輸出內容類似底下這樣。
CPU 1 is empty
CPU 3 is empty
CPU 4 is empty
cpus=8
# /sys/kernel/debug/tracing/trace
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
qemu-system-i38-14653 [000] 7751.718118: kvm_fpu: load
qemu-system-i38-14653 [000] 7751.718122: kvm_entry: vcpu 0
qemu-system-i38-14653 [000] 7751.718125: kvm_exit: [FAILED TO PARSE] exit_reason=0 guest_rip=0xfff0
qemu-system-i38-14653 [000] 7751.718126: kvm_page_fault: address ffff0 error_code 1d
qemu-system-i38-14653 [000] 7751.718141: kvm_entry: vcpu 0
qemu-system-i38-14653 [000] 7751.718142: kvm_exit: [FAILED TO PARSE] exit_reason=0 guest_rip=0xc701
qemu-system-i38-14653 [000] 7751.718143: kvm_page_fault: address feffd066 error_code 9
qemu-system-i38-14653 [000] 7751.718149: kvm_entry: vcpu 0
qemu-system-i38-14653 [000] 7751.718150: kvm_exit: [FAILED TO PARSE] exit_reason=30 guest_rip=0xc701
qemu-system-i38-14653 [000] 7751.718152: kvm_pio: pio_write at 0x70 size 1 count 1
* [[http://www.linux-kvm.org/page/Tracing|Tracing]]
====== 概觀 ======
* [[http://labs.cre8tivetech.com/2010/09/virtualization-with-kvm-introduction/|Virtualization with KVM: Introduction]]
* [[http://www.linuxinsight.com/files/kvm_whitepaper.pdf|KVM: Kernel-based Virtualization Driver by Qumranet]]
* [[http://www.redhat.com/f/pdf/rhev/DOC-KVM.pdf|KVM – KERNEL BASED VIRTUAL MACHINE]]
* [[http://benjr.tw/node/534|KVM(Kernel-based Virtual Machine) + KQEMU]]
* [[http://benjr.tw/node/532|Kernel-based Virtual Machine]]
* [[http://blog.vmsplice.net/2011/03/qemu-internals-big-picture-overview.html|QEMU Internals: Big picture overview]]
$ git clone git://git.kernel.org/pub/scm/virt/kvm/kvm.git
* [[http://lwn.net/Articles/216886/|Stable kvm userspace interface]]
KVM 初始化概觀請見 [[http://www.linux-kvm.org/page/Initialization|The initialization of a kvm]]。
- 載入 kvm.ko。
- 開啟 ''/dev/kvm'' 取得 KVM fd,此為 KVM 暴露給上層應用程序 (QEMU) 的介面,應用程序透過 ioctl 操控 KVM fd 對 KVM 發出命令。
- QEMU 對 kvm fd 發出 KVM_CREATE_VM 命令,取得 VM fd。底下均需要檢查 KVM 版本是否有提供該功能。
- KVM_SET_TSS_ADDR: 會在客戶機物理內存起始位址分配 3 個頁面。猜測是用來存放 [[wp>Task state segment|Task state segment (TSS)]]。
- KVM_SET_MEMORY_REGION (已建議改用 KVM_SET_USER_MEMORY_REGION): 用來修改客戶機物理內存分配。
- KVM_CREATE_IRQCHIP
- QEMU 對 VM fd 發出 KVM_CREATE_VCPU 命令,取得 VCPU fd。客戶機中每一個 VCPU 都要有一個相對應的 VCPU fd。QEMU 對 VCPU fd 發出 KVM_RUN 命令,從根模式用戶態切換至根模式內核態進入 KVM,KVM 再透過 VMEntry 運行客戶機,此時從根模式內核態切換至非根模式。
強烈建議閱讀 [[http://blog.csdn.net/yearn520/article/details/6461047|KVM 实现机制]] 和 [[https://www.kernel.org/doc/mirror/ols2007v1.pdf#page=225|kvm: the Linux Virtual Machine Monitor]] 一文。KVM 和 QEMU 分別運行在根模式中的內核態 (kernel) 與用戶態 (user),客戶機則是運行在非根模式 (guest)。
static long kvm_dev_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
long r = -EINVAL;
switch (ioctl) {
case KVM_GET_API_VERSION:
r = -EINVAL;
if (arg)
goto out;
r = KVM_API_VERSION;
break;
case KVM_CREATE_VM:
r = kvm_dev_ioctl_create_vm(arg);
break;
... 略 ...
}
}
- ''kvm_vm_ioctl'' (''virt/kvm/kvm_main.c'') 負責處理對 VM fd 下達的命令。
static long kvm_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
switch (ioctl) {
case KVM_CREATE_VCPU:
r = kvm_vm_ioctl_create_vcpu(kvm, arg);
if (r < 0)
goto out;
break;
case KVM_SET_USER_MEMORY_REGION: {
struct kvm_userspace_memory_region kvm_userspace_mem;
r = -EFAULT;
// 注意! 客戶機物理內存是從 QEMU 虛擬內存分配。底下將 KVM_SET_USER_MEMORY_REGION
// 後面接的 kvm_userspace_mem 參數從用戶態拷貝到內核態。
if (copy_from_user(&kvm_userspace_mem, argp,
sizeof kvm_userspace_mem))
goto out;
r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem, 1);
if (r)
goto out;
break;
}
... 略 ...
}
}
* kvm_vm_ioctl_set_memory_region -> kvm_set_memory_region -> \_\_kvm_set_memory_region。
/*
* Allocate some memory and give it an address in the guest physical address
* space.
*
* Discontiguous memory is allowed, mostly for framebuffers.
*
* Must be called holding mmap_sem for write.
*/
int __kvm_set_memory_region(struct kvm *kvm,
struct kvm_userspace_memory_region *mem,
int user_alloc)
{
}
- ''kvm_vcpu_ioctl'' (''virt/kvm/kvm_main.c'') 負責處理對 VCPU fd 下達的命令。
static long kvm_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
switch (ioctl) {
case KVM_RUN:
r = -EINVAL;
if (arg)
goto out;
r = kvm_arch_vcpu_ioctl_run(vcpu, vcpu->run);
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
break;
... 略 ...
}
====== CPU ======
* [[http://www.pagetable.com/?p=348|Intel VT VMCS Layout]]
* [[http://boy-asmc.blogspot.com/2009/11/intel-vmx.html|Intel VMX]]
* [[http://www.360doc.com/content/11/0505/10/6580811_114483497.shtml|與 VMCS 初始化有關]]
* 如何載入客戶機第一條指令並運行?
===== 資料結構 =====
* 在 ''arch/x86/kvm/vmx.c'' 定義 VMCS。
struct vmcs {
u32 revision_id;
u32 abort;
char data[0];
};
* VMCS (Virtual Machine Control Structure)
* [[http://download.intel.com/products/processor/manual/326019.pdf|Intel® 64 and IA-32 Architectures
Software Developer's Manual Volume 3C]] 第 24 章。
* AMD 稱此結構為 VMCB (Virtual Machine Control Block),[[http://lxr.free-electrons.com/source/arch/x86/include/asm/svm.h#L180|vmcb]]。
* 存取 VMCS 是用 ''vmcs_writel'' 或 ''vmcs_readl'' 及其包裝函式。
static void vmcs_writel(unsigned long field, unsigned long value)
{
u8 error;
// arch/x86/include/asm/vmx.h
// #define ASM_VMX_VMWRITE_RAX_RDX ".byte 0x0f, 0x79, 0xd0"
// 將 RAX 內容寫入用 RDX 索引到的 VMCS 欄位,RAX 和 RDX 分別由參數 value 和 field 給值。
asm volatile (__ex(ASM_VMX_VMWRITE_RAX_RDX) "; setna %0"
: "=q"(error) : "a"(value), "d"(field) : "cc");
if (unlikely(error))
vmwrite_error(field, value);
}
static __always_inline unsigned long vmcs_readl(unsigned long field)
{
unsigned long value;
// #define ASM_VMX_VMREAD_RDX_RAX ".byte 0x0f, 0x78, 0xd0"
asm volatile (__ex_clear(ASM_VMX_VMREAD_RDX_RAX, "%0")
: "=a"(value) : "d"(field) : "cc");
return value;
}
* [[http://tptp.cc/mirrors/siyobik.info/instruction/VMWRITE|VMWRITE]]
* [[http://tptp.cc/mirrors/siyobik.info/instruction/VMREAD|VMREAD]]
* 在 ''include/linux/kvm_host.h'' 定義 ''kvm'',此即為 VM。
struct kvm {
spinlock_t mmu_lock;
struct mutex slots_lock;
struct mm_struct *mm; /* userspace tied to this vm */
struct kvm_memslots *memslots;
struct srcu_struct srcu;
#ifdef CONFIG_KVM_APIC_ARCHITECTURE
u32 bsp_vcpu_id; // SMP 一般支援 APIC,此時其中一個 VCPU 扮演 BSP (bootstrap processor)。
#endif
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; // VM 內可含多個 VCPU。
struct kvm_arch arch; // 不同的平台會定義自己的 kvm_arch。
... 略 ...
};
* ''kvm_arch'' (''arch/x86/include/asm/kvm_host.h'')。
struct kvm_arch {
unsigned int n_used_mmu_pages;
unsigned int n_requested_mmu_pages;
unsigned int n_max_mmu_pages;
unsigned int indirect_shadow_pages;
struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
/*
* Hash table of struct kvm_mmu_page.
*/
struct list_head active_mmu_pages;
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
int iommu_flags;
struct kvm_pic *vpic;
struct kvm_ioapic *vioapic; // x86 上的 IOAPIC。
... 略 ...
};
* 在 ''include/linux/kvm_host.h'' 定義 ''kvm_vcpu'',此即為 VCPU。
struct kvm_vcpu {
struct kvm *kvm;
int cpu;
int vcpu_id;
int srcu_idx;
int mode; // VCPU 處於何種模式,如: OUTSIDE_GUEST_MODE 或 IN_GUEST_MODE。
unsigned long requests;
unsigned long guest_debug;
struct mutex mutex;
struct kvm_run *run; // 保存諸如 KVM VMExit 相關訊息。
int fpu_active;
int guest_fpu_loaded, guest_xcr0_loaded;
wait_queue_head_t wq;
struct pid *pid;
int sigset_active;
sigset_t sigset;
struct kvm_vcpu_stat stat;
... 略 ...
struct kvm_vcpu_arch arch; /* 不同 ISA 有自己的 kvm_vcpu_arch,其中包含平台特定的暫存器組。 */
};
* ''arch/x86/include/asm/kvm_host.h'' 定義 x86 自己的 ''kvm_vcpu_arch''。
struct kvm_vcpu_arch {
unsigned long regs[NR_VCPU_REGS];
// x86 SMP 平台,CPU 內有一個 LAPIC 接收 IOAPIC 送上來的中斷。
// 在此以軟體模擬。
struct kvm_lapic *apic; /* kernel irqchip context */
... 略 ...
/*
* Paging state of the vcpu
*
* If the vcpu runs in guest mode with two level paging this still saves
* the paging mode of the l1 guest. This context is always used to
* handle faults.
*/
struct kvm_mmu mmu;
... 略 ...
};
* ''struct kvm_ioapic '' (''virt/kvm/ioapic.h'')。
struct kvm_ioapic {
u64 base_address;
u32 ioregsel;
u32 id;
u32 irr;
u32 pad;
// IOAPIC 會將從週邊收到的中斷轉發給 CPU 的 LAPIC。
union kvm_ioapic_redirect_entry redirtbl[IOAPIC_NUM_PINS];
unsigned long irq_states[IOAPIC_NUM_PINS];
struct kvm_io_device dev;
struct kvm *kvm;
void (*ack_notifier)(void *opaque, int irq);
spinlock_t lock;
DECLARE_BITMAP(handled_vectors, 256);
};
* ''struct kvm_lapic'' (''arch/x86/kvm/lapic.h'')。
struct kvm_lapic {
unsigned long base_address;
struct kvm_io_device dev;
struct kvm_timer lapic_timer;
u32 divide_count;
struct kvm_vcpu *vcpu; // 指向所屬的 VCPU。
bool irr_pending;
void *regs;
gpa_t vapic_addr;
struct page *vapic_page;
};
===== 內核模組初始化 =====
* [[http://www.linux-kvm.org/page/Small_look_inside|A small look inside]] 一文針對 AMD 的 SVM 所寫,底下是以 VMX 為範例。執行路徑大略如下:
vmx_init -> kvm_init -> kvm_arch_init
- 不同平台有不同的初始入口,呼叫 ''kvm_init'' 註冊平台特定的回調函式。以 VMX 為例,''vmx_init'' (''arch/x86/kvm/vmx.c'') 為其入口函式。
static struct kvm_x86_ops vmx_x86_ops = {
... 略 ...
.vcpu_create = vmx_create_vcpu,
.vcpu_free = vmx_free_vcpu,
.vcpu_reset = vmx_vcpu_reset,
... 略 ...
.set_cr3 = vmx_set_cr3, // 設置客戶機 CR3。
.set_tdp_cr3 = vmx_set_cr3, // 設置 EPTP。
.handle_exit = vmx_handle_exit, // 當客戶機 VMExit 時,陷入 VMM,交由此函式處理。
};
static int __init vmx_init(void)
{
... 略 ...
r = kvm_init(&vmx_x86_ops, sizeof(struct vcpu_vmx),
__alignof__(struct vcpu_vmx), THIS_MODULE);
... 略 ...
if (enable_ept) {
kvm_mmu_set_mask_ptes(0ull, 0ull, 0ull, 0ull,
VMX_EPT_EXECUTABLE_MASK);
ept_set_mmio_spte_mask();
kvm_enable_tdp();
} else
kvm_disable_tdp();
return 0;
}
- ''kvm_init'' 初始化 KVM。
int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
struct module *module)
{
int r;
int cpu;
// 註冊平台特定的回調函式供後續函式使用,如: kvm_arch_hardware_setup。
r = kvm_arch_init(opaque);
... 略 ...
// 設置 VMCS 並檢測各項硬體支援,如: 是否支援 EPT/NPT、VPID 等。
r = kvm_arch_hardware_setup();
}
* 不同平台對 KVM 有不同的初始化。
int kvm_arch_init(void *opaque)
{
int r;
struct kvm_x86_ops *ops = (struct kvm_x86_ops *)opaque;
/* 檢查硬體是否支援 KVM。 */
if (!ops->cpu_has_kvm_support()) {
printk(KERN_ERR "kvm: no hardware support\n");
r = -EOPNOTSUPP;
goto out;
}
if (ops->disabled_by_bios()) {
printk(KERN_ERR "kvm: disabled by bios\n");
r = -EOPNOTSUPP;
goto out;
}
r = kvm_mmu_module_init();
kvm_set_mmio_spte_mask();
kvm_init_msr_list();
kvm_x86_ops = ops; // 相當重要!
kvm_timer_init();
perf_register_guest_info_callbacks(&kvm_guest_cbs);
if (cpu_has_xsave)
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
return 0;
}
* ''kvm_arch_hardware_setup'' (''arch/x86/kvm/x86.c'') 呼叫 ''kvm_x86_ops'' 中註冊的 ''hardware_setup'' 函式,設置 VMCS 並做相關硬體檢測。
static __init int hardware_setup(void)
{
if (setup_vmcs_config(&vmcs_config) < 0)
return -EIO;
if (boot_cpu_has(X86_FEATURE_NX))
kvm_enable_efer_bits(EFER_NX);
if (!cpu_has_vmx_vpid())
enable_vpid = 0;
if (!cpu_has_vmx_ept() ||
!cpu_has_vmx_ept_4levels()) {
enable_ept = 0;
enable_unrestricted_guest = 0;
}
... 略 ...
// 分配 VMCS 給所有的 VCPU。
return alloc_kvm_area();
}
* ''setup_vmcs_config'' (''arch/x86/kvm/vmx.c'') 設定傳入的 VMCS ''vmcs_conf'' 供其它函式使用。
static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
{
/* 設定客戶機何時需要 VMExit */
min = CPU_BASED_HLT_EXITING |
... 略 ...
CPU_BASED_MONITOR_EXITING |
CPU_BASED_INVLPG_EXITING |
CPU_BASED_RDPMC_EXITING;
/* 設定使用何種硬體加速 */
opt = CPU_BASED_TPR_SHADOW |
CPU_BASED_USE_MSR_BITMAPS |
CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
}
* ''arch/x86/include/asm/vmx.h'' 定義的上述 VMCS 控制段。
* ''alloc_kvm_area'' (''arch/x86/kvm/vmx.c'') 分配 VMCS 給所有的 VCPU。
static __init int alloc_kvm_area(void)
{
int cpu;
for_each_possible_cpu(cpu) {
struct vmcs *vmcs;
vmcs = alloc_vmcs_cpu(cpu);
if (!vmcs) {
free_kvm_area();
return -ENOMEM;
}
per_cpu(vmxarea, cpu) = vmcs;
}
return 0;
}
static struct vmcs *alloc_vmcs_cpu(int cpu)
{
int node = cpu_to_node(cpu);
struct page *pages;
struct vmcs *vmcs;
pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
if (!pages)
return NULL;
vmcs = page_address(pages);
memset(vmcs, 0, vmcs_config.size);
vmcs->revision_id = vmcs_config.revision_id; /* vmcs revision id */
return vmcs;
}
* [[http://biancheng.dnbcw.info/linux/335606.html|kernel 中的 per_cpu 变量]]
===== 設置 VMCS =====
* 每一個實體 CPU 皆可綁定一個 VMCS。
struct vmcs {
u32 revision_id;
u32 abort;
char data[0];
};
/*
* Track a VMCS that may be loaded on a certain CPU. If it is (cpu!=-1), also
* remember whether it was VMLAUNCHed, and maintain a linked list of all VMCSs
* loaded on this CPU (so we can clear them if the CPU goes down).
*/
struct loaded_vmcs {
struct vmcs *vmcs;
int cpu;
int launched;
struct list_head loaded_vmcss_on_cpu_link;
};
struct vcpu_vmx {
struct kvm_vcpu vcpu;
... 略 ...
/*
* loaded_vmcs points to the VMCS currently used in this vcpu. For a
* non-nested (L1) guest, it always points to vmcs01. For a nested
* guest (L2), it points to a different VMCS.
*/
struct loaded_vmcs vmcs01;
struct loaded_vmcs *loaded_vmcs;
bool __launched; /* temporary, used in vmx_vcpu_run */
... 略 ...
};
- ''kvm_vm_ioctl'' (''virt/kvm/kvm_main.c'')。
static long kvm_vm_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
switch (ioctl) {
case KVM_CREATE_VCPU:
r = kvm_vm_ioctl_create_vcpu(kvm, arg);
if (r < 0)
goto out;
break;
... 略 ...
}
}
- ''kvm_vm_ioctl_create_vcpu'' (''virt/kvm/kvm_main.c'') 替每一個 VCPU 建立一個 ''kvm_vcpu''。
static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
{
int r;
struct kvm_vcpu *vcpu, *v;
// 替每一個 VCPU 建立一個 kvm_vcpu。
vcpu = kvm_arch_vcpu_create(kvm, id);
/* Now it's all set up, let userspace reach it */
kvm_get_kvm(kvm);
r = create_vcpu_fd(vcpu); // 返回 VCPU fd 給 QEMU。
... 略 ...
// 在此 VM 紀錄該 VCPU。
kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;
smp_wmb();
atomic_inc(&kvm->online_vcpus);
mutex_unlock(&kvm->lock);
return r;
}
- ''kvm_arch_vcpu_create'' (''virt/kvm/kvm_main.c'')。
struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
unsigned int id)
{
if (check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
printk_once(KERN_WARNING
"kvm: SMP vm created on host with unstable TSC; "
"guest TSC will not be reliable\n");
return kvm_x86_ops->vcpu_create(kvm, id);
}
- ''vmx_create_vcpu'' (''arch/x86/kvm/vmx.c'')。
static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
{
... 略 ...
vmx->loaded_vmcs->vmcs = alloc_vmcs();
... 略 ...
}
static struct vmcs *alloc_vmcs_cpu(int cpu)
{
int node = cpu_to_node(cpu);
struct page *pages;
struct vmcs *vmcs;
pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order);
if (!pages)
return NULL;
vmcs = page_address(pages);
memset(vmcs, 0, vmcs_config.size);
vmcs->revision_id = vmcs_config.revision_id; /* vmcs revision id */
return vmcs;
}
static struct vmcs *alloc_vmcs(void)
{
return alloc_vmcs_cpu(raw_smp_processor_id());
}
-
/*
* Sets up the vmcs for emulated real mode.
*/
static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
{
#ifdef CONFIG_X86_64
unsigned long a;
#endif
int i;
/* I/O */
vmcs_write64(IO_BITMAP_A, __pa(vmx_io_bitmap_a));
vmcs_write64(IO_BITMAP_B, __pa(vmx_io_bitmap_b));
... 略 ...
}
- 不同的平台時做不同的 ''kvm_arch_vcpu_setup''。
int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
{
int r;
vcpu->arch.mtrr_state.have_fixed = 1;
vcpu_load(vcpu);
r = kvm_arch_vcpu_reset(vcpu);
if (r == 0)
r = kvm_mmu_setup(vcpu);
vcpu_put(vcpu);
return r;
}
* vcpu_load -> kvm_arch_vcpu_load。
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
{
/* Address WBINVD may be executed by guest */
if (need_emulate_wbinvd(vcpu)) {
if (kvm_x86_ops->has_wbinvd_exit())
cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);
else if (vcpu->cpu != -1 && vcpu->cpu != cpu)
smp_call_function_single(vcpu->cpu,
wbinvd_ipi, NULL, 1);
}
kvm_x86_ops->vcpu_load(vcpu, cpu);
... 略 ...
}
===== 運行客戶機 =====
QEMU (用戶態) 針對 VCPU 發起 KVM_RUN 命令,KVM (內核態) 處理該命令,並切至非根模式運行客戶機。
KVM_RUN -> kvm_vcpu_ioctl (kvm_main.c) -> kvm_arch_vcpu_ioctl_run (x86.c)
-> __vcpu_run (x86.c) -> vcpu_enter_guest (x86.c) -> kvm_x86_ops->run(vcpu) (vmx_vcpu_run in vmx.c)
- ''kvm_vcpu_ioctl'' (''virt/kvm/kvm_main.c'') 處理針對 VCPU 發起的命令。
static long kvm_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
switch (ioctl) {
case KVM_RUN:
r = -EINVAL;
if (arg)
goto out;
r = kvm_arch_vcpu_ioctl_run(vcpu, vcpu->run);
trace_kvm_userspace_exit(vcpu->run->exit_reason, r);
break;
... 略 ...
}
- ''kvm_arch_vcpu_ioctl_run'' (''arch/x86/kvm/x86.c'')。
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
{
... 略 ...
r = __vcpu_run(vcpu);
out:
post_kvm_run_save(vcpu);
if (vcpu->sigset_active)
sigprocmask(SIG_SETMASK, &sigsaved, NULL);
return r;
}
- \_\_vcpu_run (''arch/x86/kvm/x86.c'')。
static int __vcpu_run(struct kvm_vcpu *vcpu)
{
... 略 ...
while (r > 0) {
if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
!vcpu->arch.apf.halted)
r = vcpu_enter_guest(vcpu);
else {
... 略 ...
}
... 略 ...
}
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
vapic_exit(vcpu);
return r;
}
- ''vcpu_enter_guest'' (''arch/x86/kvm/x86.c'') 切入非根模式。
static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
// 檢查 VCPU 是否有待處理的事件。
if (vcpu->requests) {
}
// 注入中斷至 VCPU。
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
inject_pending_event(vcpu);
{
// 載入客戶機頁表。
r = kvm_mmu_reload(vcpu);
kvm_guest_enter();
// 進入非根模式,運行客戶機。
kvm_x86_ops->run(vcpu);
... 略 ...
// 處理 VMExit。
r = kvm_x86_ops->handle_exit(vcpu);
out:
return r;
}
* ''kvm_guest_enter''[(http://www.mail-archive.com/kvm@vger.kernel.org/msg52746.html)]。
static inline void kvm_guest_enter(void)
{
BUG_ON(preemptible());
account_system_vtime(current);
curre
nt->flags |= PF_VCPU;
/* KVM does not hold any references to rcu protected data when it
* switches CPU into a guest mode. In fact switching to a guest mode
* is very similar to exiting to userspase from rcu point of view. In
* addition CPU may stay in a guest mode for quite a long time (up to
* one time slice). Lets treat guest mode as quiescent state, just like
* we do with user-mode execution.
*/
rcu_virt_note_context_switch(smp_processor_id());
}
* ''rcu_virt_note_context_switch'' 是 ''rcu_note_context_switch'' 的包裝。有 ''rcu'' 前綴代表它是用來存取共享資料的版本。
static inline void rcu_virt_note_context_switch(int cpu)
{
rcu_note_context_switch(cpu);
}
/*
* Note a context switch. This is a quiescent state for RCU-sched,
* and requires special handling for preemptible RCU.
* The caller must have disabled preemption.
*/
void rcu_note_context_switch(int cpu)
{
trace_rcu_utilization("Start context switch");
rcu_sched_qs(cpu);
trace_rcu_utilization("End context switch");
}
- ''vmx_vcpu_run'' 之前已註冊在 ''kvm_x86_ops'' 中。''vmx_vcpu_run'' (''arch/x86/kvm/vmx.c'') 載入必要的客戶機狀態,並發起 VMEntry 切換至客戶機模式 [(http://people.cs.nctu.edu.tw/~chenwj/log/QEMU/zruan0-2012-07-03.txt)]。
static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
{
vmx->__launched = vmx->loaded_vmcs->launched;
asm(
... 略 ...
// 硬體並不會將所有客戶機暫存器載入至 CPU,部分交由 KVM 處理。
/* Load guest registers. Don't clobber flags. */
"mov %c[rax](%0), %%"R"ax \n\t"
"mov %c[rbx](%0), %%"R"bx \n\t"
"mov %c[rdx](%0), %%"R"dx \n\t"
"mov %c[rsi](%0), %%"R"si \n\t"
"mov %c[rdi](%0), %%"R"di \n\t"
"mov %c[rbp](%0), %%"R"bp \n\t"
"mov %c[rcx](%0), %%"R"cx \n\t" /* kills %0 (ecx) */
// x86/include/asm/vmx.h 以 hex 定義 ASM_VMX_VMLAUNCH 和 ASM_VMX_VMRESUME,
// 這是因應舊有的組譯器認不得 VMX_VMLAUNCH 和 VMX_VMRESUME 指令。
/* Enter guest mode */
"jne .Llaunched \n\t"
__ex(ASM_VMX_VMLAUNCH) "\n\t"
"jmp .Lkvm_vmx_return \n\t"
".Llaunched: " __ex(ASM_VMX_VMRESUME) "\n\t"
".Lkvm_vmx_return: "
// 返回根模式內核態。
/* Save guest registers, load host registers, keep flags */
... 略 ...
);
... 略 ...
}
- ''vmx_handle_exit'' (''arch/x86/kvm/vmx.c'') 之前已註冊在 ''kvm_x86_ops'' 中。
/*
* The guest has exited. See if we can fix it or if we need userspace
* assistance.
*/
static int vmx_handle_exit(struct kvm_vcpu *vcpu)
{
... 略 ...
if (exit_reason < kvm_vmx_max_exit_handlers
&& kvm_vmx_exit_handlers[exit_reason])
return kvm_vmx_exit_handlers[exit_reason](vcpu);
else {
vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
vcpu->run->hw.hardware_exit_reason = exit_reason;
}
return 0;
}
* ''kvm_vmx_exit_handlers'' 針對各種 VMExit 的來源定義對應的處理函式。
/*
* The exit handlers return 1 if the exit was handled fully and guest execution
* may resume. Otherwise they set the kvm_run parameter to indicate what needs
* to be done to userspace and return 0.
*/
static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_EXCEPTION_NMI] = handle_exception,
... 略 ...
[EXIT_REASON_CR_ACCESS] = handle_cr,
[EXIT_REASON_INVLPG] = handle_invlpg,
[EXIT_REASON_MONITOR_INSTRUCTION] = handle_invalid_op,
};
* 以 INVLPG 為例,當客戶機因為執行 INVLPG 導致 VMExit 陷入 VMM,VMM 會呼叫 ''handle_invlpg''。
static int handle_invlpg(struct kvm_vcpu *vcpu)
{
// 進一步讀取 VMExit 的原因。對 INVLPG 而言,此為欲剔除的 GVA。
unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
// 交由 KVM 處理。
kvm_mmu_invlpg(vcpu, exit_qualification);
// 跳過此條已被處理過的 (客戶機) INVLPG,將客戶機 eip 指向下一條指令。
skip_emulated_instruction(vcpu);
return 1;
}
* ''kvm_mmu_invlpg'' (''kvm/mmu.c'')。
void kvm_mmu_invlpg(struct kvm_vcpu *vcpu, gva_t gva)
{
// 依情況不同有不同實現方式。
vcpu->arch.mmu.invlpg(vcpu, gva);
// 向 VCPU 注入 KVM_REQ_TLB_FLUSH 的 request。
kvm_mmu_flush_tlb(vcpu);
++vcpu->stat.invlpg;
}
處於非根模式的 CPU 在執行特定指令時,會返回 KVM。客戶機作業系統亦可主動透過 VMCALL 返回 KVM [(http://people.cs.nctu.edu.tw/~chenwj/log/QEMU/avi-2012-07-11.txt)]。
===== 模擬客戶機指令 =====
早期 VT-x 版本無法正確處理實模式的指令,因此需跳回至 KVM 模擬。此外,MMIO 也需要交由 KVM 模擬 [(http://people.cs.nctu.edu.tw/~chenwj/log/QEMU/zruan0-2012-07-03.txt)]。
====== Memory ======
作業系統創建頁表,由硬體搜尋頁表作位址轉換,同時以 TLB (硬體) 作快取。作業系統和硬體一同協作以維護頁表和 TLB 內容的一致性。內存虛擬化需要經過底下兩層位址轉換:
* GVA (Guest Virtual Address) -> GPA (Guest Physical Address)
* 由客戶機作業系統透過頁表作轉換,即傳統做法
* GPA (Guest Physical Address) -> HPA (Host Physical Address)
* 由 VMM 負責
影子頁表用來加速位址轉換 (GVA -> HPA)。當客戶機作業系統修改頁表時,影子頁表也需要修改。這屬於軟體上的加速。請見 [[http://events.linuxfoundation.org/slides/2011/linuxcon-japan/lcj2011_guangrong.pdf|KVM MMU Virtualization]] 第 7 頁。
We can't rely on invlpg and mov cr3 to tell us when we need to invalidate shadow page table entries So, we track guest page table modifications ourselves: every shadowed guest page is write protected against guest modifications if the guest tries to modify, we trap and emulate the modifying instruction because we know the address, we can clear the associated shadow page table entry(ies)支援 EPT/NPT 的硬體上仍然只有一份 TLB,其中存放 GVA -> HPA 的映射 [(http://people.cs.nctu.edu.tw/~chenwj/log/QEMU/agraf-2012-07-09.txt)]。[[http://en.wikipedia.org/wiki/Translation_lookaside_buffer#Virtualization_and_x86_TLB|Virtualization and x86 TLB]] * [[http://www.linux-kvm.org/wiki/images/e/e5/KvmForum2007%24shadowy-depths-of-the-kvm-mmu.pdf|The Shadowy Depths of the KVM MMU]] * [[http://developer.amd.com/assets/NPT-WP-1%201-final-TM.pdf|AMD-V™ Nested Paging]] * [[wp>Extended Page Table]] * [[http://www.linux-kvm.org/wiki/images/c/c7/KvmForum2008$kdf2008_11.pdf|Extending KVM with new Intel ® Virtualization technology]] * [[http://lists.gnu.org/archive/html/qemu-devel/2012-01/msg02580.html|[Qemu-devel] nested page table translation for non-x86 operating system]] * [[http://phorum.study-area.org/index.php?topic=51292.0|[硬體技術]記憶體管理虛擬化:AMD NPT/Intel EPT簡介]] * [[http://blog.chinaunix.net/space.php?uid=20157960&do=blog&id=1974351|AMD Secure Virtual Machine]] * [[http://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf|Performance Evaluation of Intel EPT Hardware Assist]] * [[http://www.vmware.com/pdf/RVI_performance.pdf|Performance Evaluation of AMD RVI Hardware Assist]] * [[http://calab.kaist.ac.kr/~jhuh/papers/ahn_isca2012.pdf|Revisiting Hardware-Assisted Page Walks for Virtualized Systems]] * [[https://www.ibm.com/developerworks/mydeveloperworks/blogs/a2674a1d-a968-4f17-998f-b8b38497c9f7/entry/address_types_in_kvm_qemu9?lang=zh|Address Types in KVM & QEMU]] ===== Overview ===== 請見 ''Documentation/virtual/kvm/mmu.txt''。常見術語:
pfn host page frame number
hpa host physical address
hva host virtual address
gfn guest frame number
gpa guest physical address
gva guest virtual address
ngpa nested guest physical address
ngva nested guest virtual address
pte page table entry (used also to refer generically to paging structure
entries)
gpte guest pte (referring to gfns)
spte shadow pte (referring to pfns)
tdp two dimensional paging (vendor neutral term for NPT and EPT)
* 這邊使用 kvm 的 process 指的就是 QEMU,QEMU 將自身的虛擬內存分配給 VM 作為客戶機物理內存。
Guest memory (gpa) is part of the user address space of the process that is using kvm. Userspace defines the translation between guest addresses and user addresses (gpa->hva); note that two gpas may alias to the same hva, but not vice versa.* [[http://blog.stgolabs.net/2012/03/kvm-virtual-x86-mmu-setup.html#!/2012/03/kvm-virtual-x86-mmu-setup.html|kvm: virtual x86 mmu setup]],此段話講的是 GVA -> GPA。
One of the initialization steps that KVM does when a virtual machine (VM) is started, is setting up the vCPU's memory management unit (MMU) to translate virtual (lineal) addresses into physical ones within the guest's domain.- 設置 MMU 的入口點位在 ''init_kvm_mmu'',這裡我們只關注開啟 EPT 的情況,也就是 ''tdp_enabled'' 為真。tdp 是 two dimensional paging 的縮寫。
/*
* When setting this variable to true it enables Two-Dimensional-Paging
* where the hardware walks 2 page tables:
* 1. the guest-virtual to guest-physical
* 2. while doing 1. it walks guest-physical to host-physical
* If the hardware supports that we don't need to do shadow paging.
*/
bool tdp_enabled = false;
static int init_kvm_mmu(struct kvm_vcpu *vcpu)
{
if (mmu_is_nested(vcpu))
return init_kvm_nested_mmu(vcpu);
else if (tdp_enabled)
return init_kvm_tdp_mmu(vcpu);
else
return init_kvm_softmmu(vcpu);
}
* ''include/linux/kvm_host.h'' 定義 ''struct kvm_vcpu'',這是最重要的資料結構。
struct kvm_vcpu {
struct kvm *kvm;
#ifdef CONFIG_PREEMPT_NOTIFIERS
struct preempt_notifier preempt_notifier;
#endif
int cpu;
int vcpu_id;
int srcu_idx;
int mode;
unsigned long requests;
unsigned long guest_debug;
struct mutex mutex;
struct kvm_run *run;
... 略 ...
struct kvm_vcpu_arch arch; /* 不同 ISA 有自己的 kvm_vcpu_arch */
};
* ''arch/x86/include/asm/kvm_host.h'' 定義 x86 自己的 ''kvm_vcpu_arch''。
struct kvm_vcpu_arch {
... 略 ...
/*
* Paging state of the vcpu
*
* If the vcpu runs in guest mode with two level paging this still saves
* the paging mode of the l1 guest. This context is always used to
* handle faults.
*/
struct kvm_mmu mmu;
/*
* Paging state of an L2 guest (used for nested npt)
*
* This context will save all necessary information to walk page tables
* of the an L2 guest. This context is only initialized for page table
* walking and not for faulting since we never handle l2 page faults on
* the host.
*/
struct kvm_mmu nested_mmu;
/*
* Pointer to the mmu context currently used for
* gva_to_gpa translations.
*/
struct kvm_mmu *walk_mmu;
... 略 ...
}
* 透過填充 ''walk_mmu'' 這項資料結構,可以設定如何查詢客戶機頁表。
* ''arch/x86/include/asm/kvm_host.h'' 定義 x86 上內存虛擬化相關資料結構。
struct kvm_mmu {
void (*new_cr3)(struct kvm_vcpu *vcpu);
void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index);
int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err,
bool prefault);
void (*inject_page_fault)(struct kvm_vcpu *vcpu,
struct x86_exception *fault);
void (*free)(struct kvm_vcpu *vcpu);
gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
struct x86_exception *exception);
gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access);
int (*sync_page)(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *sp);
void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva);
void (*update_pte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
u64 *spte, const void *pte);
hpa_t root_hpa;
int root_level;
int shadow_root_level;
union kvm_mmu_page_role base_role;
bool direct_map;
u64 *pae_root;
u64 *lm_root;
u64 rsvd_bits_mask[2][4];
bool nx;
u64 pdptrs[4]; /* pae */
};
- ''init_kvm_tdp_mmu'' 負責填充 ''walk_mmu''。
static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
{
struct kvm_mmu *context = vcpu->arch.walk_mmu;
... 略 ...
if (!is_paging(vcpu)) {
context->nx = false;
context->gva_to_gpa = nonpaging_gva_to_gpa;
context->root_level = 0;
} else if (is_long_mode(vcpu)) {
context->nx = is_nx(vcpu);
context->root_level = PT64_ROOT_LEVEL;
reset_rsvds_bits_mask(vcpu, context);
context->gva_to_gpa = paging64_gva_to_gpa;
} else if (is_pae(vcpu)) {
context->nx = is_nx(vcpu);
context->root_level = PT32E_ROOT_LEVEL;
reset_rsvds_bits_mask(vcpu, context);
context->gva_to_gpa = paging64_gva_to_gpa;
} else {
context->nx = false;
context->root_level = PT32_ROOT_LEVEL;
reset_rsvds_bits_mask(vcpu, context);
context->gva_to_gpa = paging32_gva_to_gpa;
}
return 0;
}
* 這裡主要看 ''gva_to_gpa'' 被賦與什麼值,此函式負責 GVA -> GPA 的轉換。
* ''FNAME'' 是在 ''arch/x86/kvm/paging_tmpl.h'' 裡定義的宏,會將函數名稱擴展。[[http://blog.stgolabs.net/2012/03/kvm-hardware-assisted-paging.html#!/2012/03/kvm-hardware-assisted-paging.html|kvm: hardware assisted paging]]。
static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
struct x86_exception *exception)
{
struct guest_walker walker;
gpa_t gpa = UNMAPPED_GVA;
int r;
r = FNAME(walk_addr)(&walker, vcpu, vaddr, access);
if (r) {
gpa = gfn_to_gpa(walker.gfn);
gpa |= vaddr & ~PAGE_MASK;
} else if (exception)
*exception = walker.fault;
return gpa;
}
- ''walk_addr'' 會轉呼叫 ''walk_addr_generic'',''walk_addr_generic'' 會取得 GVA 對應的客戶頁表項。
static int FNAME(walk_addr)(struct guest_walker *walker,
struct kvm_vcpu *vcpu, gva_t addr, u32 access)
{
return FNAME(walk_addr_generic)(walker, vcpu, &vcpu->arch.mmu, addr,
access);
}
/*
* Fetch a guest pte for a guest virtual address
*/
static int FNAME(walk_addr_generic)(struct guest_walker *walker,
struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gva_t addr, u32 access)
{
if (last_gpte) {
int lvl = walker->level;
gpa_t real_gpa;
gfn_t gfn;
u32 ac;
gfn = gpte_to_gfn_lvl(pte, lvl);
gfn += (addr & PT_LVL_OFFSET_MASK(lvl)) >> PAGE_SHIFT;
if (PTTYPE == 32 &&
walker->level == PT_DIRECTORY_LEVEL &&
is_cpuid_PSE36())
gfn += pse36_gfn_delta(pte);
ac = write_fault | fetch_fault | user_fault;
real_gpa = mmu->translate_gpa(vcpu, gfn_to_gpa(gfn),
ac);
if (real_gpa == UNMAPPED_GVA)
return 0;
walker->gfn = real_gpa >> PAGE_SHIFT;
break;
}
... 略 ...
error:
errcode |= write_fault | user_fault;
if (fetch_fault && (mmu->nx ||
kvm_read_cr4_bits(vcpu, X86_CR4_SMEP)))
errcode |= PFERR_FETCH_MASK;
/* 填充 walker->fault */
trace_kvm_mmu_walker_error(walker->fault.error_code);
return 0;
}
- ''kvm_read_guest_page_mmu'' (''arch/x86/kvm/x86.c'') 負責讀取 GVA 相對應的 HPA。
/*
* This function will be used to read from the physical memory of the currently
* running guest. The difference to kvm_read_guest_page is that this function
* can read from guest physical or from the guest's guest physical memory.
*/
int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
gfn_t ngfn, void *data, int offset, int len,
u32 access)
{
gfn_t real_gfn;
gpa_t ngpa;
ngpa = gfn_to_gpa(ngfn);
real_gfn = mmu->translate_gpa(vcpu, ngpa, access);
if (real_gfn == UNMAPPED_GVA)
return -EFAULT;
real_gfn = gpa_to_gfn(real_gfn);
return kvm_read_guest_page(vcpu->kvm, real_gfn, data, offset, len);
}
KVM 是將 QEMU 的虛擬內存分配給客戶機作為物理內存。KVM 將客戶機物理內存分為數個 slot[(https://lkml.org/lkml/2006/11/5/125)]。KVM 利用硬體內存虛擬化的流程大致如下[(http://people.cs.nctu.edu.tw/~chenwj/log/QEMU/agraf-2012-06-14.txt)]:
- 客戶機欲存取 GVA 0,查詢客戶機頁表得到 PTE (page table entry),該項將 GVA 0 映射至 GPA 0。
- 硬體查詢 EPT,發現 GPA 0 並無對應項,發生 EPT 頁缺失。
- 此時,KVM 介入。透過 memslot 得知 GPA 0 對映的 HVA x,再取得 HVA x 對映的 HPA y (透過 QEMU 這個進程的頁表)。最後將 GPA 0 -> HPA y 的映射填入 EPT。
- 客戶機再次存取 GVA 0,這時透過 EPT 即可得到對映的 HPA y,用 HPA y 存取內存。
kvm_set_phys_mem (kvm-all.c) -> kvm_set_user_memory_region (kvm-all.c)
- kvm_set_user_memory_region。
static int kvm_set_user_memory_region(KVMState *s, KVMSlot *slot)
{
struct kvm_userspace_memory_region mem;
mem.slot = slot->slot;
mem.guest_phys_addr = slot->start_addr;
mem.memory_size = slot->memory_size;
mem.userspace_addr = (unsigned long)slot->ram;
mem.flags = slot->flags;
if (s->migration_log) {
mem.flags |= KVM_MEM_LOG_DIRTY_PAGES;
}
return kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
}
* kvm_userspace_memory_region 是 KVM 裡的資料結構 (include/linux/kvm.h),KVMSlot 是 QEMU 裡的資料結構 (kvm-all.c)。
/* for KVM_CREATE_MEMORY_REGION */
struct kvm_memory_region {
__u32 slot;
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
};
/* for KVM_SET_USER_MEMORY_REGION */
struct kvm_userspace_memory_region {
__u32 slot;
__u32 flags; /* 目前只支援 KVM_MEM_LOG_DIRTY_PAGES 此 flag,KVM 用此 flag 來追蹤客戶機內存是否為 dirty。*/
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
__u64 userspace_addr; /* start of the userspace allocated memory */
};
typedef struct KVMSlot
{
target_phys_addr_t start_addr;
ram_addr_t memory_size;
void *ram;
int slot;
int flags;
} KVMSlot;
struct KVMState
{
KVMSlot slots[32];
int fd;
int vmfd;
int coalesced_mmio;
struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;
bool coalesced_flush_in_progress;
int broken_set_mem_region;
int migration_log;
int vcpu_events;
int robust_singlestep;
int debugregs;
int pit_state2;
int xsave, xcrs;
int many_ioeventfds;
/* The man page (and posix) say ioctl numbers are signed int, but
* they're not. Linux, glibc and *BSD all treat ioctl numbers as
* unsigned, and treating them as signed here can break things */
unsigned irqchip_inject_ioctl;
};
* [[http://damocles.blogbus.com/logs/47970914.html|virtualized address translation]]
===== Shadow Page Table =====
硬體 TLB 和 CR3 存放與指向的都是影子頁表或其內容。當客戶機欲存取 CR3 或是使用 INVLPG,會陷入 VMM,爾後由 VMM 接手。若是開啟 EPT,因為有另一個 EPTP 可供操作,客戶機可以存取 CR3 或是使用 INVLPG 而不會陷入 VMM。
- 預設情況,''setup_vmcs_config'' 會將 VM 設置為存取 CR3 或是執行 INVLPG 就觸發 VMExit。
static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
{
min = CPU_BASED_HLT_EXITING |
CPU_BASED_CR3_LOAD_EXITING |
CPU_BASED_CR3_STORE_EXITING |
... 略 ...
CPU_BASED_INVLPG_EXITING |
}
- ''vcpu_enter_guest'' (''arch/x86/kvm/x86.c'') 切入非根模式。
static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
// 檢查 VCPU 是否有待處理的事件。
if (vcpu->requests) {
}
// 注入中斷至 VCPU。
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
inject_pending_event(vcpu);
{
// 載入客戶機頁表。
r = kvm_mmu_reload(vcpu);
// 進入非根模式。
kvm_guest_enter();
// 運行客戶機。
kvm_x86_ops->run(vcpu);
... 略 ...
// 處理 VMExit。
r = kvm_x86_ops->handle_exit(vcpu);
out:
return r;
}
- 若是最上層由 CR3 指向的影子頁表尚未分配,''kvm_mmu_load'' (''arch/x86/kvm/mmu.c''),分配該影子頁表並將 CR3 指向它。
int kvm_mmu_load(struct kvm_vcpu *vcpu)
{
int r;
r = mmu_topup_memory_caches(vcpu);
if (r)
goto out;
r = mmu_alloc_roots(vcpu);
spin_lock(&vcpu->kvm->mmu_lock);
mmu_sync_roots(vcpu);
spin_unlock(&vcpu->kvm->mmu_lock);
if (r)
goto out;
/* set_cr3() should ensure TLB has been flushed */
vcpu->arch.mmu.set_cr3(vcpu, vcpu->arch.mmu.root_hpa);
out:
return r;
}
* ''struct kvm_mmu_page''。
- ''arch.mmu.set_cr3'' 會被賦值為平台特定的函式。以 VMX 為例,''set_cr3'' 會被賦值為 ''vmx_set_cr3'' (''arch/x86/kvm/vmx.c''),''vmx_set_cr3'' 透過寫入 VMCS 特定欄位設定 CR3 和 EPTP。
static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
{
unsigned long guest_cr3;
u64 eptp;
guest_cr3 = cr3;
if (enable_ept) {
eptp = construct_eptp(cr3);
// 設定 EPTP,指向 EPT (GPA -> HPA)。
vmcs_write64(EPT_POINTER, eptp);
guest_cr3 = is_paging(vcpu) ? kvm_read_cr3(vcpu) :
vcpu->kvm->arch.ept_identity_map_addr;
ept_load_pdptrs(vcpu);
}
vmx_flush_tlb(vcpu);
// 設定 CR3,指向客戶機頁表 (GVA -> GPA)。
vmcs_writel(GUEST_CR3, guest_cr3);
}
- ''page_fault'' (''arch/x86/kvm/paging_tmpl.h'')。
/*
* Page fault handler. There are several causes for a page fault:
* - there is no shadow pte for the guest pte
* - write access through a shadow pte marked read only so that we can set
* the dirty bit
* - write access to a shadow pte marked read only so we can update the page
* dirty bitmap, when userspace requests it
* - mmio access; in this case we will never install a present shadow pte
* - normal guest page fault due to the guest pte marked not present, not
* writable, or not executable
*
* Returns: 1 if we need to emulate the instruction, 0 otherwise, or
* a negative value on error.
*/
static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
bool prefault)
{
// MMIO。
if (unlikely(error_code & PFERR_RSVD_MASK))
return handle_mmio_page_fault(vcpu, addr, error_code,
mmu_is_nested(vcpu));
// 查找客戶機頁表。
r = FNAME(walk_addr)(&walker, vcpu, addr, error_code);
// 如果客戶機頁表沒有 GVA -> GPA 映射,向 VCPU 注入頁缺失例外,交由客戶機 OS 處理。
if (!r) {
pgprintk("%s: guest page fault\n", __func__);
if (!prefault)
inject_page_fault(vcpu, &walker.fault);
return 0;
}
// 或是影子頁表沒有 GVA -> HPA 映射,VMM 填充該影子頁表項。
sptep = FNAME(fetch)(vcpu, addr, &walker, user_fault, write_fault,
level, &emulate, pfn, map_writable, prefault);
}
* 針對已存在的客戶機頁表項,沒有對應的影子頁表項。由 KVM 負責填充影子頁表項。
* 針對客戶機的寫動作,因為對應的影子頁表項設成只讀,所以發生頁缺失。
* 針對客戶機的 MMIO 觸發的頁缺失,不處理。
* 因為客戶機頁表項不存在,或是該客戶機存取違反客戶機頁表項設置的權限。注入頁缺失給客戶機,交還給客戶機作業系統處理。
* [[http://zhongshugu.wordpress.com/2010/06/14/shadow-page-table-in-kvm/|shadow page table in kvm]]
* [[http://the-hydra.blogspot.tw/2006/12/enlightenment-about-shadow-page-table.html|An enlightenment about shadow page table]]
* [[http://stackoverflow.com/questions/9832140/what-exactly-do-shadow-page-tables-for-vmms-do|What exactly do shadow page tables (for VMMs) do?]]
* [[http://lwn.net/Articles/216794/|Some KVM developments]]
* [[http://lwn.net/Articles/216759/|KVM: MMU: Cache shadow page tables]]
* 影子頁表如何與客戶機頁表同步? 將客戶機頁表設為寫保護,還需要什麼? 流程為何?
- ''rmap_write_protect'' (''arch/x86/kvm/mmu.c'') [(https://lkml.org/lkml/2007/1/4/116)]。
static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
int level, bool pt_protect)
{
u64 *sptep;
struct rmap_iterator iter;
bool flush = false;
for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
BUG_ON(!(*sptep & PT_PRESENT_MASK));
if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
sptep = rmap_get_first(*rmapp, &iter);
continue;
}
sptep = rmap_get_next(&iter);
}
return flush;
}
- ''spte_write_protect'' (''arch/x86/kvm/mmu.c'')。
static bool
spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
{
u64 spte = *sptep;
if (!is_writable_pte(spte) &&
!(pt_protect && spte_is_locklessly_modifiable(spte)))
return false;
rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
if (__drop_large_spte(kvm, sptep)) {
*flush |= true;
return true;
}
if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;
spte = spte & ~PT_WRITABLE_MASK;
*flush |= mmu_spte_update(sptep, spte);
return false;
}
But making the guest kernel trap on page table accesses might not be as easy: Page tables are normal data in memory, and while switching page tables, i.e. loading the pointer to the root page table is a privileged instruction that will trap when issued in user mode,注意! 先不談虛擬化的情況下,以 Linux 為例。每個進程在內核中會有一份對映的資料結構,其中會有該進程的頁表基址。內核會被映射在每一個進程虛擬內存高位址處,accessing page table entries just means accessing memory, and it won't trap. The trick to still make these accesses trap is to mark the pages the page table entries reside on as invalid on the shadow page tables. http://events.ccc.de/congress/2006/Fahrplan/attachments/1132-InsideVMware.pdf
*pte = new_value;
這是一個內存存取,所以需要經過頁表作地址轉換。
static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
{
if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
enabled */
_cpu_based_exec_control &= ~(CPU_BASED_CR3_LOAD_EXITING |
CPU_BASED_CR3_STORE_EXITING |
CPU_BASED_INVLPG_EXITING);
rdmsr(MSR_IA32_VMX_EPT_VPID_CAP,
vmx_capability.ept, vmx_capability.vpid);
}
... 略 ...
}
- ''vmx_set_cr3'' (''arch/x86/kvm/vmx.c'') 設定 CR3 和 EPTP。
static void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
{
unsigned long guest_cr3;
u64 eptp;
guest_cr3 = cr3;
if (enable_ept) {
eptp = construct_eptp(cr3);
// 設定 EPTP,指向 EPT (GPA -> HPA)。
vmcs_write64(EPT_POINTER, eptp);
guest_cr3 = is_paging(vcpu) ? kvm_read_cr3(vcpu) :
vcpu->kvm->arch.ept_identity_map_addr;
ept_load_pdptrs(vcpu);
}
vmx_flush_tlb(vcpu);
// 設定 CR3,指向客戶機頁表 (GVA -> GPA)。
vmcs_writel(GUEST_CR3, guest_cr3);
}
- ''tdp_page_fault'' 在因為發生 EPT 頁缺失,陷入 VMM 時,設置 EPT 項。
static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
bool prefault)
{
spin_lock(&vcpu->kvm->mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
kvm_mmu_free_some_pages(vcpu);
if (likely(!force_pt_level))
transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
// 設置 EPT 項。
r = __direct_map(vcpu, gpa, write, map_writable,
level, gfn, pfn, prefault);
spin_unlock(&vcpu->kvm->mmu_lock);
return r;
}
* \_\_direct_map。''spte'' 即 shadow page table entry。
static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
int map_writable, int level, gfn_t gfn, pfn_t pfn,
bool prefault)
{
for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
if (iterator.level == level) {
unsigned pte_access = ACC_ALL;
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
0, write, &emulate,
level, gfn, pfn, prefault, map_writable);
direct_pte_prefetch(vcpu, iterator.sptep);
++vcpu->stat.pf_fixed;
break;
}
if (!is_shadow_present_pte(*iterator.sptep)) {
u64 base_addr = iterator.addr;
base_addr &= PT64_LVL_ADDR_MASK(iterator.level);
pseudo_gfn = base_addr >> PAGE_SHIFT;
sp = kvm_mmu_get_page(vcpu, pseudo_gfn, iterator.addr,
iterator.level - 1,
1, ACC_ALL, iterator.sptep);
if (!sp) {
pgprintk("nonpaging_map: ENOMEM\n");
kvm_release_pfn_clean(pfn);
return -ENOMEM;
}
mmu_spte_set(iterator.sptep,
__pa(sp->spt)
| PT_PRESENT_MASK | PT_WRITABLE_MASK
| shadow_user_mask | shadow_x_mask
| shadow_accessed_mask);
}
}
return emulate;
}
為了避免在 VMExit 沖掉所有 TLB 內容,x86 支援 VPID,可在 TLB 項加上 VPID。
- ''vmx_create_vcpu'' 在創建 VCPU 時,會配置其 VPID。
static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
{
int err;
struct vcpu_vmx *vmx = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
int cpu;
if (!vmx)
return ERR_PTR(-ENOMEM);
allocate_vpid(vmx);
... 略 ...
}
- ''allocate_vpid'' 使用 ''vmx_vpid_bitmap'' 替每一個客戶機分配一個 VPID,VMM 自己的 VPID 恆為 0。
static void allocate_vpid(struct vcpu_vmx *vmx)
{
int vpid;
vmx->vpid = 0;
if (!enable_vpid)
return;
spin_lock(&vmx_vpid_lock);
vpid = find_first_zero_bit(vmx_vpid_bitmap, VMX_NR_VPIDS);
if (vpid < VMX_NR_VPIDS) {
vmx->vpid = vpid;
__set_bit(vpid, vmx_vpid_bitmap);
}
spin_unlock(&vmx_vpid_lock);
}
客戶機存取客戶機內存,經由影子頁表或是 EPT 轉換位址時觸發頁缺失。[[http://download.intel.com/products/processor/manual/326019.pdf|Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3C]] 第 27 章,第 127 頁描述發生 EPT violation,Exit Qualification 會帶有什麼資訊。
- handle_ept_violation (vmx.c)。
static int handle_ept_violation(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification;
gpa_t gpa;
u32 error_code;
int gla_validity;
exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
if (exit_qualification & (1 << 6)) {
printk(KERN_ERR "EPT: GPA exceeds GAW!\n");
return -EINVAL;
}
gla_validity = (exit_qualification >> 7) & 0x3;
if (gla_validity != 0x3 && gla_validity != 0x1 && gla_validity != 0) {
printk(KERN_ERR "EPT: Handling EPT violation failed!\n");
printk(KERN_ERR "EPT: GPA: 0x%lx, GVA: 0x%lx\n",
(long unsigned int)vmcs_read64(GUEST_PHYSICAL_ADDRESS),
vmcs_readl(GUEST_LINEAR_ADDRESS));
printk(KERN_ERR "EPT: Exit qualification is 0x%lx\n",
(long unsigned int)exit_qualification);
vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
return 0;
}
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
trace_kvm_page_fault(gpa, exit_qualification);
/* It is a write fault? */
error_code = exit_qualification & (1U << 1);
/* ept page table is present? */
error_code |= (exit_qualification >> 3) & 0x1;
return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
}
- kvm_mmu_page_fault (mmu.c)。
int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
void *insn, int insn_len)
{
int r, emulation_type = EMULTYPE_RETRY;
enum emulation_result er;
r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
if (r < 0)
goto out;
if (!r) {
r = 1;
goto out;
}
if (is_mmio_page_fault(vcpu, cr2))
emulation_type = 0;
er = x86_emulate_instruction(vcpu, cr2, emulation_type, insn, insn_len);
switch (er) {
case EMULATE_DONE:
return 1;
case EMULATE_DO_MMIO:
++vcpu->stat.mmio_exits;
/* fall through */
case EMULATE_FAIL:
return 0;
default:
BUG();
}
out:
return r;
}
- x86_emulate_instruction (x86.c)。
int x86_emulate_instruction(struct kvm_vcpu *vcpu,
unsigned long cr2,
int emulation_type,
void *insn,
int insn_len)
{
int r;
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
bool writeback = true;
kvm_clear_exception_queue(vcpu);
if (!(emulation_type & EMULTYPE_NO_DECODE)) {
init_emulate_ctxt(vcpu);
ctxt->interruptibility = 0;
ctxt->have_exception = false;
ctxt->perm_ok = false;
ctxt->only_vendor_specific_insn
= emulation_type & EMULTYPE_TRAP_UD;
r = x86_decode_insn(ctxt, insn, insn_len);
trace_kvm_emulate_insn_start(vcpu);
++vcpu->stat.insn_emulation;
if (r != EMULATION_OK) {
if (emulation_type & EMULTYPE_TRAP_UD)
return EMULATE_FAIL;
if (reexecute_instruction(vcpu, cr2))
return EMULATE_DONE;
if (emulation_type & EMULTYPE_SKIP)
return EMULATE_FAIL;
return handle_emulation_failure(vcpu);
}
}
if (emulation_type & EMULTYPE_SKIP) {
kvm_rip_write(vcpu, ctxt->_eip);
return EMULATE_DONE;
}
if (retry_instruction(ctxt, cr2, emulation_type))
return EMULATE_DONE;
/* this is needed for vmware backdoor interface to work since it
changes registers values during IO operation */
if (vcpu->arch.emulate_regs_need_sync_from_vcpu) {
vcpu->arch.emulate_regs_need_sync_from_vcpu = false;
memcpy(ctxt->regs, vcpu->arch.regs, sizeof ctxt->regs);
}
restart:
r = x86_emulate_insn(ctxt);
if (r == EMULATION_INTERCEPTED)
return EMULATE_DONE;
if (r == EMULATION_FAILED) {
if (reexecute_instruction(vcpu, cr2))
return EMULATE_DONE;
return handle_emulation_failure(vcpu);
}
if (ctxt->have_exception) {
inject_emulated_exception(vcpu);
r = EMULATE_DONE;
} else if (vcpu->arch.pio.count) {
if (!vcpu->arch.pio.in)
vcpu->arch.pio.count = 0;
else
writeback = false;
r = EMULATE_DO_MMIO;
} else if (vcpu->mmio_needed) {
if (!vcpu->mmio_is_write)
writeback = false;
r = EMULATE_DO_MMIO;
} else if (r == EMULATION_RESTART)
goto restart;
else
r = EMULATE_DONE;
if (writeback) {
toggle_interruptibility(vcpu, ctxt->interruptibility);
kvm_set_rflags(vcpu, ctxt->eflags);
kvm_make_request(KVM_REQ_EVENT, vcpu);
memcpy(vcpu->arch.regs, ctxt->regs, sizeof ctxt->regs);
vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
kvm_rip_write(vcpu, ctxt->eip);
} else
vcpu->arch.emulate_regs_need_sync_to_vcpu = true;
return r;
}
* [[http://zhongshugu.wordpress.com/2010/06/17/ept-in-kvm/|EPT in kvm]]
* [[http://tptp.cc/mirrors/siyobik.info/instruction/INVVPID|INVVPID]]
* [[http://blog.chinaunix.net/uid-1858380-id-3205061.html|Intel内存虚拟化技术分析]]
* [[http://www.ibm.com/developerworks/cn/linux/l-cn-virtnew/|x86 平台硬件辅助虚拟化技术的新发展]]
====== I/O ======
VMM 只需模擬裝置的軟件接口,如: port IO、MMIO、DMA 和中斷。無需模擬裝置物理上的結構。
- ''ioapic_deliver'' (''virt/kvm/ioapic.c'')。
static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq)
{
union kvm_ioapic_redirect_entry *entry = &ioapic->redirtbl[irq];
struct kvm_lapic_irq irqe;
irqe.dest_id = entry->fields.dest_id;
... 略 ...
// 將 IRQ 注入到 KVM 所擁有的 VCPU/LAPIC。
return kvm_irq_delivery_to_apic(ioapic->kvm, NULL, &irqe);
}
- ''kvm_irq_delivery_to_apic'' (''virt/kvm/irq_comm.c'')。
int kvm_irq_delivery_to_apic(struct kvm *kvm, struct kvm_lapic *src,
struct kvm_lapic_irq *irq)
{
int i, r = -1;
struct kvm_vcpu *vcpu, *lowest = NULL;
kvm_for_each_vcpu(i, vcpu, kvm) {
if (!kvm_apic_present(vcpu))
continue;
if (!kvm_apic_match_dest(vcpu, src, irq->shorthand,
irq->dest_id, irq->dest_mode))
continue;
if (!kvm_is_dm_lowest_prio(irq)) {
if (r < 0)
r = 0;
r += kvm_apic_set_irq(vcpu, irq);
} else if (kvm_lapic_enabled(vcpu)) {
if (!lowest)
lowest = vcpu;
else if (kvm_apic_compare_prio(vcpu, lowest) < 0)
lowest = vcpu;
}
}
if (lowest)
r = kvm_apic_set_irq(lowest, irq);
return r;
}
* ''kvm_apic_set_irq''。
* [[http://blog.scottlowe.org/2009/12/02/what-is-sr-iov/|What is SR-IOV]]
* [[http://static.usenix.org/event/wiov08/tech/full_papers/dowty/dowty_html/|GPU Virtualization on VMware's Hosted I/O Architecture]]
* [[http://www.zillians.com/vgpu/home]]
* [[http://blog.csdn.net/yearn520/article/details/6663532|KVM虚拟机代码揭秘——中断虚拟化]]
* [[http://blog.csdn.net/zhou0/article/details/7020288|kvm-qemu 设备IO虚拟化]]
* [[http://wenku.baidu.com/view/3894a490daef5ef7ba0d3c6e.html|修改客户操作系统优化KVM虚拟机的I_O性能]]
* [[http://kernel-demystified.com/forum/index.php/topic,14.msg14.html#msg14|Device virtualization with Qemu and KVM]]
====== Live Migration ======
[[https://events.linuxfoundation.org/images/stories/pdf/lcjp2012_yamahata_postcopy.pdf|Postcopy Live migration for QEmu/KVM]] 是就 [[https://www.kernel.org/doc/mirror/ols2007v1.pdf#page=225|kvm: the Linux Virtual Machine Monitor]] 的方法加以改良。[[https://events.linuxfoundation.org/images/stories/pdf/lcjp2012_yoshikawa.pdf|How to Mitigate Latency Problems during KVM/QEMU Live Migration]] 則是就現行做法進行優化,[[http://lwn.net/Articles/202847/|SRCU (Sleepable RCU)]] 是 [[http://lwn.net/Articles/262464/|RCU]] 的變種。
* [[http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg00258.html|[Qemu-devel] [PATCH v2 00/41] postcopy live migration]]
====== 時鐘 ======
客戶機作業系統其時間是否需要跟實際時間一致端看其如何應用。
* [[http://www.vmware.com/pdf/vi3_esx_vmdesched.pdf|Improving Guest Operating System Accounting for Descheduled Virtual Machines in ESX Server 3.x Systems]]
* [[http://www.vmware.com/files/pdf/Timekeeping-In-VirtualMachines.pdf|Timekeeping in VMware Virtual Machines]]
* [[http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=54601234DF57AE0B1854B35C608C3D15?doi=10.1.1.106.781&rep=rep1&type=pdf|Temporal Search: Detecting Hidden Malware Timebombs with Virtual Machines]]
====== QEMU ======
KVM 是 Linux 內核的一個模組,它會以裝置 ''/dev/kvm'' 向外界提供它的功能。QEMU 透過 ''ioctl'' 去讀寫該裝置請求 KVM 完成特定任務。KVM 主要的工作有兩個:
第一,它負責檢視客戶機 VM Exit 的原因並做相對應的處理; 第二,它負責透過 VM Entry 啟動客戶機。客戶機若因為操作 IO 而觸發 VM Exit,KVM 會轉交 QEMU 完成 IO。整個 KVM 流程基本如下[(http://people.cs.nctu.edu.tw/~chenwj/log/QEMU/agraf-2012-06-13.txt)]:
* 開啟 ''/dev/kvm'' 取得 fd。
* 透過 ''ioctl'' 操作 ''/dev/kvm'' 取得 VM fd。
* 再透過 ''ioctl'' 操作 VM fd,針對每一個 VCPU 取得個別的 fd。
其概念可以參考 [[http://mac-on-linux.svn.sourceforge.net/viewvc/mac-on-linux/trunk/src/cpu/kvm/kvm.c?revision=171&view=markup]]。
* ''struct KVMState'' 和 ''struct KVMSlot'' 分別是其重要資料結構。
typedef struct KVMSlot
{
target_phys_addr_t start_addr;
ram_addr_t memory_size;
void *ram;
int slot;
int flags;
} KVMSlot;
struct KVMState
{
KVMSlot slots[32];
int fd;
int vmfd;
int coalesced_mmio;
struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;
bool coalesced_flush_in_progress;
int broken_set_mem_region;
int migration_log;
int vcpu_events;
int robust_singlestep;
int debugregs;
int pit_state2;
int xsave, xcrs;
int many_ioeventfds;
/* The man page (and posix) say ioctl numbers are signed int, but
* they're not. Linux, glibc and *BSD all treat ioctl numbers as
* unsigned, and treating them as signed here can break things */
unsigned irqchip_inject_ioctl;
};
* ''CPU_COMMON'' (''cpu-defs.h'') 裡有欄位給 KVM 使用。
#define CPU_COMMON
struct KVMState *kvm_state; \
struct kvm_run *kvm_run; \
int kvm_fd; \
int kvm_vcpu_dirty;
- ''main'' (''vl.c'') 會呼叫 ''configure_accelerator'' 檢查使用者是否選用 KVM。
int main(int argc, char **argv, char **envp)
{
... 略 ...
/* init the memory */
if (ram_size == 0) {
ram_size = DEFAULT_RAM_SIZE * 1024 * 1024;
}
configure_accelerator();
qemu_init_cpu_loop();
if (qemu_init_main_loop()) {
fprintf(stderr, "qemu_init_main_loop failed\n");
exit(1);
}
... 略 ...
}
* ''kvm_init'' (''kvm-all.c'')。
int kvm_init(void)
{
KVMState *s;
// slot 是用來記錄客戶機物理位址與 QEMU 虛擬位址的映射。
for (i = 0; i < ARRAY_SIZE(s->slots); i++) {
s->slots[i].slot = i;
}
// 開啟 ''/dev/kvm'' 取得 fd。
s->fd = qemu_open("/dev/kvm", O_RDWR);
// 透過 ''ioctl'' 操作 ''/dev/kvm'' 取得 VM fd。
s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
kvm_state = s; // kvm_state 為一全域變數。
memory_listener_register(&kvm_memory_listener, NULL);
}
- 當 KVM 開啟時,VCPU handler 為 ''qemu_kvm_cpu_thread_fn'',''qemu_kvm_start_vcpu'' 會喚起一個執行緒執行 ''qemu_kvm_cpu_thread_fn''。若是原本 TCG 的模式,則改由 ''qemu_tcg_init_vcpu'' 喚起 ''qemu_tcg_cpu_thread_fn''。
static void qemu_kvm_start_vcpu(CPUArchState *env)
{
env->thread = g_malloc0(sizeof(QemuThread));
env->halt_cond = g_malloc0(sizeof(QemuCond));
qemu_cond_init(env->halt_cond);
qemu_thread_create(env->thread, qemu_kvm_cpu_thread_fn, env,
QEMU_THREAD_JOINABLE);
while (env->created == 0) {
qemu_cond_wait(&qemu_cpu_cond, &qemu_global_mutex);
}
}
void qemu_init_vcpu(void *_env)
{
CPUArchState *env = _env;
env->nr_cores = smp_cores;
env->nr_threads = smp_threads;
env->stopped = 1;
if (kvm_enabled()) {
qemu_kvm_start_vcpu(env);
} else if (tcg_enabled()) {
qemu_tcg_init_vcpu(env);
} else {
qemu_dummy_start_vcpu(env);
}
}
- ''qemu_kvm_cpu_thread_fn'' 呼叫 ''kvm_cpu_exec'' 此一主要執行迴圈。
static void *qemu_kvm_cpu_thread_fn(void *arg)
{
... 略 ...
r = kvm_init_vcpu(env);
if (r < 0) {
fprintf(stderr, "kvm_init_vcpu failed: %s\n", strerror(-r));
exit(1);
}
qemu_kvm_init_cpu_signals(env);
/* signal CPU creation */
env->created = 1;
qemu_cond_signal(&qemu_cpu_cond);
while (1) {
if (cpu_can_run(env)) {
r = kvm_cpu_exec(env);
if (r == EXCP_DEBUG) {
cpu_handle_guest_debug(env);
}
}
qemu_kvm_wait_io_event(env);
}
return NULL;
}
* ''kvm_init_vcpu'' (''kvm-all.c'')。
int kvm_init_vcpu(CPUArchState *env)
{
KVMState *s = kvm_state;
long mmap_size;
int ret;
ret = kvm_vm_ioctl(s, KVM_CREATE_VCPU, env->cpu_index);
env->kvm_fd = ret; // VCPU fd 而非 KVM fd。http://lists.gnu.org/archive/html/qemu-devel/2012-06/msg02302.html
env->kvm_state = s;
env->kvm_vcpu_dirty = 1;
// QEMU 的 kvm_run 被 mmap 到 VCPU fd。這非常重要,當後續 KVM 將客戶機的 IO 交給 QEMU 執行,
// QEMU 就是透過 kvm_run 讀取 IO 相關細節。
env->kvm_run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED,
env->kvm_fd, 0);
ret = kvm_arch_init_vcpu(env);
if (ret == 0) {
qemu_register_reset(kvm_reset_vcpu, env);
kvm_arch_reset_vcpu(env);
}
err:
return ret;
}
- 主要執行迴圈為 ''kvm_cpu_exec'' (''kvm-all.c'')。
int kvm_cpu_exec(CPUArchState *env)
{
struct kvm_run *run = env->kvm_run;
do {
... 略 ...
run_ret = kvm_vcpu_ioctl(env, KVM_RUN, 0);
// 檢視 VMExit 的原因,並做相應的處理。若 VMExit 可由 KVM (內核) 處理,由 KVM 處理。
// 其餘諸如 IO 則交給 QEMU。
switch (run->exit_reason) {
// IO 交由 QEMU (用戶態) 處理。
case KVM_EXIT_IO:
DPRINTF("handle_io\n");
kvm_handle_io(run->io.port,
(uint8_t *)run + run->io.data_offset,
run->io.direction,
run->io.size,
run->io.count);
ret = 0;
break;
case KVM_EXIT_MMIO:
DPRINTF("handle_mmio\n");
cpu_physical_memory_rw(run->mmio.phys_addr,
run->mmio.data,
run->mmio.len,
run->mmio.is_write);
ret = 0;
break;
... 略 ...
// 其餘交由平台特定的 handler 處理。
default:
DPRINTF("kvm_arch_handle_exit\n");
ret = kvm_arch_handle_exit(env, run);
break;
}
} while (ret == 0);
}
* 不同平台定義不同的 ''kvm_arch_handle_exit''。以 x86 為例,''kvm_arch_handle_exit'' (''target-i386/kvm.c'')。
int kvm_arch_handle_exit(CPUX86State *env, struct kvm_run *run)
{
}
* KVM 和 QEMU 之間會同步一些資料結構,例如: ''struct kvm_run''。
/* for KVM_RUN, returned by mmap(vcpu_fd, offset=0) */
struct kvm_run {
/* in */
__u8 request_interrupt_window;
__u8 padding1[7];
/* out */
__u32 exit_reason;
__u8 ready_for_interrupt_injection;
__u8 if_flag;
__u8 padding2[2];
/* in (pre_kvm_run), out (post_kvm_run) */
__u64 cr8;
__u64 apic_base;
... 略 ...
};
struct kvm_vcpu {
... 略 ...
struct kvm_run *run;
... 略 ...
};
static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
unsigned short port, void *val,
unsigned int count, bool in)
{
trace_kvm_pio(!in, port, size, count);
vcpu->arch.pio.port = port;
vcpu->arch.pio.in = in;
vcpu->arch.pio.count = count;
vcpu->arch.pio.size = size;
if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
vcpu->arch.pio.count = 0;
return 1;
}
// 回到 QEMU 之後,QEMU 會檢視以下來欄位。
vcpu->run->exit_reason = KVM_EXIT_IO;
vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
vcpu->run->io.size = size;
vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
vcpu->run->io.count = count;
vcpu->run->io.port = port;
return 0;
}
* 上述流程可以參考 [[http://mac-on-linux.svn.sourceforge.net/viewvc/mac-on-linux/trunk/src/cpu/kvm/misc.c?revision=166&view=markup]] 中的 ''molcpu_mainloop''。
====== Q & A ======
- KVM 和 QEMU 的關係? \\ [[http://www.linux-kvm.org/page/FAQ#What_is_the_difference_between_KVM_and_QEMU.3F|What is the difference between KVM and QEMU?]] 和 [[http://www.fujitsu.com/downloa
ds/MAG/vol47-3/paper18.pdf|Kernel-based Virtual Machine Technology]]。
- KVM 如何處理 guest OS 記憶體存取? \\ [[http://www.linux-kvm.org/page/Memory|http://www.linux-kvm.org/page/Memory]]
- [[http://blog.vmsplice.net/2011/03/should-i-use-qemu-or-kvm.html|Should I use QEMU or KVM?]]
- [[http://software.intel.com/en-us/blogs/2009/06/25/virtualization-and-performance-understanding-vm-exits/|Virtualization and Performance: Understanding VM Exits]] 和 [[http://etc.chinabyte.com/pdf/Performance%20Analysis%20in%20Virtualization.pdf|Performance Analysis in Virtualization]]。
- [[http://www.linux-kvm.org/wiki/images/7/70/2010-forum-threading-qemu.pdf|Multi-threading QEMU?]]
====== Submitted Patch ======
* [[http://article.gmane.org/gmane.comp.emulators.kvm.devel/92797|[PATCH] Fix typo in x86/kvm/vmx.c]]
====== 其它 ======
* [[http://lkml.indiana.edu/hypermail/linux/kernel/0610.2/1369.html|[PATCH 0/7] KVM: Kernel-based Virtual Machine]]
* [[http://kerneltrap.org/mailarchive/linux-kvm/2010/3/24/6260060/thread|OEM version of Windows in kvm (SLIC &Co)]]
* [[http://forums.mydigitallife.info/threads/12401-Modified-Bios-for-KVM-Qemu-Bochs-Bios|Modified Bios for KVM/Qemu/Bochs Bios?]]
* [[http://hi.baidu.com/elffin/blog/item/fbab8ced160438db2e2e21f0.html|BIOS中SLIC 2.1表详细组成及验证激活相关解释说明]]
* [[http://blog.csdn.net/zeo112140/article/details/7260884|内核虚拟化KVM——overview]]
* [[wp>Hyper-V]]
* [[http://www.docin.com/p-261864116.html|kvm介绍]]
* [[http://wenku.baidu.com/view/ba1cf94769eae009581bec14.html|KVM虚拟机分析]]
* [[http://zhangjun2915.blog.163.com/blog/static/3808623620105683158449/|[KVM学习笔记]kvm-kmod-2.6.33.1主要函数路径]]
* [[http://zhangjun2915.blog.163.com/blog/static/3808623620105274356676/|[KVM学习笔记]VMCS研究总结]]
* [[http://zhangjun2915.blog.163.com/blog/static/3808623620105801556417/|[KVM学习笔记]kvm-kmod-2.6.33.1数据结构]]
* [[http://old.lwn.net/lwn/images/conf/rtlws11/papers/proc/p18.pdf|Towards Linux as a Real-Time Hypervisor]]
* [[http://support.amd.com/us/Processor_TechDocs/24594_APM_v3.pdf|AMD64 Architecture Programmer’s Manual Volume 3: General-Purpose and System Instructions]]
* [[http://www.linux-kvm.org/wiki/images/d/d5/KvmForum2007$KVM-tuning-testing-SMP2.pdf|KVM tuning and testing, and SMP enhancement]]
* [[http://labs.vmware.com/download/139/|Software Techniques for Avoiding Hardware Virtualization Exits]]
====== 文章 ======
* [[http://www.ece.cmu.edu/~ece845/sp11/docs/uhlig-vt-overview.pdf|Intel virtualization technology]]
* [[http://www.ibm.com/developerworks/linux/library/l-linux-kvm/|Discover the Linux Kernel Virtual Machine]]
* [[https://www.ibm.com/developerworks/mydeveloperworks/blogs/ibmvirtualization/entry/kvm_architecture_the_key_components_of_open_virtualization_with_kvm2?lang=zh|KVM Architecture: The Key Components of Open Virtualization with KVM]]
* [[http://developer.amd.com/documentation/articles/pages/630200614.aspx|Processor-Based Virtualization, AMD64 Style, Part I]]
* [[http://developer.amd.com/documentation/articles/pages/630200615.aspx|Processor-Based Virtualization, AMD64 Style, Part II]]
====== 外部連結 ======
* [[http://www.linux-kvm.org|KVM]]
* [[https://wiki.linaro.org/Server%20virtualization%20overview|Server virtualization overview]]
* [[http://www.ibm.com/developerworks/linux/library/l-hypervisor/index.html?ca=dgr-lnxw16Lnx-Hypervisor&S_TACT=105AGX59&S_CMP=grlnxw16|Anatomy of a Linux hypervisor]]
* [[http://blog.vmsplice.net/2011/03/qemu-internals-overall-architecture-and.html|QEMU Internals: Overall architecture and threading model]]
* [[http://www.linux-tutorial.info/modules.php?name=MContent&pageid=263|Bottom Half Handling]]
* [[http://blog.vmsplice.net/2011/03/qemu-internals-big-picture-overview.html|QEMU Internals: Big picture overview
]]
* [[http://www.ovirt.org/|oVirt Project]]
* [[http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Thursday_130pm_Hugh_Brock_Perry_Myers_Whats_Next.pdf|oVirt: An Open Management Framework for Virtualized Environments]]
* [[http://virtual.51cto.com/art/201112/308425.htm|红帽oVirt项目:开源RHEV虚拟化管理工具]]
* [[http://www.gluster.org/|Gluster Project]]
* [[http://stenlyho.blogspot.com/2009/01/vt-xvt-d-intel.html|從VT-x到VT-d Intel虚擬化技術發展藍圖]]