User Tools

Site Tools


nfv-kvm-tuning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
nfv-kvm-tuning [2016/01/04 06:32]
Chao Peng [Base Platform and Environment]
nfv-kvm-tuning [2016/01/13 22:46] (current)
Jiang, Yunhong [Performance/Latency Tuning]
Line 6: Line 6:
   * Software: Based on Real-Time Linux https://​rt.wiki.kernel.org/​index.php/​Main_Page.   * Software: Based on Real-Time Linux https://​rt.wiki.kernel.org/​index.php/​Main_Page.
  
-Please refer to [[nfv-kvm-test|kvmfornfv ​test]] for details.+Please refer to [[nfv-kvm-test]] for details.
  
 ====== Configuration ====== ====== Configuration ======
  
-A right configuration is critical for improving the NFV performance/​latency. Even on the same codebase, different configrations can make completely different performance/​latency result.+A right configuration is critical for improving the NFV performance/​latency. Even working ​on the same codebase, different configrations can make completely different performance/​latency result.
  
 There are many combinations of configurations,​ from hardware configuration to Operating System configuration and application level configuration. And there is no one simple configuration that works for every case. To tune a specific scenario, it's important to know the behaviors of different configurations and their impact. There are many combinations of configurations,​ from hardware configuration to Operating System configuration and application level configuration. And there is no one simple configuration that works for every case. To tune a specific scenario, it's important to know the behaviors of different configurations and their impact.
 ===== Platform Configuration ===== ===== Platform Configuration =====
  
-Some hardware features can be configured through firmware interface(e.g. BIOS) but others may not be configurable (e.g. SMI on most platforms).+Some hardware features can be configured through firmware interface(like BIOS) but others may not be configurable (e.g. SMI on most platforms).
  
   * __Power management__:​ Most power management related features save power on the expensive of latency. These features includes: Intel®Turbo Boost Technology, Enhanced Intel®SpeedStep,​ Processor C state and P state.Normarlly they should be disabled. But depending on the real-time application design and latency requirements,​there might be some featues can be enabled if the impact on deterministic execution of workload is small. ​   * __Power management__:​ Most power management related features save power on the expensive of latency. These features includes: Intel®Turbo Boost Technology, Enhanced Intel®SpeedStep,​ Processor C state and P state.Normarlly they should be disabled. But depending on the real-time application design and latency requirements,​there might be some featues can be enabled if the impact on deterministic execution of workload is small. ​
Line 23: Line 23:
   * __Legacy USB Support/​Port 60/64 Emulation__:​ These features involve some emulation in firmware and can introduce randome latency. It is recommended to disable.   * __Legacy USB Support/​Port 60/64 Emulation__:​ These features involve some emulation in firmware and can introduce randome latency. It is recommended to disable.
  
-  * __SMI__: System Management Interrupt runs outside of the kernel code and can potentially cause latency. It'​s ​a pity there is no simple to disable it. Some vendors may provide ​switchs ​in BIOS related this but most machine ​would not have.+  * __SMI__: System Management Interrupt runs outside of the kernel code and can potentially cause latency. It is a pity there is no simple ​way to disable it. Some vendors may provide ​related switches ​in BIOS but most machines ​would not have.
  
 ===== Operating System Configuration ===== ===== Operating System Configuration =====
Line 31: Line 31:
   * __Memory allocation__: ​ Memory shoud be reserved for realtime application and usually hugepage should be used to reduce page faut/TLB miss.   * __Memory allocation__: ​ Memory shoud be reserved for realtime application and usually hugepage should be used to reduce page faut/TLB miss.
  
-  * __IRQ affinity__: All the non-realtime IRQs should affinitized to non realtime CPUs to reduce the impact on realtime CPUs. Some OS distribution contains ​irqbalance deamon which balences the IRQs among all the cores dynamically. It should be disabled as well.+  * __IRQ affinity__: All the non-realtime IRQs should affinitized to non realtime CPUs to reduce the impact on realtime CPUs. Some OS distributions contain ​irqbalance deamon which balences the IRQs among all the cores dynamically. It should be disabled as well.
  
-  * __Device assignment for VM__: If device is used in a VM, then device passthru is desirable. In this case, IOMMU should be used.+  * __Device assignment for VM__: If device is used in a VM, then device passthru is desirable. In this case, IOMMU should be enabled.
  
-  * __Tickless__:​ Frequent tick cause latency. CONFIG_NOHZ_FULL should be enabled in linux kernel.With CONFIG_NOHZ_FULL,​ the physical CPU will trigger much less tick timer interrupt (currently, 1 tick per second). This will reduce ​the impact to the VNF because each host timer interrupt triggers VM exit from guest to host and cause performance/​latency impact.+  * __Tickless__:​ Frequent tick cause latency. CONFIG_NOHZ_FULL should be enabled in linux kernel. With CONFIG_NOHZ_FULL,​ the physical CPU will trigger much less tick timer interrupt(currently,​ 1 tick per second). This can reduce ​latency ​because each host timer interrupt triggers VM exit from guest to host and cause performance/​latency impact.
  
-  * __TSC__: Mark TSC clock source as reliable. A TSC clock source that is thought as unreliable ​will cause kernel to continuous to enable clock source watchdog to check if TSC frequency is still correct. On latest Intel platform with Constant TSC/​Invariant TSC/​Synchronized TSC, the TSC is reliable already hence the watchdog is useless but cause latency.+  * __TSC__: Mark TSC clock source as reliable. A TSC clock source that is thought as unreliable ​causes ​kernel to continuous to enable clock source watchdog to check if TSC frequency is still correct. On latest Intel platform with Constant TSC/​Invariant TSC/​Synchronized TSC, the TSC is reliable already hence the watchdog is useless but cause latency.
  
   * __Idle__: The poll option forced a polling idle loop that can slightly improve the performance of waking up an idle CPU.   * __Idle__: The poll option forced a polling idle loop that can slightly improve the performance of waking up an idle CPU.
Line 43: Line 43:
   * __RCU_NOCB__:​ RCU is a kernel synchronization mechanism. Refer to http://​lxr.free-electrons.com/​source/​Documentation/​RCU/​whatisRCU.txt for more information. With RCU_NOCB, the impact from RCU to the VNF will be reduced.   * __RCU_NOCB__:​ RCU is a kernel synchronization mechanism. Refer to http://​lxr.free-electrons.com/​source/​Documentation/​RCU/​whatisRCU.txt for more information. With RCU_NOCB, the impact from RCU to the VNF will be reduced.
  
-  * __Disable the RT throttling__:​ RT Throttling is a Linux kernel mechanism that occurs when a process or thread uses 100% of the core, leaving no resources for the Linux scheduler to execute the kernel/​housekeeping tasks. RT Throttling increases the forwarding hence should be disabled.+  * __Disable the RT throttling__:​ RT Throttling is a Linux kernel mechanism that occurs when a process or thread uses 100% of the core, leaving no resources for the Linux scheduler to execute the kernel/​housekeeping tasks. RT Throttling increases the latency so should be disabled.
  
-  * __NUMA configuration__:​ To achive ​the best latency. CPU/Memory and device allocated for realtime application/​VM should be in the same NUMA node.+  * __NUMA configuration__:​ To achieve ​the best latency. CPU/Memory and device allocated for realtime application/​VM should be in the same NUMA node.
  
  
Line 55: Line 55:
   * Make vfio MSI interrupt be non-threaded   * Make vfio MSI interrupt be non-threaded
 Threaded irq can help reduce interrupt latency because it avoids locking interrupt too long  in interrupt handler. But if the interrupt handler itself does not take much time just like vfio for which the only thing to do is inject the interrupt to guest which can be really fast. In such case threaded irq would cost time to do the context switch between irq thread and  interrupt handler. Another point is in NFV scenario such realtime interrupt(like DPDK interrupt) ​ is almost the highest priority, so making such interrupt non-threaded would certainly benefit the highest application. [[https://​gerrit.opnfv.org/​gerrit/​gitweb?​p=kvmfornfv.git;​a=commit;​h=a233b3fef0ef0048071145eb233becffbdf96d0f | See code change]]. Threaded irq can help reduce interrupt latency because it avoids locking interrupt too long  in interrupt handler. But if the interrupt handler itself does not take much time just like vfio for which the only thing to do is inject the interrupt to guest which can be really fast. In such case threaded irq would cost time to do the context switch between irq thread and  interrupt handler. Another point is in NFV scenario such realtime interrupt(like DPDK interrupt) ​ is almost the highest priority, so making such interrupt non-threaded would certainly benefit the highest application. [[https://​gerrit.opnfv.org/​gerrit/​gitweb?​p=kvmfornfv.git;​a=commit;​h=a233b3fef0ef0048071145eb233becffbdf96d0f | See code change]].
- 
-  * Use VMX preemption timer to emulate the lapic deadline timer 
-KVM emulates lapic deadline timer using native hrtimer facility which can cause lots of vmexits even when guest’s deadline is actually not hit. Instead, we use VMX preemption timer to emulate it If cpu is in non-root mode which cause vmexit only when the deadline is hit. 
  
   * Cache Allocation Technology(CAT) enabling   * Cache Allocation Technology(CAT) enabling
 Last leve cache(LLC) contention is a key resource contention for memory intensive workloads running on the same socket. Intel CAT can be used to partition LLC among realtime/​non-realtime apps/VMs. Last leve cache(LLC) contention is a key resource contention for memory intensive workloads running on the same socket. Intel CAT can be used to partition LLC among realtime/​non-realtime apps/VMs.
nfv-kvm-tuning.1451889177.txt.gz · Last modified: 2016/01/04 06:32 by Chao Peng