Utilising vSphere Performance Monitoring Tools ‚Äď Part Two: vscsiStats

The vscsiStats command allows for the troubleshooting of performance issues for virtual machine storage. It can collect data at the virtual SCSI device level and report this performance data as a histogram by collecting information on each I/O operation and reports on I/O metrics such as length, seek distance, number of outstanding I/O, I/O latency and inter-arrival time.

This provides more performance data than the ESXTOP which in comparison only provides latency and throughput statistics. Also, as vscsiStats targets the virtual SCSI device level it can report on virtual machine hard disks of all types. This can be useful for determining the behaviour of a workload in order to determine the placement of storage for a virtual machine.

In order to retrieve performance data for a virtual machine we will be required to obtain the virtual machine worldGroupID and if you require filter the retrieval based on a specific virtual machine hard disk the handleID. This information can be retrieved as below:

vscsiStats -l

In this example, we will initially retrieve performance data for the virtual machine ‘vm1’ wherer the worldGroupID is ‘2718498’ as described in the below output.

Virtual Machine worldGroupID: 2718498, Virtual Machine Display Name: vm1, Virtual Machine Config File: /vmfs/volumes/54afe658-9ab37544-54a8-0026b9746656/vm1/vm1.vmx, {
Virtual SCSI Disk handleID: 11547 (scsi0:0)
Virtual SCSI Disk handleID: 11548 (scsi0:1)
Virtual SCSI Disk handleID: 11549 (scsi0:2)
}

Now we will start the retrieval of performance data for all the virtual machine hard disk for the specific virtual machine.

vscsiStats -s -w 2718498

The retrieval of performance data will now occur in the background, whilst data is being retrieved which will taake 30 minutes by default, you may print the histogram for a specific statistic by specifying the statistic name (iolength, seekDistance, outstandingIOs, latency and interarrival)

vscsiStats -p latency

This will show the current data retrieved for latency of IOs, Read IOs and Write IOS in microseconds. In the below example we may see that no Write I/Os took longer than 15000 microseconds where the peak Write I/O demand was 7916 microseconds.

Histogram: latency of Write IOs in Microseconds (us) for virtual machine worldGroupID : 3595341, virtual disk handleID : 11916 (scsi0:0) {
 min : 207
 max : 7916
 mean : 371
 count : 13421
 {
 0 (<= 1)
 0 (<= 10)
 0 (<= 100)
 12000 (<= 500)
 1198 (<= 1000)
 222 (<= 5000)
 1 (<= 15000)
 0 (<= 30000)
 0 (<= 50000)
 0 (<= 100000)
 0 (> 100000) 
 }

If you require to stop the retrieval of performance data before this completes, you may so with the following command.

vscsiStats -x

In the above example, we invoked the vscsiStats command to retrieve performance data for all virtual machine hard disks for a sepcific virtual machine. If we required only to retrieve performance data for the Hard Disk (0:2) we can invoke the command and specify the handleID.

vscsiStats -s -w 2718498 -i 11549

VMware vSphere Performance ‚Äď Part Five: Optimizing Virtual Machine Resources

In previous articles I have described a number of steps to optimise the performance of ESXi host system resources, now we will look at details on how to optimise virtual machines with the available resources.

Memory Configuration

Through memory management techniques ESXi can  manage memory and is capable of reclaiming any excess memory assigned to virtual machines when the host system memory is exhausted. However, it is important for a number of reasons that you configure memory for the virtual machine to satisfy the workload of your virtual machine and do not over allocate memory resources, as this process of assigning excess memory can lead to a number of issues, for example:

  • Increase in the amount of available overhead memory to power on the virtual machine.
  • Increase in size of the virtual machine swap file (VSWP) results in increased disk usage.

By default, ESXi is configured to support the use of large pages. However, the guest operating system in some instances my require additional configuration in order to use large memory pages. For Example, for a Windows Server 2012 server with Microsoft SQL Server installed we would be required to configure the ‘Lock pages in memory’ privilege to the user account running the Microsoft SQL Server service so that the application will execute with the use of large memory pages.

Network Configuration

It is recommended that you use the VMXNET3 virtual network adapter for all supported guest operating systems that have VMware Tools installed. This virtual machine network adapter is optimised to provide higher throughput, lower latency and less overhead when compared to the other virtual machine network adapter options. The driver required for the VMXNET3 virtual adapter is not provided by the guest operating system and therefore requires VMware Tools installed to supply the driver.

VMXNET3, the newest generation of virtual network adapter from VMware, offers performance on par with or better than its previous generations in both Windows and Linux guests. Both the driver and the device have been highly tuned to perform better on modern systems. Furthermore, VMXNET3 introduces new features and enhancements, such as TSO6 and RSS. TSO6 makes it especially useful for users deploying applications that deal with IPv6 traffic, while RSS is helpful for deployments requiring high scalability. All these features give VMXNET3 advantages that are not possible with previous generations of virtual network adapters. Moving forward, to keep pace with an ever‚Äźincreasing demand for network bandwidth, we recommend customers migrate to VMXNET3 if performance is of top concern to their deployments.

CPU Configuration 

The CPU scheduler in ESXi schedules CPU activity and fairly grants CPU access to virtual machines using shares it is important to configure a virtual machine with required number of vCPUs required for the workload. For application workloads that are unknown , it is my recommendation to start with the approach of small and increase the number of vCPUs gradually until you notice an acceptable and stable performance from the virtual machine workload. By enabling CPU Hotplug for the virtual machine where supported by the guest operating system and/or application this allows for the flexibility of additional vCPUS to be added without incurring downtime for the virtual machine.

In the example of overcommitting CPU resources for a virtual machine the process of assigning the additional vCPU to the virtual machine can lead to performance issues as well as directly consuming additional memory for the associated virtual machine overhead. The additional CPU resources may cause for the host system to be exchausted and for the virtual machines on this host system to degrade in performance.  Therefore, adding additional vCPUs to the virtual machine to resolve a perceived vCPU contention issue may actually add an extra burden to the host system and degrade performance further.

By default the VMkernel schedules virtual machines vCPU to run on any logical CPU for the host systems hardware. In some cases you may wish to schedule CPU affinity for the virtual machine. For Example, you may require to troubleshoot the performance of CPU workload if this was not sharing CPU resources with other workloads on the host system, where you do not have the ability to migrate the virtual machine to run on an isolated host system. Also, you may wish to use CPU scheduling to measure throughput and response times of several virtual machines competing for CPU resources  agaisnt specific logical CPUs on the host system.

One limitation of enabling CPU scheduling affinity for a virtual machine is that vMotion is not functional in this configuration and also if have virtual machine in a Distributed Resources Scheduler (DRS) cluster, the ability to enable CPU scheduling affinity is disabled.

Storage Configuration 

The placement of a virtual machine on a datastore may have a significant impact on the performance of that virtual machine, this is due to I/O requirements for all virtual machines on that shared resource may result in I/O latency if the underlying storage array is unable to meet the requirements. In the event of optimising the placement of virtual machine I/O you may utilise Storage vMotion in order to migrate virtual machines to datastores that have fewer competing workloads or that are configured with better performance by triggering a I/O latency threashold.

When provisioning a virtual machine the  default virtual SCSI controller type is based on the guest operating system and in most cases will be adequate for the virtual machine workload. However, if this is not sufficient to satisfy the virtual machine workload using a VMware Paravirtual SCSI controller can improve performance with higher achievable throughput and lower CPU utilization in comparison to other SCSI controller types.  As per the VMXNET3 virtual network adapter, this requires VMware Tools installed to provide the appropriate drivers for the supported guest operating system.

As discussed in a previous performance blog (http://wp.me/p15Mdc-u7) by configuring the virtual machine disk type as Eager-zeroed thick provides the best performance virtual machine disk type as they do not require to obtain new physical disk blocks or write zeros during normal operations.

If you require to remove the VMFS layer to satisfy I/O requirements for performance in some instances you may configure a virtual machine to use raw device mappings (RDM).

 

VMware vSphere Performance ‚Äď Part Four: Optimizing ESXi Host Storage

It is important to ensure the maximum queue depth of the HBA is configured to meet the manufacturer and VMware recommendations, which may dependant on the combination of ESXi version and the HBA model and version.

For QLogic HBAs, the default queue depth values are configured as below:

DefaultAQLEN

 

 

 

 

To view a list of loaded modules on the host system, invoke the following from the vSphere Command Line interface:

esxcli system module list

In my example the module loaded for the HBA is a using QLogic native drivers so I can search the output for a match for the module name:

esxcli system module list | grep qln

In order to determine the current queue depth for a HBA, we will need to identity the device name of the HBA by selecting a host in the vSphere Web Client and browsing to Manage > Storage > Storage Adapters.

HBAdevices

 

Once we have identified the device name of the HBA we will use esxtop from the ESXi shell to determine the current queue depth.

1) Press ‘d’ to display statistics for storage adapters.

2) Press ‘f’ to display the available fields and press ‘D’ to select ‘Queue_Stats’ (AQLEN) and press enter to return the the statistics.

QueueDepth


 

In order to configure the maximum queue depth we will require to retrieve the configuration parameter for the particular module loaded, in my example ‘qlnativefc’

esxcli system module parameters list -m qlnativefc

From the output of the parameters we can determine the the parameter that we require to modify is ‘ql2xmaxqdepth’ and can be performed as below, to set the maximum queue depth value to be ’32’.

esxcli system module parameters set -p ql2xmaxqdepth=32 -m qlnativefc

It is recommended to maintain configuration settings for all the host systems HBAs in a cluster, therefore by making the above change on one host system you should ensure that this is implemented on all host systems containing identical HBAs in the cluster. In the event that the host systems in a cluster are of mixed HBAs it is recommended to configure the maximum queue depth value is uniform accross all hosts.

 

VMware vSphere Performance ‚Äď Part Three: Optimizing ESXi Host CPU

In order to support ESXi on a host system you require a minimum of two CPU cores, but ultimately need to ensure that your host has sufficient CPU resources in order to satisfy CPU demand of the virtual machines and VMkernal. It is also recommended to use CPUs that leverage hardware assisted virtualization as the performance of virtual machines can be significantly improved as the hardware will trap sensitive events and instructions at the hardware offloading the workload from the hypervisor.

HardwareAssistedVirtualization

 

 

 

 

 

First generation enhancements include Intel Virtualization Technology (VT-x) and AMD’s AMD-V which both target privileged instructions with a new CPU execution mode feature that allows the VMM to run in a new root mode below ring 0. As depicted in Figure 7, privileged and sensitive calls are set to automatically trap to the hypervisor, removing the need for either binary translation or paravirtualization. The guest state is stored in Virtual Machine Control Structures (VT-x) or Virtual Machine Control Blocks (AMD-V).

Due to high hypervisor to guest transition overhead and a rigid programming model, VMware’s binary translation approach currently outperforms first generation hardware assist implementations in most circumstances. The rigid programming model in the first generation implementation leaves little room for software flexibility in managing either the frequency or the cost of hypervisor to guest transitions1. Because of this, VMware only takes advantage of these first generation hardware features in limited cases such as for 64-bit guest support on Intel processors.

As well as enabling hardware assisted virtualization (Intel VT-x or AMD-V), it is also recommended to enable the following settings in the BIOS:

  • Ensure¬†all installed CPU sockets and cores are enabled.
  • Intel Turbo Boost – Allows for the CPU to run faster than its thermal design power (TDP) configuration specified frequency when requested by the hypervisor and the CPU is operating below its power, current and temperature limits.
  • Hyperthreading – Allows for two independent threads to run concurrently on a single core.

By default, if hyperthreading is enabled in the BIOS, the ESXi host system will automatically use hyperthreading. However, the default behaviour can be modified on the host system in the vSphere Web client by select a host system and browsing to Manage > Settings > Hardware > Processors  and select Edit and uncheck the hyperthreading enabled option to which the host system will require a restart to apply the change.

DisableHT

 

 

 

 

To disable ESXi host system from using hyperthreading using PowerCLI, we can invoke the Get-View cmdlet to retrieve the CpuScheduler and initiate the Disable Hyper Threading task.

$HostSystem = "esxi1host.domain.local"

$CpuScheduler = Get-View (Get-View -ViewType HostSystem -Property ConfigManager.CpuScheduler -Filter @{"Name" = $HostSystem}).ConfigManager.CpuScheduler

$CpuScheduler.DisableHyperThreading()

It is also recommended to disable any hardware devices that will not be used in the BIOS to prevent CPU cycles being used by devices such as a serial port.

VMware vSphere Performance ‚Äď Part Two: Optimizing ESXi Host Networking

An ESXi host system requires a minimum of one network interface card (NIC), for redundancy a host system should be configured with a minimum of at least two NICs in order to satisfy VMkernal and virtual machine demand. For virtual machine communication on a host , these tasks will consume CPU resources and there for will require sufficient CPU resources to also be available for concurrent VMkernal and vMotion network activity.

In some instances (as discussed in http://wp.me/p15Mdc-ua) Direct I/O may be configured to ensure high throughput for a virtual machine workload, which requires the host system to have hardware-assisted virtualization (Intel VT-d or AMD-Vi) enabled in the BIOS.

A host system can use multiple physical CPUs to process network packets from a single network adapter by enabling SplitRx mode, which can improve performance of specific workloads in particular multicast traffic. By default, the feature is automatically enabled on VMXNET3 virtual network adapters , where the host system will detect a single network queue on a physical NIC which is heavily utilised and servicing eight virtual machine network adapters with evenly distributed loads.

SplitRx mode can be enabled on the host system by modifying the ‘Net.NetSplitRxMode’¬†configuration parameter, by default SplitRx mode is enabled with the value of ‘1’. ¬†To disable SplitRx mode using the v Sphere Web Client select a host system and browse to¬†Manage > Settings > System > Advanced Host System Settings and edit the value of the configuration parameter¬† ‘Net.NetSplitRxMode’ value to be ‘0’.

Alternatively, we can retrieve the current value of the configuration parameter using the ‘Get-AdvancedSetting’ cmdlet and configure the value using the ‘Set-AdvancedSetting’ cmdlet

Get-AdvancedSetting -Entity esxi1host.domain.local -Name Net.NetSplitRXMode | Select Entity, Name, Value
Get-AdvancedSetting -Entity esxi1host.domain.local -Name Net.NetSplitRXMode | Set-AdvancedSetting -Value 0 -Confirm:$False

VMware vSphere Performance – Part One: Optimizing ESXi Host Memory

Once you have have assessed the hardware requirements for ESXi host systems (requires a minimum of 2GB to install) to ensure sufficient memory resources are provided to satisfy the demand of the virtual machines, system overhead and the level of failure protection required, we shall look at various options for optimising ESXi host system memory.

Hardware Assisted Memory Management Unit (MMU) Virtualization

Hardware Assisted Memory Management Unit (MMU) Virtualization allows for enhanced virtual machine performance as the host system hardware provide an additional layer of page tables in the hardware mapping guest operating system memory to host system physical memory and therefore there is no requirement to maintain shadow page tables, reducing overhead memory consumption.

  • Implemented by Intel using extended page tables (EPTs).
  • Implemented by AMD with rapid virtualization index (RVI).

In some instances the performance benefit may be negated if the virtual machines workload results in a high frequency of misses in the hardware lookaside buffer (TLB) as the time required for the host system to service a TLB miss is increased due to the absence of shadow page tables. However, the cost of TLB misses can be negated from the guest operating system by utilising large pages in the virtual machines workload.

From the below performance evaluations we can see the following performance gains:

Intel EPT – results in performance gains of up to 48% for MMU-intensive benchmarks and up to 600% for MMU-intensive microbenchmarks.

AMD RVI – results in performance gains of up to 42% for MMU-intensive benchmarks and up to 500% for MMU-intensive microbenchmarks.

In both cases, the use of large pages are recommended for TLB-intensive workloads.

http://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf

http://www.vmware.com/pdf/RVI_performance.pdf

Memory Scrub Rate

The host system can where possible utilize ECC memory to set the memory scrub rate in the BIOS, this process onsists of reading from each computer memory location, correcting bit errors (if any) with an error-correcting code (ECC), and writing the corrected data back to the same location. It is recommended to configure the memory scrub rate to be that of the manufacturers recommendations, which is most likely to be the default BIOS setting.

Non-uniform memory access (NUMA)

Non-uniform memory access (NUMA) allows for a processor to access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors). The benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users. Therefore, in most cases VMware recommends disabling node interleaving, which will enable NUMA and allow for the host system to optimally place each page of the virtual machines virtual memory

Their is also a design consideration to take into account when sizing memory for virtual machines as when NUMA is enabled as the CPU scheduler utilises NUMA optimisation to assign each virtual machine to a NUMA node to keep vCPU and memory in the same location with the support memory locality. If a virtual machine cannot be assigned within a single NUMA node the CPU scheduler spreads load across all sockets in a round robin manner, similar to a where NUMA is not enabled or supported.

The impact of the CPU scheduler spreading the load across all sockets will result in the virtual machine not achieving NUMA optimizations as the virtual machine load will be spread across the host system.

For more information on NUMA see the below:

https://pubs.vmware.com/vsphere-51/index.jsp#com.vmware.vsphere.resmgmt.doc/GUID-7E0C6311-5B27-408E-8F51-E4F1FC997283.html

Memory Overhead

Memory overhead is required for the VMKernal an host agents on each host system, to which this can be configured with the use of a system swap file which allows for 1GB of memory to be reclaimed when the host system is under pressure. By default this option is disabled but the host system can be configured to use a swap file by invoking the below from the vSphere Command Line interface, where the example creates ¬†and enables a system swap file on datastore named ‘Datastore-01’.

esxcli sched swap system set -d true -n Datastore-01

As well as the memory overhead on the host system each virtual machine will require a memory overhead to support the following:

  • VM executable (VMX) process – required to bootstrap and support the guest operating system.
  • VM monitor (VMM) – contains the virtual machine hardware data structures, such as memory mappings and CPU state.
  • Virtual machine hardware devices
  • Subsystems, such as kernel and management agents

Once a virtual machine is powered on the memory requirements for VMM and virtual hardware devices are reserved. By utilising a VMX swap file you can reduce the VMX memory reservation from 50MB and greater to approximately 10MB per virtual machine, the total size of the VMX swap file is approximately 100MB.

By default the host system will create a VMX swap file in the virtual machines working directory. However, the behaviour of a VMX swap file can be controlled by the virtual machines configuration file.

As discussed, the VMX swap file is created by default when a virtual machine is powered on, if you require to disable this behaviour you will be required to set the ‘sched.swap.vmxSwapEnabled’ parameter value to be ‘False’.

The VMX swap file is not related to the virtual machine swap file (VSWP) which allow for a virtual machine guest operating system to consume less physical memory than configured.

By default, the swap file location for virtual machines specified by the host system is to store the swap files in the same directory as the virtual machine. However, the swap file location can be configured on the host system to store the swap file in a specific datastore.

This can be modified by selecting the host system in the vSphere Web Client and browsing to Manage > Settings > Virtual Machines > Swap file location and enable¬†‘Use a specific datastore’ and selecting the datastore to be used to store the virtual machine files.

Alternatively you can configure the swap file location using ¬†PowerCLI by invoking the ‘Set-VMHost’ cmdlet and specifying the VMSwapFileDatastore parameter, as below:

Get-VMHost esxihost1.domain.local | Set-VMHost -VMSwapfileDatastore Datastore-01 -Confirm:$False

In order for the virtual machine swap files to be created in the specific datastore either invoke a vMotion task or apply the above in a powered off state and power on. One condition for creating the virtual machine swap files is that the virtual machine must be inheriting the swap file policy from the host system. In order to retrieve virtual machines that may not be configured with this behaviour , we can invoke the following from PowerCLI:

(Get-VMHost esxihost1.domain.local | Get-VM  | Where-Object {$_.VMSwapFilePolicy -ne "Inherit"}).Name

It is important to ensure that the datastore selected to store the virtual machine swap file is attached to each host system in the cluster, as this can impact vMotion performance for the affected virtual machine(s).  If the virtual machine is unable to store the swap files in the specified datastore, the behaviour is to revert to the default and store the swap files in the same directory as the virtual machine .

You may also, specify a location for the VMX swap file by configuring the virtual machine parameter ‘sched.swap.vmxSwapDir’ in the vSphere Web Client by selecting a virtual machine and browsing to Manage > Settings > VM Options > Advanced > Configuration Paramaters > Edit Configuration > Add Row¬†, add the parameter name ‘sched.swap.vmxSwapDir’ and datastore location to create the VMX swap file as the value.

Again, you can use PowerCLI to make the above change by invoking the ‘New-AdvancedSetting’ cmdlet:

New-AdvancedSetting -Entity vm1 -Name sched.swap.vmxSwapDir -Value "/vmfs/volume/Datastore-01"

The below article provides a sample of virtual machine memory overhead values required to power on a virtual machine:

https://pubs.vmware.com/vsphere-55/index.jsp?topic=%2Fcom.vmware.vsphere.resmgmt.doc%2FGUID-B42C72C1-F8D5-40DC-93D1-FB31849B1114.html

 

Automating the simulation of random memory workloads using TestLimit utility with Powershell

I was recently looking at simulating memory usage across a number of VMs where the utilisation pattern, time duration of the memory load and idle time between workloads would be random.

I was looking at using the Sysinternals TestLimt utility to achieve this by invoking a leak and touch memory to simulate resource usage.

As I wanted to generate random values I will be using the Get-Random cmdlet to use a value specified in a minimum and maximum range.

One of the requirements of this script is to simulate memory resource usage in a continuous loop, this is achieved by providing the While statement for a condition that is always true.

While ($True)
   {

I will then generate a random values for the duration to run the TestLimit utility and the percentage of memory usage to touch where a minimum and maximum value is specified to the Get-Random cmdlet.

$Date = Get-Date
$Minutes = Get-Random -Minimum 20 -Maximum 120
$Duration = $Date.AddMinutes($Minutes)
$Utilisation = Get-Random -Minimum 20 -Maximum 60

As I specifying memory utilisation as a percentage, I will need to calculate this from the total physical memory configured on the host and truncate to round towards zero. The TestLimit utility requires that memory objects are specified in MB, so the value will be converted following the percentage calculation.

$ComputerSystem = Get-WmiObject Win32_ComputerSystem 
$Memory = [math]::truncate($ComputerSystem.TotalPhysicalMemory /100 * $Utilisasation

Now that I have the values I wish to pass as arguments to the TestLimit utility for the amount of memory to touch and leak, I will start the process, where TestLimit will leak and touch memory pages (-d) with a total amount of memory generated from the above calculation (-c), where the working directory of the utility is ‘C:\TestLimit’.

Start-Process testlimit64.exe -ArgumentList "-d -c $Memory" -workingdirectory "C:\TestLimit"

As I want to run the process only for a period of time specified until the current date is greater or equal to the random duration value generated, the powershell script will be paused using the Start-Sleep cmdlet to check the current date every 60 seconds and to compare that to the duration.

Do
   { 
   Start-Sleep -Seconds 60
   }
Until ((Get-Date) -ge $Duration

Once the current date is greater or equal to the duration string the process testlimit64 will be terminated and the script will then be paused for a random period of time generated from the minimum and maximum value. Once the script is resumed, the script block will be invoked again as we want this process to run in a continous loop.

Get-Process | Where-Object {$_.Name -eq "testlimit64"} | Stop-Process
$Sleep = Get-Random -Minimum 180 -Maximum 720
Start-Sleep -Seconds $Sleep 
}

In this instance the powershell script is invoked by a scheduled task at computer startup. The full script can be downloaded from:

https://app.box.com/s/lq0b6r9srjs5nmm55cta

The TestLimit utility can be downloaded from:

http://live.sysinternals.com/WindowsInternals/testlimit64.exe