Category Archives: Troubleshooting

Troubleshooting Random Slowness with VMs

Sometimes you’ll come to the question that comes up quite often and that is when user’s complain about slowness on their VMs. Of course the first thing that we do is try to isolate the problem.

So we go through our troubleshooting steps:

  1. Define the problem
  2. Gather detailed information
  3. Consider probable cause for the failure
  4. Devise a plan to solve the problem
  5. Implement the plan
  6. Observe the results of the implementation
  7. Repeat the process if the plan doesn’t resolve the problem
  8. Document

For VM slowness issues, then we want to identify the exact nature of the problem. Identifying which VMs are experiencing this slowness issue and if they’re all on the same host will help to isolate any issues. First, find out if there were any recent changes that would correlate with the slowness issue. Then I like to rule out any issues with networking so you can ping the VM, even though this might not be conclusive due to the VM not responding even when there is no networking problem.

You then want to find out if other VMs are experiencing the same problem and if so, then what is the commonality with all of them. If you can rule out the networking portion then you want to tackle getting more details on the storage. The next thing would be to figure out the type of storage array backing those VMs whether it is an active/active or active/passive, ALUA.

Using esxtop to identify certain statistic thresholds like DAVG, KAVG, QAVG, and GAVG is a great way to further pinpoint where performance degradation is happening.

  • High DAVG indicates a performance issue on the storage array or somewhere along the path to it.
  • High KAVG indicates a performance issue within the VMkernel. Possible causes could include queuing, drivers, etc..
  • High QAVG indicates a performance issue is causing queue latency to go up. This could be an indicator of underperforming storage if higher DAVG numbers are experienced as well.
  • High GAVG is normally the total of the three previous counters. If experiencing high GAVG while other latency metrics seem sufficient, the issue could reside within the VM drivers or virtual hardware.

Along with the above steps then you also want to verify that your storage array is handling the demand from the VMs.

A lot of this information can be found in different resources online. But I like to have a tried and true script that I can work off when doing performance troubleshooting.