VMware vSphere: ESXi Host Troubleshooting
ESXi Host Understanding and Overview
I want to focus more on troubleshooting ESXi in this KB more than the overview, but if we are going to troubleshoot things it is at least good to understand what you have to work with. Before we get too deep into things, let's go over some basic terminology to be sure we are all on the same page. When we are dealing with a virtualized environment there are going to many different components and layers, so it is important when speaking with one another that proper terminology is used for complete understanding:
- Datastore - This is a logical container that contains the files that comprise a virtual machine. A datastore has a Virtual Machine File System (VMFS) written to it and is maintained by ESXi. A datastore can either come from a SAN provided LUN (via iSCSI or FC) or an NFS share via NAS.
- Guest Operating System (Guest OS) - This refers to the Virtual Machines (VM) residing on the ESXi Host.
- Host - This is the physical host (ESXi) not the Virtual Machines.
- Hypervisor - Simply put, the hypervisor serves as a platform for running multiple Operating Systems (software based). ESXi is the hypervisor when referring to a vSphere environment.
- Raw Device Map (RDM) - An RDM provides a way for a virtual machine to have direct access to a LUN using the ESXi Hosts physical storage subsystem (i.e FC or iSCSI). When a VM uses a LUN as an RDM the VM controls the filesystem on that LUN and not ESXi. This means if a Windows VM is using a LUN as an RDM an NTFS format would be written to that LUN, but lower level functionality (like MPIO for example) would still be controlled by the ESXi host.
- Virtual Machine - A virtual machine is a software computer that, like a physical computer, runs an operating system and applications. A virtual machine is made up of configuration files and is backed by the physical resources of a host (ESXi in this case).
- VMkernel - VMkernel is the Operating System that runs directly on the ESXi host. VMkernel manages the physical resources on the hardware, including memory, physical processors, storage, and networking controllers.
All right, now that we have a better understanding of some basic terminology and we can speak the same lingo, a quick overview of the ESXi environment is definitely in order. I have found that pictures best describe this and we can then explain it as we go:
As you can probably guess by the terminology above, VMkernel is very important for ESXi as it is essentially an equivalent to an Operating System. It provides the means for running all processes on the ESXi Host. It also has control of all hardware and manages the resources for the applications.
The following run directly within VMkernel:
- Resource scheduling - This is where CPU Scheduling, Memory Scheduling, Network Bandwidth, and Storage Bandwidth are controlled.
- VMFS - The “Virtual Machine File System” is a Cluster File System and is created by VMware to enable the best optimization available for VMs. It is one of the components that provides ESXi with the ability to share resources across hosts in clusters. It also enables things such as Snapshots, vMotion, Storage vMotion, DRS, etc to run.
- Storage and Networking - Virtual NICs and vSwitches run directly inside VMkernel along with the storage and network stack. More verbose details will be given on these later.
- Device drivers - Both VMware native and 3rd Party drivers (for example Qlogic) are contained within the VMkernel. Like any driver, it is critical for proper functionality with hardware. VMware Support Teams are very hyper-focused on drivers being correct!
The main processes that run on top of VMkernel which you will want to understand are:
- DCUI - This is a low level management interface that is accessed through the console of the server. This is primarily used for configuration, but comes in handy for troubleshooting if management agents (listed below) are unresponsive.
- hostd - This process is what provides the interface to VMkernel and is used by VI Client connections (also known as VIC or "thick client" or "Web Client"). Essentially it is what allows for remote connectivity to manage the ESXi host. It is also responsible for authenticating users and keeps track of which users / groups have which privileges.
- vpxa (vCenter Agent) - This process is used to communicate to the vCenter Server (if one is in use). It is the intermediary between hostd agent and vCenter Server. If this agent fails, then communication with vCenter also fails and you will no longer be able to manage the ESXi host in vCenter until the issue is resolved. If vCenter is not in use, then this process is not used.
- Virtual Machine Monitor (VMM) - This is a very critical process as it is what provides the execution environment for a VM (i.e. what allows VMs to be powered on and partially what allows VMs to be independent from one another). This is one reason why having VMware Tools installed on VMs is so critical, as it actually creates a hook into VMM and allows for better monitoring when using things such as FT / HA / etc.
All right, so why did we go over this information? Why is it important to you? The answer to this is simple, because it helps with every day troubleshooting!
Let's look at a few examples to drive the point home:
- I am working on a case where the customer is complaining that Pure Storage disconnected from their ESXi host. Well, I don't even know where to start, what should I do? Well, if we look at the information above we know that the storage and networking stacks run within VMkernel. In cruising through logs I happen to see that there is a file called "vmkernel.log", since I know that storage and networking directly correlates with VMkernel I look in the file and voila; you see all the related errors to the issue the customer just described and we help them resolve it!
- Customer calls into Pure Storage and is saying that "Since adding Pure Storage to our environment we are seeing our ESXi host and vCenter Server disconnect from one another. What have you done?!". Well, we know that ESXi and vCenter communicate via vpxd (vCenter Side) and vpxa (ESXi side), so we decide to look in... drumroll... vpxa.log; sure enough we find that there is a bug within vpxa that is stopping communication. We prove this and get VMware involved who resolves the issue and everyone is happy! Yay!
- The customer opens a ticket stating that have a single ESXi host that isn't managed by vCenter Server. They are having performance problems on the ESXi host and swear to you that they are not doing anything on the ESXi host at the time of the issue. Well, we all trust our customers, but we want to be sure. So we decide to look in "hostd.log" and see what calls were being made from the VIC to the ESXi host around the time of the problem. What is this? 250 VMware snapshots triggered within the ESXi host at the same time every day that the performance problems hit? Well that has to be a coincidence, right?
Okay, so I think at this point we get the important of understanding what is responsible for specific actions and why it matters where it runs and what it interacts with. This is a basic construct of the ESXi host, but since we are not VMware we don't need to know everything, just enough to be dangerous and troubleshoot cases that come our way. With the information above, this gives us what we need to understand for architecture.
ESXi Host Troubleshooting
Identifying PURE LUNs
One of the nice things about VMware is how they identify each of the attached devices via the "NAA" (Network Addressing Authority) identifier. This makes it easy to look at the ESXi host and quickly know which LUNs are from the Pure Array and which ones are not. This is helpful when reviewing the logs as well to be able to exclude / include what may or may not be applicable to us. Additionally, you can decipher from the last 24 numbers/letters the serial number of the LUN as it would be recorded on the FlashArray.
So lets look below at an example of an NAA identifier for one of our LUNs we will explain how to determine this:
So for the first part let's look at the first part of the identifier to determine how we can tell if this is a Pure LUN or not. I have highlighted in red above the portion of the unique identifier that let's us know this is a Pure LUN, more specifically the "624a937". Whenever you see this identifier on a LUN you know it belongs to a Pure Storage FlashArray.
So what if we want to determine what LUN that would correlate to on our FlashArray? You take the last 24 characters in the identifier to decipher this :
If you look above, highlighted in blue, you will note that this is the entire serial number of a Pure LUN. If we look on the array and do a grep for this while running "purevol list" you will see this output:
root@purearray-ct0:~# purevol list |grep -i a78e6e1d4bacd0960001001a CLFDEV01_26 2T - 2014-05-27 15:05:41 EDT A78E6E1D4BACD0960001001A
As you can from this example we would be working with a LUN called "CLFDEV01_26". Now you can do whatever additional diagnosis you require with that LUN.
VMware Log Files
VMware has A LOT of different files depending on where / what you are looking at. Since 9 times out of 10 you will be looking at ESXi logs (sometimes vCenter Logs) we will start with ESXi Host logs and then add more later. Please note that this is not an exhaustive list, but is what you will most likely be using in the majority of your cases. You can find the logs listed below in two places within a VMware Support Bundle:
/var/log - This directory within the VMware Support bundle has the most recent log for each of the log files. They will not be zipped and will be what was seen most recently on the ESXi Host.
/var/run/log - This is essentially the "archive" directory of the logs. Here you will find older logs and most of them zipped. These are useful when you need to go back more than a day or two. You will need to unzip these files (or use zcat / zgrep / vi) to read the contents.
Log File Types
- hostd.log -- These are the host management service logs. This means that it will include information about virtual machine and hosts "Task" and "Events". Essentially, when you turn on a VM, storage a vMotion, change settings, etc it will be logged here. You will also see some storage errors here, although most of the time looking in the VMkernel log for storage issues is better. Not always though so I wanted to list there here. It also contains information regarding the vSphere Client and vCenter Server talking (vpxa agent) and any SDK connections that may be used. You will want to look at these logs whenever you suspect a potential internal communication issue with the ESXi host and vCenter or to get specific storage error codes to help with troubleshooting.
- shell.log – This command is fairly self explanatory. It is a log of any SSH sessions that have been established and commands that were run during that SSH session. This is helpful if you suspect the customer may have changed settings, upgraded firmware/driver, etc and need to confirm. One thing to keep in mind is if they changed settings within the vSphere Client you would need to look in the hostd.log as described previously since this is only pertaining to SSH / shell activity.
- sysboot.log – This log is wonderful to review if the customer is having issues booting their ESXi hosts and need to better understand why. Since there have been cases where ESXi hosts are slow booting while connected to Pure Storage, it is good to know about this file so that you can see what is happening while the array is booting and loading all the modules. This will typically help diagnose where the breakdown is happening. Other than boot issues you will not typically review this file.
- syslog.log – There is nothing mystical about the syslog.log and is the same as pretty much any other Linux / Unix syslog. It has the management service initializations, watchdogs, scheduled tasks and DCUI use. One benefit to using the syslog.log is when you are troubleshooting iSCSI cases and want to see if there are iSCSI sessions resetting, no-op issues being seen, etc. This log can be very insightful for many different things and would imagine you will spend a fair amount of time within the syslog.log file. Another thing to keep in mind, as noted previously, is that this keeps logs from DCUI (Direct Console User Interface) usage. So if they plug directly in to the box within a datacenter, this is where the logging is kept. Good to know, once again, if you suspect some changes may have been made.
- vobd.log –The vobd (VMkernel Observation Events) log is a wonderful tool to have at your fingertips. This log propagates kernel level errors to a quick / easy log to read and helps filter through a lot of noise. For instance, if you want to see which paths failed at a specific time, which volumes went PDL (Permanent Device Loss) etc you can look here to get a quick, filtered, reference. I will usually look at this log before looking at the vmkernel log file to get a high level understanding of what the ESXi host saw during whatever storage event I am investigating. Interestingly enough this log is from the vob daemon (thus vobd) and is used for 3rd Party application monitoring, it just so happens that this file is available for review on the ESXi host itself as well.
- vmkernel.log – This will probably be your most used / reviewed log file within an ESXi host (in parallel with vobd) as it is what contains the core VMkernel logs (which VMkernel is essentially the ESXi host OS). It includes data like device discovery, storage communication / errors, networking device communication, NIC and HBA driver / firmware events (depending on what happened), and virtual machine activity (such as powering on a VM). Whenever there is a storage communication problem this is where you want to turn. It gives you in-depth information about the affected LUNs / devices, what the SCSI Op-Codes where for the issue, what the VMware NMP (Native Multi-Path) plugin is reporting, etc. This is where you go to understand most of the issues within an ESXi host when it comes to our issues. You will want to familiarize yourself with this log file since you will live in here when working these problems.
- vmkwarning.log – This is kind of like the vobd log file but it is a summary of only the "warning" and "alert" log messages pulled from the VMkernel log. Once again, it is a nice place to get a high level understanding of what happened, fast.
- vmksummary.log – This is actually quite a useful log file as it is a summary of ESXi host uptime (reboots, power on, shutdown). It shows an hourly heartbeat with uptime, number of virtual machines running, and service resource consumption. This is a good place to come see when the ESXi host was rebooted, how many VMs are actively running on this ESXi host and how it has changed over time (performance), and if an ESXi host is over taxed. Since these are just snapshots they are obviously not the "say all, end all" but they will help give you a general direction where you could go to look at applicable logs for performance issues (coming later).
That's it, you are now a VMware Log expert and should know where to look for specific types of issues. You can also see the following VMware KB here, but it is not as in-depth so I thought I would describe them in more detail here.
This command shows you the individual LUNs attached to the ESXi Hosts and all of the working paths. This is very useful for knowing pathing configuration.
This command is excellent to use as it prints the mappings for the VMFS volumes (datastores) to the NAA and VMware UUID identifiers.
esxcli storage core adapter list
This is a quick command that will list the vmhba's in the ESXi Hosts so you know which physical (or virtual at times) hardware you're working with.
esxcli storage core device vaai status get
This command will show you each LUN that is attached to the ESXi Host and what individual VAAI capabilities they report. If you ever suspect VAAI is not negotiated, look at this output for confirmation.
esxcli storage nmp device list
This command shows what LUNs are attached to the ESXi Host as well as their "SATP" (Storage Array Type Plugin) and "PSP" (Path Selection Policy) for each LUN individually. This will also provide insight into how often each LUN is switching IO paths
vmkfstools -P -v 10 /vmfs/volumes/name_of_datastore
This command gives you all of the information you need to know about a datastores volume. A few of the key things to note are if ATS-Only is set (VAAI Offload), Size of Datastore, Number of Partitions, UUID, and NAA conversion. This is a great command to refer to.
This lists the version of VMware ESXi along with it's Update and Build.
Fibre Channel Commands
esxcli storage san fc events get
This command shows you what events the Fibre Channel card (FCoE at times) reports. This will report events such as PLOGIs, Link Up, Link Down, etc.
esxcli storage san fc stats get
This provides insight into any errors being seen from the individual FC vmhbas (such as Loss of sync, Loss of signal, CRCs, etc).
/proc/scsi/hba_vendor/number (example: cat /proc/scsi/qla2xxx/6)
This command gives you the Physical HBAs Driver and Firmware versions in the ESXi Host. Very useful for checking the VMware "Hardware Compatibility List" (HCL) for valid configurations. This command is only available in ESXi 5.1 or lower.
/usr/lib/vmware/vmkmgmt_keyval/vmkmgmt_keyval -a (seen as vmkmgmt_keyval_-a.txt &)
This command gives you the Physical HBAs Driver and Firmware versions in the ESXi Host. Very useful for checking the VMware "Hardware Compatibility List" (HCL) for valid configurations. This command is only available in ESXi 5.5 or greater.
vmkping -I vmkX 'ipaddress'
This tests a basic 'ping' from the desired ‘vmkernel’ port on the ESXi host to the specified FlashArray IP.
vmkping -d -s 8972 -I vmkX ipaddress
This allows you to test if you are able to ping the FlashArray from the desired ‘vmkernel’ port on the ESXi host using jumbo frames to ensure they are enabled throughout the network.
nc -z 'ipaddress' 3260
This confirms whether or not port ‘3260’ is open throughout the network path to the FlashArray.
esxcli iscsi adapter auth chap get -A vmhbaX
This confirms whether or not CHAP is configured for use on the specific iSCSI HBA configured for use with the FlashArray.
This allows you to review the configuration of the 'vmkernel port' on the ESXi host. It helps confirm configuration of IP address, MTU, etc.
This allows you to review the configuration of the 'vswitch' that is configured for iSCSI use. It helps confirm configuration of MTU, VLAN, etc.
esxcli iscsi adapter list
This command gives you insight into whether the iSCSI Software Initiator is in use or a physical iSCSI card.
esxcli iscsi adapter get --adapter vmhbaX
This provides the configuration of the iSCSI vmhba that is being used.
esxcli iscsi adapter target list
This command outputs what targets are connected to the ESXi host and through which iSCSI HBAs (vmhba).
esxcli iscsi session list
This command shows iSCSI sessions to configured targets and provides iSCSI Session IDs, TargetPortal Groups, etc for troubleshooting path issues.
esxcli iscsi networkportal list
This command provides in-depth information about the iSCSI vmhba regarding physical NICs, vmkernel ports, virtual switches, etc in use for iCSCSI. This command is critical to understand iSCSI configuration.
/usr/lib/vmware/vm-support/bin/nicinfo.sh (seen as "nicinfo.sh.txt)
This provides insight into NIC hardware, drivers, firmware, and all of their stats (Such as dropped packets, CRCs, etc).
NOTE: All of these commands are found in VMware Support Bundles under the "commands" folder. If you are looking for the "esxcli" commands in the Support Bundle they will be labeled as "localcli". When typing the commands on the ESXi Host SSH you can use either 'esxcli' or 'localcli' though.
Useful VMware / SCSI KB Articles
Here are KBs constantly referred to and reviewed to troubleshoot and understand VMware issues / environments:
Interpreting SCSI Sense Codes on ESXi:
In the SAN world you SCSI is used for every operation. Thus when there are errors we get what is called "SCSI Sense Codes". SCSI Sense Codes are what provide an insight into what issue is being seen by the responses received from the FlashArray and/or local hosts driver. Along with SCSI Sense Codes you have "Additional Sense Code" (ASC) and "Additional Sense Code Qualifiers" (ASCQ). These two work in parallel with SCSi Sense Codes to give a more in-depth description of the errors. You will work with these errors daily in a SAN environment and this article provides great insight into both ESXi Specific errors as well as generic SCSI errors.
Also see http://www.virten.net/vmware/esxi-sc...-code-decoder/ for a very useful tool for decoding ESXi SCSI Sense Codes.
Understanding SCSI host-side NMP errors:
This KB is awesome for wanting to know what the ESXi Host's driver / firmware is reporting when issues are seen. You see want to look at these whenever you are looking at any NMP (Native MultiPath) errors within the vmkernel logs. I reference this almost daily when working on VMware tickets since NMP errors are almost always seen when tickets are opened with us. This is used in parallel with "Interpreting SCSI Sense Codes on ESXi".
SCSI Sense Keys:
This is a quick reference for understanding what the SCSI Sense Error Codes. This works in parallel with the 2 articles listed above. You will notice this is the offical T10 website and is what should be the "goto" for anything SCSI.
Additional Sense Data:
This is a quick reference for understanding ASC/ASCQ codes. This works in parallel with the 3 articles listed above. You will notice this is also the offical T10 website and is what should be the "goto" for anything SCSI regarding ASC/ASCQ.
VMware VAAI White Paper:
I use this VAAI White Paper quite regularly to review how VAAI functions and recalling default behaviors, limitations, etc. I use this in parallel with "VMware VAAI Frequently Asked Questions".
VMware VAAI Frequently Asked Questions:
This is another great reference for VAAI limitations and functionality. I would get to know this KB well since there is lots of great information!
Mr. VMware (Best Practices Confirmation)
Fuse Tool: mr vmware KB has been created to track this tool. Please refer to the KB moving forward for updates on the tool.
NOTE: Additional work is being completed on Mr. VMware to include hardware, firmware, driver, iSCSI, and other important bits of information.