Section 7 - Troubleshooting a vSphere Deployment
Fault Tolerance (VM) - provides continuous availability for applications in the event
of server failure.
Three valid uses
cases for Fault Tolerance...
... Protecting business critical
applications
... Clustering custom applications which
have no other supported method
... Reducing complexity compared to other
clustering solutions
Fault tolerance...
... Requires a dedicated 10GB NIC
... Supports thin provisioned disks
Three vSphere
features that are supported with Fault Tolerance...
... vSphere Data Protection
... vMotion
... Enhanced vMotion Compatibility
Three features or
devices incompatible with Fault Tolerance...
... N_Port ID Virtualization (NPIV)
... CD-ROM backed by a physical device
... 3D enabled Video Devices
The following features
are not supported with FT:
Snapshots, Storage vMotion, Linked clones,
Virtual SAN, Virtual Volumes, VM Component Protection, Storage-based policy
management and I/O filters.
kernelLatency - data counter can be used to identify suspected issues with VMs on a
host trying to send more throughput to the storage system than the
configuration on the host supports
Objective 7.1 -
Troubleshoot vCenter Server, ESXi Hosts and VMs
Troubleshoot
Common Installation Issues:
Make sure your hosts meet the hardware requirements as
well as the VMware HCL.
Monitor ESXi
System Health:
The Common Information Model (CIM) allows for a standard
framework to manage computing resources and presents information via the
vSphere Client.
Note: Execute Reset Sensors from the host’s
Hardware Status tab in vCenter - to remove all the CIM data
ESXi Log Files and Locations:
/var/log/auth.log
= ESXi Shell authentication success and failure log
/var/log/dhclient.log
= DHCP client service log
/var/log/esxupdate.log
= ESXi patch and update installation log
/var/log/lacp.log
= Link Aggregation Control Protocol log
/var/log/hostd.log
= Host management service log (includes VM and host Tasks and Events,
communication with vSphere Client, vCenter, SDK)
/var/log/hostd-probe.log
= Host management service responsiveness checker
/var/log/rhttproxy.log
= HTTP connections proxied on behalf of other ESXi host webservices
/var/log/shell.log
= ESXi Shell usage logs, including enable/disable and every command entered
/var/log/sysboot.log
= Early VMkernel startup and module loading
/var/log/boot.gz
= A compressed file that contains boot log information
/var/log/syslog.log
= Management service initialization, watchdogs, scheduled tasks and DCUI use
/var/log/usb.log
= USB device arbitration events
/var/log/vobd.log
= VMkernel Observation evnets
/var/log/vmkernel.log
= Core VMkernel logs (including device discovery, storage and networking device
and driver events, and VM startup)
/var/log/vmkwarning.log
= A summary of Warning and Alert log messages from VMkernel.log
/var/log/vmksummary.log
= A summary of ESXi host startup/shutdown, hourly heartbeat with uptime, number
of VMs running, service resource consumption
/var/log/Xorg.log
= Video acceleration
Note: vpxa = vCenter Server Agent
vCenter Log Locations...
... on Windows - C:\ProgramData\VMware\VMware
VirtualCenter\Logs
... on VA - /var/log/vmware/vpx
vCenter Log Files:
vpxd.log =
Main vCenter Server log
vpxd-profiler.log
= Profiled metrics for operations performed in vCenter Server*
*Used by VPX
Operational Dashboard (VOD) at https://VCHostnameOrIP/vod/index.html
vpxd-alert.log
= Non-fatal info logged about vpxd process
cim-diag.log
& vws.log = CIM monitoring info
drmdump =
actions proposed and taken by DRS
ls.log =
Health reports for the Licensing Services extension
vimtool.log =
Dump of string used during installation of vCenter Server
stats.log =
historical performance data collection from ESXi hosts
sms.log =
Health reports for Storage Monitoring
Service extension
eam.log =
Health reports for ESX Agent Monitor
extension
catalina.date.log = connectivity information
and status of the VMware Webmanagement Services
jointool.log =
Health status of VMwareVCMSDS service and individual ADAM database objects, and
replication logs between linked-mode vCenter Servers
Identify Common Command Line Interface (CLI)
Commands:
esxtop - used
for real time performance monitoring and troubleshooting
vmkping -
(like ping) allows for sending traffic out a specified vmkernel interface
esxcli network
name space - used for monitoring or configuring ESXi networking
esxcli storage
name space - used for monitoring or configuring ESXi storage
vmkfstools -
allows for management of VMFS volumes and virtual disks
vimtop:
g
- in vimtop displays the top four physical CPUs
f
- display all available CPUs overview
o
- network view
k
- disk view
m
- display memory overview information
To power off a
virtual machine while connected to an ESXi host using SSH:
> vim-cmd vmsvc/power.off VMID
Identify Fault
Tolerance Network Latency Issues:
- Use dedicated 10-Gbit network for Fault Tolerance
traffic
- Use the vmkping command to verify low sub-millisecond
network latency
Objective 7.2 -
Troubleshooting vSphere Storage and Network Issues
Troubleshoot
Physical Network Adapter Configuration Issues:
- Be sure that physical NICs that are assigned to a
virtual switch are configured the same on the physical switch (speed, VLANs,
MTU...)
- If using IP Hash for Load Balancing method, make sure
the physical switch side has link aggregation enabled
- If using beacon probing for network failover detection,
standard practice is to use a minimum of 3 uplinks
Troubleshoot
Virtual Switch and Port Group Configuration Issues:
- Port Group/dvPort Groups - case sensitivity is required
across hosts
- vSwitch settings must be the same across hosts (e.g.
otherwise Motion will fail)
Troubleshoot
Common Network Issues - areas:
- Virtual Machine
- ESX/ESXi Host Networking (uplinks)
- vSwitch or dvSwitch Configuration
- Physical Switch Configuration
Troubleshoot VMFS
Metadata Consistency:
Use the vSphere On-disk Metadata Analyser (VOMA) to
identify and fix incidents of metadata corruption (for VMFS datastores or a
virtual flash resource):
# esxcli storage vmfs extent
list
# voma -m vmfs -f check -d /vmfs/devices/disks/naa.1234567...
Identify Storage
I/O Constraints:
Disk Metric:
Threshold (ms): Description
KAVG: 2: The amount of time the command spends in the
VMkernel
DAVG: 25: This is the average response time per command
being sent to the device
GAVG: 25: This is the response time as it is perceived by
the guest OS
GAVG = DAVG + KAVG
Note: If KAVG is
> 0 it usually means I/O is backed up in a device or adapter queue.
Objective 7.3 -
Troubleshoot vSphere Upgrades
Monitor tab
-> System Logs -> Export Systems Logs
Choose ESX/ESXi hosts you want to export logs from
(Optional selection) Include vCenter Server and vSphere
Web Client Logs
Specify which system logs are to be exported:
- Storage
- ActiveDirectory
- VirtualMachines
- System
- Userworld
- Performance Snapshot
Download Log Bundle!
Note: CLI Tool> vm-support
Configure vCenter
Logging Options: Logging settings
Select level of detail that vCenter Server uses for log
files:
- none =
Disable logging
- error =
Errors only
- warning =
Errors and Warnings
- info =
Normal logging (Default)
- verbose =
Verbose
- trivia =
Extended Verbose
Objective 7.4 -
Troubleshoot and Monitor vSphere Performance
Describe How Tasks
and Events are Viewed in vCenter Server:
Monitor tab -> Tasks or Events
Identify Critical
Performance Metrics:
Critical points to monitor are: CPU, Memory, Networking,
and Storage
Explain Common
Memory Metrics:
Metric =
Description
SWR/s and SWW/s = Measured in megabytes, these
counters represent the rate at which the ESXi hosts is swapping memory in from
disk (SWR/s) and swapping memory out to disk (SWW/s)
SWCUR = This
is the amount of swap space currently used by the virtual machine
SWTGT = This
is the amount of swap space that the host expects the virtual machine to use
MCTL =
Indicates whether the balloon driver is installed in the virtual machine
MCTLSZ =
Amount of physical memory that the balloon driver has reclaimed
MCTLTGT =
Maximum amount of memory that the host wants to reclaim via the balloon driver
Explain Common CPU
Metrics:
Metric =
Description
%Used =
Percentage of physical CPU time used by a group of worlds
%RDY =
Percentage of time a group was ready to run but was not provided CPU resources
%CSTP =
Percentage of time the vCPUs of a virtual machine spent in the co-stopped state,
waiting to be co-started
%SYS =
Percentage of time spent in the ESXi VMkernel on behalf of the world/resource
pool
Explain Common
Network Metrics:
Metric =
Description
MbTX/s =
Amount of data transmitted in Mbps
MbRX/s =
Amount of data received in Mbps
%DRPTX =
Percentage of outbound packets dropped
%DRPRX =
Percentage of inbound packets dropped
Explain Common
Storage Metrics:
Metric =
Description
DAVG = Average
amount of time it takes a device to service a single I/O request
KAVG = The
average amount of time it takes the VMkernel to service a disk operation
GAVG = The
total latency seen from the virtual machine when performing an I/O request
ABRT/s =
Number of commands aborted per second
Identify Host
Power Management Policy:
Power Management
Policy = Description
Not supported
= Not supported / Disabled in BIOS
High Performance
= The VMkernel detects certain power management features, but will not use them
unless the system BIOS requests them for power capping or thermal events
Balanced (Default)
= The VMkernel uses the available power management features conservatively to
reduce host energy consumption with minimal compromise to performance
Low Power =
The VMkernel aggressively uses available power management features to reduce
host energy consumption at the risk of lower performance
Custom = The
VMkernel bases its power management policy on the values of several advanced
configuration parameters
Identify
CPU/Memory Contention Issues - Monitor Performance through ESXTOP
Troubleshoot
Enhanced vMotion Compatibility (EVC) Issues:
- EVC mode ensures that all ESXi hosts in a cluster
present the same CPU level/feature set to VMs, even if the CPUs on the hosts
differ
Note: CPUs still
need to be of the same CPU manufacturer.
ESXi 6.0 Supports
these EVC Modes:
AMD Opteron
Generation: 1, 2, 3, 3 (no 3Dnow!), 4,
“Piledriver”
Intel
Generation: “Merom”, “Penryn”,
“Nehalem”, “Westmere”, “Sandy Bridge”, “Ivy Bridge”, “Haswell”
Overview Charts:
Display multiple data sets in one panel to easily evaluate different resource
statistics, display thumbnail charts for child objects, and display charts for
a parent and a child object
Advanced Charts: Display
more information than overview charts, are configurable, and can be printed or
exported to a spreadsheet
Objective 7.5 -
Troubleshoot HA and DRS Configurations and Fault Tolerance
HA Requirements:
- All hosts must be licensed for vSphere HA
- You need at least 2 hosts in the cluster
- All hosts should be configured with static IP, or, if
using DHCP, address must be persistent across reboots
- There should be at least 1 management network in common
among all hosts
- All hosts should have access to the same VM networks
and datastores
- For VM monitoring to work, VMware tools must be
installed
- supports both IPv4 and IPv6
DRS Requirements:
- Shared Storage: can be either SAN or NAS
- Place the disks of VMs on datastores that are
accessible by all hosts
- Processor Compatibility: same vendor (AMD or Intel),
and supported family for EVC
Note: CPU
Compatibility Masks - you can hide certain CPU features from the VM to prevent
vMotion failing due to incompatible CPUs
vMotion
Requirements:
- The virtual machine configuration file for ESXi hosts
must reside on VMFS
- vMotion does not support raw disks, or migrations of
applications using MSCS
- vMotion requires a private GbE (minimum)
migration network between all of the vMotion enabled hosts
Verify
vMotion/Storage vMotion Configuration:
- Proper networking (VMkernel interface for vMotion)
- CPU compatibility
- Shared storage access across all hosts
Note: When migrating
a virtual machine, 3 available options...
... Change compute resource only
... Change storage only
... Change both compute resource and storage
Verify HA Network
Configuration:
- On ESXi hosts in the cluster, vSphere HA
communications, by default, travel over VMkernel networks, except those marked
for use with vMotion
Verify HA/DRS
Cluster Configuration:
You can monitor for errors by looking at the Cluster Operational Status and
Configuration Issues screens
Troubleshoot HA
Capacity Issues: The 3 Admission
Control Policies:
- Host failures the cluster tolerates
(default): Configure vSphere HA to tolerate a specified number of host failures
- Percentage of cluster resources reserved as
failover spare capacity: Configure vSphere HA to perform admission control
by reserving a specific percentage of cluster CPU and memory resources for
recovery from host failure
- Specify failover hosts
When troubleshooting HA, look for:
- Failed or disconnected hosts
- Over size VM’s with high CPU/memory reservations
(affects slot sizes)
- Lack of capacity/resources
Troubleshoot HA
Redundancy Issues:
- Need to design in redundancy for a clusters HA network
traffic (either using NIC teaming preferably to separate physical switches; or
via secondary management network attached to a different virtual switch)
If after a host
failure, a virtual machine has not restarted 2 possible reasons...
... Virtual machine was not protected by HA
at the time of the failure
... Insufficient spare capacity on available
hosts
Interpret the DRS
Resource Distribution Graph and Target/Current Host Load Deviation:
- Accessed from Summary tab at cluster level, under
section for VMware DRS “View Resource Distribution Chart”
- The DRS Resource Distribution Chart is used to display
both memory and CPU metrics for each host in the cluster
- The DRS process runs every 5 minutes and analyses
resource metrics on each host across the cluster
Troubleshoot DRS
Load Imbalance/Overcommit Issues:
- host failure
- vCenter Server is unavailable and VMs are powered on
via host connection, or changes are made to hosts or VMs
- cluster becomes invalid if user reduces reservation on
a parent resource pool while a VM is in the process of failing over
Troubleshoot Storage
vMotion Migration Issues:
- VMs disk must be in persistent mode or be RDMs
- For Virtual Compatibility Mode RDMs, you can migrate
the mapping file, or convert to thick/thin during migration, as long as
destination is not NFS
- For Physical Compatibility Mode RDMs, you can migrate
mapping file only
Two scenarios that
can cause Storage DRS to be disabled on a virtual disk...
... The disk is a CD-ROM/ISO file
... The virtual machine is a template
vMotion Resource
Maps:
Provide a visual representation of hosts, datastores, and
networks associated with the VM. Also which hosts in the VM’s cluster or
datacenter are compatible.
Identify Fault
Tolerance Requirements:
- physical CPUs must be compatible with vMotion or EVC
- physical CPUs must support hardware MMU virtualization
(Intel EPT or AMD RCI)
- dedicated 10GB network for FT logging
- vSphere Standard and Enterprise allows up to 2 CPUs for
FT
- vSphere Enterprise Plus allows up to 4 CPUs for FT
Features NOT
supported if a VM is protected by Fault Tolerance:
- VM snapshots
- Storage Vmotion
- Linked Clones
- Virtual SAN
- VM Component Protection (VMCP)
- Virtual Volume datastores
- Storage-based policy management
- I/O filters
When disabling
Distributed Resource Scheduler (DRS) Cluster on vSphere 6.x Cluster...
... The resource pool hierarchy of the DRS
cluster is removed
... The affinity settings of the DRS cluster
are removed and not maintained when DRS is re-enabled
Features supported
when using Fault Tolerance in vSphere 6.x (include)...
... vMotion
... vSphere Distributed Switches
Miscellaneous
If the vSphere Client is connected directly to
an ESXi host - an administrator is unable to access the Clone Virtual
Machine wizard
If you do not see the
Hardware Status tab in the vSphere Web Client, two possible explanations...
... The Hardware Status Plug-In is disabled
... The VMware VirtualCenter Management
Webservices service is not running
To address the
warning “This host currently has no management network redundancy”...
... Add an additional uplink to the
management vmknic
... Include the advanced HA parameter
das.ignoreRedundantNetWarning
Two conditions that
can cause orphaned VMs...
... The virtual machine was deleted outside
of vCenter Server
... The ESXi host has lost access to the
storage device
Three changes that
could result in a Network rollback operation...
... Changing the IP settings of management
VMkernel network adapters
... Changing the MTU of a distributed switch
... Updating the VLAN of the management
VMkernel network adapter
Change the Data Collection Level to 3 - to review device statistics to
troubleshoot an issue (for device level information)
When attempting to
power on a virtual machine and getting “Unable to access a file since it is
locked”, two actions to address...
... Investigate the logs for both the host
and the virtual machine
... Reboot the host the virtual machine is
running on
When attempting to
migrate a virtual machine with a USB device attached, the compatibility check
fails with the error message “Currently connected device uses backing path which
is not accessible”, two resolutions...
... Make sure that the devices are not in
the process of transferring data
... Re-add and enable vMotion for each
affected USB device
In a Fully
Automated Distributed Resource Scheduler (DRS) cluster with vMotion enabled,
virtual machines are never migrated, three scenarios...
... DRS is disabled on the virtual machine
... Moving the virtual machine will violate
an affinity rule
... Virtual machine has a local device
mounted
DRS does not move a
virtual machine when it is initially powered on despite insufficient resources
on the host, three possible causes...
... DRS is disabled on the virtual machine
... The virtual machine has a device mounted
... The virtual machine has fault tolerance
enabled
The following
scenarios can cause Storage DRS to be disabled on a virtual disk...
... A virtual
machine's swap file is host-local
... A certain
location is specified for a virtual machine's .vmx swap file
... The relocate or
Storage vMotion operation is currently disabled for the virtual machine in
vCenter Server
... The home disk
of a virtual machine is protected by vSphere HA and relocating will cause loss
of vSphere HA protection
... The disk is a CD-ROM/ISO file
... If the disk is
an independent disk, Storage DRS is disabled (except in the case of relocation
or clone placement)
... If the virtual
machine has system files on a separate datastore from the home datastore
(legacy), Storage DRS is disabled on the home disk
... If the virtual
machine has a disk whose base/redo files are spread across separate datastores
(legacy), Storage DRS for the disk is disabled
... The virtual
machine has hidden disks
... The virtual machine is a template
... The virtual
machine is vSphere Fault Tolerance-enabled
... The virtual
machine is sharing files between its disks
... The virtual
machine is being Storage DRS-placed with manually specified datastores
Comments
Post a Comment