VCP6-DCV Exam Cram Notes: Section 7 of 10

Section 7 - Troubleshooting a vSphere Deployment

Fault Tolerance (VM) - provides continuous availability for applications in the event of server failure.

Three valid uses cases for Fault Tolerance...
... Protecting business critical applications
... Clustering custom applications which have no other supported method
... Reducing complexity compared to other clustering solutions

Fault tolerance...
... Requires a dedicated 10GB NIC
... Supports thin provisioned disks

Three vSphere features that are supported with Fault Tolerance...
... vSphere Data Protection
... vMotion
... Enhanced vMotion Compatibility

Three features or devices incompatible with Fault Tolerance...
... N_Port ID Virtualization (NPIV)
... CD-ROM backed by a physical device
... 3D enabled Video Devices

The following features are not supported with FT:
Snapshots, Storage vMotion, Linked clones, Virtual SAN, Virtual Volumes, VM Component Protection, Storage-based policy management and I/O filters.

kernelLatency - data counter can be used to identify suspected issues with VMs on a host trying to send more throughput to the storage system than the configuration on the host supports

Objective 7.1 - Troubleshoot vCenter Server, ESXi Hosts and VMs

Troubleshoot Common Installation Issues:
Make sure your hosts meet the hardware requirements as well as the VMware HCL.

Monitor ESXi System Health:
The Common Information Model (CIM) allows for a standard framework to manage computing resources and presents information via the vSphere Client.

Note: Execute Reset Sensors from the host’s Hardware Status tab in vCenter - to remove all the CIM data

ESXi Log Files and Locations:
/var/log/auth.log = ESXi Shell authentication success and failure log
/var/log/dhclient.log = DHCP client service log
/var/log/esxupdate.log = ESXi patch and update installation log
/var/log/lacp.log = Link Aggregation Control Protocol log
/var/log/hostd.log = Host management service log (includes VM and host Tasks and Events, communication with vSphere Client, vCenter, SDK)
/var/log/hostd-probe.log = Host management service responsiveness checker
/var/log/rhttproxy.log = HTTP connections proxied on behalf of other ESXi host webservices
/var/log/shell.log = ESXi Shell usage logs, including enable/disable and every command entered
/var/log/sysboot.log = Early VMkernel startup and module loading
/var/log/boot.gz = A compressed file that contains boot log information
/var/log/syslog.log = Management service initialization, watchdogs, scheduled tasks and DCUI use
/var/log/usb.log = USB device arbitration events
/var/log/vobd.log = VMkernel Observation evnets
/var/log/vmkernel.log = Core VMkernel logs (including device discovery, storage and networking device and driver events, and VM startup)
/var/log/vmkwarning.log = A summary of Warning and Alert log messages from VMkernel.log
/var/log/vmksummary.log = A summary of ESXi host startup/shutdown, hourly heartbeat with uptime, number of VMs running, service resource consumption
/var/log/Xorg.log = Video acceleration

Note: vpxa = vCenter Server Agent

vCenter Log Locations...
... on Windows - C:\ProgramData\VMware\VMware VirtualCenter\Logs
... on VA - /var/log/vmware/vpx

vCenter Log Files:
vpxd.log = Main vCenter Server log
vpxd-profiler.log = Profiled metrics for operations performed in vCenter Server*
*Used by VPX Operational Dashboard (VOD) at https://VCHostnameOrIP/vod/index.html
vpxd-alert.log = Non-fatal info logged about vpxd process
cim-diag.log & vws.log = CIM monitoring info
drmdump = actions proposed and taken by DRS
ls.log = Health reports for the Licensing Services extension
vimtool.log = Dump of string used during installation of vCenter Server
stats.log = historical performance data collection from ESXi hosts
sms.log = Health reports for Storage Monitoring Service extension
eam.log = Health reports for ESX Agent Monitor extension = connectivity information and status of the VMware Webmanagement Services
jointool.log = Health status of VMwareVCMSDS service and individual ADAM database objects, and replication logs between linked-mode vCenter Servers

Identify Common Command Line Interface (CLI) Commands:
esxtop - used for real time performance monitoring and troubleshooting
vmkping - (like ping) allows for sending traffic out a specified vmkernel interface
esxcli network name space - used for monitoring or configuring ESXi networking
esxcli storage name space - used for monitoring or configuring ESXi storage
vmkfstools - allows for management of VMFS volumes and virtual disks

g - in vimtop displays the top four physical CPUs
f - display all available CPUs overview
o - network view
k - disk view
m - display memory overview information

To power off a virtual machine while connected to an ESXi host using SSH:
> vim-cmd vmsvc/ VMID

Identify Fault Tolerance Network Latency Issues:
- Use dedicated 10-Gbit network for Fault Tolerance traffic
- Use the vmkping command to verify low sub-millisecond network latency

Objective 7.2 - Troubleshooting vSphere Storage and Network Issues

Troubleshoot Physical Network Adapter Configuration Issues:
- Be sure that physical NICs that are assigned to a virtual switch are configured the same on the physical switch (speed, VLANs, MTU...)
- If using IP Hash for Load Balancing method, make sure the physical switch side has link aggregation enabled
- If using beacon probing for network failover detection, standard practice is to use a minimum of 3 uplinks

Troubleshoot Virtual Switch and Port Group Configuration Issues:
- Port Group/dvPort Groups - case sensitivity is required across hosts
- vSwitch settings must be the same across hosts (e.g. otherwise Motion will fail)

Troubleshoot Common Network Issues - areas:
- Virtual Machine
- ESX/ESXi Host Networking (uplinks)
- vSwitch or dvSwitch Configuration
- Physical Switch Configuration

Troubleshoot VMFS Metadata Consistency:
Use the vSphere On-disk Metadata Analyser (VOMA) to identify and fix incidents of metadata corruption (for VMFS datastores or a virtual flash resource):
# esxcli storage vmfs extent list
# voma -m vmfs -f check -d /vmfs/devices/disks/naa.1234567...

Identify Storage I/O Constraints:
Disk Metric: Threshold (ms): Description
KAVG: 2: The amount of time the command spends in the VMkernel
DAVG: 25: This is the average response time per command being sent to the device
GAVG: 25: This is the response time as it is perceived by the guest OS
Note: If KAVG is > 0 it usually means I/O is backed up in a device or adapter queue.

Objective 7.3 - Troubleshoot vSphere Upgrades

Monitor tab -> System Logs -> Export Systems Logs
Choose ESX/ESXi hosts you want to export logs from
(Optional selection) Include vCenter Server and vSphere Web Client Logs
Specify which system logs are to be exported:
- Storage
- ActiveDirectory
- VirtualMachines
- System
- Userworld
- Performance Snapshot
Download Log Bundle!

Note: CLI Tool> vm-support

Configure vCenter Logging Options: Logging settings
Select level of detail that vCenter Server uses for log files:
- none = Disable logging
- error = Errors only
- warning = Errors and Warnings
- info = Normal logging (Default)
- verbose = Verbose
- trivia = Extended Verbose

Objective 7.4 - Troubleshoot and Monitor vSphere Performance

Describe How Tasks and Events are Viewed in vCenter Server:
Monitor tab -> Tasks or Events

Identify Critical Performance Metrics:
Critical points to monitor are: CPU, Memory, Networking, and Storage

Explain Common Memory Metrics:
Metric = Description
SWR/s and SWW/s = Measured in megabytes, these counters represent the rate at which the ESXi hosts is swapping memory in from disk (SWR/s) and swapping memory out to disk (SWW/s)
SWCUR = This is the amount of swap space currently used by the virtual machine
SWTGT = This is the amount of swap space that the host expects the virtual machine to use
MCTL = Indicates whether the balloon driver is installed in the virtual machine
MCTLSZ = Amount of physical memory that the balloon driver has reclaimed
MCTLTGT = Maximum amount of memory that the host wants to reclaim via the balloon driver

Explain Common CPU Metrics:
Metric = Description
%Used = Percentage of physical CPU time used by a group of worlds
%RDY = Percentage of time a group was ready to run but was not provided CPU resources
%CSTP = Percentage of time the vCPUs of a virtual machine spent in the co-stopped state, waiting to be co-started
%SYS = Percentage of time spent in the ESXi VMkernel on behalf of the world/resource pool

Explain Common Network Metrics:
Metric = Description
MbTX/s = Amount of data transmitted in Mbps
MbRX/s = Amount of data received in Mbps
%DRPTX = Percentage of outbound packets dropped
%DRPRX = Percentage of inbound packets dropped

Explain Common Storage Metrics:
Metric = Description
DAVG = Average amount of time it takes a device to service a single I/O request
KAVG = The average amount of time it takes the VMkernel to service a disk operation
GAVG = The total latency seen from the virtual machine when performing an I/O request
ABRT/s = Number of commands aborted per second

Identify Host Power Management Policy:
Power Management Policy = Description
Not supported = Not supported / Disabled in BIOS
High Performance = The VMkernel detects certain power management features, but will not use them unless the system BIOS requests them for power capping or thermal events
Balanced (Default) = The VMkernel uses the available power management features conservatively to reduce host energy consumption with minimal compromise to performance
Low Power = The VMkernel aggressively uses available power management features to reduce host energy consumption at the risk of lower performance
Custom = The VMkernel bases its power management policy on the values of several advanced configuration parameters

Identify CPU/Memory Contention Issues - Monitor Performance through ESXTOP

Troubleshoot Enhanced vMotion Compatibility (EVC) Issues:
- EVC mode ensures that all ESXi hosts in a cluster present the same CPU level/feature set to VMs, even if the CPUs on the hosts differ
Note: CPUs still need to be of the same CPU manufacturer.

ESXi 6.0 Supports these EVC Modes:
AMD Opteron Generation: 1, 2, 3, 3 (no 3Dnow!), 4, “Piledriver”
Intel Generation: “Merom”, “Penryn”, “Nehalem”, “Westmere”, “Sandy Bridge”, “Ivy Bridge”, “Haswell”

Overview Charts: Display multiple data sets in one panel to easily evaluate different resource statistics, display thumbnail charts for child objects, and display charts for a parent and a child object
Advanced Charts: Display more information than overview charts, are configurable, and can be printed or exported to a spreadsheet

Objective 7.5 - Troubleshoot HA and DRS Configurations and Fault Tolerance

HA Requirements:
- All hosts must be licensed for vSphere HA
- You need at least 2 hosts in the cluster
- All hosts should be configured with static IP, or, if using DHCP, address must be persistent across reboots
- There should be at least 1 management network in common among all hosts
- All hosts should have access to the same VM networks and datastores
- For VM monitoring to work, VMware tools must be installed
- supports both IPv4 and IPv6

DRS Requirements:
- Shared Storage: can be either SAN or NAS
- Place the disks of VMs on datastores that are accessible by all hosts
- Processor Compatibility: same vendor (AMD or Intel), and supported family for EVC
Note: CPU Compatibility Masks - you can hide certain CPU features from the VM to prevent vMotion failing due to incompatible CPUs

vMotion Requirements:
- The virtual machine configuration file for ESXi hosts must reside on VMFS
- vMotion does not support raw disks, or migrations of applications using MSCS
- vMotion requires a private GbE (minimum) migration network between all of the vMotion enabled hosts

Verify vMotion/Storage vMotion Configuration:
- Proper networking (VMkernel interface for vMotion)
- CPU compatibility
- Shared storage access across all hosts

Note: When migrating a virtual machine, 3 available options...
... Change compute resource only
... Change storage only
... Change both compute resource and storage

Verify HA Network Configuration:
- On ESXi hosts in the cluster, vSphere HA communications, by default, travel over VMkernel networks, except those marked for use with vMotion

Verify HA/DRS Cluster Configuration:
You can monitor for errors by looking at the Cluster Operational Status and Configuration Issues screens

Troubleshoot HA Capacity Issues: The 3 Admission Control Policies:
- Host failures the cluster tolerates (default): Configure vSphere HA to tolerate a specified number of host failures
- Percentage of cluster resources reserved as failover spare capacity: Configure vSphere HA to perform admission control by reserving a specific percentage of cluster CPU and memory resources for recovery from host failure
- Specify failover hosts

When troubleshooting HA, look for:
- Failed or disconnected hosts
- Over size VM’s with high CPU/memory reservations (affects slot sizes)
- Lack of capacity/resources

Troubleshoot HA Redundancy Issues:
- Need to design in redundancy for a clusters HA network traffic (either using NIC teaming preferably to separate physical switches; or via secondary management network attached to a different virtual switch)

If after a host failure, a virtual machine has not restarted 2 possible reasons...
... Virtual machine was not protected by HA at the time of the failure
... Insufficient spare capacity on available hosts

Interpret the DRS Resource Distribution Graph and Target/Current Host Load Deviation:
- Accessed from Summary tab at cluster level, under section for VMware DRS “View Resource Distribution Chart”
- The DRS Resource Distribution Chart is used to display both memory and CPU metrics for each host in the cluster
- The DRS process runs every 5 minutes and analyses resource metrics on each host across the cluster

Troubleshoot DRS Load Imbalance/Overcommit Issues:
- host failure
- vCenter Server is unavailable and VMs are powered on via host connection, or changes are made to hosts or VMs
- cluster becomes invalid if user reduces reservation on a parent resource pool while a VM is in the process of failing over

Troubleshoot Storage vMotion Migration Issues:
- VMs disk must be in persistent mode or be RDMs
- For Virtual Compatibility Mode RDMs, you can migrate the mapping file, or convert to thick/thin during migration, as long as destination is not NFS
- For Physical Compatibility Mode RDMs, you can migrate mapping file only

Two scenarios that can cause Storage DRS to be disabled on a virtual disk...
... The disk is a CD-ROM/ISO file
... The virtual machine is a template

vMotion Resource Maps:
Provide a visual representation of hosts, datastores, and networks associated with the VM. Also which hosts in the VM’s cluster or datacenter are compatible.

Identify Fault Tolerance Requirements:
- physical CPUs must be compatible with vMotion or EVC
- physical CPUs must support hardware MMU virtualization (Intel EPT or AMD RCI)
- dedicated 10GB network for FT logging
- vSphere Standard and Enterprise allows up to 2 CPUs for FT
- vSphere Enterprise Plus allows up to 4 CPUs for FT

Features NOT supported if a VM is protected by Fault Tolerance:
- VM snapshots
-  Storage Vmotion
- Linked Clones
- Virtual SAN
- VM Component Protection (VMCP)
- Virtual Volume datastores
- Storage-based policy management
- I/O filters

When disabling Distributed Resource Scheduler (DRS) Cluster on vSphere 6.x Cluster...
... The resource pool hierarchy of the DRS cluster is removed
... The affinity settings of the DRS cluster are removed and not maintained when DRS is re-enabled

Features supported when using Fault Tolerance in vSphere 6.x (include)...
... vMotion
... vSphere Distributed Switches


If the vSphere Client is connected directly to an ESXi host - an administrator is unable to access the Clone Virtual Machine wizard

If you do not see the Hardware Status tab in the vSphere Web Client, two possible explanations...
... The Hardware Status Plug-In is disabled
... The VMware VirtualCenter Management Webservices service is not running

To address the warning “This host currently has no management network redundancy”...
... Add an additional uplink to the management vmknic
... Include the advanced HA parameter das.ignoreRedundantNetWarning

Two conditions that can cause orphaned VMs...
... The virtual machine was deleted outside of vCenter Server
... The ESXi host has lost access to the storage device

Three changes that could result in a Network rollback operation...
... Changing the IP settings of management VMkernel network adapters
... Changing the MTU of a distributed switch
... Updating the VLAN of the management VMkernel network adapter

Change the Data Collection Level to 3 - to review device statistics to troubleshoot an issue (for device level information)

When attempting to power on a virtual machine and getting “Unable to access a file since it is locked”, two actions to address...
... Investigate the logs for both the host and the virtual machine
... Reboot the host the virtual machine is running on

When attempting to migrate a virtual machine with a USB device attached, the compatibility check fails with the error message “Currently connected device uses backing path which is not accessible”, two resolutions...
... Make sure that the devices are not in the process of transferring data
... Re-add and enable vMotion for each affected USB device

In a Fully Automated Distributed Resource Scheduler (DRS) cluster with vMotion enabled, virtual machines are never migrated, three scenarios...
... DRS is disabled on the virtual machine
... Moving the virtual machine will violate an affinity rule
... Virtual machine has a local device mounted

DRS does not move a virtual machine when it is initially powered on despite insufficient resources on the host, three possible causes...
... DRS is disabled on the virtual machine
... The virtual machine has a device mounted
... The virtual machine has fault tolerance enabled

The following scenarios can cause Storage DRS to be disabled on a virtual disk...
... A virtual machine's swap file is host-local
... A certain location is specified for a virtual machine's .vmx swap file
... The relocate or Storage vMotion operation is currently disabled for the virtual machine in vCenter Server
... The home disk of a virtual machine is protected by vSphere HA and relocating will cause loss of vSphere HA protection
... The disk is a CD-ROM/ISO file
... If the disk is an independent disk, Storage DRS is disabled (except in the case of relocation or clone placement)
... If the virtual machine has system files on a separate datastore from the home datastore (legacy), Storage DRS is disabled on the home disk
... If the virtual machine has a disk whose base/redo files are spread across separate datastores (legacy), Storage DRS for the disk is disabled
... The virtual machine has hidden disks
... The virtual machine is a template
... The virtual machine is vSphere Fault Tolerance-enabled
... The virtual machine is sharing files between its disks
... The virtual machine is being Storage DRS-placed with manually specified datastores
