4-Node AFF A800 MCIP - Research Unbalanced ISLs / ISL Monitoring

Investigating unbalanced MCIP ISLs (network utilization on one switch pair's ISL is much higher than the other). A tip to go on:

Keep a close look at the network port utilization of the back-end (MCIP Cisco in this case) switches and the QoS statistics latency show for the cluster domain. Perhaps the unbalance is down to the iSCSI paths to the remote disks. Maybe we can tweak those paths.

In the documentation we see:

Considerations for ISLs (netapp.com)

The maximum theoretical throughput of shared ISLs (for example, 240 Gbps with six 40 Gbps ISLs) is a best-case scenario. When using multiple ISLs, statistical load balancing can impact the maximum throughput. Uneven balancing can occur and reduce throughput to that of a single ISL.

So uneven balancing is kind-of to be expected (the statement above is more for when a pair of switches has multiple ISLs, not one per switch pair, but I think it is still relatable.)

statistics start -object ?

There are so many objects we can get metrics from in ONTAP but I don't see any that will help with our unbalanced ISLs.

statistics catalog object show

When I ran this on ONTAP 9.11.1 there were 692 objects. There was nothing specific to ISL (searched and found nothing.) There are various MCC objects:

  • mcc_conifg
  • mcc_drc
  • mcc_hm_storage_bridge_fc_port
  • mcc_hm_storage_bridge_sas_port
  • mcc_hm_storage_switch
  • mcc_perf_cluster
  • mcc_perf_node
  • mcc_perf_vserver
  • mcc_storage
  • mcc_subsystem
  • mcc_vserver
  • mcculp

The one that looked interesting was mcculp:

  • mcculp : These counters track IO latency statistics pertaining to MCC interconnect DR node collected at MCC ULP layer.

But nothing useful in there pertaining to the ISLs.

QoS Satistics Latency Show

A useful check but again, nothing pertaining to the ISLs.

ActiveIQ

On the AFF A800, we have e0b and e1b as the HA and MetroCluster interfaces (HA traffic and disk traffic travels through these ports).

  • e0b goes to MCIP switch 1
  • e1b goes to MCIP switch 2
See in AIQ: METROCLUSTER-INTERFACE
  • Cluster 1 Node 1 e0b/e1b = 10.1.1.1 / 10.1.2.1
  • Cluster 1 Node 2 e0b/e1b = 10.1.1.2 / 10.1.2.2
  • Cluster 2 Node 1 e0b/e1b = 10.1.1.3 / 10.1.2.3
  • Cluster 2 Node 2 e0b/e1b = 10.1.1.4 / 10.1.2.4

See in AIQ: METROCLUSTER-ISCSI-INITIATOR

Then each node has 8 iSCSI initiators. 4 to the dr_partner and 4 to the dr_auxiliary.

See in AIQ: METROCLUSTER-TCP-IFSTAT

Here we can see the ifstat outputs for the MetroCluster interfaces since their last reset. For example:

Switch 1:

Cluster 1 Node 1 e0b: Receive = 77913 & Transmit = 54891
Cluster 1 Node 2 e0b: Receive = 48183 & Transmit = 69964
Cluster 2 Node 1 e0b: Receive = 62577 & Transmit = 82400
Cluster 2 Node 2 e0b: Receive = 73133 & Transmit = 54550 

Switch 2:

Cluster 1 Node 1 e1b: Receive = 62519 & Transmit = 63153
Cluster 1 Node 2 e1b: Receive = 63104 & Transmit = 57760
Cluster 2 Node 1 e1b: Receive = 63828 & Transmit = 66478 
Cluster 2 Node 2 e1b: Receive = 58863 & Transmit = 60917

In the example about, the traffic on both switches looks fairly balanced. But some of the traffic on e0b/e1b is local HA traffic. Even so, you would expect the local HA traffic to be balanced.

See: METROCLUSTER-TCP-STATS

This has (as expected) TCP Stats for the MetroCluster Interface connections (i.e. connection to connection.)

See: DISK-PATHS.XML

This is the most useful if we're investigating disk paths and their balance.

We see for the iSCSI initiator ports 0m and 0v, per controller, the actual Kbytes/sec on disk (Rolling Average). From this data we could work out if one initiator or the other is experiencing more load.

Would take a while to go through the data and work it out.

It could turn out that the local MCIP switches being ISL-ed locally allows traffic we think would say in one switch fabric, to traverse the ISL via a different fabric. This problem needs switch stats.


To be continued (maybe) ...

What Metrics Do We Use to Monitor the MCIP ISL?


TBC

Comments