NetApp StorageGRID in DII and Thinking About Headroom: Part 1

Introduction

I had a question from customer to provide the max IOPs from a StorageGRID system, and also advise how heavily utilized their current system is. Essentially a question of headroom.

From Google AI: IOPS (Input/Output Operations Per Second) measures the read and write performance of a physical storage device, like an SSD or HDD, while S3 operations relate to the performance of Amazon S3, a cloud storage service

The above essentially says to use S3 operations instead of IOPs when talking about performance with respect to an S3 object storage system. Of course, if you have physical hardware with spinning disks, you can always come up with a rough IOPs figure, but relating that to S3 operations is not directly possible (at least, I don't think so, please correct me if I'm wrong).

  • 5,400 RPM HDDs:       50-60 IOPS
  • 7,200 RPM SATA HDDs:  75-100 IOPS
  • 10,000 RPM SATA HDDs: 125-150 IOPS
  • 10,000 RPM SAS HDDs:  140 IOPS
  • 15,000 RPM SAS HDDs:  175-210 IOPS

With S3 storage, we also need to consider throughput.

  • Quote:
    • "Normally, your grid is sized to achieve a required throughput, defined in terms of S3 operations per second or bytes per second. For example, you might have a requirement that your grid handle 1,000 S3 operations per second, or 2,000 MB per second, of object ingests and retrievals.
    • "For example, if your grid was sized to achieve a throughput of 2,000 MB/second, and your average object size is 2 MB, then your grid was sized to be able to handle 1,000 S3 operations per second (2,000 MB / 2 MB)."

And I want to present some data using NetApp DII (Data Infrastructure Insights) dashboards.

Fusion Tool

In NetApp's Fusion Tool (as of 27.05.2025), for Workloads we have:

  • Basic Workload Input
    • Usable Capacity (TiB)
    • Average Object Size (32KB, 64KB, 128KB, 256KB, 512KB, 1MB, 2MB, 4MB, 8MB, 16MB, 32MB, 64MB, 128MB, 256MB, 512MB)
    • Throughput: Req/s or MB/s
    • Workload Profile (Read/Write/Delete)
      • Mixed Workload 1 (50/25/25)
      • Mixed Workload 2 (25/50/25)
      • 100% Reads (100/0/0)
      • 100% Writes (0/100/0)

TR-6773i: StorageGRID Performance: NetApp StorageGRID 11.9

This is an internal DR and cannot be shared externally. For the purposes of creating my dashboards, there are some useful takeaways.
  1. We need to consider performance both appliances:
    1. Storage Appliance
    2. Services Appliance (gateway/load balancer)
  2. There are a lot of factors that can influence performance:
    • Number of storage nodes at a site
    • Storage node deployment type
    • Type of StorageGRID appliance
    • Client application infrastructure
    • Workload (read, write, delete, concurrency*, object sizes)
    • Storage node object data and metadata capacity use
    • Information lifecycle management (ILM) configuration
    • Number of sites and latency between sites
    • Network infrastructure and configuration (i.e. LACP, 4 x 10GbE, 4 x 25GbE ...)
    • Platform services
    • Cross-grid replication
    • Stored object encryption
  3. Measurement of performance:
    1. S3 requests per second (requests/sec)
      • "For smaller objects, the main factor that influences performance is the transactional rate at which objects are processed by the system. Depending on the context, the object transactional rate can vary based on the operation that is performed, such as ingest rate, retrieval rate, or delete rate".
    2. Throughput (megabytes per second or MBps)
      • "For larger objects, the payload size is the main factor that affects the performance of the system. Therefore, the throughput or the bandwidth of the system is the best indicator of performance."
  4. Test results for various StorageGRID storage node appliances are given as -
    1. 100% PUT per-node performance (using 2-copy replicated ILM)
    2. 100% GET per-node performance (using 2-copy replicated ILM)
    3. 100% PUT per-node performance (using EC2+1 ILM)
    4. 100% GET per-node performance (using EC2+1 ILM)
      • with all object sizes (KB): 32, 64 ... 524288
      • threads (1024 for 32KB, 64KB, 128KB and either 8 or 4 for the rest)
      • requests per second
      • throughput (MBps)
*Concurrency in computing refers to the ability of a system to handle multiple tasks or processes simultaneously or at overlapping times.

Note (from Google AI but seems reasonable): StorageGRID erasure coding is not ideal for small objects (less than 200KB), objects that need to be retrieved quickly (high latency), or when a large number of storage nodes and sites are unavailable. It's also less efficient for frequently retrieved objects or those that require frequent repairs.

StorageGRID Fields In NetApp Data Infrastructure Insights (DII)

I want to create some dashboards that can help with the customer question. Below I've listed the fields in DII and highlighted those that I think will be of interest for creating dashboards.

Correct as of 27.05.2025.
  • netapp_storagegrid.cluster
    • agent_version
    • cluster_id
    • cluster_ip
    • cluster_name
    • cluster_oid
    • usable_percent
    • utilization_percent
  • netapp_storagegrid.node
    • agent_version
    • cluster_id
    • cluster_ip
    • cluster_name
    • cluster_oid
    • code
    • content_buckets_and_containers
    • content_objects
    • content_objects_lost
    • content_objects_lost_rate
    • cpu_seconds_total
    • http_sessions_incoming_attempted_rate
    • http_sessions_incoming_currently_established_rate
    • http_sessions_incoming_failed_rate
    • http_sessions_incoming_successful_rate
    • identity_service_failed_authorize_by_uuid_requests_rate
    • identity_service_failed_authorize_requests_rate
    • identity_service_failed_change_password_requests_rate
    • identity_service_failed_get_group_requests_rate
    • identity_service_failed_list_groups_requests_rate
    • identity_service_failed_new_connection_dials_rate
    • identity_service_failed_new_connections_rate
    • identity_service_failed_schedule_synchronization_requests_rate
    • identity_service_failed_search_group_rate
    • identity_service_failed_search_groups_rate
    • identity_service_failed_search_user_rate
    • identity_service_failed_search_users_rate
    • identity_service_failed_synchronization_scans_rate
    • identity_service_failed_tenant_synchronizations_rate
    • identity_service_failed_validate_requests_rate
    • identity_service_schedule_synchronization_time_ms_bucket_rate
    • identity_service_schedule_synchronization_time_ms_count_rate
    • identity_service_schedule_synchronization_time_ms_sum_rate
    • identity_service_total_schedule_synchronization_requests_rate
    • ilm_awaiting_background_objects
    • ilm_awaiting_client_evaluation_objects_per_second (sec)
    • ilm_awaiting_client_objects
    • ilm_awaiting_total_objects
    • ilm_scan_objects_per_second (sec)
    • ilm_scan_period_estimated_minutes
    • job
    • metadata_queries_average_latency_milliseconds (ms)
    • network_received_bytes_rate (B/s)
    • network_transmitted_bytes_rate (B/s)
    • node_name
    • node_uuid
    • operation
    • platform_services_failed_raft_requests
    • platform_services_failed_raft_requests_rate
    • platform_services_failed_replications
    • platform_services_failed_replications_rate
    • platform_services_failed_s3_notifications
    • platform_services_failed_s3_notifications_rate
    • platform_services_permanently_failed_requests
    • platform_services_permanently_failed_requests_rate
    • platform_services_total_raft_requests
    • platform_services_total_raft_requests_rate
    • platform_services_total_s3_notifications
    • platform_services_total_s3_notifications_rate
    • s3_data_transfers_bytes_ingested_rate
    • s3_data_transfers_bytes_retrieved_rate
    • s3_operations_failed
    • s3_operations_successful
    • s3_operations_failed_rate
    • s3_operations_successful_rate
    • s3_operations_unauthorized
    • s3_operations_unauthorized_rate
    • s3_requests_cancelled_total
    • s3_requests_cancelled_total_rate
    • s3_requests_total
    • s3_requests_total_rate
    • servercertificate_management_interface_cert_expiry_days
    • servercertificate_storage_api_endpoints_cert_expiry_days
    • service_cpu_seconds (sec)
    • service_cpu_seconds_rate
    • service_load
    • service_memory_usage_bytes (B)
    • service_memory_usage_bytes_rate (B/s)
    • service_network_received_bytes (B)
    • service_network_received_bytes_rate (B/s)
    • service_network_transmitted_bytes (B)
    • service_network_transmitted_bytes_rate (B/s)
    • service_restarts
    • service_restarts_rate
    • service_runtime_seconds (sec)
    • service_uptime_seconds (sec)
    • site_id
    • site_name
    • storage_state_current_maintenance
    • storage_state_current_offline
    • storage_state_current_online
    • storage_state_current_read_only
    • storage_status_no_error
    • storage_status_no_free_space
    • storage_status_transition
    • storage_status_unknownerr
    • storage_status_vols_unavail
    • storage_utilization_data_bytes (B)
    • storage_utilization_data_bytes_rate (B/s)
    • storage_utilization_metadata_allowed_bytes (B)
    • storage_utilization_metadata_bytes (B)
    • storage_utilization_metadata_bytes_rate (B/s)
    • storage_utilization_total_space_bytes (B)
    • storage_utilization_total_space_bytes_rate (B/s)
    • storage_utilization_usable_space_bytes (B)
    • storage_utilization_usable_space_bytes_rate (B/s)
    • storage_utilization_used_space
    • swift_data_transfers_bytes_ingested
    • swift_data_transfers_bytes_ingested_rate
    • swift_data_transfers_bytes_retrieved
    • swift_data_transfers_bytes_retrieved_rate
    • swift_operations_failed
    • swift_operations_failed_rate
    • swift_operations_successful
    • swift_operations_successful_rate
    • swift_operations_unauthorized
    • swift_operations_unauthorized_rate
    • usable_percent (%)
    • utilization_percent (%)
  • netapp_storagegrid.tenant
    • usage_data_bytes (B)
    • agent_version
    • cluster_id
    • cluster_ip
    • cluster_name
    • cluster_oid
    • storagepool_key
    • tenant_id
    • tenant_name
    • usage_data_bytes (B)
    • usage_data_bytes_rate (B/s)
    • usage_object_count
    • usage_object_count_rate
    • usage_quota_bytes (B)
    • usage_quota_bytes_rate (B/s)
  • netapp_storagegrid.bucket
    • agent_version
    • bucket_id
    • bucket_name
    • cluster_id
    • cluster_ip
    • cluster_name
    • cluster_oid
    • code_type
    • internal_volume_key
    • method_type
    • policy_id
    • storagepool_key
    • tenant_id
    • tenant_name
    • usage_data_bytes (B)
    • usage_data_bytes_rate (B/s)
    • usage_object_count
    • usage_object_count_rate
Note: After doing this, CPU and memory utilization were also added. The collectors are constantly being improved with useful metrics, so static documentation can rapidly get out of date.

Formula & Mathematics

In order to present the data we need to understand a formula:

S3 Operations per Second (S3/s) = Throughput (MB/s)
                                  ------------------------
                                  Average Object Size (MB)

In DII, we have:

S3 Operations per Second = s3_operations_failed_rate + s3_operations_successful_rate
S3 Throughput = s3_data_tranfers_bytes_ingested_rate + s3_data_transfers_bytes_retrieved_rate
% write = s3_data_transfers_bytes_ingested_rate / S3 Throughput
% read = s3_data_transfers_bytes_retrieved_rate / S3 Throughput

Which means we can work out the average object size. The workload profile. And use performance data from NetApp (which they can get internally from TR-6773i - note the "i" for internal.) To roughly work out how loaded the nodes are.

The figures in the TR are from 100% read and 100% write, but you can multiply those by the calculated % read and % write above, to get a maximum based on the workload, i.e.:

% read x 100% read maximums + % write x 100% write maximums

** To be continued **

Comments