NetApp StorageGRID in DII and Thinking About Headroom: Part 1

Introduction

I had a question from customer to provide the max IOPs from a StorageGRID system, and also advise how heavily utilized their current system is. Essentially a question of headroom.

From Google AI: IOPS (Input/Output Operations Per Second) measures the read and write performance of a physical storage device, like an SSD or HDD, while S3 operations relate to the performance of Amazon S3, a cloud storage service

The above essentially says to use S3 operations instead of IOPs when talking about performance with respect to an S3 object storage system. Of course, if you have physical hardware with spinning disks, you can always come up with a rough IOPs figure, but relating that to S3 operations is not directly possible (at least, I don't think so, please correct me if I'm wrong).

5,400 RPM HDDs: 50-60 IOPS
7,200 RPM SATA HDDs: 75-100 IOPS
10,000 RPM SATA HDDs: 125-150 IOPS
10,000 RPM SAS HDDs: 140 IOPS
15,000 RPM SAS HDDs: 175-210 IOPS

With S3 storage, we also need to consider throughput.

Quote:

"Normally, your grid is sized to achieve a required throughput, defined in terms of S3 operations per second or bytes per second. For example, you might have a requirement that your grid handle 1,000 S3 operations per second, or 2,000 MB per second, of object ingests and retrievals.
"For example, if your grid was sized to achieve a throughput of 2,000 MB/second, and your average object size is 2 MB, then your grid was sized to be able to handle 1,000 S3 operations per second (2,000 MB / 2 MB)."

And I want to present some data using NetApp DII (Data Infrastructure Insights) dashboards.

Fusion Tool

In NetApp's Fusion Tool (as of 27.05.2025), for Workloads we have:

Basic Workload Input

Usable Capacity (TiB)
Average Object Size (32KB, 64KB, 128KB, 256KB, 512KB, 1MB, 2MB, 4MB, 8MB, 16MB, 32MB, 64MB, 128MB, 256MB, 512MB)
Throughput: Req/s or MB/s
Workload Profile (Read/Write/Delete)

Mixed Workload 1 (50/25/25)
Mixed Workload 2 (25/50/25)
100% Reads (100/0/0)
100% Writes (0/100/0)

TR-6773i: StorageGRID Performance: NetApp StorageGRID 11.9

This is an internal DR and cannot be shared externally. For the purposes of creating my dashboards, there are some useful takeaways.

We need to consider performance both appliances:

Storage Appliance
Services Appliance (gateway/load balancer)

There are a lot of factors that can influence performance:

Number of storage nodes at a site
Storage node deployment type
Type of StorageGRID appliance
Client application infrastructure
Workload (read, write, delete, concurrency*, object sizes)
Storage node object data and metadata capacity use
Information lifecycle management (ILM) configuration
Number of sites and latency between sites
Network infrastructure and configuration (i.e. LACP, 4 x 10GbE, 4 x 25GbE ...)
Platform services
Cross-grid replication
Stored object encryption

Measurement of performance:

S3 requests per second (requests/sec)

"For smaller objects, the main factor that influences performance is the transactional rate at which objects are processed by the system. Depending on the context, the object transactional rate can vary based on the operation that is performed, such as ingest rate, retrieval rate, or delete rate".

Throughput (megabytes per second or MBps)

"For larger objects, the payload size is the main factor that affects the performance of the system. Therefore, the throughput or the bandwidth of the system is the best indicator of performance."

Test results for various StorageGRID storage node appliances are given as -

100% PUT per-node performance (using 2-copy replicated ILM)
100% GET per-node performance (using 2-copy replicated ILM)
100% PUT per-node performance (using EC2+1 ILM)
100% GET per-node performance (using EC2+1 ILM)

with all object sizes (KB): 32, 64 ... 524288
threads (1024 for 32KB, 64KB, 128KB and either 8 or 4 for the rest)
requests per second
throughput (MBps)

*Concurrency in computing refers to the ability of a system to handle multiple tasks or processes simultaneously or at overlapping times.

Note (from Google AI but seems reasonable): StorageGRID erasure coding is not ideal for small objects (less than 200KB), objects that need to be retrieved quickly (high latency), or when a large number of storage nodes and sites are unavailable. It's also less efficient for frequently retrieved objects or those that require frequent repairs.

StorageGRID Fields In NetApp Data Infrastructure Insights (DII)

I want to create some dashboards that can help with the customer question. Below I've listed the fields in DII and highlighted those that I think will be of interest for creating dashboards.

Correct as of 27.05.2025.

netapp_storagegrid.cluster

agent_version
cluster_id
cluster_ip
cluster_name
cluster_oid
usable_percent
utilization_percent

netapp_storagegrid.node

agent_version
cluster_id
cluster_ip
cluster_name
cluster_oid
code
content_buckets_and_containers
content_objects
content_objects_lost
content_objects_lost_rate
cpu_seconds_total
http_sessions_incoming_attempted_rate
http_sessions_incoming_currently_established_rate
http_sessions_incoming_failed_rate
http_sessions_incoming_successful_rate
identity_service_failed_authorize_by_uuid_requests_rate
identity_service_failed_authorize_requests_rate
identity_service_failed_change_password_requests_rate
identity_service_failed_get_group_requests_rate
identity_service_failed_list_groups_requests_rate
identity_service_failed_new_connection_dials_rate
identity_service_failed_new_connections_rate
identity_service_failed_schedule_synchronization_requests_rate
identity_service_failed_search_group_rate
identity_service_failed_search_groups_rate
identity_service_failed_search_user_rate
identity_service_failed_search_users_rate
identity_service_failed_synchronization_scans_rate
identity_service_failed_tenant_synchronizations_rate
identity_service_failed_validate_requests_rate
identity_service_schedule_synchronization_time_ms_bucket_rate
identity_service_schedule_synchronization_time_ms_count_rate
identity_service_schedule_synchronization_time_ms_sum_rate
identity_service_total_schedule_synchronization_requests_rate
ilm_awaiting_background_objects
ilm_awaiting_client_evaluation_objects_per_second (sec)
ilm_awaiting_client_objects
ilm_awaiting_total_objects
ilm_scan_objects_per_second (sec)
ilm_scan_period_estimated_minutes
job
metadata_queries_average_latency_milliseconds (ms)
network_received_bytes_rate (B/s)
network_transmitted_bytes_rate (B/s)
node_name
node_uuid
operation
platform_services_failed_raft_requests
platform_services_failed_raft_requests_rate
platform_services_failed_replications
platform_services_failed_replications_rate
platform_services_failed_s3_notifications
platform_services_failed_s3_notifications_rate
platform_services_permanently_failed_requests
platform_services_permanently_failed_requests_rate
platform_services_total_raft_requests
platform_services_total_raft_requests_rate
platform_services_total_s3_notifications
platform_services_total_s3_notifications_rate
s3_data_transfers_bytes_ingested_rate
s3_data_transfers_bytes_retrieved_rate
s3_operations_failed
s3_operations_successful
s3_operations_failed_rate
s3_operations_successful_rate
s3_operations_unauthorized
s3_operations_unauthorized_rate
s3_requests_cancelled_total
s3_requests_cancelled_total_rate
s3_requests_total
s3_requests_total_rate
servercertificate_management_interface_cert_expiry_days
servercertificate_storage_api_endpoints_cert_expiry_days
service_cpu_seconds (sec)
service_cpu_seconds_rate
service_load
service_memory_usage_bytes (B)
service_memory_usage_bytes_rate (B/s)
service_network_received_bytes (B)
service_network_received_bytes_rate (B/s)
service_network_transmitted_bytes (B)
service_network_transmitted_bytes_rate (B/s)
service_restarts
service_restarts_rate
service_runtime_seconds (sec)
service_uptime_seconds (sec)
site_id
site_name
storage_state_current_maintenance
storage_state_current_offline
storage_state_current_online
storage_state_current_read_only
storage_status_no_error
storage_status_no_free_space
storage_status_transition
storage_status_unknownerr
storage_status_vols_unavail
storage_utilization_data_bytes (B)
storage_utilization_data_bytes_rate (B/s)
storage_utilization_metadata_allowed_bytes (B)
storage_utilization_metadata_bytes (B)
storage_utilization_metadata_bytes_rate (B/s)
storage_utilization_total_space_bytes (B)
storage_utilization_total_space_bytes_rate (B/s)
storage_utilization_usable_space_bytes (B)
storage_utilization_usable_space_bytes_rate (B/s)
storage_utilization_used_space
swift_data_transfers_bytes_ingested
swift_data_transfers_bytes_ingested_rate
swift_data_transfers_bytes_retrieved
swift_data_transfers_bytes_retrieved_rate
swift_operations_failed
swift_operations_failed_rate
swift_operations_successful
swift_operations_successful_rate
swift_operations_unauthorized
swift_operations_unauthorized_rate
usable_percent (%)
utilization_percent (%)

netapp_storagegrid.tenant

usage_data_bytes (B)
agent_version
cluster_id
cluster_ip
cluster_name
cluster_oid
storagepool_key
tenant_id
tenant_name
usage_data_bytes (B)
usage_data_bytes_rate (B/s)
usage_object_count
usage_object_count_rate
usage_quota_bytes (B)
usage_quota_bytes_rate (B/s)

netapp_storagegrid.bucket

agent_version
bucket_id
bucket_name
cluster_id
cluster_ip
cluster_name
cluster_oid
code_type
internal_volume_key
method_type
policy_id
storagepool_key
tenant_id
tenant_name
usage_data_bytes (B)
usage_data_bytes_rate (B/s)
usage_object_count
usage_object_count_rate

Note: After doing this, CPU and memory utilization were also added. The collectors are constantly being improved with useful metrics, so static documentation can rapidly get out of date.

Formula & Mathematics

In order to present the data we need to understand a formula:

S3 Operations per Second (S3/s) = Throughput (MB/s)

------------------------

Average Object Size (MB)

In DII, we have:

S3 Operations per Second = s3_operations_failed_rate + s3_operations_successful_rate

S3 Throughput = s3_data_tranfers_bytes_ingested_rate + s3_data_transfers_bytes_retrieved_rate

% write = s3_data_transfers_bytes_ingested_rate / S3 Throughput

% read = s3_data_transfers_bytes_retrieved_rate / S3 Throughput

Which means we can work out the average object size. The workload profile. And use performance data from NetApp (which they can get internally from TR-6773i - note the "i" for internal.) To roughly work out how loaded the nodes are.

The figures in the TR are from 100% read and 100% write, but you can multiply those by the calculated % read and % write above, to get a maximum based on the workload, i.e.:

% read x 100% read maximums + % write x 100% write maximums

** To be continued **

My Other IT Blog

Search This Blog

NetApp StorageGRID in DII and Thinking About Headroom: Part 1

Comments

Post a Comment