Intermittent timeouts reported and High CPU observed after StorageGrid Expansion
Applies to
- NetApp StorageGRID
- Active-Backup High Availability (HA) Groups
Issue
- Following a StorageGRID expansion, clients observe intermittent SSL TLS handshake timeouts and degraded S3 performance on GETS/PUTS.
- Example client error observed:
* [<name>: Failed to obtain output stream from connection for object [xxx], Fixed Content Device [{xxx-xxx-xxx]] Message was: Read timed out]\n at com.filenet.engine.content.fcprovider.handlers.s3.XXX.createContentObjectFromFile(XXX.java:523)\n at com.filenet.engine.content.fcprovider.handlers.s3.XXX.createContentObjectFromFile(XXX.java:432)\n at com.filenet.engine.content.fcprovider.handlers.s3.XXXX.createContent(XXXX.java:320)\n
- Some GETS not recording in the
/var/local/log/nginx-gw/endpoint-access.logas expected, indicating the request not reaching the GRID. - Active Load Balancer records spike in CPU and running at ~85% or higher.

node_nf_conntrack_entriesin Prometheus show connections almost double after adding more storage nodes.

- Increase in HTTP sessions on all storage nodes via
rate(storagegrid_http_sessions_incoming_attempted[5m])in Prometheus.

