Slice Service restarts due to snapshots taking longer than the snapshot retention period
Applies to
- NetApp SolidFire Storage Nodes
- NetApp H Series Storage Nodes
- NetApp Element software v12.3.x or below
Issue
- The
sliceServiceUnhealthy
warning as following is detected from a cluster that is using standard and/or remote replication snapshots and setting a schedule to delete the snapshots periodically.
Example:
25 2019-06-17T18:04:33.957Z Warning service 3 37 Yes 2019-06-17T18:10:38.400Z sliceServiceUnhealthy SolidFire Application cannot communicate with a metadata service.
23 2019-06-14T17:04:54.761Z Warning service 3 37 Yes 2019-06-14T17:09:38.927Z sliceServiceUnhealthy SolidFire Application cannot communicate with a metadata service.
20 2019-06-13T20:04:28.626Z Warning service 3 37 Yes 2019-06-13T20:08:47.734Z sliceServiceUnhealthy SolidFire Application cannot communicate with a metadata service.
- From the event on the Active IQ, as the periodic deletion of the snapshots, the Slice service is restarted and generates the core dump occasionally.
Example:
11806 2019-06-17T18:09:47.613Z serviceEvent Restarted SliceService: previous run killed with signal 6 (SIGABRT) core dump coreFileCount=1 servicecorelimit=7 49 3 37 { "replay":192.1844909617118 }
11801 2019-06-17T18:04:27.232Z sliceEvent Deleted snapshot due to reaching expiration date {Knowledgebase"snapshotID": 302, "expirationDate": "2019-06-17T18:00:00Z" }
11800 2019-06-17T18:04:27.226Z sliceEvent Deleted snapshot due to reaching expiration date {"snapshotID": 301, "expirationDate": "2019-06-17T18:00:00Z" }
11799 2019-06-17T18:04:27.220Z sliceEvent Deleted snapshot due to reaching expiration date {"snapshotID": 300, "expirationDate": "2019-06-17T18:00:00Z" }
11798 2019-06-17T18:04:27.214Z sliceEvent Deleted snapshot due to reaching expiration date {"snapshotID": 299, "expirationDate": "2019-06-17T18:00:00Z" }
8371 2019-06-14T17:09:10.752Z serviceEvent Restarted SliceService: previous run killed with signal 6(SIGABRT) core dump coreFileCount=1 servicecorelimit=7 49 3 37 { "replay": 176.2541414554888 }
8370 2019-06-14T17:04:22.484Z sliceEvent Deleted snapshot due to reaching expiration date {"snapshotID": 199, "expirationDate": "2019-06-14T17:00:02Z" }
8369 2019-06-14T17:04:22.479Z sliceEvent Deleted snapshot due to reaching expiration date {"snapshotID": 198, "expirationDate": "2019-06-14T17:00:01Z" }
8368 2019-06-14T17:04:21.904Z sliceEvent Deleted snapshot due to reaching expiration date {"snapshotID": 197, "expirationDate": "2019-06-14T17:00:02Z" }
7417 2019-06-13T20:08:29.169Z serviceEvent Restarted SliceService: previous run killed with signal 6(SIGABRT) core dump coreFileCount=1 servicecorelimit=7 49 3 37 { "replay": 114.2269542371808 }
7414 2019-06-13T20:04:21.117Z sliceEvent Deleted snapshot due to reaching expiration date {"snapshotID": 177, "expirationDate": "2019-06-13T20:00:01Z" }
7413 2019-06-13T20:04:21.111Z sliceEvent Deleted snapshot due to reaching expiration date {"snapshotID": 176, "expirationDate": "2019-06-13T20:00:01Z" }
- This issue can happen on both the source and target sides of local or remote replication clusters.