StorageX migration floods secd
Applies to
- ONTAP 9
- StorageX
- Migrations
Answer
Storage migrations utilizing third-party multi-threaded migration tools could cause secd bottlenecks. Often, this presents itself as a potential client authentication performance issue when migrations are occurring. Any work related to authentication for the node(s) could also be impacted.
Evidence of this bottleneck can be seen in the following errors:
Secd.log
Secd can begin to queue and requests will sit in the queue for longer periods.
[kern_secd:info:10816] debug: Worker Thread 34641252096 processing RPC 153:secd_rpc_auth_get_creds with request ID:6605 which sat in the queue for 23 seconds. { in run() at src/server/secd_rpc_server.cpp:2067 }
This might progress that RPC requests are failing due to 23 sec timeouts.
[kern_secd:info:5895] .------------------------------------------------------------------------------.
[kern_secd:info:5895] | RPC TOOK TOO LONG: |
[kern_secd:info:5895] | RPC used 24 seconds (max is 23) |
[kern_secd:info:5895] | and likely caused the client to timeout |
[kern_secd:info:5895] .------------------------------------------------------------------------------.
Eventually, if secd rpc memory allocation hits 80%, we start to record these messages:
secd.log
[kern_secd:info:10816] [SECD MASTER THREAD] SecD RPC Server: Too many outstanding Generic RPC requests: sending System Error to RPC 153:secd_rpc_auth_get_creds Request ID:65535.
ems.log
[secd: secd.rpc.server.request.dropped:debug]: The RPC secd_rpc_auth_get_creds sent from NBLADE_CIFS was dropped by SecD due to memory pressure.
Collecting secd cm statistics can also confirm how many times this condition is hit.
nas::> set diag
nas::*> statistics start -object secd -instance secd -node NETAPP01-06 -sample-id sample_695
nas::*> statistics stop –sample-id sample_695
nas::*> statistics show –sample-id sample_695
Object: secd
Instance: secd
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: NETAPP01-06
instance_name secd
node_name NETAPP01-06
num_rpcs_dropped_due_to_low_memory
mgwd 0
nblade 98765
dblade 0
num_rpcs_failed -
mgwd 0
nblade 98753
dblade 0
libc 0
rpc_task_queue_latency
also records a histogram of each queued request and how long it stayed in queue.
process_name secd
rpc_task_queue_latency -
<20us 16667
<40us 0
<60us 0
<80us 0
<100us 0
<200us 0
<400us 0
<600us 0
<800us 0
<1ms 0
<2ms 0
<4ms 0
<6ms 0
<8ms 0
<10ms 0
<12ms 0
<14ms 0
<16ms 0
<18ms 0
<20ms 0
<40ms 0
<60ms 0
<80ms 0
<100ms 0
<200ms 0
<400ms 0
<600ms 0
<800ms 0
<1s 0
<2s 17620
<4s 16077
<6s 43298
<8s 31813
<10s 378
<20s 23
<30s 0
<60s 0
<90s 0
<120s 0
>120s 0
Also, since credential lookups occur in secd_rpc_auth_get_creds expect to see elevated counts in:
Object: secd_rpc
Instance: secd_rpc_auth_get_creds
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: vservername
Counter Value
-------------------------------- --------------------------------
instance_name secd_rpc_auth_get_creds
last_update_time Thu May 24 15:50:09 2018
longest_runtime 0ms
node_name NETAPP-06
num_calls 97699
num_failures 86
num_successes 97613
process_name secd
shortest_runtime 0ms
vserver_name
vservername vserver_uuid c4f936f2-66a6-11e7-9713-
90e2bacde704
StorageX is specifically mentioned because these type of issues are seen with this migration product.
StorageX by default uses 16 threads per cpu core (configurable), so in the large multi-proc\core server, it can scale quickly in concurrency. Each thread is responsible for copying a file; then at the end of the job task, placing Security Descriptors including DACLS\SACLS\owner information. At conclusion, that thread will then work on the next file.
For example: 8 cpu core server, which equates to 128 threads, migrating very small files, and if each file owner is unique, this can result in ONTAP doing a lot of credential lookup work in a short amount of time. Also, with StorageX we could be dealing with multiple servers running their replication agent.
Why does setting a file owner cause ONTAP more secd work?
Setting the file owner causes ONTAP to have to build of the user’s credential. If that credential is not already cached, query the Domain Controllers for user’s credentials.
RFE: Option to disable sid owner lookup upon setting ACL for file
https://mysupport-beta.netapp.com/si...P/BURT/1153207
Or a fixed version of this can also avoid the situation:
Avoid getting windows group membership while setting file ownership
Sample of what will be seen in a packet trace:
>>file owner is set by StorageX at the end of the file sync
Frame1 Source: StorageX Dest: ONTAP SMB2 SetInfo Request SEC_INFO/SMB2_SEC_INFO_00 File: 1.txt
Owner: S-1-5-21-1417671877-1164952658-2896985891-1156 (Domain SID-Domain RID)
>>ONTAP (if SID not cached) will need to go to Domain Controller to lookup SID
Frame2 Source: ONTAP Dest: DC LSARPC lsa_LookupSids2 request
Sid: S-1-5-21-1417671877-1164952658-2896985891-1156 (Domain SID-Domain RID)
RID: 1156 (Domain RID)
>> Domain Controller will respond with the name translation of SID
Frame3 Source: DC Dest: ONTAP LSARPC lsa_LookupSids2 response
Pointer to String (uint16): thor
>>ONTAP will build the credential via s4u2self (LDAP is fallback) to Domain Controller
Frame4 Source: ONTAP Dest: DC KRB5 TGS-REQ
padata-type: kRB5-PADATA-S4U2SELF (129)
KerberosString: thor
>>Domain Controller will respond with user’s credentials – ONTAP will usermap internally
Frame5 Source: DC Dest: ONTAP KRB5 TGS-REP
>>ONTAP responds to the original setinfo in Frame1
Frame6 Source: ONTAP Dest: StorageX SMB2 SetInfo Response
What recommendations do we have when we hit this situation?
- Check for external server bottlenecks\delays
- Lessen StorageX threads
- Spread the load to other node(s) secd
- Work with the customer account team\PS to assist in migrations
Check for external server bottlenecks\delays
Since part of the file syncs involve setting owner information on the files being migrated, this can put pressure on secd processing. Normal client workload would most likely not consist of creating so many account credential lookups. Because large migrations where many threads are occurring and lots of small files are being synced, this can cause a flood of credential lookups to occur. Check fior any delays\bottlenecks in external server communications (for DNS, AD, LDAP, NIS, name-mapping, name-services, etc)
Troubleshooting these delays\bottlenecks can help lessen the impact of a flood of credential lookups.
See How can I tell if an external service such as netlogon, ldap-ad, lsa, ldap-nis-namemap, or nis is responding slowly?
Check for external server delays with regard to offbox vscan, fpolicy, auditing. Anything that can add latency to elevated ops associated with the migration.
Lessen StorageX threads
This recommendation will most likely require a StorageX engagement to verify the best approach to limit their concurrency. At time of publish, this is the known method on how to accomplish this.
DWORD - HKEY_LOCAL_MACHINE\SOFTWARE\Data Dynamics\StorageX\ReplicationAgent\MaxDirectoryEnumerationThreads
MaxDirectoryEnumerationThreads (REG_DWORD): default is 0 (or not defined) which means calculate the max number of threads based on the count of CPUs in the current system.
Restart the RA.
Linux (UNIX) Replication Agent:
The same setting came be made on the UNIX RA in the following file: /usr/local/URA/log/Registry.xml. Add the following lines under the <Replication Agent> tag:
<VALUE name = "DisableReplicationPipelining" type "REG_DWORD"><0X00000001>
<VALUE name = "MaxDirectoryEnumerationThreads" type "REG_DWORD"><hex value of number of enumeration threads>
Best practice may require a trial and error on the balance required to not overwhelm secd and keep migration on pace. The thread count can be as low as ‘1’.
Additional Information
Add your text here.