Skip to main content
NetApp Knowledgebase

StorageX migration floods secd

Views:
121
Visibility:
Public
Votes:
0
Category:
ontap-9
Specialty:
nas
Last Updated:

 

Applies to

  • ONTAP 9
  • StorageX
  • Migrations

Answer

Storage migrations utilizing third-party multi-threaded migration tools could cause secd bottlenecks. Often, this presents itself as a potential client authentication performance issue when migrations are occurring. Any work related to authentication for the node(s) could also be impacted.
 
 Evidence of this bottleneck can be seen in the following errors:
 
Secd
Secd can begin to queue and requests will sit in the queue for longer periods.


[kern_secd:info:10816] debug: Worker Thread 34641252096 processing RPC 153:secd_rpc_auth_get_creds with request ID:6605 which sat in the queue for 23 seconds. { in run() at src/server/secd_rpc_server.cpp:2067 }
 

This might progress that RPC requests are failing due to 23 sec timeouts.
[kern_secd:info:5895] .------------------------------------------------------------------------------.
[kern_secd:info:5895] |                              RPC TOOK TOO LONG:                              |
[kern_secd:info:5895] |                       RPC used 24 seconds (max is 23)                        |
[kern_secd:info:5895] |                   and likely caused the client to timeout                    |
[kern_secd:info:5895] .------------------------------------------------------------------------------.
 

 Eventually, if secd rpc memory allocation hits 80%, we start to record these messages:
 
secd
[kern_secd:info:10816] [SECD MASTER THREAD] SecD RPC Server: Too many outstanding Generic RPC requests: sending System Error to RPC 153:secd_rpc_auth_get_creds Request ID:65535.
 
ems
[secd: secd.rpc.server.request.dropped:debug]: The RPC secd_rpc_auth_get_creds sent from NBLADE_CIFS was dropped by SecD due to memory pressure.
 
Collecting secd cm statistics can also confirm how many times this condition is hit.
 
nas::> set diag
nas::*> statistics start -object secd -instance secd -node NETAPP01-06 -sample-id sample_695
nas::*> statistics stop –sample-id sample_695
nas::*> statistics show –sample-id sample_695

 
Object: secd
Instance: secd
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: NETAPP01-06

 
instance_name                                                secd
    node_name                                              NETAPP01-06
    num_rpcs_dropped_due_to_low_memory
                                mgwd                                0
                              nblade                            98765
                              dblade                                0
    num_rpcs_failed                                                 -
                                mgwd                                0
                              nblade                            98753
                              dblade                                0
                                libc                                0
 

rpc_task_queue_latency also records a histogram of each queued request and how long it stayed in queue.
 
    process_name                                                 secd
    rpc_task_queue_latency                                          -
                               <20us                            16667
                               <40us                                0
                               <60us                                0
                               <80us                                0
                              <100us                                0
                              <200us                                0
                              <400us                                0
                              <600us                                0
                              <800us                                0
                                <1ms                                0
                                <2ms                                0
                                <4ms                                0
                                <6ms                                0
                                <8ms                                0
                               <10ms                                0
                               <12ms                                0
                               <14ms                                0
                               <16ms                                0
                               <18ms                                0
                               <20ms                                0
                               <40ms                                0
                               <60ms                                0
                               <80ms                                0
                              <100ms                                0
                              <200ms                                0
                              <400ms                                0
                              <600ms                                0
                              <800ms                                0
                                 <1s                                0
                                 <2s                            17620
                                 <4s                            16077
                                 <6s                            43298
                                 <8s                            31813
                                <10s                              378
                                <20s                               23
                                <30s                                0
                                <60s                                0
                                <90s                                0
                               <120s                                0
                               >120s                                0


Also, since credential lookups occur in secd_rpc_auth_get_creds expect to see elevated counts in:

Object: secd_rpc
Instance: secd_rpc_auth_get_creds
Start-time: 5/24/2018 15:46:34
End-time: 5/24/2018 15:50:09
Elapsed-time: 214s
Scope: vservername

    Counter                                                     Value
    -------------------------------- --------------------------------
    instance_name                             secd_rpc_auth_get_creds
    last_update_time                         Thu May 24 15:50:09 2018
    longest_runtime                                               0ms
    node_name                                               NETAPP-06
    num_calls                                                   97699
    num_failures                                                   86
    num_successes                                               97613
    process_name                                                 secd
    shortest_runtime                                              0ms
    vserver_name                                         
vservername
    vserver_uuid                             c4f936f2-66a6-11e7-9713-
                                                         90e2bacde704
 
 
 
 
StorageX is specifically mentioned because these type of issues are seen with this migration product.
 
StorageX by default uses 16 threads per cpu core (configurable), so in the large multi-proc\core server, it can scale quickly in concurrency. Each thread is responsible for copying a file; then at the end of the job task, placing Security Descriptors including DACLS\SACLS\owner information. At conclusion, that thread will then work on the next file.
 
For example: 8 cpu core server, which equates to 128 threads, migrating very small files, and if each file owner is unique, this can result in ONTAP doing a lot of credential lookup work in a short amount of time. Also, with StorageX we could be dealing with multiple servers running their replication agent.
 
Why does setting a file owner cause ONTAP more secd work?
 
Setting the file owner causes ONTAP to have to build of the user’s credential. If that credential is not already cached, query the Domain Controllers for user’s credentials.
 
This RFE can help avoid this situation in future:
RFE: Option to disable sid owner lookup upon setting ACL for file
https://mysupport-beta.netapp.com/si...P/BURT/1153207 

Or a fixed version of this can also avoid the situation:
Avoid getting windows group membership while setting file ownership
 

Sample of what will be seen in a packet trace:
 
>>file owner is set by StorageX at the end of the file sync
Frame1 Source: StorageX   Dest: ONTAP SMB2    SetInfo Request SEC_INFO/SMB2_SEC_INFO_00 File: 1.txt
Owner: S-1-5-21-1417671877-1164952658-2896985891-1156  (Domain SID-Domain RID)
 
>>ONTAP (if SID not cached) will need to go to Domain Controller to lookup SID
Frame2 Source: ONTAP  Dest: DC LSARPC lsa_LookupSids2 request
Sid: S-1-5-21-1417671877-1164952658-2896985891-1156  (Domain SID-Domain RID)
RID: 1156  (Domain RID)
 
>> Domain Controller will respond with the name translation of SID
Frame3 Source: DC  Dest: ONTAP LSARPC lsa_LookupSids2 response
Pointer to String (uint16): thor
 
>>ONTAP will build the credential  via s4u2self (LDAP is fallback) to Domain Controller
Frame4 Source: ONTAP  Dest: DC KRB5      TGS-REQ
padata-type: kRB5-PADATA-S4U2SELF (129)
KerberosString: thor
 
>>Domain Controller will respond with user’s credentials – ONTAP will usermap internally
Frame5 Source: DC  Dest: ONTAP KRB5      TGS-REP
 
>>ONTAP responds to the original setinfo in Frame1
Frame6 Source: ONTAP  Dest: StorageX  SMB2    SetInfo Response
 

What recommendations do we have when we hit this situation?
  • Check for external server bottlenecks\delays
  • Lessen StorageX threads
  • Spread the load to other node(s) secd
  • Work with the customer account team\PS to assist in migrations

Check for external server bottlenecks\delays
 
Since part of the file syncs involve setting owner information on the files being migrated, this can put pressure on secd processing. Normal client workload would most likely not consist of creating so many account credential lookups. Because large migrations where many threads are occurring and lots of small files are being synced, this can cause a flood of credential lookups to occur. Check fior any delays\bottlenecks in external server communications (for DNS, AD, LDAP, NIS, name-mapping, name-services, etc)
Troubleshooting these delays\bottlenecks can help lessen the impact of a flood of credential lookups.

See How can I tell if an external service such as netlogon, ldap-ad, lsa, ldap-nis-namemap, or nis is responding slowly?

Check for external server delays with regard to offbox vscan, fpolicy, auditing. Anything that can add latency to elevated ops associated with the migration.
 

Lessen StorageX threads

This recommendation will most likely require a StorageX engagement to verify the best approach to limit their concurrency. At time of publish, this is the known method on how to accomplish this.
 
Registry key is DWORD -  HKEY_LOCAL_MACHINE\SOFTWARE\Data Dynamics\StorageX\ReplicationAgent\MaxDirectoryEnumerationThreads
 
MaxDirectoryEnumerationThreads (REG_DWORD): default is 0 (or not defined) which means calculate the max number of threads based on the count of CPUs in the current system. 
 
Restart the RA.
 
Linux (UNIX) Replication Agent:
The same setting came be made on the UNIX RA in the following file: /usr/local/URA/log/Registry.xml. Add the following lines under the <Replication Agent> tag:
<VALUE name = "DisableReplicationPipelining" type "REG_DWORD"><0X00000001>
<VALUE name = "MaxDirectoryEnumerationThreads" type "REG_DWORD"><hex value of number of enumeration threads>
 
Best practice may require a trial and error on the balance required to not overwhelm secd and keep migration on pace. The thread count can be as low as ‘1’.
 

Additional Information

Add your text here.