What causes ESX VMFS to "disappear"?

Last updated

Feb 12, 2024
Save as PDF
Share
1. Share
2. Tweet
3. Share

Views:: 4,391

Visibility:: Public

Votes:: 2

Category:: data-ontap-8

Specialty:: 7dot

Last Updated:: 2/12/2024, 8:21:33 AM

Applies to

ONTAP
Data ONTAP 7 and earlier
FlexPod

Answer

What triggers a VMFS resignature event?

When should I allow VMFS resignaturing and when should I avoid it?

Virtual Machine File Systems (VMFS) have a Universally Unique Identifier (UUID) and metadata which is determined by properties of the LUN. When a LUN is discovered that contains a VMFS, the metadata in the VMFS is compared to the properties of the LUN. The purpose of this check is to distinguish between a valid path to a LUN (possibly additional paths to be used for multipathing), and a copy of a LUN. If the properties match, the LUN is determined to be a valid path to a LUN. If the properties do not match, the LUN is determined to be a snapshot or clone of the LUN, not the original LUN.

If the LUN/VMFS was not previously known, and the metadata matches, the new VMFS is mounted and becomes accessible.

If the path is an additional path to an existing, mounted VMFS, the path is added as an additional path for multipathing purposes. The path will become visible in esxcfg-mpath -l.

If the metadata does not match the LUN properties, ESX determines the LUN to be a snapshot or clone. (VMware calls this a snapshot of a VMFS. In NetApp terminology, a snapshot LUN must be cloned to be writable as snapshots are read-only). ESX takes one of three actions, depending on the settings of two LVM options.

Default action: Ignore LUN

LVM.DisallowSnapshotLun=1 LVM.EnableResignature=0

ESX ignores the LUN. The VMFS volumes will not be visible in /vmfs/volumes or in VirtualCenter under Configuration > Storage. Although the LUN(s) itself (and all its paths) will be visible in Configuration > Storage Adapters under the appropriate HBAs. Entries are logged in /var/log/vmkernel, as well as the physical console of the ESX host.

Sep 10 12:59:49 esx1 vmkernel: 0:02:55:05.943 cpu1:1031)LVM: 5777: Device vmhba1:0:0:1 is a snapshot: Sep 10 12:59:49 esx1 vmkernel: 0:02:55:05.943 cpu1:1031)LVM: 5783: disk ID: len 22, lun 0, devType 0, scsi 5, h(id) 286614107086019105> Sep 10 12:59:49 esx1 vmkernel: 0:02:55:05.943 cpu1:1031)LVM: 5785: m/d disk ID: len 22, lun 0, devType 0, scsi 5, h(id) 4520976606442506935> Sep 10 12:59:49 esx1 vmkernel: 0:02:55:05.953 cpu1:1031)LVM: 5777: Device vmhba1:0:0:1 is a snapshot: Sep 10 12:59:49 esx1 vmkernel: 0:02:55:05.953 cpu1:1031)LVM: 5783: disk ID: len 22, lun 0, devType 0, scsi 5, h(id) 286614107086019105> Sep 10 12:59:49 esx1 vmkernel: 0:02:55:05.953 cpu1:1031)LVM: 5785: m/d disk ID: len 22, lun 0, devType 0, scsi 5, h(id) 4520976606442506935> Sep 10 12:59:49 esx1 vmkernel: 0:02:55:05.958 cpu1:1031)ALERT: LVM: 4941: vmhba1:0:0:1 may be snapshot: disabling access. See resignaturing section in SAN config guide.
Temporary LUN use:

LVM.DisallowSnapshotLun=0 LVM.EnableResignature=0

ESX mounts the VMFS without modification. Using this option precludes being able to use actual LUN snapshots. If a snapshot or clone of a LUN is accessed using this option and the original LUN is also mounted, the two will be confused and corruption will probably result.

Warning:
Do not use this option to access actual snapshots or clones of a LUN if the original is also being accessed from the same ESX host. Results are unpredictable and corruption may result!!

Permanent LUN or snapshot/clone use (with resignaturing):

LVM.DisallowSnapshotLun=1 or 0 (setting is ignored) LVM.EnableResignature=1

When a VMFS is discovered with metadata that does not match the LUN properties, the VMFS is resignatured, meaning the metadata is updated to match the LUN properties and a new UUID is generated. The VMFS will be visible in /vmfs/volumes with the new UUID and a symlink of the form 'snap-00000001- '. The symlink name will also show in VirtualCenter under Configuration > Storage (SCSI, SAN, and NFS)

In general, problems with LUN/VMFS access or snapshot state should be resolved by fixing LUN properties on the filer side, or by resignaturing. Temporary LUN use as detailed above has few, if any, use cases with NetApp storage.

Caution:
When you perform a LUN resignature, it must be done from only one host and the setting should be changed immediately after the resignature to avoid it happening from another host.

Implications of resignaturing

If a LUN property changes resulting in a resignature of a VMFS, any VMs that were registered on the VMFS will become 'Inaccessible' in VirtualCenter. This is because the VM (specifically its .vmx file) is registered using the /vmfs/volumes/<UUID> path

. If a VMFS is resignatured, VMs will need to be reregistered or their existing registration corrected (see below).

Events that can cause VMFS metadata and LUN properties to mismatch

Change in LUN ID
Caused by:
- Admin error (LUN unmapped then mapped to wrong ID or LUN mapped to multiple ESX hosts at different LUN IDs).
  Best action: Unmap and remap using the correct LUN ID if possible.
- Clone or mirror to a new LUN
  Best action: Resignature. Any VMs in the clone would be registered as new VMs.
- CFMODE conversion to SSI
  Best action: Resignature. Re-register VMs or fix vmInventory.xml (see below)

Change in LUN serial number
- Caused by:When the NVRAM card is swapped (either by itself or as part of a head/controller swap or upgrade), the filer will assign new serial numbers to all LUNs. See BUG 255243.
  fas3070a> lun show -v /vol/failtest/lun 100g (107374182400) (r/w, online, mapped) Serial#: HnSsd4DgUiAW Share: none Space Reservation: enabled Multiprotocol Type: vmware Maps: esx1=13 esx2=13
  Copied

When to allow resignature

When a LUN containing a VMFS is cloned (FlexClone, lun clone or sis clone), mirrored (SnapMirror then break/online or SyncMirror split) or copied (ndmpcopy, dd) within a filer or between filers. This would be a new copy of a LUN and should be resignatured. In these situations, any cloned VMs are new instances and must be registered (and possibly reconfigured to avoid conflicts of names, SIDs, IP addresses, etc.).

When to avoid resignature

When LUN properties change because of a hardware change and there are valid VMs on the VMFS, prevent LUN resignature if possible by correcting LUN properties on the filer before starting any ESX hosts or connecting them to the storage or LUNs.

In either case, set the LVM option or fix the problem on the filer side, then rescan in ESX. Rescan one ESX host first, since resignaturing only needs to happen once, then once the VMFS are visible and stable on the first host, rescan on the other ESX hosts.

Migration to SSI

The term SSI (Single System Image) comes from the fact that unlike other CFMODEs, with SSI both controllers/heads have the same worldwide nodename (WWN or WWNN). Each FCP port on both heads will have a unique worldwide portname (WWPN) based on the common WWN. The issue that impacts VMware is that unlike standby or partner mode, in SSI a LUN on each controller cannot be mapped to the same initiator(s) at the same LUN ID. If in standby or partner mode there are LUNs mapped on both heads to the same initiator(s) at the same LUN IDs, one LUN of each conflicting pair will have to be remapped to an unused LUN ID. This means the LUN ID will not match the VMFS metadata and a resignature will be necessary.

This can occur for many pairs of LUNs. Any VMs in VMFS on these LUNs will appear 'Inaccessible' and need to be re-registered. To minimize the impact, it is suggested to examine the environment to see if one conflicting VMFS contains fewer VMs, and remap that LUN to minimize the number of VMs to reregister.

Migration from iSCSI to FCP

When migrating from iSCSI to FCP, the same LUN ID should be used for FCP that was used for iSCSI if possible. This will not always be possible in situations where the FCP CFMODE is SSI and some iSCSI LUNs on each controller/head are mapped to the same initiator with the same ID. This must be resolved in the same procedure as for a migration to SSI above. This scenario should by prevented by not mapping LUNs on both controllers to the same initiators at the same LUN ID. Use a scheme such as the first 20 LUN IDs on one head and the second on the other, or an even/odd assignment. If one controller is used for VMware and the other for other applications, these conflicts will not occur. In many cases, users will not know about this ahead of time.

Migrating from FCP to iSCSI should never have any conflicts. Keep LUN IDs the same as they were for FCP.

Controller or NVRAM swap or upgrade procedure

After swapping a controller or NVRAM card, LUN serial numbers may be changed to reflect an S/N based on the new NVRAM ID. This will cause an unnecessary resignature event that should be avoided as follows:

BEFORE swapping, collect LUN properties including LUN S/N with lun show -v. (If in takeover, do partner lun show -v. This data may also be available in AutoSupport.)
Turn OFF VMFS resignaturing on all ESX hosts.
Replace the controller or NVRAM card.
Bring up the controller, but do NOT start ESX hosts using the LUNs.
For each LUN:
- lun offline /vol/volname/lun_name
- lun serial /vol/volname/lun_name old_serial_number
- lun online /vol/volname/lun_name
Verify LUN S/Ns and IDs with lun show -v
Start ESX.
Note:
Other KnowledgeBase articles refer to the commands lun attribute list and lun attribute set. This is functionally equivalent to the above method.

LUN restore

(SnapRestore, restore from tape, SnapVault, etc.)

If the LUN is being restored in place of the original, the LUN should be restored with the same properties as before. Verify LUN ID and S/N before starting/connecting ESX hosts.
If the LUN is being restored in a different location and the original still exists and is in use by ESX, the restored LUN should have a different LUN ID and serial number and the VMFS in the restored LUN should be resignatured.

How to reregister the virtual machine(s)

Make sure the VMFS is visible to all ESX hosts.
Delete the 'Inaccessible' VM.
Browse the datastore (click the ESX host --> Summary --> Right click the datastore)
Browse through the datastore to find each VM's .vmx file.
Right-click the vmx and 'Add to inventory'
Repeat steps 2-5 for each 'inaccessible' VM.

Alternate (advanced) method

Put first ESX host into maintenance mode, or disable automatic DRS.
Migrate functioning VMs onto other hosts.
SSH into ESX service console.
cd /vmfs/volumes
ls -l
[root@esx1 volumes]# ls â??l drwxrwxrwt 1 root 1260 Nov 27 11:58 474c4a74-b4cc8c53-6e29-000423c3e840 drwxrwxrwt 1 root 980 Nov 27 08:49 474c4aa2-772bdc66-e441-000423c3e840 drwxrwxrwt 1 root 1260 Nov 27 11:58 474c955b-527b5a13-1417-000423c3e840 lrwxr-xr-x 1 root 35 Nov 29 13:36 snap-00000002-VMFS11 -> 474c955b-527b5a13-1417-000423c3e840 lrwxr-xr-x 1 root 35 Nov 29 13:36 VMFS11 -> 474c4a74-b4cc8c53-6e29-000423c3e840 lrwxr-xr-x 1 root 35 Nov 29 13:36 VMFS13 -> 474c4aa2-772bdc66-e441-000423c3e840
Copied
Note (or copy) the new UUID(s) of the datastore(s) on which the inaccessible VMs live. You may need to look in the VMFS themselves to be sure which VMs live where.
cd /etc/vmware/hostd
cp vmInventory.xml vmInventory.xml-save
Edit vmInventory.xml and change the UUID for the inaccessible VMs to the correct UUID for their datastores. (If unsure which VMs are in which datastore, look in each datastore to ensure you have the right UUID for each VM).
Save vmInventory.xml and exit the editor.
To make ESX re-read vmInventory.xml:
[root@esx1 hostd]# service mgmt-vmware restart Stopping VMware ESX Server Management services: VMware ESX Server Host Agent Watchdog [ OK ] VMware ESX Server Host Agent [ OK ] Starting VMware ESX Server Management services: VMware ESX Server Host Agent (background) [ OK ] Availability report startup (background) [ OK ]
Copied
Verify all VMs are properly accessible.
Bring the ESX host out of maintenance mode and/or return DRS to original settings. Sample vmInventory.xml file with UUID paths for VMs in an NFS and a VMFS datastore:
[root@esx1 hostd]# more vmInventory.xml <ConfigRoot> <ConfigEntry id="0006"> <objID>112objID> <vmxCfgPath>/vmfs/volumes/9f801592-14465f39/WinNFS8/WinNFS8.vmxvmxCfgPath> ConfigEntry> <ConfigEntry id="0027"> <objID>608objID> <vmxCfgPath>/vmfs/volumes/46e5a3bf-2d233fa0-1546-0014220f1381/houwin2003sp2-8/houwin2003sp2-8.vmxvmxCfgPath> ConfigEntry> ConfigRoot>
Copied

Additional Information

Do not use the 'Add Storage' wizard to discover existing VMFS, whether for the first connection to a VMFS on an ESX host, additional path or snapshots/clones. The 'Add Storage' wizard is for formatting a LUN with a new VMFS.

Warning: If you format the LUN, the existing VMFS and its contents will be destroyed.

Verify that no VMs are already 'Inaccessible' or have any other problems (missing VMDK or RDM LUNs). Capturing the output of vmware-cmd -l on all ESX hosts may be useful. Note that with VMotion, the ESX host on which a VM may appear can change, but the VMX path should be constant.
Verify that all VMFS are visible as expected. Note the names and UUIDs. (ls -l /vmfs/volumes)
Verify paths to each LUN are as expected.
Capture the following from the filer(s). Note that this information is also available in a recent AutoSupport:
- lun show -v
- igroup show
- FCP show initiators (if using FCP)
- iSCSI show initiators (if using iSCSI)
Prior to ONTAP 7.2.4, LUN Serial Numbers changed as a result of:
- Head Swaps (NVRAM)
- ONTAP upgrades
- CFODisaster (MetroCluster)