vSphere and P4500 G2

User Rating:  / 2
PoorBest 

Last month we installed three installed three new vSphere 4.1 U1 servers (HP Proliant DL380G7) with plenty of RAM and CPUs on top of a HP Lefthand P4500 G2 SAS storage system. The SAN is iSCSI based on ProCurve 2910al switches. All components are redundantly attached so we don't have any single point of failure.

 

We had to migrate 3TB of data from the old storage to the new Lefthand storage and everything went smooth. Although we changed from an EVA4100 fibre channel storage system to "only" a iSCSI based Lefthand storage system, speed was perfect.

In addition to the new vSphere servers and the storage system, we also installed a new backup server. This server needs direct access to the iSCSI storage, too because we want to use SAN-based backups of the VMs.

So we installed the new server with Windows Server 2008 R2 and enabled the built-in software iSCSI initiator. Furthermore, we presented all VMware storage to the backup server too.

 

With doing this our problems began.....

 

Starting or stopping any VM on the new storage took a LOT of time (5-10min), sometimes they didn't start or stop at all. Starting several VMs at the same time causes most of the VMs to fail with suspicious error messages, but sometimes also with an "I/O error" message. Doing VMotion was also impossible most of the time. Sometimes crashing at 10%, sometimes at 50-60%, sometimes it worked.

Looking in the messages logfile, we found several "failed on physical path" error messages like this one:

 

vmkernel: cpu1: NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100040ebc00) to NMP device "naa.6000eb3xxxxxxxxxxxx" failed on physical path "vmhba38:C0:T4:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x9 0x4 0x2.


I know these kinds of errors from other projects where the storage was not able to deliver the data because it is too slow but that couldn't be the problem this time.
Even if we stop all VMs but two, these errors came up.We suspected the backup server to have something to do with it but unserving the LUNs from the system and even stopping the iSCSI service on the backupserver doesn't fix the problem.

Rebooting ESX servers and installing latest patches also had no impact.We had several other projects with nearly the same hardware and we never observed such problems so where is the difference here?

Thinking a bit about it, we realized that we had installed the HP Lefthand MPIO DSM on the backup server. Normally we do not install any MPIO DSM on the backupserver because most of the VMware backup products do not work well with MPIO drivers. But this time, we do not plan to use such kind of products and therefore installed the DSM to get proper failover and loadbalancing.

Searching on the internet I found the blog from Rhys Goodwin and he suffered the same problem. He figured out that installing the DSM can cause severe problems on VMware datastores because of locking mechanism used by the DSM.

Update: thanks Craig for commenting this article and making me aware that I I forgot to mention the the link to the solution.... here it is:

http://blog.rhysgoodwin.com/windows-admin/hp-lefthand-vsphere-failed-on-physical-path/

And here is his six-step guide to solve this problem:

 

1.Uninstall HP Lefthand DSM for MPIO from Windows hosts (We still want to try to present the VMFS LUNs back to the backup server at some stage)
2.Shutdown all VMs
3.Shutdown all the ESX hosts
4.Shutdown Lefthand (Shut down the management group, not the nodes individually)
5.Power up the Lefthand and make sure all the nodes are up and volumes are all online
6.Power up the ESX hosts and VMs

End of update


By the way, this problem is also known at HP and VMware. Unfortunately, there is only a little warning at the end of a HP document, that tells you not to use the MPIO DSM when you plan to map VMware datastores to this Windows server. Cool....who the hell reads manuals to the end???


Yesterday we followed his 6 point problem-solving-plan and now we are able to start/stop/vmotion VMs as we want without affecting other running VMs.

 

So take care, do not use HP's MPIO DSM for P4000 systems with VMware datastores and ALWAYS read manuals to the end. It will cost you less time than searching on the internet for a solution.{jcomments on}

Post your comments...

  • Craig

    Posted at 2011-09-09 01:09:58

    Thanks, good info and well written, but for anyone (else) that was left hanging..

    that would be this link.. http://blog.rhysgoodwin.com/windows-admin/hp-lefthand-vsphere-failed-on-physical-path/

    and this six-step-fix
    1.Uninstall HP Lefthand DSM for MPIO from Windows hosts (We still want to try to present the VMFS LUNs back to the backup server at some stage)
    2.Shutdown all VMs
    3.Shutdown all the ESX hosts
    4.Shutdown Lefthand (Shut down the management group, not the nodes individually)
    5.Power up the Lefthand and make sure all the nodes are up and volumes are all online
    6.Power up the ESX hosts and VMs

    Reply to comment

Additional information