HP MSA1500 Ruined My Customers Week

0 Comments Storage

A customer of mine with a small ESX deployment ran into some major grief this week with their MSA1500.  Unfortunately for them, it took 3 days for them to pick up the phone and get hold of me. This isn’t the first time I’ve ran into a problem caused by an MSA1500, and it will not be the last.  Symptoms start out minimal, VMs appear to be running slow…systems unresponsive…then BOOM all out catastrophic failure!

The problem is that the controller built into the MSA1500, it just was not made to support any throughput.  The sweet spot for these devices is 2 ESX hosts and a handful of Virtual Machines (0-15), anything more than that and you’ll be asking for trouble.

Here is some of the errors you should be expecting in your vmkernel and vmkwarning logs;

Jan  9 22:13:13 groucho vmkernel: 0:01:32:55.713 cpu0:1146)Fil3: 9811: Max Timeout retries exceeded for caller 0x928f4b (status ‘Timeout’)
Jan  9 22:13:13 groucho vmkernel: 0:01:32:55.713 cpu1:1087)VSCSI: 2803: Reset request on handle 8195 (0 outstanding commands)
Jan  9 22:13:13 groucho vmkernel: 0:01:32:55.713 cpu1:1054)VSCSI: 3019: Resetting handle 8195 [0/0]
Jan  9 12:15:25 groucho vmkernel: 2:18:31:13.494 cpu1:1037)WARNING: FS3: 4784: Reservation error: Timeout
Jan  9 21:28:16 groucho vmkernel: 0:00:47:59.406 cpu1:1033)VSCSI: 2803: Reset request on handle 8192 (1 outstanding commands)
Jan  9 21:28:16 groucho vmkernel: 0:00:47:59.406 cpu1:1054)VSCSI: 3019: Resetting handle 8192 [0/0]
Jan  9 12:15:25 groucho vmkernel: 2:18:31:13.494 cpu1:1037)WARNING: FS3: 4784: Reservation error: Timeout
Jan  9 15:36:08 groucho vmkernel: 2:21:51:56.008 cpu0:1034)VSCSI: 2803: Reset request on handle 8208 (3 outstanding commands)
Jan  9 15:36:08 groucho vmkernel: 2:21:51:56.008 cpu1:1054)VSCSI: 3019: Resetting handle 8208 [0/0]
Jan  9 20:40:49 groucho vmkernel: 0:00:00:02.483 cpu0:1024)CpuSched: 16758: Reset scheduler statistics
Jan  9 20:40:50 groucho vmkernel: 0:00:00:10.004 cpu1:1035)World: vm 1064: 895: Starting world FS3ResMgr with flags 1
Jan  8 07:58:05 groucho vmkernel: 1:14:13:57.304 cpu0:1034)VSCSI: 2803: Reset request on handle 8201 (1 outstanding commands)
Jan  8 07:58:05 groucho vmkernel: 1:14:13:57.305 cpu1:1054)VSCSI: 3019: Resetting handle 8201 [0/0]

So, what happens is the controller in the MSA simply chokes and slows everything down to a screeching halt. This of course does not play well with ESX.  The only way to resolve was to remote into each VM and shutdown, luckily SOME of them responded to the VMware Tools guest shutdown…only two out of the 14 needed to be forcefully killed (kill -9 ).  After everything was down we shutdown the ESX hosts.  Then we proceeded to shutdown and restart the MSA (controller -> shelves -> shelves -> controller).  Once back online we powered on only 2 of the 3 ESX hosts, I do not want to create too much contention on the MSA and luckily those two hosts will still run all their VMs without a problem.  Next week sometime we will be migrating to an EVA they have laying around.

So, in the end what did we learn?   MSA = Good for Test and Labs … BAD for Production