|
|
Stu Gepp
|
We have deployed 6 separate EVA4400s with a single disk shelf each in our produciton environment.
On Wednesday last week one of these suffered a double cache battery failure - the second battery died 65 minutes after the first.
This caused the whole SAN to go offline and we lost access to all LUNs.
HP have replaced the batteries and all is working again now but they are saying that the offline behaviour is by design. I find this incrediblely bizarre behaviour as I wouldn't expect the cache batteries to be in the critical path - they are there just in case external power fails.
Is this behaviour common across all HP SANs?
|
|
|
Note: If you are the author of this question and wish to assign points to any of the answers, please login first.For more information on assigning points ,click
here
|
|
|
Sort Answers By:
Date or Points
|
|
kunalsahoo
|
|
Nov 3, 2009 19:46:23 GMT
0 pts
|
|
Hi,
This is by design across all EVAs and this is for data safety...
when both the batteries goes down, there is always a possibility that unflushed data would be there in the cache, hence the controllers locks the host access in this scenario so that the data doesn't get corrupt or lost, once the batteries are back first the unflushed data is written to the disks and then controllers automatically releases the lock...
attached is the table which describes best Vdisk access in different battery stages... |
|
IBaltay
|
|
Nov 3, 2009 19:50:04 GMT
0 pts
|
|
Hi, surely it is mistaken. Could you provide your current controller firmware? |
|
IBaltay
|
|
Nov 3, 2009 19:52:55 GMT
0 pts
|
|
|
could you provide the termination code of the issue? |
|
Uwe Zessin
|
|
Nov 3, 2009 20:03:26 GMT
10 pts
|
|
Sure its for data safety, but the problem is that this behaviour cannot be turned off to at least run minimal services in write-through mode! Instead your are left "dead in the water"!
We had a problem with 2 EVAs some years ago: 3 dead cache batteries and no replacements available! I've asked around, but it seems that there isn't even a 'secret bypass' to enable access to the data.
I would not have a problem if the customer had to sign a paper that he accepts the risk of a cache loss and then somebody comes and flips a bit to allow access again. Most systems are connected through an external UPS anyway.
In the what can be thought of the previous generation of storage array, based on the HSG controller it was possible to tell the module to ignore the state of the cache battery and it was documented in the manual. |
|
IBaltay
|
|
Nov 3, 2009 20:35:54 GMT
6 pts
|
|
ok i see i have overlooked that the question was to the cache policy preventing the access to the whole DG - starting with detecting the battery system on one controller as no longer good and moving the disk drives to the other controller. If then the battery system fails on the other controller too, there is not disk presentation... My questions were rather to find out if the double battery failure had not been related to the controller firmware bug, or so to be able to prevent the reoccurance of such failure... |
|
kunalsahoo
|
|
Nov 3, 2009 20:47:31 GMT
5 pts
|
|
@Ibaltay Is there any known BUG in eva 4400 fw which mars the both the batteries @ the same time ? please share the fw version and if possible the release notes of the version in which it is fixed....is the new one has the fix ,21000 to be specific... |
|
IBaltay
|
|
Nov 3, 2009 21:08:28 GMT
7 pts
|
|
|
|
Stu Gepp
|
|
Nov 3, 2009 21:13:41 GMT
N/A: Question Author
|
|
The controller firmware is 09522000. I am aware of the previous version(s) falsely reporting battery failure but in my experience the failure cleared in a matter of milliseconds.
These batteries definitely died and rendered the SAN unusable.
I find it bizarre that the policy is that if both the batteries die I am denied access to my data. I don't even have the option to move it elsewhere where I can do writes safely - regardless of the fact that the EVA is in a datacentre with UPS and generator backup. |
|
kunalsahoo
|
|
Nov 3, 2009 21:36:50 GMT
6 pts
|
|
|
Unfortunately thats how it works...however bit surprised to see both batt failure at the same time...which is quite rare to experience..... |
|
Uwe Zessin
|
|
Nov 4, 2009 09:29:20 GMT
8 pts
|
|
|
Over the years, cache batteries have never worked 100% reliable - that is not limited to the EVA. I fully agree with Stu that it is not acceptable for an 'enterprise' product to prevent all access to the data - data integrity in mind or not. A system designer/ programmer far away in a cubicle just cannot make that decision! |
|
|
Stu Gepp
|
|
Nov 4, 2009 15:08:59 GMT
N/A: Question Author
|
|
kunalsahoo,
I am trying to understand the logic behind the table you posted and I can't figure it out.
It is evident that the controller is able to disable write-back and use write-thru if the best battery available is 'low'.
Why does it not do the same when there are no batteries available? |
|
Uwe Zessin
|
|
Nov 4, 2009 16:39:09 GMT
9 pts
|
|
Because write-through is only part of the solution. It does not fix the "write hole" problem:
A short write of VRAID-1 data needs to update two disk drives, same for VRAID-5 (data + parity). During a power failure it is possible that the write hits one disk only. This results in data inconsistency.
Of course it is possible to 'fix' it after power comes back: declare one mirror copy as a master and update the other, declare all VRAID-5 data as the master and recalculate the parity.
The OpenVMS operating system has implemented the feature for host-based RAID-1 (known as volume shadowing) and something similar is available for the add-on RAID-5 software.
The EVA uses a lot of proprietary hardware and makes many assumptions about data layouts. This gives speed and allows a simple PPC to control the whole system, but don't expect such advanced features.
Still, there should be a way to get at the data even with the write-hole risk. |
|
|
Stu Gepp
|
|
Nov 4, 2009 16:48:57 GMT
N/A: Question Author
|
|
Thank you Uwe.
That makes it much clearer. |
|
|
Mauro Livi
|
|
Nov 4, 2009 17:38:19 GMT
3 pts
|
|
Hi all, Just saw this and became a bit concerned as I will be upgrading to XCS 09522000 in the near future. Is Stu's problem of BOTH battery failures something of an extremely rare anomaly or is this perhaps the latest problem in a long line of EVA firmware bugs? Up till now I've been under the impression that XCS 09522000 is the best and most robust firmware version to date. I would hate to think that there is a bug in there which marks BOTH batteries as failures. Obviously I can ill afford something like that.
Your feedback is appreciated as always. Mauro |
|
Uwe Zessin
|
|
Nov 4, 2009 17:55:00 GMT
8 pts
|
|
The unpresentation of vdisks when both batteries are dead was a design decision and is present in previous EVA generations as well.
It is my experience (since EVA V1.0) that the failure of both batteries is rare and has nothing to do with XCS firmware bugs. The failure of 3 out of 4 batteries I mentioned yesterday was a production quality problem. Fortunately, for the customer, it happened before the systems were put into production. |
|
|
Mauro Livi
|
|
Nov 4, 2009 18:36:33 GMT
5 pts
|
|
Thanks Uwe. Yeah I understand that the unpresentation of vdisks when both batteries are dead was a design decision...I think it was a strange decision, but it is what it is.
I have never experienced both batteries dying at the same time and just wanted to make sure that there was no bug in the firmware labeling them (both) as failed.
Thanks again. Mauro |
|
Rob Leadbeater
|
|
Nov 4, 2009 19:49:57 GMT
5 pts
|
|
Hi,
I've not had the (mis)fortune to work on an EVA4400 yet, however multiple battery failures on older EVAs are quite common, especially if they're not shut down correctly - the exact scenario the batteries are supposed to prevent against ! Seemingly if you get more than a few deep discharge cycles, then game over.
Here's another call to get some of the old HSG functionality back into the HSV product line..
Cheers,
Rob |
|
|
Stu Gepp
|
|
Nov 13, 2009 18:25:09 GMT
N/A: Question Author
|
|
Hey folks,
I have been pursuing this with HP support as well and have finally got an answer.
The root cause is a firmware bug which causes the cache battery to be marked unusable at the end of a periodic test cycle even though the battery is not in fact bad. We were very unlucky that this happened to two batteries in the same enclosure within a short time. HP knows about the problem and is engineering a fix for the next firmware version which is tentatively due late December or early January.
I have had lengthy discussions about why the SAN goes off line when the cache batteries are both unavailable. The term cache battery is partly a misnomer because these batteries not only hold up the user data cache but also the metadata which records the configuration of the vdisks and how they are distributed across the physical disks. If a physical disk were to fail while the batteries went unavailable the metadata could be corrupted and destroy access to the entire contents of the SAN. This is common across all the EVA product line although higher end models have two cache batteries per controller rather than one. |
|
|
Mauro Livi
|
|
Nov 17, 2009 19:47:34 GMT
6 pts
|
|
So let me get this straight...09522000 has a known bug whereby it will mark the cache battery as bad even though it is not.
You'd be ok if it does this to one of the batteries, but if you're one of the "UNLUCKY" few where it marks BOTH batteries as bad, then your SAN would be rendered inaccessible and you'd better get replacement batteries in there PRONTO.
Does that just about cover it??? So if I upgrade and this happens, I'll need to rely on my luck and hope that it happens to only one of the batteries??? I might as well play the lottery :)
Mauro |
|
|
Stu Gepp
|
|
Nov 17, 2009 21:36:48 GMT
N/A: Question Author
|
|
Mauro,
You have summed it up nicely.
It is for this reason I have obtained a complete set of batteries for all my EVA4400s and will shortly have them on site right next to each EVA. I wanted to take HP logistics out of the time line to get replacements fitted.
The nice thing is that the expiration date on the batteries (shelf life) says 2099-12-12. I hope to have retired by the time these batteries are no longer any good. |
|
Rob Leadbeater
|
|
Nov 17, 2009 22:47:04 GMT
Unassigned
|
|
> the expiration date on the batteries > (shelf life) says 2099-12-12
I'd be a bit dubious about that... Sure looks like an auto generated date routine gone wrong to me.
Previous EVA and HSG batteries had a 3 year shelf life, if I recall correctly. I somehow doubt that the battery technology has changed sufficiently to increase that to 90 years !
Cheers,
Rob |
|