Jump to content
 English      
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP
     Forums advanced search
HP.com Home
IT Resource Center Forums > Storage > storage area networks (SAN)

EVA4400 cache battery failure forces SAN offline

» 

IT Resource Center

» Login
» Register
» My profile
» Search knowledge base
» Forums
» Patch database
» Download drivers, software and firmware
» Warranty check
» Support Case Manager
» Software Update Manager
» Training and Education
» More maintenance and support options
» Online help
» Site map

Member icons
 
 HP moderator  HP moderator
 Expert in this area  Expert in this area
Member status
ITRC Pro ITRC Pro
250 points
ITRC Graduate ITRC Graduate
500 points
ITRC Wizard ITRC Wizard
1000 points
ITRC Royalty ITRC Royalty
2500 points
ITRC Pharaoh ITRC Pharaoh
7500 points
Olympian Olympian
20000 points
1-Star Olympian 1-Star Olympian
40000 points
2-Star Olympian 2-Star Olympian
80000 points
»  How to earn points
»  Support forums FAQs
Question status
Magical answer Magical answer
Message with a response that solved the author's question
Favorites status
Add to my favorites Add to my favorites
Delete from my favorites Delete from my favorites
This thread has been closed Thread closed
 

Content starts here
   Create a new message    Receive e-mail notification if a new reply is posted  Reply to this message
Author Subject: EVA4400 cache battery failure forces SAN offline      Add to my favorites
Stu Gepp
Nov 3, 2009 19:33:54 GMT   

We have deployed 6 separate EVA4400s with a single disk shelf each in our produciton environment.

On Wednesday last week one of these suffered a double cache battery failure - the second battery died 65 minutes after the first.

This caused the whole SAN to go offline and we lost access to all LUNs.

HP have replaced the batteries and all is working again now but they are saying that the offline behaviour is by design. I find this incrediblely bizarre behaviour as I wouldn't expect the cache batteries to be in the critical path - they are there just in case external power fails.

Is this behaviour common across all HP SANs?
Note: If you are the author of this question and wish to assign points to any of the answers, please login first.For more information on assigning points ,click here


Sort Answers By: Date or Points
kunalsahoo Expert in this area
Nov 3, 2009 19:46:23 GMT  0 pts Attachement is 344042.txt 

Hi,

This is by design across all EVAs and this is for data safety...

when both the batteries goes down, there is always a possibility that unflushed data would be there in the cache, hence the controllers locks the host access in this scenario so that the data doesn't get corrupt or lost, once the batteries are back first the unflushed data is written to the disks and then controllers automatically releases the lock...

attached is the table which describes best Vdisk access in different battery stages...
IBaltay Expert in this area This member has accumulated 2500 or more points
Nov 3, 2009 19:50:04 GMT  0 pts

Hi,
surely it is mistaken. Could you provide your current controller firmware?
IBaltay Expert in this area This member has accumulated 2500 or more points
Nov 3, 2009 19:52:55 GMT  0 pts

could you provide the termination code of the issue?
Uwe Zessin This member has accumulated 20000 or more points
Nov 3, 2009 20:03:26 GMT  10 pts

Sure its for data safety, but the problem is that this behaviour cannot be turned off to at least run minimal services in write-through mode! Instead your are left "dead in the water"!

We had a problem with 2 EVAs some years ago: 3 dead cache batteries and no replacements available! I've asked around, but it seems that there isn't even a 'secret bypass' to enable access to the data.

I would not have a problem if the customer had to sign a paper that he accepts the risk of a cache loss and then somebody comes and flips a bit to allow access again. Most systems are connected through an external UPS anyway.

In the what can be thought of the previous generation of storage array, based on the HSG controller it was possible to tell the module to ignore the state of the cache battery and it was documented in the manual.
IBaltay Expert in this area This member has accumulated 2500 or more points
Nov 3, 2009 20:35:54 GMT  6 pts

ok i see i have overlooked that the question was to the cache policy preventing the access to the whole DG - starting with detecting the battery system on one controller as no longer good and moving the disk drives to the other controller. If then the battery system fails on the other controller too, there is not disk presentation...
My questions were rather to find out if the double battery failure had not been related to the controller firmware bug, or so to be able to prevent the reoccurance of such failure...
kunalsahoo Expert in this area
Nov 3, 2009 20:47:31 GMT  5 pts

@Ibaltay
Is there any known BUG in eva 4400 fw which mars the both the batteries @ the same time ? please share the fw version and if possible the release notes of the version in which it is fixed....is the new one has the fix ,21000 to be specific...
IBaltay Expert in this area This member has accumulated 2500 or more points
Nov 3, 2009 21:08:28 GMT  7 pts

it was meant historicaly as seen from the ADVISORY: (Revised) HP StorageWorks 4400/6400/8400 Enterprise Virtual Array XCS 09522000 released:
http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&objectID=c01850026&jumpid=reg_R1002_USEN

from which it is seen that lot of versions had been innactivated...

e.g. HP StorageWorks 4400 Enterprise Virtual Array release notes (XCS 09006000)p. 4. mentioned the fix of addressing a firmware issue which could result in false reporting of controller component and battery failure.

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01658405/c01658405.pdf?jumpid=reg_R1002_USEN
Stu Gepp
Nov 3, 2009 21:13:41 GMT    N/A: Question Author

The controller firmware is 09522000. I am aware of the previous version(s) falsely reporting battery failure but in my experience the failure cleared in a matter of milliseconds.

These batteries definitely died and rendered the SAN unusable.

I find it bizarre that the policy is that if both the batteries die I am denied access to my data. I don't even have the option to move it elsewhere where I can do writes safely - regardless of the fact that the EVA is in a datacentre with UPS and generator backup.
kunalsahoo Expert in this area
Nov 3, 2009 21:36:50 GMT  6 pts

Unfortunately thats how it works...however bit surprised to see both batt failure at the same time...which is quite rare to experience.....
Uwe Zessin This member has accumulated 20000 or more points
Nov 4, 2009 09:29:20 GMT  8 pts

Over the years, cache batteries have never worked 100% reliable - that is not limited to the EVA. I fully agree with Stu that it is not acceptable for an 'enterprise' product to prevent all access to the data - data integrity in mind or not. A system designer/ programmer far away in a cubicle just cannot make that decision!
Stu Gepp
Nov 4, 2009 15:08:59 GMT    N/A: Question Author

kunalsahoo,

I am trying to understand the logic behind the table you posted and I can't figure it out.

It is evident that the controller is able to disable write-back and use write-thru if the best battery available is 'low'.

Why does it not do the same when there are no batteries available?
Uwe Zessin This member has accumulated 20000 or more points
Nov 4, 2009 16:39:09 GMT  9 pts

Because write-through is only part of the solution. It does not fix the "write hole" problem:

A short write of VRAID-1 data needs to update two disk drives, same for VRAID-5 (data + parity). During a power failure it is possible that the write hits one disk only. This results in data inconsistency.

Of course it is possible to 'fix' it after power comes back: declare one mirror copy as a master and update the other, declare all VRAID-5 data as the master and recalculate the parity.

The OpenVMS operating system has implemented the feature for host-based RAID-1 (known as volume shadowing) and something similar is available for the add-on RAID-5 software.

The EVA uses a lot of proprietary hardware and makes many assumptions about data layouts. This gives speed and allows a simple PPC to control the whole system, but don't expect such advanced features.

Still, there should be a way to get at the data even with the write-hole risk.
Stu Gepp
Nov 4, 2009 16:48:57 GMT    N/A: Question Author

Thank you Uwe.

That makes it much clearer.
Mauro Livi
Nov 4, 2009 17:38:19 GMT  3 pts

Hi all,
Just saw this and became a bit concerned as I will be upgrading to XCS 09522000 in the near future. Is Stu's problem of BOTH battery failures something of an extremely rare anomaly or is this perhaps the latest problem in a long line of EVA firmware bugs? Up till now I've been under the impression that XCS 09522000 is the best and most robust firmware version to date. I would hate to think that there is a bug in there which marks BOTH batteries as failures. Obviously I can ill afford something like that.

Your feedback is appreciated as always.
Mauro
Uwe Zessin This member has accumulated 20000 or more points
Nov 4, 2009 17:55:00 GMT  8 pts

The unpresentation of vdisks when both batteries are dead was a design decision and is present in previous EVA generations as well.

It is my experience (since EVA V1.0) that the failure of both batteries is rare and has nothing to do with XCS firmware bugs. The failure of 3 out of 4 batteries I mentioned yesterday was a production quality problem. Fortunately, for the customer, it happened before the systems were put into production.
Mauro Livi
Nov 4, 2009 18:36:33 GMT  5 pts

Thanks Uwe.
Yeah I understand that the unpresentation of vdisks when both batteries are dead was a design decision...I think it was a strange decision, but it is what it is.

I have never experienced both batteries dying at the same time and just wanted to make sure that there was no bug in the firmware labeling them (both) as failed.

Thanks again.
Mauro
Rob Leadbeater Expert in this area This member has accumulated 7500 or more points
Nov 4, 2009 19:49:57 GMT  5 pts

Hi,

I've not had the (mis)fortune to work on an EVA4400 yet, however multiple battery failures on older EVAs are quite common, especially if they're not shut down correctly - the exact scenario the batteries are supposed to prevent against ! Seemingly if you get more than a few deep discharge cycles, then game over.

Here's another call to get some of the old HSG functionality back into the HSV product line..

Cheers,

Rob
Stu Gepp
Nov 13, 2009 18:25:09 GMT    N/A: Question Author

Hey folks,

I have been pursuing this with HP support as well and have finally got an answer.

The root cause is a firmware bug which causes the cache battery to be marked unusable at the end of a periodic test cycle even though the battery is not in fact bad. We were very unlucky that this happened to two batteries in the same enclosure within a short time. HP knows about the problem and is engineering a fix for the next firmware version which is tentatively due late December or early January.



I have had lengthy discussions about why the SAN goes off line when the cache batteries are both unavailable. The term cache battery is partly a misnomer because these batteries not only hold up the user data cache but also the metadata which records the configuration of the vdisks and how they are distributed across the physical disks. If a physical disk were to fail while the batteries went unavailable the metadata could be corrupted and destroy access to the entire contents of the SAN. This is common across all the EVA product line although higher end models have two cache batteries per controller rather than one.
Mauro Livi
Nov 17, 2009 19:47:34 GMT  6 pts

So let me get this straight...09522000 has a known bug whereby it will mark the cache battery as bad even though it is not.

You'd be ok if it does this to one of the batteries, but if you're one of the "UNLUCKY" few where it marks BOTH batteries as bad, then your SAN would be rendered inaccessible and you'd better get replacement batteries in there PRONTO.

Does that just about cover it??? So if I upgrade and this happens, I'll need to rely on my luck and hope that it happens to only one of the batteries??? I might as well play the lottery :)

Mauro
Stu Gepp
Nov 17, 2009 21:36:48 GMT    N/A: Question Author

Mauro,

You have summed it up nicely.

It is for this reason I have obtained a complete set of batteries for all my EVA4400s and will shortly have them on site right next to each EVA. I wanted to take HP logistics out of the time line to get replacements fitted.

The nice thing is that the expiration date on the batteries (shelf life) says 2099-12-12. I hope to have retired by the time these batteries are no longer any good.
Rob Leadbeater Expert in this area This member has accumulated 7500 or more points
Nov 17, 2009 22:47:04 GMT    Unassigned

> the expiration date on the batteries
> (shelf life) says 2099-12-12

I'd be a bit dubious about that... Sure looks like an auto generated date routine gone wrong to me.

Previous EVA and HSG batteries had a 3 year shelf life, if I recall correctly. I somehow doubt that the battery technology has changed sufficiently to increase that to 90 years !

Cheers,

Rob
 
Create a new message    Receive e-mail notification if a new reply is posted   Reply to this message
 
 
Printable version
Privacy statement Using this site means you accept its terms
© 2009 Hewlett-Packard Development Company, L.P.