HP.com Home
IT Resource Center Forums > Servers > ProLiant servers (ML,DL)

Proliant DL360 w/RH 8.0 hangs


» Return to original page
Content starts here
   Create a new message    Receive e-mail notification if a new reply is posted  Reply to this message
Author Subject: Proliant DL360 w/RH 8.0 hangs      Add to my favorites
Gavin Brennan
Feb 5, 2003 00:27:42 GMT   

I'm trying to track down a problem with intermittent hanging on a Proliant DL 360 G2 (Bios P26) w/RH 8.0 (2.4.18&19). The system hangs at random times when there is no activity at all, and requires a hard reboot to bring it back. There are no suspicious messages in the log files. It looks like it just goes to sleep and doesn't know how to wake up, but I had removed the apmd package so I don't understand how that's possible.
Note: If you are the author of this question and wish to assign points to any of the answers, please login first.For more information on assigning points ,click here


Sort Answers By: Date or Points
Jay Richardson
Feb 5, 2003 11:53:44 GMT    Unassigned

I had a similiar problem with a DL380G2. Does the mouse still move but no response from the system at all? If so and your drives are hot swappable. This is not a solution but try removing one and re-inserting it. If the system starts responding it coukld be the controller. We've had almost every component replaced before our problem cleared up.
Jason Pepin
Feb 7, 2003 19:30:15 GMT    Unassigned

I have the same issue. Two brand new DL360s loaded with Redhat 8.0 kernel update 2.4.18-24.8.0smp and they just lock up at random. Jay, what was the last component you replaced?
Gavin Brennan
Feb 7, 2003 21:22:00 GMT    N/A: Question Author

Jay, I had tried pulling one disk out, waiting a bit, and pushing it back in again. That wasn't successful, but maybe I was too quick. I've now pulled one of the drives into a separate system so I don't have the option right now.

The screen has been blank each time the system has hung. From the disk indicators I believe that it still had power, but I had no response on console or keyboard to anything I did (except power-cycling).

I've now tried three kernels: 2.4.18,19,20, and tried a Storage Array Controller patch. Like Jason, I'm wondering what was the last part you replaced?
Jason Pepin
Feb 12, 2003 17:30:52 GMT  7 pts

Gavin, I was able to keep one of my RH 8.0 DL360's running for the past two days by removing one of the two procs. Even if I tried to boot into single proc mode the box would still hang. Once I physically removed the second proc the box has worked without locking up. Have you had any other luck.
Gavin Brennan
Feb 14, 2003 22:18:45 GMT    N/A: Question Author

I was able to keep one server up for almost three days 8^)> That's been the high-water mark. But it's been very erratic. It can take anywhere from 1 to 71 hours for a server to hang. Or reboot itself if I have the compaq health utilities and ASR enabled.

I cannot open up the cases myself; that's outside my realm. But since I have to put in a service call, I need to make sure this isn't a software or firmware issue.

We have now replicated this issue on two DL360 G2s (each now with a single HD). One is running RH 7.3 (kernel 2.4.18-24.7) the other RH 8.0 (kernel 2.4.18-19.8.0smp). The fact that it happens on two boxes makes me think it's not hardware. The fact that it happens with several versions of Red Hat makes me think it's not the OS. We've run the Compaq/HP Diagnostics package against the systems, and the Linux Test project, and not found a problem. This has happened both in our main network, and on an isolated test network.
Wes Strange
Feb 21, 2003 01:13:14 GMT  7 pts

I too am having a similar issue with a DL380 G3 with dual processors running RH8.0 (2.4.18). It would lock-up totally and would require a hard reboot. After removing the secondary processor it has been stable for the past 10 hours (before it took anywhere from 30min to 2 hrs to lockup). I've run diags on the system and no problems are found. I don't know whether it's hardware or software.
Gavin Brennan
Feb 21, 2003 14:27:49 GMT    N/A: Question Author

I think I have a stable work-around. I booted the two DL360s with "noapic" and they have stayed up for more than six days under pretty constant load (running Linux Test Project on a loop). This setting disables sharing the interrupt table between the processors. I think that there is a driver that is not handling this feature properly, and I have no time to track it down. The slight loss in performance from running with noapic will not be a problem on this server.
Wels
Feb 24, 2003 12:21:23 GMT    Unassigned

I've the same pbs as yours but with DL360G2 and RedHat AS 2.1. The solution i've found is to disable all compaq insight agent. After this, no more hangs.

Regards.
Charles Berg
Feb 26, 2003 21:58:08 GMT    Unassigned

I have the same problem. This is with rh8.0, kernel 2.4.18-24.8.0smp, on a DL380 G3.

At first I had the 1.15 ILO firmware, and the 20020910 ROM. I've since upgraded to 1.20 and 20030108. (no crashes since but I only upgraded yesterday).

Mine crashes pretty infrequently - I ran it for about a week straight with every test I could think of (LTP repeatedly, hundreds of kernel compiles, memtester, cpuburn, and a variety of disk benchmarks). It then crashed while (mostly) idle, and then again 3 days later.

Has setting noapic proven itself a reliable solution? Has anyone found it to not work? My machine crashes too infrequently for me to say that any change has fixed it until it runs for a couple of weeks at least.
Charles Berg
Feb 28, 2003 15:45:08 GMT    Unassigned

I found out that my OS Type in the BIOS setup was set wrong (to Windows 2000). Under Advanced in the BIOS setup there is an "MPS Table Mode" option, which is set to "Auto Set Table" by default. Apparently with this setting, the OS type affects which APIC setting it uses. (though I don't know what settings each OS type uses). This seems consistent with the noapic setting fixing the problem. What are everyone's OS Type set to?
Gavin Brennan
Feb 28, 2003 16:15:58 GMT    N/A: Question Author

Responding to a couple of messages here:
I found that the OS setting had no effect on the hangs. The Compaq insight package likewise (though with the compaq packages installed and running, ASR would reboot the box when it hung.)
The system has stayed up for two periods of a week with the noapic option set; I had to move it in-between. This beats my previous record of thre days. Given the intermittent nature of the this problem, I won't really relax until it stays up for a month.
RemigijusV
Mar 4, 2003 21:20:25 GMT    Unassigned Attachement is 3738.pdf 

Maybe You missed this: Installing Red Hat Linux on ProLiant Servers HOWTO, Version 8.0. I attached most important part of it.
Charles Berg
Mar 6, 2003 19:31:24 GMT    Unassigned

Is everyone with this problem using the tg3 driver? It turns out that it had a bug, fixed in a kernel from RedHat released yesterday, that crashes the system.

https://rhn.redhat.com/errata/RHBA-2003-069.html
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=69920
Gavin Brennan
Mar 6, 2003 19:42:43 GMT    N/A: Question Author

We are, and that looks like the real answer (using noapic being just a workaround). When we finish patching sendmail everywhere we'll apply the tg3 patch to a test box and see how we do.
Tom Cramer Expert in this area
Apr 14, 2003 21:59:00 GMT    Unassigned

The hang is related to the tg3 module RedHat uses by default. It only occurs on an SMP kernel (either a multiple CPU machine, or a machine that supports Hyperthreading). The HP recommended fix is to switch to the supported bcm5700 driver. The driver may be found in the driver download section for your machine. An alternative fix is to apply the errata fix documented in the Redhat document RHBA-2003:069-12.

Tom Cramer
ISSG Linux Support
HP
Jeff Manning
May 8, 2003 00:41:21 GMT    Unassigned

Tom, thanks for your fix. I'm stuck though because I can't boot my Proliant 1600 running RH 8.0 because it hangs 'Checking for new hardware'. I'm assuming that it's the tg3 issue.

I'm posting with hopes that you or someone on the list can help me to change the tg3 module to the bcm5700 driver without booting the server, or, how to boot RH w/out searching for new hardware.

Thanks in advance!

Jeff Manning
Mario Obejas
May 22, 2003 23:05:14 GMT    Unassigned

FWIW, we have been running the updated kernel (2.4.18-27) that supposedly fixed the tg3 issues and still experienced the hangs.
With a uniprocessor kernel, we experienced no hangs.

Based on the earlier portions of this thread, we tried the APIC "disabled". We got rid of the freezes, pronounced it production ready, and started to throw stuff on it. At a certain point with the much increased NFS traffic load, we started getting the freezes again.

We updated to the 2.4.20-13 kernel and also changed the APIC mode to Fully Mapped, based on this post (see 4.1.4):
http://ouray.cudenver.edu/~etumenba/smp-howto/SMP-HOWTO-4.html

Maybe it's a bit early but several days have passed, and no freeze yet. I'm really hoping this is it. With the current load, we would have felt something by now. We'll post if it freezess again.

FWIW, we use the tg3 driver based on performance advantages we measured versus the "supported" bcm5700 driver.

BTW, we also have hyperthreading disabled. We may get adventurous in the future and turn it back on, but we want to change one variable at a time, if possible.
 
Create a new message    Receive e-mail notification if a new reply is posted   Reply to this message
 
 
» Return to original page
Privacy statement Using this site means you accept its terms
© 2009 Hewlett-Packard Development Company, L.P.