[WBEL-users] RE: Server hard disk failure (Ed Lauzier)
Edward Lauzier
elauzier at platform.com
Thu Mar 31 08:40:53 CST 2005
Well, I thought I'd let a few in on a horror story I had awhile ago
with WBEL Release 3 Respin 1. I was getting ext3 file system corruptions
on a daily basis and could not pin down what was causing it. It got
so bad that once I had to pull an entire directory structure one by one
out of lost+found. At least I had the data and there was no "real"
data loss.
I then started to look into why this could be happening. I was getting
kernel panics on one box and file system corruption on another. A user
pointed out to me that it may be a RedHat kernel bug that may or may not
have been fixed. I was testing some of our software and took advantage
of the servers going down to test our failover scenerios. After I was
finished testing, I turned off our software.
When I turned off our software, the problems stopped. This told me that
the crux of the problem could be with the kernel drivers for NFS and ext3
filesystems and how they interact.
The machine with the corrupted filesystem(s) was an NFS server.
The box with the kernel panics was an NFS client running WBEL 3R1.
After I turned off our software, which runs fine by the way on
all other platforms, the problems stopped. Strange. Our software does not
cause these problems when the shared area is on a NetApp, EMC box, or
Solaris box.
Conclusions for WBEL 3R1:
NFS and ext3 kernel drivers may cause some problems on some hardware types.
On the problem platforms, I'm using the ASUS A7V motherboard. I also have
IBM BladeCenter servers running in a similar configuration and have not
had problems ( yet ). No problems either with a Sun box running Solaris8
and sharing out an area for NFS. The user who informed me that there was
a possible kernel bug causing the problems also suggested using an ext2
filesystem, which I have not gone to.( I forget the thread...)
I'd rather see the problem identified and fixed, and move forward...
Hope this helps...
Ed
-----Original Message-----
From: whitebox-users-bounces at beau.org on behalf of whitebox-users-request at beau.org
Sent: Thu 3/31/2005 12:04 AM
To: whitebox-users at beau.org
Cc:
Subject: Whitebox-users Digest, Vol 3, Issue 43
Send Whitebox-users mailing list submissions to
whitebox-users at beau.org
To subscribe or unsubscribe via the World Wide Web, visit
http://beau.org/mailman/listinfo/whitebox-users
or, via email, send a message with subject or body 'help' to
whitebox-users-request at beau.org
You can reach the person managing the list at
whitebox-users-owner at beau.org
>Message: 1
>Date: Wed, 30 Mar 2005 22:40:55 +0100
>From: Francies Moore <liz at indract.freeserve.co.uk>
>Subject: [WBEL-users] Server hard disk failure
>To: whitebox-users at beau.org
>Message-ID: <424B1CE7.4020209 at indract.freeserve.co.uk>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>Hi everyone
>One of my WBEL servers has crashed due to the failure of one of its hard
>disks. My hardware support technician (a recent Linux convert) says it
>could be a filesystem failure. Is such a thing possible on a machine
>which was doing nothing over Easter? I thought Linux was more stable
>than "that other system" in this regard.
>Whatever happened to it, it cannot reboot as the ext3 journal cannot get
>its head around the situation.
>How do I recover what is left on the surviving hard disk (which contains
>the operating system and some user files)? Do I revert to ext2 by
>deleting the journal and changing the fstab? If so, where do I find the
>journal file, and what is it called?
>Can I go back to ext3 when a new HD is fitted?
>Thanks.
>Francies
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://beau.org/pipermail/whitebox-users/attachments/20050331/97c93245/attachment.htm
More information about the Whitebox-users
mailing list