[WBEL-users] RE: Server hard disk failure (Ed Lauzier)

Edward Lauzier elauzier at platform.com
Thu Mar 31 08:40:53 CST 2005


Well,  I thought I'd let a few in on a horror story I had awhile ago
with WBEL Release 3 Respin 1.  I was getting ext3 file system corruptions
on a daily basis and could not pin down what was causing it.  It got
so bad that once I had to pull an entire directory structure one by one
out of lost+found.  At least I had the data and there was no "real"
data loss.

I then started to look into why this could be happening.  I was getting
kernel panics on one box and file system corruption on another.  A user
pointed out to me that it may be a RedHat kernel bug that may or may not
have been fixed.  I was testing some of our software and took advantage
of the servers going down to test our failover scenerios.  After I was
finished testing, I turned off our software.

When I turned off our software, the problems stopped.  This told me that
the crux of the problem could be with the kernel drivers for NFS and ext3
filesystems and how they interact.
The machine with the corrupted filesystem(s) was an NFS server.  
The box with the kernel panics was an NFS client running WBEL 3R1. 
After I turned off our software, which runs fine by the way on
all other platforms, the problems stopped.  Strange.  Our software does not
cause these problems when the shared area is on a NetApp, EMC box, or
Solaris box.

Conclusions for WBEL 3R1:
NFS and ext3 kernel drivers may cause some problems on some hardware types.
On the problem platforms, I'm using the ASUS A7V motherboard.  I also have
IBM BladeCenter servers running in a similar configuration and have not
had problems ( yet ).  No problems either with a Sun box running Solaris8
and sharing out an area for NFS.  The user who informed me that there was
a possible kernel bug causing the problems also suggested using an ext2
filesystem, which I have not gone to.( I forget the thread...) 
I'd rather see the problem identified and fixed, and move forward...

Hope this helps...

Ed


-----Original Message-----
From:	whitebox-users-bounces at beau.org on behalf of whitebox-users-request at beau.org
Sent:	Thu 3/31/2005 12:04 AM
To:	whitebox-users at beau.org
Cc:	
Subject:	Whitebox-users Digest, Vol 3, Issue 43
Send Whitebox-users mailing list submissions to
	whitebox-users at beau.org

To subscribe or unsubscribe via the World Wide Web, visit
	http://beau.org/mailman/listinfo/whitebox-users
or, via email, send a message with subject or body 'help' to
	whitebox-users-request at beau.org

You can reach the person managing the list at
	whitebox-users-owner at beau.org
>Message: 1
>Date: Wed, 30 Mar 2005 22:40:55 +0100
>From: Francies Moore <liz at indract.freeserve.co.uk>
>Subject: [WBEL-users] Server hard disk failure
>To: whitebox-users at beau.org
>Message-ID: <424B1CE7.4020209 at indract.freeserve.co.uk>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed

>Hi everyone

>One of my WBEL servers has crashed due to the failure of one of its hard 
>disks.  My hardware support technician (a recent Linux convert) says it 
>could be a filesystem failure.  Is such a thing possible on a machine 
>which was doing nothing over Easter?  I thought Linux was more stable 
>than "that other system" in this regard.

>Whatever happened to it, it cannot reboot as the ext3 journal cannot get 
>its head around the situation.

>How do I recover what is left on the surviving hard disk (which contains 
>the operating system and some user files)?  Do I revert to ext2 by 
>deleting the journal and changing the fstab?  If so, where do I find the 
>journal file, and what is it called?

>Can I go back to ext3 when a new HD is fitted?

>Thanks.

>Francies






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://beau.org/pipermail/whitebox-users/attachments/20050331/97c93245/attachment.htm


More information about the Whitebox-users mailing list