Not in the Mood
May. 17th, 2009 07:34 amMy mail and webserver just crashed. At first it looked like the usual problem that one of the 3 drives in the main RAID10 array had gotten 'stuck' which, so far as I can tell, is caused by a bug in the raid software itself.
A quick reboot and the system marked the 'stuck' drive as dirty and started a rebuild. So far, so good.
3 hours later I noticed the system was unresponsive and upon closer inspection was spitting out warning messages to the console so fast that I couldn't read anything but the work 'rebuilding'.
After trying everything I could to make the system calm down, I had to power it off. I restarted it again (in single user mode this time) and it seemed fine, and I told it to go back to rebuilding the array, which it did.
3 hours later it announced that a completely different drive had failed and as it now had 2 dirty drives in one 3-drive array, it was stopping right there. And that's the current state. I have NO IDEA how to tell it to assume that a 'spare' drive is actually not a spare. I also don't know if its complaint of a failing drive is correct or not. I can't even check the logs, since the logs are stored on the drive that was rebuilding...
So, right now I don't have my main mail server, and I have no real idea on how to fix the problem. So, for the moment I'm going to stop fiddling with it and see if some extended thought might solve things.
A quick reboot and the system marked the 'stuck' drive as dirty and started a rebuild. So far, so good.
3 hours later I noticed the system was unresponsive and upon closer inspection was spitting out warning messages to the console so fast that I couldn't read anything but the work 'rebuilding'.
After trying everything I could to make the system calm down, I had to power it off. I restarted it again (in single user mode this time) and it seemed fine, and I told it to go back to rebuilding the array, which it did.
3 hours later it announced that a completely different drive had failed and as it now had 2 dirty drives in one 3-drive array, it was stopping right there. And that's the current state. I have NO IDEA how to tell it to assume that a 'spare' drive is actually not a spare. I also don't know if its complaint of a failing drive is correct or not. I can't even check the logs, since the logs are stored on the drive that was rebuilding...
So, right now I don't have my main mail server, and I have no real idea on how to fix the problem. So, for the moment I'm going to stop fiddling with it and see if some extended thought might solve things.
no subject
Date: 2009-05-17 01:47 pm (UTC)I use the variety of RAID that just mirrors disks -- two drives the same. I have no troubles with it. I'm afraid to use anything more elaborate.
Do you have a backup other than the RAID?
Try a nondestructive read check on the drives?
-=- hendrik
no subject
Date: 2009-05-17 02:45 pm (UTC)no subject
Date: 2009-05-17 02:49 pm (UTC)no subject
Date: 2009-05-17 11:46 pm (UTC)no subject
Date: 2009-05-18 12:18 pm (UTC)However, it has occurred to me that I could probably boot the system up with a live CD long enough to repair the problem. My only worry is that I don't know of any Linux programs for doing a low-level hard drive check on a drive of unknown contents. Since the drives are raid parts of a Raid10, I need something lower than a filesystem check.
no subject
Date: 2009-05-18 09:00 pm (UTC)But check with man badblocks first in case I got it wrong.
-- hendrik
no subject
Date: 2009-05-17 01:51 pm (UTC)no subject
Date: 2009-05-17 02:49 pm (UTC)