"Success?" or, "SMART is dumb!"
May. 17th, 2007 05:36 pmSo, starting from late yesterday afternoon, I ran HDD Recoverer on the broken drive (It turns out that choosing the 'second' drive caused it to test all SATA drives. WTF? Anyway, it did the tests). By around 3:00 pm it was obviously not going to work. It had found some bad sectors, but couldn't recover them. (When HDD works, it recovers the sectors AND all data).
So, I looked up the drive characteristics in my little log book, saw that they were Maxtors, and went and downloaded the Maxtor (now Seagate) drive diagnostic and repair utility for my drives. I booted it up and had it tell me what it knew about the drives.
First thing it said was that both 250GB drives had SMART capability and were running with SMART enabled, and SMART had not reported any problems on either drive. So, I looked at my diagram, figured out it was the second drive with the errors and asked the drive for an internal diagnostic. The dialogue went something like this:
Me> Run Drive-Self-Test on drive #1
Util> Working... WHOA! This drive is seriously borked!
Me> What's wrong?
Util> No idea but the internal log shows mucho errors.
Me> So what do they say is wrong?
Util> No idea, its just numbers, you know? Run the Big-Diagnostic to find out, and come back in three hours.
Me> So do that.
Util> Righto.
... So, I go away and make Pasta sauce for a few hours, and when I come back I see:
Util> You have 5 hard sector errors on the drive. I think I can repair them, but you'll lose the data in them.
Me> Go for it.
Util> Fixed. No problems.
Me> I LOVE YOU!!!
Util> Chill out dude.
Me> Run Drive-Self-Test on drive #1
Util> No new errors reported since ten seconds ago, duh.
Me> Okay, run Big-Diagnostic on drive #1.
Util> Fine, come back in three hours.
So, I'm going to head off to a networking cocktail thingy, while the utility verifies that things are looking good, but I'm quietly confident that I'll manage to recover most of the data of the 300GB raid. Barring unforeseen problems, I may even have the web/mail server back up by tomorrow.
EDIT: Big diagnostic came up clean. I'm too tired to do any more tonight, so I'm just going to set things up to test the other drives overnight (just in case) and I'll continue with the reconstruction in the morning.
So, I looked up the drive characteristics in my little log book, saw that they were Maxtors, and went and downloaded the Maxtor (now Seagate) drive diagnostic and repair utility for my drives. I booted it up and had it tell me what it knew about the drives.
First thing it said was that both 250GB drives had SMART capability and were running with SMART enabled, and SMART had not reported any problems on either drive. So, I looked at my diagram, figured out it was the second drive with the errors and asked the drive for an internal diagnostic. The dialogue went something like this:
Me> Run Drive-Self-Test on drive #1
Util> Working... WHOA! This drive is seriously borked!
Me> What's wrong?
Util> No idea but the internal log shows mucho errors.
Me> So what do they say is wrong?
Util> No idea, its just numbers, you know? Run the Big-Diagnostic to find out, and come back in three hours.
Me> So do that.
Util> Righto.
... So, I go away and make Pasta sauce for a few hours, and when I come back I see:
Util> You have 5 hard sector errors on the drive. I think I can repair them, but you'll lose the data in them.
Me> Go for it.
Util> Fixed. No problems.
Me> I LOVE YOU!!!
Util> Chill out dude.
Me> Run Drive-Self-Test on drive #1
Util> No new errors reported since ten seconds ago, duh.
Me> Okay, run Big-Diagnostic on drive #1.
Util> Fine, come back in three hours.
So, I'm going to head off to a networking cocktail thingy, while the utility verifies that things are looking good, but I'm quietly confident that I'll manage to recover most of the data of the 300GB raid. Barring unforeseen problems, I may even have the web/mail server back up by tomorrow.
EDIT: Big diagnostic came up clean. I'm too tired to do any more tonight, so I'm just going to set things up to test the other drives overnight (just in case) and I'll continue with the reconstruction in the morning.
no subject
Date: 2007-05-17 11:23 pm (UTC)no subject
Date: 2007-05-18 02:21 am (UTC)no subject
Date: 2007-05-18 12:03 pm (UTC)no subject
Date: 2007-05-18 03:28 am (UTC)One thing that annoys me is that I'm abruptly not in a position to help see to it that this sort of thing doesn't happen in the future!
no subject
Date: 2007-05-18 11:59 am (UTC)My guess is that some read error caused it to be offlined so that it was out of synch when I tried to rebuild the array. There are raid-status reporting facilities that I'm going to try using so that we'll get proper notification of this sort of thing in the future.
I think we may need to replace that third drive, but I'm not sure. We ended up replacing its twin during the last upgrade because of sudden transient read errors. This may be related.
no subject
Date: 2007-05-23 12:05 am (UTC)So, we are currently running in full raid mode and I've enabled diagnostic reporting so that if any of the arrays degrade I'll be alerted by email.
It looks safe to run like this until you get back and we can talk about this further. Right now though, it looks like a better backup system would be a higher priority than a new drive.
no subject
Date: 2007-05-18 10:57 am (UTC)And wise enough to know it! Most people aren't that wise.