The server hosting this blog just suffered what can only be described as catastrophic hardware failure, thanks to the crap that are Maxtor hard drives, Dell CERC 1.5 RAID controllers and the Linux driver for those (aacraid).
So, the machine got shut down for a moment on thursday night, just the time to move the rack around in the datacenter, and one of the disks did not come back in one piece.
But it did come back online after a reboot, prompting an array rebuild by the controller which took 12.5 hours for 250 GB (RAID 1). And the drive failed again 10 minutes after the rebuild was completed, confusing both the controller and the driver. At that point I could still log into the machine, but any command would trigger nice I/O errors all over the place.
Fortunately, the FS got shut down by XFS automatically on I/O errors and I haven’t lost a single bit of data. The machine came back just fine with the faulty drive removed.
And you thought a RAID controller would help in this case. Obviously this one doesn’t. I need to mention that we’ve observed this behaviour on another machine before this incident, we have another machine suffering from a faulty drive in just the same way, so this isn’t an isolated case and it’s reproducible.