Hard Drive Errata

Posted by Christopher Smith Sat, 08 Dec 2007 21:47:00 GMT

So, I had my first experience with hard drive failure in my server/MythTV backend. One of my drives failed in an odd fashion. It appeared to be continuously trying to do something over the SATA bus (my hard drive light was permanently on). Linux software RAID did the right thing and dropped the drive from the arrays. Rebooting didn’t help, but powering off did. So, that makes one think the hard drive firmware had some kind of bug.

To Linux’s software RAID’s credit, everything recovered quite nicely simply by adding the appropriate partitions back in to the appropriate RAID’s. Rebuild time was impressively quick. In case there was any doubt left as to how much overhead software RAID imposes on a system… during the rebuild I was able to play and record HDTV OTA content without noticing any impact, and the software RAID chewed up maybe 3% of the CPU time (if that).

Of course, all this has got me pretty nervous about the drive in question. SMART isn’t showing anything wrong with it, but I’ve checked with a few sys admin types, and they’ve all pretty much said the same thing: if the drive is under warranty (and it is), send it back. It failed, and it is best to let them figure out how/why.

Not wanting to lose all redundancy in my system, I’m of course looking at a replacement drive. My system currently has three Samsung 500GB drives, all of which I ordered at the same time, so they literally have sequential serial numbers. This is generally considered unwise as they are highly likely to have the same reliability characteristics (so when one fails, the rest may follow). To that end I’m thinking of swapping out one of the drives with a Western Digital 500GB drive drive that I’ve been using in my desktop. If I then order a replacement drive I no longer have the RAID house of cards I started with.

The question is, what to get. I am tempted to go either with another one of the Samsung’s or the Western Digital drives. They are about the same price and they are both quiet but fast drives. The other temptation is to go with the 1TB drives that both companies have offered on the market. The Samsung sets the record for data density (1TB with only 3 platters!) and therefore also has an impressive sustained transfer rate (which ought to be a key consideration for a MythTV backend). It is also very quiet and power efficient (mostly because of having so few platters I suspect). The Western Digital one, on the other hand, is a very different kind of drive. It basically trades off some performance (primarily by reducing the RPM of the drive) in order to achieve unheard of power efficiency (particularly while idle) and quiet. The Western Digital GP is particularly tempting right now as I can get it for $250 from newegg (not as much bang for the buck as the 500GB drives, but an amazing deal for a 1TB drive!).

If I went with the bigger drives, I’d use the extra space on said drive to store “maybe I’ll watch it someday, but probably not” content, as well as using it as a scratch space for transcoding and such. Then, when the replacement drive came back from Samsung, I might try hacking Zumastor into a Kuro Box Pro, and stuff the drive in it. I could then use it as a live network backup of all my media content, with prior revisions (and yes, using Zumastor in such a way ought to be an interesting test of its capabilities). It is an interesting thought anyway.

While I was researching all this, I came across a warning from the folks over at Western Digital, noting that the current incarnation of the GP was not intended for use in a RAID (and in fact this is true for all the drives I’m mentioning). A RAID model is, of course, due shortly. The whole Enterprise vs. Desktop hard drive business has been seriously called in to question by the work of Bianca Schroeder and Garth Gibson, but the Western Digital folks have a fine point here about the advantages and disadvantages of TLER for RAID vs. desktop type roles.

Here’s what ticks me off: TLER isn’t a hardware thing. It’s a firmware thing. Just like write caching (desktop drives typically have write caching enabled, enterprise drives don’t, but you can switch the setting if you want to), you should be able to flip a bit somewhere in the drive’s firmware and then you’re off to the races. Only, hard drive vendors don’t want it that way. They want to have one cheap drive with razor thin margins for the ultra competitive desktop space, and then another, high margin drive they can sell to the enterprise space. They’re running out of ways to differentiate, so now they are limiting the flexibility of their firmware. This really irks me in a Richard Stallman and Xerox kind of way. The real kicker is this though: Google’s study of hard disk failures suggests that any drive failure that would make TLER relevant is indicative that it is time to get a replacement drive anyway. So ultimately, TLER is only really a beneficial feature if you have an absolute need to avoid a disk access latency penalty with your RAID even when a drive is about to fail. Sure, there are places for that need, but it sure sounds to me like enterprise’s for the most part should save their money and use cheaper desktop drives anyway.

Ah well, enough ranting. Time to make a decision.

Comments

Leave a comment

Comments