In Which I Demonstrate Little Respect For My Betters

Posted by Christopher Smith Fri, 08 May 2009 15:25:00 GMT

Of late, DBMS2 has been a source interesting factoids and questionable analysis about the data warehouse industry. I’ve bitten my tongue about some of the articles, because some of them have actually been directly about goings on at my employer, but this interview with Carson Schmidt seems free and clear of that problem while still providing analysis of debatable value. Now, let me preface this by saying AFAIK Carson Schmidt is a brilliant man who likely has forgotten more about OLAP databases than I will ever know. Indeed, knowing this, I get the feeling some of what he is saying has been misinterpreted, taken out of context, or just plain poorly analyzed. That said, I want to apply some critical thinking to the discussion.

Let’s try to break things down a few different ways:

  • Cheap drives do lose a lot of performance to error correction… but that is merely a firmware setting that costs $0 to change… if you can get access to the firmware. Enterprise class drives expect to run in a RAID config where it is better to just give up and hope that other drives have the parity data. If you are buying a lot of hard drives (i.e. you are building an EDW), you can no doubt negotiate a firmware config change.

  • Faster rotation, smaller media, more heads, etc. are all great for random seeks. They are pointless for throughput (or more accurately counter balanced by greater media density). Being bound by seeks is a great way to have a visit from the fail whale.

  • The vibration issue is a real one. Solution: dampen vibrations before they get to the drive. Padding, mount suspensions, etc., etc. Even if you build it in to the drive, this is actually very cheap to do right, but enterprise drives charge a premium for it.

  • Command queuing support is now a commodity feature. You can get some improvements by having better drive firmware for this, and moderate improvements by providing faster processors and more memory for deeper queues, but unless you clog the pipe with random seeks (read: bad), even commodity drives will deliver.

  • The “electronic features” section is basically is talking about the performance characteristics of the node that have NOTHING TO DO WITH THE DRIVE. If you want to see cheap drives with massive CPU & RAM available to them, go check out your local Hadoop cluster.

  • The drive industry continues to price and market drives based on their interfaces far too much, and this article makes the same mistake. Storage interfaces are increasingly similar in terms of their performance capabilities. To the extent that one has better bandwidth or latency than the other, you pay for it in $’s. While storage interfaces have unique characteristics that make them better or worse suited for certain types of solutions, we really need to move past the notion that say having a Fibre interface means you’ll have better iops or for that matter throughput even if the SAS drive has faster rotation rates, shorter strokes, more heads, density, etc., etc.

  • Talking about SSD’s being a 100x performance win over disk drives is once again showing a focus on iops. SSD’s are awesome for iops, and you can use cache and RAID techniques to get their throughput up to impressive levels, but I’m sorry they aren’t anywhere close to 100x better at throughput. It’s nice that they fill in for an area where magnetic disk continues to fall behind, but the bottom line is that even SSD’s are only a short term fix that ignore the larger trend: if your software is iops bound, it is increasingly going to fall behind relative to sequential scan (even dumb sequential scan) software.

  • In general, it shouldn’t be news to anyone that expensive drives have better iops performance than cheap ones. That’s been the story of the storage industry for at least the last two decades.

Here’s a simpler summary of what is really going on here: seeks suck. Seeks have not gotten much better over time, though throughput has improved tremendously. Enterprise drives, which traditionally target the OLTP market, have the best seek performance you can get from magnetic disc media, because with OLTP all that matters is iops/$, and in many cases just iops. It turns out though that the only area where we can really move the needle on iops is if you have parallel, independent I/Os (very much the kind of thing that scatter/gather AIO is all about). So, you can’t improve seek latency much (Savvio 15K.2’s are at best 5x better), but you can improve how many seeks you can do at once… to a point. Here’s the thing though: at a certain point, sequential throughput becomes sooo much faster than random seeks, that it can be as fast to do the logical equivalent of a table scan than to do hundreds of parallel indexed lookups. Good software will recognize this and do the right thing.

Here’s the thing: assuming you are being smart in your software, a single Savvio is going to beat the snot out of a single Barracuda drive, but the Savvio is going to cost a lot more. For the same price you could buy several Barracuda drives, and then the Barracuda’s have such a huge win on sequential scan performance. You have to ask yourself: “how often are my indexes eliminating so many records for me that the Savvio’s better iops still trumps the awesome scan performance of this chunk of Barracuda array?” But here’s the real killer: since parallel, independent I/O’s are where the Savvio’s are going to really show their stuff… couldn’t you instead have your software split the I/O’s over the Barracuda drives? How many iops does that get you? ;-)

The trend in data warehouses is to build highly parallelized clusters that bring the processing power closer and closer to the storage (moving away from the SAN model), and to have the software treat each node as a highly sophisticated IOP, a IOPU (to play on the GPU concept) if you will, that simultaneously scans and analyzes the data on each node, and then aggregates the results. To the extent that the drive and the IO controller try to get clever with their own software, they actually get more and more in the way over time. At some point, it might theoretically be better to have a hundreds of really cheap, really dense single platter, single head drives with firmware that doesn’t try any error recovery or command queuing, and leave the software running on the individual “IOPU’s” to do all the cleverness along with the actually analytical work at hand.

I did find myself more in agreement with Carson on the non-storage tidbits (which I’m sure indicates just how wrong I am about the storage stuff). 10GigE seems ready to wipe Infiniband off the map, and Nehalem seems to have taken out any remaining advantages AMD had over Intel for OLAP work.

Back in the Land of the Living

Posted by Christopher Smith Tue, 20 May 2008 10:45:00 GMT

Well, our server crashed today. Weirdest bug I ever saw: we got a kernel oops when smartd tried to get health information from the drives in the 3ware RAID array. One of the drives appears to have malfunctioned, so perhaps that is related. The fragility was possibly caused by running a fairly up to date smartd on a fairly out of date kernel with SKAS patches… but it is far from clear. I need to test this out more to be sure of what the magic sequence was, but needless to say… it’s been an experience.

Hard Drive Errata

Posted by Christopher Smith Sun, 09 Dec 2007 05:47:00 GMT

So, I had my first experience with hard drive failure in my server/MythTV backend. One of my drives failed in an odd fashion. It appeared to be continuously trying to do something over the SATA bus (my hard drive light was permanently on). Linux software RAID did the right thing and dropped the drive from the arrays. Rebooting didn’t help, but powering off did. So, that makes one think the hard drive firmware had some kind of bug.

To Linux’s software RAID’s credit, everything recovered quite nicely simply by adding the appropriate partitions back in to the appropriate RAID’s. Rebuild time was impressively quick. In case there was any doubt left as to how much overhead software RAID imposes on a system… during the rebuild I was able to play and record HDTV OTA content without noticing any impact, and the software RAID chewed up maybe 3% of the CPU time (if that).

Of course, all this has got me pretty nervous about the drive in question. SMART isn’t showing anything wrong with it, but I’ve checked with a few sys admin types, and they’ve all pretty much said the same thing: if the drive is under warranty (and it is), send it back. It failed, and it is best to let them figure out how/why.

Not wanting to lose all redundancy in my system, I’m of course looking at a replacement drive. My system currently has three Samsung 500GB drives, all of which I ordered at the same time, so they literally have sequential serial numbers. This is generally considered unwise as they are highly likely to have the same reliability characteristics (so when one fails, the rest may follow). To that end I’m thinking of swapping out one of the drives with a Western Digital 500GB drive drive that I’ve been using in my desktop. If I then order a replacement drive I no longer have the RAID house of cards I started with.

The question is, what to get. I am tempted to go either with another one of the Samsung’s or the Western Digital drives. They are about the same price and they are both quiet but fast drives. The other temptation is to go with the 1TB drives that both companies have offered on the market. The Samsung sets the record for data density (1TB with only 3 platters!) and therefore also has an impressive sustained transfer rate (which ought to be a key consideration for a MythTV backend). It is also very quiet and power efficient (mostly because of having so few platters I suspect). The Western Digital one, on the other hand, is a very different kind of drive. It basically trades off some performance (primarily by reducing the RPM of the drive) in order to achieve unheard of power efficiency (particularly while idle) and quiet. The Western Digital GP is particularly tempting right now as I can get it for $250 from newegg (not as much bang for the buck as the 500GB drives, but an amazing deal for a 1TB drive!).

If I went with the bigger drives, I’d use the extra space on said drive to store “maybe I’ll watch it someday, but probably not” content, as well as using it as a scratch space for transcoding and such. Then, when the replacement drive came back from Samsung, I might try hacking Zumastor into a Kuro Box Pro, and stuff the drive in it. I could then use it as a live network backup of all my media content, with prior revisions (and yes, using Zumastor in such a way ought to be an interesting test of its capabilities). It is an interesting thought anyway.

While I was researching all this, I came across a warning from the folks over at Western Digital, noting that the current incarnation of the GP was not intended for use in a RAID (and in fact this is true for all the drives I’m mentioning). A RAID model is, of course, due shortly. The whole Enterprise vs. Desktop hard drive business has been seriously called in to question by the work of Bianca Schroeder and Garth Gibson, but the Western Digital folks have a fine point here about the advantages and disadvantages of TLER for RAID vs. desktop type roles.

Here’s what ticks me off: TLER isn’t a hardware thing. It’s a firmware thing. Just like write caching (desktop drives typically have write caching enabled, enterprise drives don’t, but you can switch the setting if you want to), you should be able to flip a bit somewhere in the drive’s firmware and then you’re off to the races. Only, hard drive vendors don’t want it that way. They want to have one cheap drive with razor thin margins for the ultra competitive desktop space, and then another, high margin drive they can sell to the enterprise space. They’re running out of ways to differentiate, so now they are limiting the flexibility of their firmware. This really irks me in a Richard Stallman and Xerox kind of way. The real kicker is this though: Google’s study of hard disk failures suggests that any drive failure that would make TLER relevant is indicative that it is time to get a replacement drive anyway. So ultimately, TLER is only really a beneficial feature if you have an absolute need to avoid a disk access latency penalty with your RAID even when a drive is about to fail. Sure, there are places for that need, but it sure sounds to me like enterprise’s for the most part should save their money and use cheaper desktop drives anyway.

Ah well, enough ranting. Time to make a decision.