In Which I Demonstrate Little Respect For My Betters

Posted by Christopher Smith Fri, 08 May 2009 15:25:00 GMT

Of late, DBMS2 has been a source interesting factoids and questionable analysis about the data warehouse industry. I’ve bitten my tongue about some of the articles, because some of them have actually been directly about goings on at my employer, but this interview with Carson Schmidt seems free and clear of that problem while still providing analysis of debatable value. Now, let me preface this by saying AFAIK Carson Schmidt is a brilliant man who likely has forgotten more about OLAP databases than I will ever know. Indeed, knowing this, I get the feeling some of what he is saying has been misinterpreted, taken out of context, or just plain poorly analyzed. That said, I want to apply some critical thinking to the discussion.

Let’s try to break things down a few different ways:

  • Cheap drives do lose a lot of performance to error correction… but that is merely a firmware setting that costs $0 to change… if you can get access to the firmware. Enterprise class drives expect to run in a RAID config where it is better to just give up and hope that other drives have the parity data. If you are buying a lot of hard drives (i.e. you are building an EDW), you can no doubt negotiate a firmware config change.

  • Faster rotation, smaller media, more heads, etc. are all great for random seeks. They are pointless for throughput (or more accurately counter balanced by greater media density). Being bound by seeks is a great way to have a visit from the fail whale.

  • The vibration issue is a real one. Solution: dampen vibrations before they get to the drive. Padding, mount suspensions, etc., etc. Even if you build it in to the drive, this is actually very cheap to do right, but enterprise drives charge a premium for it.

  • Command queuing support is now a commodity feature. You can get some improvements by having better drive firmware for this, and moderate improvements by providing faster processors and more memory for deeper queues, but unless you clog the pipe with random seeks (read: bad), even commodity drives will deliver.

  • The “electronic features” section is basically is talking about the performance characteristics of the node that have NOTHING TO DO WITH THE DRIVE. If you want to see cheap drives with massive CPU & RAM available to them, go check out your local Hadoop cluster.

  • The drive industry continues to price and market drives based on their interfaces far too much, and this article makes the same mistake. Storage interfaces are increasingly similar in terms of their performance capabilities. To the extent that one has better bandwidth or latency than the other, you pay for it in $’s. While storage interfaces have unique characteristics that make them better or worse suited for certain types of solutions, we really need to move past the notion that say having a Fibre interface means you’ll have better iops or for that matter throughput even if the SAS drive has faster rotation rates, shorter strokes, more heads, density, etc., etc.

  • Talking about SSD’s being a 100x performance win over disk drives is once again showing a focus on iops. SSD’s are awesome for iops, and you can use cache and RAID techniques to get their throughput up to impressive levels, but I’m sorry they aren’t anywhere close to 100x better at throughput. It’s nice that they fill in for an area where magnetic disk continues to fall behind, but the bottom line is that even SSD’s are only a short term fix that ignore the larger trend: if your software is iops bound, it is increasingly going to fall behind relative to sequential scan (even dumb sequential scan) software.

  • In general, it shouldn’t be news to anyone that expensive drives have better iops performance than cheap ones. That’s been the story of the storage industry for at least the last two decades.

Here’s a simpler summary of what is really going on here: seeks suck. Seeks have not gotten much better over time, though throughput has improved tremendously. Enterprise drives, which traditionally target the OLTP market, have the best seek performance you can get from magnetic disc media, because with OLTP all that matters is iops/$, and in many cases just iops. It turns out though that the only area where we can really move the needle on iops is if you have parallel, independent I/Os (very much the kind of thing that scatter/gather AIO is all about). So, you can’t improve seek latency much (Savvio 15K.2’s are at best 5x better), but you can improve how many seeks you can do at once… to a point. Here’s the thing though: at a certain point, sequential throughput becomes sooo much faster than random seeks, that it can be as fast to do the logical equivalent of a table scan than to do hundreds of parallel indexed lookups. Good software will recognize this and do the right thing.

Here’s the thing: assuming you are being smart in your software, a single Savvio is going to beat the snot out of a single Barracuda drive, but the Savvio is going to cost a lot more. For the same price you could buy several Barracuda drives, and then the Barracuda’s have such a huge win on sequential scan performance. You have to ask yourself: “how often are my indexes eliminating so many records for me that the Savvio’s better iops still trumps the awesome scan performance of this chunk of Barracuda array?” But here’s the real killer: since parallel, independent I/O’s are where the Savvio’s are going to really show their stuff… couldn’t you instead have your software split the I/O’s over the Barracuda drives? How many iops does that get you? ;-)

The trend in data warehouses is to build highly parallelized clusters that bring the processing power closer and closer to the storage (moving away from the SAN model), and to have the software treat each node as a highly sophisticated IOP, a IOPU (to play on the GPU concept) if you will, that simultaneously scans and analyzes the data on each node, and then aggregates the results. To the extent that the drive and the IO controller try to get clever with their own software, they actually get more and more in the way over time. At some point, it might theoretically be better to have a hundreds of really cheap, really dense single platter, single head drives with firmware that doesn’t try any error recovery or command queuing, and leave the software running on the individual “IOPU’s” to do all the cleverness along with the actually analytical work at hand.

I did find myself more in agreement with Carson on the non-storage tidbits (which I’m sure indicates just how wrong I am about the storage stuff). 10GigE seems ready to wipe Infiniband off the map, and Nehalem seems to have taken out any remaining advantages AMD had over Intel for OLAP work.

Comments

Leave a comment

Comments