Reality is Trying to Put the Onion Out of Business

Posted by Christopher Smith Fri, 06 Nov 2009 22:25:00 GMT

Okay, these are real news stories, all within the last 24 hours. You can’t make this stuff up.

In order of increasing ridiculousness:

I feel like someone is messing with me.

In Which I Demonstrate Little Respect For My Betters

Posted by Christopher Smith Fri, 08 May 2009 15:25:00 GMT

Of late, DBMS2 has been a source interesting factoids and questionable analysis about the data warehouse industry. I’ve bitten my tongue about some of the articles, because some of them have actually been directly about goings on at my employer, but this interview with Carson Schmidt seems free and clear of that problem while still providing analysis of debatable value. Now, let me preface this by saying AFAIK Carson Schmidt is a brilliant man who likely has forgotten more about OLAP databases than I will ever know. Indeed, knowing this, I get the feeling some of what he is saying has been misinterpreted, taken out of context, or just plain poorly analyzed. That said, I want to apply some critical thinking to the discussion.

Let’s try to break things down a few different ways:

  • Cheap drives do lose a lot of performance to error correction… but that is merely a firmware setting that costs $0 to change… if you can get access to the firmware. Enterprise class drives expect to run in a RAID config where it is better to just give up and hope that other drives have the parity data. If you are buying a lot of hard drives (i.e. you are building an EDW), you can no doubt negotiate a firmware config change.

  • Faster rotation, smaller media, more heads, etc. are all great for random seeks. They are pointless for throughput (or more accurately counter balanced by greater media density). Being bound by seeks is a great way to have a visit from the fail whale.

  • The vibration issue is a real one. Solution: dampen vibrations before they get to the drive. Padding, mount suspensions, etc., etc. Even if you build it in to the drive, this is actually very cheap to do right, but enterprise drives charge a premium for it.

  • Command queuing support is now a commodity feature. You can get some improvements by having better drive firmware for this, and moderate improvements by providing faster processors and more memory for deeper queues, but unless you clog the pipe with random seeks (read: bad), even commodity drives will deliver.

  • The “electronic features” section is basically is talking about the performance characteristics of the node that have NOTHING TO DO WITH THE DRIVE. If you want to see cheap drives with massive CPU & RAM available to them, go check out your local Hadoop cluster.

  • The drive industry continues to price and market drives based on their interfaces far too much, and this article makes the same mistake. Storage interfaces are increasingly similar in terms of their performance capabilities. To the extent that one has better bandwidth or latency than the other, you pay for it in $’s. While storage interfaces have unique characteristics that make them better or worse suited for certain types of solutions, we really need to move past the notion that say having a Fibre interface means you’ll have better iops or for that matter throughput even if the SAS drive has faster rotation rates, shorter strokes, more heads, density, etc., etc.

  • Talking about SSD’s being a 100x performance win over disk drives is once again showing a focus on iops. SSD’s are awesome for iops, and you can use cache and RAID techniques to get their throughput up to impressive levels, but I’m sorry they aren’t anywhere close to 100x better at throughput. It’s nice that they fill in for an area where magnetic disk continues to fall behind, but the bottom line is that even SSD’s are only a short term fix that ignore the larger trend: if your software is iops bound, it is increasingly going to fall behind relative to sequential scan (even dumb sequential scan) software.

  • In general, it shouldn’t be news to anyone that expensive drives have better iops performance than cheap ones. That’s been the story of the storage industry for at least the last two decades.

Here’s a simpler summary of what is really going on here: seeks suck. Seeks have not gotten much better over time, though throughput has improved tremendously. Enterprise drives, which traditionally target the OLTP market, have the best seek performance you can get from magnetic disc media, because with OLTP all that matters is iops/$, and in many cases just iops. It turns out though that the only area where we can really move the needle on iops is if you have parallel, independent I/Os (very much the kind of thing that scatter/gather AIO is all about). So, you can’t improve seek latency much (Savvio 15K.2’s are at best 5x better), but you can improve how many seeks you can do at once… to a point. Here’s the thing though: at a certain point, sequential throughput becomes sooo much faster than random seeks, that it can be as fast to do the logical equivalent of a table scan than to do hundreds of parallel indexed lookups. Good software will recognize this and do the right thing.

Here’s the thing: assuming you are being smart in your software, a single Savvio is going to beat the snot out of a single Barracuda drive, but the Savvio is going to cost a lot more. For the same price you could buy several Barracuda drives, and then the Barracuda’s have such a huge win on sequential scan performance. You have to ask yourself: “how often are my indexes eliminating so many records for me that the Savvio’s better iops still trumps the awesome scan performance of this chunk of Barracuda array?” But here’s the real killer: since parallel, independent I/O’s are where the Savvio’s are going to really show their stuff… couldn’t you instead have your software split the I/O’s over the Barracuda drives? How many iops does that get you? ;-)

The trend in data warehouses is to build highly parallelized clusters that bring the processing power closer and closer to the storage (moving away from the SAN model), and to have the software treat each node as a highly sophisticated IOP, a IOPU (to play on the GPU concept) if you will, that simultaneously scans and analyzes the data on each node, and then aggregates the results. To the extent that the drive and the IO controller try to get clever with their own software, they actually get more and more in the way over time. At some point, it might theoretically be better to have a hundreds of really cheap, really dense single platter, single head drives with firmware that doesn’t try any error recovery or command queuing, and leave the software running on the individual “IOPU’s” to do all the cleverness along with the actually analytical work at hand.

I did find myself more in agreement with Carson on the non-storage tidbits (which I’m sure indicates just how wrong I am about the storage stuff). 10GigE seems ready to wipe Infiniband off the map, and Nehalem seems to have taken out any remaining advantages AMD had over Intel for OLAP work.

Passing of the Torch

Posted by Christopher Smith Thu, 05 Feb 2009 15:49:00 GMT

It’s hard not to try to read more in to this rather interesting timing of two freakishly-coincidental-occurences at Microsoft and Google.

Wow. Just Wow.

Posted by Christopher Smith Sat, 01 Nov 2008 02:09:00 GMT

Well, I’d say the software business has been transformed overnight by a Court of Appeals ruling. This should add more chaos to the Nasdaq (as if it needed it). it’s going to be a bumpy ride for a while folks.

The New Hardware

Posted by Christopher Smith Sun, 28 Sep 2008 13:28:00 GMT

So, after much theatrics, our little co-op managed to get it together and install new hardware. We ended up going with Silicon Mechanics Rackform nServe A266. It’s drool worthy hardware, so I thought I’d detail what a wonderful little box this is for our needs.

A bit about Silicon Mechanics: they are a SuperMicro system builder who’ve always made their presence felt in the local Los Angeles Linux community. They show up at SCALE each year without fail, often donating one of their servers as a raffle prize (and you don’t even have to give them your business card to get it!). They clearly care a great deal about their hardware and about the Linux community, so they seemed the right kind of people for our project (and to their credit, they were very patient and helpful with us, particularly considering how small an order we were).

The A266 appealed to us for a lot of reasons, but probably the biggest one was its power footprint and how easy Silicon Mechanics makes it to figure out what your power draw is really going to be. Right there as you are configuring your hardware they estimate your power consumption in very precise terms, so you can know your power footprint before you ever see hardware, and more importantly you can easily figure out what kind of adjustments to make to get the best bang for the watt. This is a brilliant reaction to data center concerns shifting away from rackspace and towards heat and power consumption. I really hope other vendors adopt this themselves. The system uses Opteron cores, which aren’t exactly legendary for their power footprint (AMD is apparently about to bring out some real miserly Opterons towards the end of this year, but until then Intel clearly has the edge), but by going with 2 quad core, high efficiency Opteron 2347 HE’s, we were able to get significant processing power without burning through our entire power budget.

The real win on the power consumption front though was with the RAM. We’re doing server virtualization, so RAM is our most valuable resource. Accordingly, we loaded up on 32GB of RAM. Now, if you have an Intel based system, you end up with FB-DIMMS, which suck power and generate heat like a hair dryer. They have low voltage FB-DIMMS out now, but I have yet to see anything to suggest that this competes well with the DDR2 memory we got in our system. The power savings on the RAM were so great, they totally overcame any excess heat from the CPU. While the memory is by no means the fastest you can get, AMD’s architecture, with HyperChannel and it’s 3 levels of cache, tends not to be as sensitive to such things, and for our needs, more RAM is way more valuable than faster RAM.

As we expected from Silicon Mechanics, the system’s construction is first rate (I’ll try to get pictures up at some point). When we installed it at the colo, we got numerous positive comments from the ops who passed by. These guys see boxes all the time, and while they tend to be hardware junkies, they also tend to be blazĂ© about the usual fare. About the only criticism I could make is that the box is definitely noisy, but this is most likely on account of the extensive cooling efforts (our hard drive temperatures appear significantly lower than with our old 4U box).

We got the system with SuperMicro’s IPMI 2.0 card with full KVM over LAN support. This really improves our ability to manage the system remotely, which is a big concern for us. I guess we could have bought a network KVM of our own, but I like having it all integrated in with the system. Unfortunately, I managed to seize up our LAN at one point, which is probably one of the main scenarios where the IPMI card isn’t going to save you.

Our storage needs are kind of weird. We need space, but we also need fairly decent IOPS. While each individual member doesn’t test the storage system much, collectively you can end up with a lot seeks going on at the same time. We ended up going with a fairly interesting strategy. We got 4 500GB drives and hooked them up to a 3Ware 9650SE RAID controller with built in battery backed cache. Normally I’m not a huge fan of hardware RAID or storage caches, but in this instance it makes a lot of sense. With all the RAM we have, we can actually expect excellent filesystem caching performance, but the one thing Linux’s filesystem cache can’t help with is a write that needs to be flushed to disk. This was particularly painful as we were going with a RAID-5 configuration, which doesn’t exactly give you great write IOPS. Piling on to this was our selection of high density 7200rpm drives instead of VelociRaptors or low latency SAS drives. The battery backed cache is a big game changer in this regard. It supports a variety of modes of operation that trade off between performance and reliability, but we went with the “balanced” mode, where it journals writes in the cache, and then signals to the OS that they are complete, while completing the actual write to the RAID at a later point. The net effect is that we can handle bursts of write IO’s very quickly and our peak write IOPS is much higher than it otherwise would be. When we initially set up the system, it did still seem kind of sensitive to high IO loads, but after some tuning it seems to be much more efficient. For our drives we went with Seagate’s Barracude ES.2’s, whose firmware seems particularly good at handling multiple in flight IO’s. We could have gone with Western Digital’s slightly cheaper and much more energy efficient RE2-GP drives, but their latencies are so much worse than the Seagate’s, and thanks to the RAM we had plenty of room in our power budget. The case has 8 hot-swap drive bays, so drive failures can be handled by the colo’s ops without so much as a hiccup for the system (and 3ware’s 3DM2 software allows you to manually flash a particular drive’s LED so the ops can be sure to pull the right one). Knowing this, we deliberately under did our drive order, with the idea being that we’d simply order new drives on an as needed basis, hopefully benefiting from cutthroat evolution of the hard drive market, such that future drives would be denser, faster, and cheaper.

This whole thing is powered by a 90% efficiency redundant 700w PSU (which is another key part of keeping our power budget down). Our previous system didn’t have a redundant power supply, and while it never failed on us, I lived in fear of getting that midnight call. I fear not now.

So far, our experience with the system has been pretty amazing. Horribly abusive emerge’s inside my virtual instance fly by like it is no big deal. This blog, despite still being Typo 4, is so much zippier than its previous instantiation. It’s hard to know how much credit ought to go to our new software platform (more on that another time), but it is clear that at the very least a huge chunk of it belongs to this new hardware.

Sometimes a Picture Is Worth More Than 1000 Words

Posted by Christopher Smith Fri, 13 Jun 2008 00:59:00 GMT

Normally I’m not a big fan of Valleywag, but days like today are the ones they really suit up for. Without further ado, let me summarize today’s tech news:

Back in the Land of the Living

Posted by Christopher Smith Tue, 20 May 2008 10:45:00 GMT

Well, our server crashed today. Weirdest bug I ever saw: we got a kernel oops when smartd tried to get health information from the drives in the 3ware RAID array. One of the drives appears to have malfunctioned, so perhaps that is related. The fragility was possibly caused by running a fairly up to date smartd on a fairly out of date kernel with SKAS patches… but it is far from clear. I need to test this out more to be sure of what the magic sequence was, but needless to say… it’s been an experience.

Valleywag hasn't gone downhill, News has

Posted by Christopher Smith Wed, 07 May 2008 15:30:00 GMT

I can’t believe anyone in the tech community is still covering the events at JavaOne, but sure enough, we-troll-for-hitsValleyWag was there to capture Neil Young’s appearance yesterday. Now, I remember when Douglas Adams showed up for the Keynote on the last day of the conference, and that made sense. It was the last day of the conference and everyone was fried –if they hadn’t left town already. Douglas, true to form, provided some great entertainment and geek cred to start off the last day push. But Neil Young is to Java as the Smurfs are to the Iraq War. Could Sun make a more profound statement about how JavaOne jumped the shark long ago than to have an aging rocker whose seminal moments occurred before Java was ever invented keynote on the second day of the event? Best quote from the whole experience goes to Dan Farber’s blog entry, where after carefully promoting BluRay, Java, the PS3, and most importantly his Archive project, we read: “…As an artist I try to remove myself from the business,” Young said. “I steer myself away from that…”.

The previous article captures how Mark Kirk has skillfully managed to create controversy in order to get media attention during an election year. “Online porn” doesn’t quite drag voters attention away from all the other election year theatrics, and “online child predator” is so yesterday’s news, but “rape rooms” is a sure fire hit. Is there any trick from Hussein’s regime that politicians won’t copy and/or trivialize?

Darl McBride Does His Iraqi Minister of Intelligence Imitation

Posted by Christopher Smith Thu, 01 May 2008 21:40:00 GMT

ArsTechnica was there to catch CEO SCO describing an interesting variant of reality. Highlights include objectively verifiable claims that books on how to program Linux don’t exist, that there is no difference between Linux and Unix, and directly contracting his own SVP’s earlier testimony that they have evidence that System V Unix is in Linux. Don’t be shocked if he later claims Shakespeare copied System V, that Linus assassinated JFK, and that Poland was never dominated by the Soviet Union.

In Your Face ComScore

Posted by Christopher Smith Fri, 18 Apr 2008 00:23:00 GMT

Man, I so wanted to say something when ComScore’s initial report came out, but my insider status (barely insider really) makes it dangerous. So, it is with great joy that I let c|net do the talking for me.

Older posts: 1 2 3 ... 5