The Pilot Is Expecting Some Turbulence Ahead
Unfortunately, the server this blog is hosted on is getting a wee bit flaky (for definitions of “wee bit” that involve “several times an hour”). While it isn’t crashing, it is hanging for several minutes at a time. This appears to be a UML bug, and probably an old one that has long since been fixed at that, but we have limited options in a addressing it for a host of reasons that you truly want to hear nothing about. To further complicate things, my DSL at home has become “intermittent” (the best kind, because it always comes up when you call tech support), so I can’t even host the blog there to achieve better uptimes.
The good news is, new hardware should be arriving soon, and with it an entirely new platform (built on OpenVZ). Cross your fingers and hope that all goes well, and maybe you’ll be able to read more mindless drivel from yours truly Really Soon Now.
Darl McBride Does His Iraqi Minister of Intelligence Imitation
ArsTechnica was there to catch CEO SCO describing an interesting variant of reality. Highlights include objectively verifiable claims that books on how to program Linux don’t exist, that there is no difference between Linux and Unix, and directly contracting his own SVP’s earlier testimony that they have evidence that System V Unix is in Linux. Don’t be shocked if he later claims Shakespeare copied System V, that Linus assassinated JFK, and that Poland was never dominated by the Soviet Union.
Linux AIO sucks less 2
So, the last little while on the Zumastor project, I’ve been working on integrating AIO in to the code base in order to absorb some of the latency penalty that we experience from disk seeks.
Critical for this was getting AIO to work with poll(2), because the ddsnapd daemon follows the tried-and-true “poll then do something” loop that allows for efficient, scalable, and relatively simple Unix server design. Unfortunately (or fortunately, if you are familiar with POSIX ;-), Linux’s native AIO doesn’t follow the POSIX AIO spec, and instead implements it’s own event queue for notification of completion of IO operations. This event queue isn’t exposed as a file, so you can’t poll it. So, I hacked together a library that spawns a separate thread which does nothing but read in events and copy them out to a pipe, so that the main thread can poll said pipe just like any other file descriptor. Ugly? Yes. Wasteful? Yes. Easier to work with than the apparent alternatives? Yes.
I got most of the way through the process. I discovered what appears to be some kind of race condition in AIO where the vast majority of the time I was losing completion events if I submitted multiple IO requests at once. I still haven’t tracked it down, but while looking for possible sources of the problem, I discovered a heretofore unknown (well, by those of us on the project at least) syscall: eventfd(2).
eventfd does for AIO what signalfd(2) does for signals. In other words: it does the obvious thing that we wanted in the first place but were too mentally challenged to find. The moral of the story: even if you think a Linux API (AIO in this case) sucks, expect it to suck less, and question why when it doesn’t.
"badram" to the Rescue
I’ve hit a real “lucky” streak with my laptops of late, with one thing going wrong after another. The latest was yesterday, when I discovered just about every process I tried to spawn died with a segmentation fault. Not good. I stuck in my memtest86+ boot CD and let it grind away all night. After running several rounds of tests, it presented me with the magical badram incantation to tell the Linux kernel to ignore the problematic bits of memory. I dutifully copied down the values, edited my grub command line, and “voila!” everything is working just fine. I’d never used it before, but I have to say I’m impressed with how effective it is.
Hard Drive Errata
So, I had my first experience with hard drive failure in my server/MythTV backend. One of my drives failed in an odd fashion. It appeared to be continuously trying to do something over the SATA bus (my hard drive light was permanently on). Linux software RAID did the right thing and dropped the drive from the arrays. Rebooting didn’t help, but powering off did. So, that makes one think the hard drive firmware had some kind of bug.
To Linux’s software RAID’s credit, everything recovered quite nicely simply by adding the appropriate partitions back in to the appropriate RAID’s. Rebuild time was impressively quick. In case there was any doubt left as to how much overhead software RAID imposes on a system… during the rebuild I was able to play and record HDTV OTA content without noticing any impact, and the software RAID chewed up maybe 3% of the CPU time (if that).
Of course, all this has got me pretty nervous about the drive in question. SMART isn’t showing anything wrong with it, but I’ve checked with a few sys admin types, and they’ve all pretty much said the same thing: if the drive is under warranty (and it is), send it back. It failed, and it is best to let them figure out how/why.
Not wanting to lose all redundancy in my system, I’m of course looking at a replacement drive. My system currently has three Samsung 500GB drives, all of which I ordered at the same time, so they literally have sequential serial numbers. This is generally considered unwise as they are highly likely to have the same reliability characteristics (so when one fails, the rest may follow). To that end I’m thinking of swapping out one of the drives with a Western Digital 500GB drive drive that I’ve been using in my desktop. If I then order a replacement drive I no longer have the RAID house of cards I started with.
The question is, what to get. I am tempted to go either with another one of the Samsung’s or the Western Digital drives. They are about the same price and they are both quiet but fast drives. The other temptation is to go with the 1TB drives that both companies have offered on the market. The Samsung sets the record for data density (1TB with only 3 platters!) and therefore also has an impressive sustained transfer rate (which ought to be a key consideration for a MythTV backend). It is also very quiet and power efficient (mostly because of having so few platters I suspect). The Western Digital one, on the other hand, is a very different kind of drive. It basically trades off some performance (primarily by reducing the RPM of the drive) in order to achieve unheard of power efficiency (particularly while idle) and quiet. The Western Digital GP is particularly tempting right now as I can get it for $250 from newegg (not as much bang for the buck as the 500GB drives, but an amazing deal for a 1TB drive!).
If I went with the bigger drives, I’d use the extra space on said drive to store “maybe I’ll watch it someday, but probably not” content, as well as using it as a scratch space for transcoding and such. Then, when the replacement drive came back from Samsung, I might try hacking Zumastor into a Kuro Box Pro, and stuff the drive in it. I could then use it as a live network backup of all my media content, with prior revisions (and yes, using Zumastor in such a way ought to be an interesting test of its capabilities). It is an interesting thought anyway.
While I was researching all this, I came across a warning from the folks over at Western Digital, noting that the current incarnation of the GP was not intended for use in a RAID (and in fact this is true for all the drives I’m mentioning). A RAID model is, of course, due shortly. The whole Enterprise vs. Desktop hard drive business has been seriously called in to question by the work of Bianca Schroeder and Garth Gibson, but the Western Digital folks have a fine point here about the advantages and disadvantages of TLER for RAID vs. desktop type roles.
Here’s what ticks me off: TLER isn’t a hardware thing. It’s a firmware thing. Just like write caching (desktop drives typically have write caching enabled, enterprise drives don’t, but you can switch the setting if you want to), you should be able to flip a bit somewhere in the drive’s firmware and then you’re off to the races. Only, hard drive vendors don’t want it that way. They want to have one cheap drive with razor thin margins for the ultra competitive desktop space, and then another, high margin drive they can sell to the enterprise space. They’re running out of ways to differentiate, so now they are limiting the flexibility of their firmware. This really irks me in a Richard Stallman and Xerox kind of way. The real kicker is this though: Google’s study of hard disk failures suggests that any drive failure that would make TLER relevant is indicative that it is time to get a replacement drive anyway. So ultimately, TLER is only really a beneficial feature if you have an absolute need to avoid a disk access latency penalty with your RAID even when a drive is about to fail. Sure, there are places for that need, but it sure sounds to me like enterprise’s for the most part should save their money and use cheaper desktop drives anyway.
Ah well, enough ranting. Time to make a decision.
Linux & Security, or How I Learned To Start Worrying And Hate Linux Advocates, Freezing Software, and Lazy Sys Admins
Security is a tricky business. Computer security is just a nightmare. Every now and then you see software companies that fail to appreciate this, and make promises that “this one can’t be hacked”. As Bruce Schneier’s famous saying goes security is a process, not a product. Products aren’t “secure”, and even expressions like “more secure” are at best highly subjective. This is why I am always enraged to hear Linux advocates who push Linux on the grounds that “it is more secure”. You can rest assured that when someone uses that line, they probably know next to nothing about computer security, and probably not that much more about one of Windows or Linux (sometimes both). Today’s story about phishing should be a wake up call.
Here’s the problem with harping on the security button when advocating Linux. First of all, consumers, for the most part, don’t care. You know why most software is provided “as is” with no warrantee? Because consumers would prefer that to a product released years later, with fewer features, and much greater cost. Let’s just say for a moment you win that battle, and in a post 9/11 world, you manage to convince them that security should be their first priority. Well then, the designers of competing software, particularly Windows, would simply shift resource allocations and product priorities accordingly, and within a surprisingly short period of time the Linux world would find itself falling behind the massive software juggernauts of the world, for they are, if nothing else, responsive to changes in customer expectations. In fact, we’ve already seen a glimpse of this with Windows XP SP2, which Microsoft pushed very aggressively despite not making a dime of revenue off of it. Let’s just say for a moment though that these big software juggernauts are too slow to notice the shift and have too much old crufty code to fix up to get the job done anytime soon. Here’s where the real can of worms starts to open up: what happens if people actually buy your argument and start shifting over to Linux en masse? Linux becomes preferred target #1 for hackers, and suddenly Linux starts falling victim to all kinds of nasty exploits (and consumers will be a lot more angry about this than they are about security problems with Windows, because you’ve made them more concerned about the issue and made the mistake of promising a product-based solution). I’ve heard people wave off this argument, but I think they fail to look at the raw data. There have been, over the years, a lot of remote exploits found in various distributions of Linux. Maybe not as many as have been found with Windows, but we’re within an order of magnitude here. Despite this, Windows has had easily a hundred times more instances of malware and security compromises reported. That should tell you something.
Now, don’t get me wrong, I feel more confident about the security of my Linux systems than I ever feel about any Windows systems I have set up. There are some good reasons for this. For starters, I know how Linux works a lot better than Windows. I know just enough about Windows to be really dangerous, but Linux I really understand. So, I can spot oddities a lot more easily, and I know a lot more about what needs to be done to secure, monitor, and respond to security threats on Linux. Admittedly, knowing and doing are two different things, but I have that problem with Windows too. :-) I’m also more confident about Linux because of the “process”. I know Microsoft has sophisticated processes for auditing code and finding bugs but I don’t know much about them so I can’t tell how much I can trust them. Linux’s open source processes are transparent and really do help minimize and mitigate security flaws. Finally, the Linux community tends to have a more sophisticated user base, which means they are more likely to have a proper security process in place, which makes Linux a less desirable target in the first place.
The biggest delusion I’ve seen in this regard comes form people who deny that there are any “real” security holes in Linux, because they’ve never had a compromised system. So first off, there are plenty of well documented cases where worms or other malware has been able to exploit security flaws in Linux. Secondly: how do you know you have never been compromised? Few people use tools like tripwire, rkhunter, chkrootkit, Nessus, etc. correctly, and even if you do, some rootkits are VERY good at hiding. The most insane counter to this point has been, “well, if I can’t observe it, is it really not a problem”. The “If a tree falls in the forest…” metaphor breaks down when your machine is used to stage a giant DDoS blackmail, a phishing scam, or a big spam dump. You may not observe it, but you can bet that somebody will someday, and you may find yourself at the wrong end of a criminal investigation, a lawsuit, etc. More importantly, exploited machines, loosely coordinated through a “botnet”, are probably the biggest security threat on the Internet right now. Just like a poorly maintained house in a neighbourhood, compromised machines drag down everyone in the neighbourhood (and unfortunately, on the Internet, the neighbourhood is quite large). Don’t be lazy: keep your systems patched.
This is why I am so irritated by “software freezes”. I’ve worked on more than a few projects where the attitude is “I know a new version of software X is out, but we’ve deployed with this old version and it is working for us so far… I don’t think it is worth the risk to upgrade to the new version. It might break things.” Sure, it probably will break things. To quote the agile software folks: embrace change. Repeatedly updating your software will teach you were you are making too many assumptions, where the underlying API’s are least mature, where your protocols lack backward compatibility, where you need to clean up your build and deployment process, etc. Sadly, by not embracing these changes, your software starts to ossify, and it becomes nearly impossible to contemplate platform upgrades. That becomes a huge issue when a new security flaw is discovered and you need to roll out a patch. Just expect to have to do an update every couple of weeks as new security flaws are discovered, take the hit with all the breaks that come from that, and the world will be a better place for all of us. I say this having worked at companies that are still running slightly patched variants of the software platforms they first launched with…. years ago. Updating is a huge mess for them, and I suspect whenever a new security flaw is discovered in their platform, all hell breaks out in the IT department.
Sadly, I think we’ve got enough of a botnet problem right now that the Internet could start to be a real ugly place. It’s time everyone recognized the mess and cleaned up their part of the neighbourhood.
As If You Needed Another Reason to Have an LWN Subscription
Run, don’t walk, over to LWN to read the first article in new series by Ulrich Drepper entitled What every programmer should know about memory. It is very detailed and well written: in other words, the kind of content I always expect when I wander over to LWN. They are an outstanding group producing quality work for far too little money, so I encourage anyone who doesn’t have a subscription to get one (as little as $2.50/month). Sure, you could wait for Ulrich’s article to be released and read it for free, but if you sign up you can read it and content like it a week before everyone else and know that you are supporting this kind of quality work.
Game Over Dude
Groklaw has the story. Finally after years of legal insanity, the SCO fiasco appears to be reaching some point of closure. Anyone interested in shorting SCOX?
User Mode Linux performance tuning
I haven’t posted for a while, partly because I’ve been on vacation, but particularly because the UML instance I’m using has been behaving… poorly. Every time I post an article it seems to hang for quite some time (minutes, not seconds). I’m starting to get some ideas about what my problems might be.
So, it has taken me a while, but a few things have become apparent to me. First, and most likely the biggest contributor, is that I’m running a fairly new host glibc with a fairly old host kernel and then on top of that I’ve got a fairly new UML kernel. In particular, I suspect this is running in to problems with various bugs UML has had with NPTL.
On top of this, I realized I’ve had my UML kernels defaulting to using completely fair queuing, which is probably counter productive. It’s probably more efficient to get rid of queuing altogether and let things go right to the host, otherwise you get all kinds of ugliness from having two levels of IO scheduling going on.
Anyway, I’ve changed my IO scheduling, and hopefully this weekend I’ll be able to put together a new set of host/UML kernels built on 2.6.20 or maybe even newer (if I dare).
[Crossing fingers]
Obsessing Over Cell Phones 6
So, for several months now, I’ve been dragging around a Sprint PPC-6700 with a cracked LCD screen. This turns out to be a real pain with touch screens, because it breaks the touch circuitry. I’ve been putting off doing something about it for a variety of reasons, but particularly given the availability of the Neo 1973, I feel the time has come to make a decision.
Despite two posts on it, I’m really not interested in getting an iPhone. Beyond that, I’m quite confused about what I want to do. My choices would appear as follows:
Just get a functional PPC-6700. I actually have insurance on the phone, so this is a cheap, $50 option that gets me back to where I was.
Upgrade to the Mogul aka, the PPC-6800. I can get a partial discount on the phone if I commit to another two years with Sprint (which I’m pretty much committed to anyway due to having recently added yet another member of the family on to our plan). This means it’d cost me slightly less than $500, which will require some doing to justify.
Give up on having my cell phone be a full fledged mobile PC, and instead going with some of Sprint’s other phones. Possibly something with basic web browsing capabilities, but the focus would be on phone capabilities.
Finally, there is the Neo 1973 lingering in the background. Honestly, if it wasn’t the first real Linux phone (i.e. where you can really access Linux and mess with it), I wouldn’t give it the time of day. However, it is such a beast and better still it has a nice VGA touch screen, and that makes it incredibly interesting.
Complicating this whole mess is the insane situation with cell phones that is the US market place. You can basically choose between four carriers, all of which basically suck. We ended up on Sprint mostly through the following function: Verizon ticked us off, Cingular had given us trouble before and had been particularly obnoxious to my mother in law who’d been an AT&T customer before the merger, and T-Mobile had horrid reception at my office. This lead us to Sprint (who actually have caused us headaches before too, but we hadn’t dealt with them for the longest so…. ;-).
Throw in to that the joy that is cell phone contracts (we’re effectively locked in to Sprint for two more years). The iPhone may not be cool enough to make me think about breaking contracts or paying for two cell phone services at once, but the Neo 1973 is. Now, the current rev lacks WiFi support, and in general, do you really want the 1.0 version of such experimental hardware? (Answer: YES! I’m a geek.)
Complicating this has been my experiences so far with the 6700. In true Windows fashion, it’s been somewhat unstable. In particular, it seems to react somewhat poorly to receiving a call while I have the phone locked… which is most of the time I receive a call. It also is a little underpowered, it is clear to me that the extra processing power and RAM in the Mogul, while seemingly minor, would actually make a huge difference with end user experience. I’ve actually not installed much software on the phone, but this is mostly because I’ve had bad experiences with most of the installs I have done. I’ve had problems with running out of memory, not having the right runtime installed, problems getting some software to install on the mini-SD card, etc. The browser on it has been handy on more than a few occasions, although it is hard to justify the $15/month service charge for the data service (unless you own Sprint stock ;-). One problem is that as web sites get more and more AJAX-y, they work more and more poorly with the crippled version of IE embedded in it (and yes, I have tried installing a Gecko based browser, with very unfortunate consequences). A number of sites are frustrating close to working completely, like Reddit, which works well enough except the login button, which is the key to unlocking the site’s potential (similar problems with the MTA website). I’ve used the e-mail support a fair bit, but it’s never really worked with work e-mail, and I only get the top 100 e-mails in each folder, which proves to be severely limiting for someone who abuses IMAP as much as I do. More than anything else, I’ve found QVGA to be an extraordinarily limiting resolution for working with anything other than “designed for QVGA” content.
At this point, I am leaning to canceling the data service, going with one of Sprint’s Sanyo phones (they seem to have the best reception and voice quality), and then debating with myself whether I’d really actually use a Neo 1973, and maybe waiting for early adopter reports to filter in. On the other hand, I’d like to *be* one of those early adopters! ;-)
As much as I like the Neo 1973’s openness, and as much as I sometimes question the ROI of my data services, I probably would have gone with the Mogul except for its shortcomings in one particular area: screen resolution. I have to hand it to Apple for not allowing themselves to be confined to a QVGA world. It’s a sad world to be in, and yet it is impossible to get a Windows Mobile phone in the US with >QVGA resolution. Outside the US there are a number of models to choose from, but the insanity of the way the cellphone market works in the US means that we get the latest and greatest last. Indeed, most of what makes the iPhone such a marvelous phone is that it is available in the US first. I imagine when it shows up in some other countries, a lot of people will shrug their shoulds and say, “So what?”. Given what I know is possible with mobile handsets today, I find it incredibly hard to spend >$400 on a cell phone and end up with a display whose resolution is the same as what was available two years ago, and hasn’t been seen on the desktop in like twenty years. Heck, I’d rather have a power-efficient but high resolution grayscale screen than the low-res color crud we have right now. While perfectly fine for scrolling through a list of menu options, it is not really suitable for reading web content that inevitably has 200+ pixels set aside on the left just for navigation links (one could blame the net for this, and I do, but phones that address this are already available, I just want them on my provider’s network).
So, I’m sure if anyone actually reads all this, they’ll have suggestions. Please fire away. I’m all ears.
Older posts: 1 2