Smith's Corollary to Greenspun's Tenth Rule
Any sufficiently complicated and extensible C program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of C++… and the C programmers involved will proudly sneer at C++ for its ad hoc design process, poor syntax, dangerous corner cases and bloat.
The Raw and The Cooked: C++0x Extensible Literals 2
When I look at C++, and squint a certain way, it appears to be a heroic attempt to retrofit a real type system on top of C’s terribly weak one. C’s weak typing is, for the most part (because we can’t possibly break backwards compatibility… except for when we do), augmented with C++’s strong typing. C’s typeless preprocessor is augmented with C++’s so-thoughtful-about-types-it-is-sentient template system. C’s structs are augmented with C++ objects and operator overloading. C’s weak typing and don’t-ask-me-why-just-cast-it operator are augmented with the far stricter, precise, and admittedly verbose static_cast<>, dynamic_cast<>, const_cast<>, reinterpret_cast<>. C’s unfortunate format string and varargs oriented I/O functions are augmented with C++’s strongly typed std::iostreams. Retrofits can sometimes be stronger and more powerful than other approaches, but they are almost inevitably more complex, less elegant, and generally less lovable than “from scratch” solutions. I think many programmers feel the C++ committee was engaged in an academic exercise to demonstrate just how true this principle could be. Looking over the C++0x proposals, it appears as though a strong sentiment on the C++ committee’s part, was that C++98 was too limited a case study and they could go further to produce more spectacular results, an opinion that has been greeted by jeers and cheers (seems like mostly jeers on the blogosphere, but then… it’s the blogosphere).
One of the benefits of all this effort is that in C++ user defined types are practically first class citizens in C++’s type system, largely indistinguishable from primitive types (Java has also realized they missed this boat and is attempting to correct it, albeit through an alternative trajectory more consistent with its nature with things like autoboxing). I say “largely”, because there are some subtle differences (well, in true C++ fashion, they are only subtle until you encounter them, at which point they are as subtle as a punch in the face) that continue to annoy, and the C++0x committee’s holy quest is leading them to find new ways to address this. Perhaps one of the more interesting efforts to bridge the remaining gaps between user-defined types and primitive types in C++ is the Extensible Literals proposal.
The Abysmal Status Quo
If you’ve worked in C or C++ (or Java for that matter) for a while, particularly if you’ve also had exposure to scripting languages, you have probably come to recognize how limited the built-in literals are. You have literals for the built-in types. You also have initializers for arrays and structs, which were so limited that the C99 committee felt compelled to improve them (see designated initializers). The Java world, after a great deal of reflection which I presume involved the use of powerful hallucinogens, has concluded XML is the best language around for defining data and there is no way another programming language can match it. So Java has simply chosen an idiom where what one would normally think of as a fine opportunity to use literals is instead a fine opportunity to define a new DTD and/or XML Schema and then write out some verbosely tagged data (always being careful to insert XML entities instead of > or < signs).
When one looks at Boost’s Program Options Library, one can’t help but marvel at the syntactic trickery employed to do something that is so simple and straightforward in, say, Perl. (Look mom! I used “Perl” and “simple” in the same sentence!). What I find most disturbing though, is that user-defined types can’t have literals associated with them. The closest you can get is a constructor that takes primitive types (which do have literals) as arguments. So, for example, if you are working with Unicode strings, you invariably end up writing something like:
UnicodeString("this is a Unicode string", "UTF-8")
Now, in their infinite wisdom, the C++Ox committee has addressed the Unicode issue separately from user defined types, so now you can do something like:
u8"this is a Unicode string"
The existence of this particular literal extension to C++0x if anything demonstrates literals are important. This is particularly true as the committee, in what can be described as a philosophical feud with the C99 folks, has not provided a corresponding built in set of literals for the complex number type (“see, we can do complex numbers entirely as a library, thereby keeping our syntax simple…” —just try to say that with a straight face).
To Boldly Go Where No C++ Compiler Has Gone Before
Into this mess, the C++0x committee has brashly bravely charged. The result is the extensible literals (i.e. user-defined literals) proposal. As with all good things in C++, there turns out to be a fair bit of complexity to the matter, but it all makes sense once one thinks about the hoops one is having the compiler jump through. The new literals mechanism is built off of suffixes (apparently, to use prefixes for user defined literals would invoke a computer apocalypse of sorts… I don’t understand the details, but I heard mumblings about California’s governor going back in time in starkers and Google’s distributed computer calling itself “Skynet”), allowing for things like 123km to translate literally in to some object representing 123 kilometers, which I have to say seems rather cool and… straightforward at first glance. That’s the simple concept from which the inevitable complexity begins.
Two Dancers Alternating Through Double Hops, One in Black and One in Yellow, Various Other Pairs of Dancers, a Guy with a Truly Impressive Falsetto, and a Guy In Stripes Who Is Definitely Not A Dancer
It turns out, to be able to express all the wonderfulness that should be C++ literals, the proposal introduces two distinct types of literals: The Raw And The Cooked.
Editor’s Note: For those of you who didn’t enter the workforce until after the original C++ standard was introduced, you may have to contact one of your elders to properly understand the cultural reference. For those of you wondering what this link has to do with the dichotomy between the natural and artificial world… I can only say that most of us at the time didn’t get the video either, but during that era the cool thing to do was have an abstract, obtuse video that completely went over the heads of most of the audience —that’s the way it was, and we liked it!
The raw literal is defined as the raw sequence of characters that form a literal. It’s the raw bytes of the literal before the compiler has had its way with it (although after the preprocessor has expanded any macros and any string literal concatenations have been done… just so that we can’t have a completely simple definition and the C preprocessor can continue to be the bane of all C++ developer’s experience).
The cooked form is defined as the typed value that the literal string represents (before all the magical user-defined literal processing happens). In particular, this allows one to be able to have that user defined literal in the “123km” example operate on the integer value “123” rather than have to first transform { ‘1’, ‘2’, ‘3’, ‘\0’ } in to a useful binary value.
My God… It’s Full of Operators
C++ operator overloading is simultaneously one of its most useful and most abused features. In what will undoubtedly ensure that everyone will either love or hate extensible literals, the proposal follows the C++ standards tradition and adds… more operators to overload in to the language. Bravely eschewing the C++/CLI’s approach of overloading the meaning of yet another symbol (I seriously could have imagined something like Foo::$Foo()), instead we have some new operators. For raw literals, we have the form:
T operator "suffix" (char const*)
Where T is the type of the literal, and “suffix” the magical suffix that identifies the user defined literal. So, when the compiler sees: 123km it interprets that (if the appropriate function is defined) as a call to long long operator"km"("123") (just like a constructor it has the option of throwing an exception… that one is going to keep the exception safety nuts up for months).
So far, though, things aren’t complicated enough to really meet the usual standard, multi-paradigm mischief that we’ve come to expect from C++. I mean, it’s barely OO, and doesn’t really tie in to the whole generic/functional programming world C++ developers have come to know and love. Fortunately, this proposal leaves no stone unturned. Indeed, this particular stone has been turned… and kicked around a fair bit afterwards, in the form of this:
template<char...> T operator "suffix" ();
Yup, that’s not just a templated version, but a variadic templated version. For those of you playing this at home, Variadic Templates are also part of the C++0x standard. They are essentially a way of turning LISP-y looking Typelists in to C-y looking vararg functions. So, our 123km example, if someone had defined an extensible literal operator like this:
template <char...> long long operator"km"();
then the compiler interprets that as a call to:
operator"km"<'1', '2', '3'>()
Note the lack of null at the end there? I’m sure that was put in there to ensure inconsistency with the other form. ;-)
Anyway, the primary advantage of the variadic template form is that if you tag the function with the new “constexpr” keyword, the entire thing can be evaluated at compile time like all good template metaprogramming foo. One could argue a sufficiently smart compiler might be able to determine when to do compile time evaluation of the former form in cases where it was possible. However, if there is one myth that the C++ committee considers heresy, it must by the myth of the Sufficiently Smart Compiler (one of life’s little ironies is how this view directly results in C++ compilers having to be the most sophisticated, complex, and nuanced compilers known to man.. but I digress).
I hope your head isn’t spinning and spewing forth pea soup right now, because we’re just firing up the barbie to move on to “cooked form” extensible literals.
Looking at the proposal, cooked literals seem actually quite straight forward and the rational way to handle the 123km example I keep bandying about. According to the proposal, one would define:
Kilometer operator"km"(unsigned long long);
which would then be invoked with 123. There is no special templated form, and cooked literals wisely take precedence over raw literals. There are similar forms for doubles and the various forms of C-style strings that now exist in C++0x (strings prefixed with “u” and “U” both get their special form), although surprisingly the functions are length terminated, so for example, one could make a convenience literal for std::strings such that:
"this is an std::string\n"s
creates an std::string by defining a literal operator as follows;
std::string operator"s"(const char* s, size_t length)
{
return string(s, length);
}
While at first this might seem counter intuitive, it does provide a nice way to distinguish between cooked literals and non-templated raw literals.
I’ve Come To Bury Extensible Literals, Not to Praise Them
The one thing I didn’t see in the proposal is how negative integers are handled (it seems odd that the proposal would imply defaulting to unsigned integers, but perhaps I’m missing something). Unfortunately, the proposal also doesn’t really address a simple way to define hierarchical/structured literals like Boost’s Program Options library desperately cries out for. While one could define literals for maps and hash maps, I just don’t see them even remotely approaching the elegance you find in scripting languages, and it still seems like there isn’t a convenient way to have a literal composed of compound expressions. Then again, it’d be hard to distinguish between the latter and a sufficiently clever constructor and literals for the constructor’s type parameters.
There are lots of cool uses one can foresee for this proposal. Obviously one could make a units/measurements system that was much more seamlessly integrated through the use of literals for constructing measurement instances. Having an agreement upon literal syntaxes for string objects would bring them one step closer to first class status beside their C-style predecessors. One particularly cool example from the proposal is to have literals for internationalization efforts, such that “foo”_i18n might translate to: ‘lookup the key “foo” in the appropriation i18n table and use the appropriate value’, which might reduce i18n friction enough that developers would adopt sane i18n development practices without first having several sessions on the rack. One can see some interesting abuses, as well. I have to wonder at the extent to which they can be employed recursively. I can’t see a reason why one couldn’t create literals with side effects, although hopefully this practice would be viewed as in poor taste.
Despite the tongue-in-cheek comments found throughout this article, and some of the shortcomings in the concept and the specifics of the proposal, I actually quite like the design of extensible types and hope that a somewhat more polished version of it will make it into the standard and the top tiered C++ compilers quickly. While a complex solution for what at first glance seems like a simple problem, like a lot of C++ features, most of the complexity is pushed to the library designer side (i.e. the person writing the extensible literal), while using the feature seems likely to be simple and straight forward in the most common cases. The former is, in my opinion, intrinsic to C++’s nature, forgiveable, and arguably a feature. Designing code for reuse is always difficult, and sometimes languages which make it deceptively easy encourage very poor designs that would effectively be stillborn in C++. If, however, you can make using said code fairly straightforward, you make it easier for less sophisticated developers to leverage the skills of the masters. In the end, this proposal strikes me as emblamatic of the language itself: yes, it is complex under the hood; yes, it has a face only a mother could love; yet, beneath all that is is both powerful, pragmatic, and cleaner to work with than the typical hackery C and C++ programmers tend to come up with to address this issue.
It's Full Of Stars
Apparently Arthur C. Clarke is rendezvousing with Rama or whatever is out there. Really, there is too much to say to say anything at all. He was truly a unique and interesting man, and his contributions to science cannot be understated (really, when was the last time you thought about the contributions to science by a man known primarily for writing fiction?).
Another Odd Little Case In C++
The “great” thing about C++ is that it is such a complex language that there are always little corner cases to consider that you completely forgot about. I ran in to one such case today.
So, let’s consider 3 functions:
T fooVal(); T& fooRef(); T* fooPtr();
Now we all know what to expect of fooVal() right? Similarly, we’re all quite clear on fooPtr(). We therefore know what fooRef() does, because we all know that a reference is just syntactic sugar around a pointer, so it really does exactly the same thing as fooPtr, right? Technically they are slightly different. fooPtr() is guaranteed to return back a pointer value, but it doesn’t guarantee that you can dereference that pointer and actually get at a valid T. It could literally be a pointer to deallocated memory or even NULL. fooRef() on the other hand, is returning a valid T by reference. Technically it is returning a pointer, but it is a pointer that is guaranteed to derefence cleanly. If the implementation of fooRef() tries to return a temporary T (i.e. one that is lexically scoped to live only inside fooRef()), the compiler will smack it around some, whereas you could actually get away with returning a pointer to a temporary T in fooPtr().
But this is the stuff that C++ programmers typically remember after they’ve been burned by it several times. The funky part is how pointers and references differ on the caller side. Imagine this code:
T foo = fooVal(); //all is well T& foo2 = fooVal(); //compiles? T* foo3 = &fooVal(); //won't compile.. but imagine if it did?
So, “foo” seems fine, but foo2 & foo3 seem really, really broken. Why? Well, fooVal() returns a T, but it is only a transient T. It is only on the stack for a short second before getting popped off as part of the cleanup from calling fooVal(). Now, in the case of “foo”, we’re fine, because the compiler will copy the return value of fooVal() in to foo before it is cleaned up. So both foo2 and foo3 have problems, because they end up pointing at this temporary value, which immediately gets deallocated. You know that if the compiler let you do foo3, as soon as you dereferenced it, all hell would break loose.
Here’s the weird corner case though: foo2 is perfectly fine and you can access it to your heart’s content.
Yup, there is a special rule, presumably in order to make references behave more like value types. If you assign a variable with a reference type to the return value of a function that returns a value type, the compiler will keep that temporary around until the variable falls out of scope. How weird is that eh?
Why Don't We Check Our Math? 3
One of life’s little mysteries is why so few traditional mainstream language have support for catching overflows for fixed-with arithmetic types. Java, for all it’s concerns about bounds checking and memory errors, doesn’t really provide any mechanism for catching overflows. C’s view of the matter is to make all unsigned math do wraparound and leaves the signed case literally undefined. C++ did nothing to improve upon C’s behavior. It’s just a mess.
One could perhaps make the argument that these kinds of errors rarely show up, but I see them all the time when I review code.
I can’t count how often I’ve seen code like this:
size_t buffer_size;
...
/* skip on down to the evil stuff */
unsigned char *iter = buffer;
while ((*iter++ = getc(file)) != EOF) {
if ((iter - buffer) == buffer_size) {
buffer_size += buffer_size;
buffer == realloc(buffer, buffer_size);
}
}All is well and good unless buffer_size ever gets to be greater than SIZE_MAX/2, and then suddenly you are writing off in to lala land. Yeah, that’d mean realloc() would have to succeed in allocating >SIZE_MAX/2 memory, but with our modern systems still primarily running 32-bit code, despite having multiple gigs of memory, this isn’t exaclty unheard of. Code like this can be found everywhere. Heck, if you check back a few generations of GNU’s corelib functions you’ll find something almost exactly like the above.
Statically typed functional programming languages tend to handle this issue either through boxed types. Dynamically typed languages tend to this by simply checking for overflow and automagically promoting to wider and wider arithmetic types in the event that an overflow occurs. Both approaches are decent approximations of an ideal solution, but they are both a response to the problem that mainstream languages seem to have their head in the sand about.
I’ve mentioned this to some people, and have received comments like “well C is very close to the metal, so they want to expose you to how the CPU does the math”. Great! Most CPU’s have an overflow register just waiting to let you know that all hell has broken loose, so surely C takes advantage of this? ;-)
The reality is that with a simple check of a register value, we can save ourselves a ton of bugs. This is a really cheap safety feature that one could always disable in performance sensitive code that had been carefully reviewed.
What brought this to mind was that I was dusting off some old code that I’ve recovered from a crashed drive and I found an old project of mine called “checkedmath” which addressed this shortcoming in C++. C++, for all of it’s shortcomings, provides just enough support for metaprogramming that you can generally come up with way to address a lot of its shortcomings in code. In this case, I added overflow checking by taking advantage of operator overloading. I’m going to polish it off a bit before posting it, but the basics look something like this:
template <typename T>
struct CheckedNumber {
CheckedNumber<T> operator+=(const T aNumber) {
if (value >= 0) {
if ((std::numeric_limits<T>::max() - value) < aNumber) {
throw arithmetic_bounds_exception(*this, aNumber, "+");
}
} else if ((std::numeric_limits<T>::min() - value) > aNumber) {
throw arithmetic_bounds_exception(*this, aNumber, "+");
}
value += aNumber;
return *this;
}
private:
T value;
};Now, that doesn’t take advantage of the hardware’s overflow detection, but my plan was always to get out a generic version that could pretty much work on any platform and then write some more efficient specializations in inline assembler (if I ever got around to re-bootstrapping my assembly programming knowledge) at a later date. The actual code is more generic than the above (probably more than it needs to be really), but you get the idea.
The reason I never finished this project was that after I figured out how to do it right, it occured to me that surely someone else had already done the same thing. Now it’s been a year later and I have yet to see anything like this. So, I’m going to throw it out to the blogosphere: anyone seen anything like this?
UPDATE: Apparently VB does handle overflow.
UPDATE: Looks like Microsoft has SafeInt. It doesn’t do boxed types and lacks optimizations, but it’s still a good start. I may still push my CheckedNumber implementation out at some point, but at least there is a semi-decent implementation of checked arithmetic out there.