You Gotta Know When to Lock 'Em, Know When to Unlock 'Em, Know When to Recurse 'Em, Know When to Let 'Em Run Concurrently... 1

Posted by Christopher Smith Thu, 28 Sep 2006 13:00:00 GMT

Wow, my first experience with a trackback-based thread! I’m such a blogging newbie that this seems like a big deal to me. Michael Suess has examined my article on why recursive threads are sometimes good and has come up with some excellent questions as well as some solid criticism. Particularly given the excellent questions, I feel I should follow up.

First, I think his opinion (“I have not seen this used in practice a whole lot and it is usually a sign that something is wrong with your synchronisation patterns anyways”) is on the money. Most programming languages, API’s, etc. have some dark and dangerous corners that one should generally not tread down unless one knows exactly why one is doing it, and recursive locks are definitely one of those corners.

One interesting comment was that he is less familiar with databases than HPC, as I had selected databases as an example because developers generally have a broad exposure to working with them. I’ve done enough HPC that I can understand part of his reluctance to accept recursive locks. I can’t think of a case where when doing HPC development I have so much as contemplated using a recursive lock.

On to his questions:

Is it really a smart idea to log in to the same database where the actual transactions are carried out as well?

Generally, I’d say no. That still doesn’t make it a great idea to deadlock in that scenario. I will say though that logging isn’t just transaction logging in the DB sense (indeed, I generally see transaction logging being integrated in with the DB, so it doesn’t fit well with my paradigm of two very loosely coupled subsystems sharing a resource). Often logging is used for trace debugging without halting a live system. It can also be the canonical record of the order of events that got the database in to its current state. Logging frameworks of the log4* ilk tend to allow logs to be routed to different destinations (potentially multiple) based on logging meta-data and a configuration file that can be changed while the code is running. So, while you might route “what if the DB fails” data somewhere else, you might still have use for logging other data to a database somewhere. Given that scenario, you’ll find lots of operations that like to have a single transactional persistent store (well, more likely a single mirrored transactional persistent store). This simplifies operations, configuration, hardware purchasing, etc., etc. It is a bit of a crime, but the punishment for said crime ought to be poor performance and poor concurrency, not random deadlocks.

Does the logging really need to happen inside the transaction?

Absolutely not, but the alternatives are generally slower and/or less elegant. Mr. Suess suggests copying any data you need and update the log afterwards. You can certainly do this, but imagine explaining to a programmer why they need to do this: “well see, there’s this chance that this code might be configured to log through a persistent store framework, and that framework needs some kind of locking around certain shared resources, and…”. You could go with a shorter argument along the lines of “logging could be slow”, which does hold more water, but has the problem that this only makes sense if you are in a heavily contested segment of code (certainly likely if you are worry about copying overhead) and logging will often be enabled. The latter bit is where you have a problem. For every bit of logging code I write that is of the “most of the time, executed in production” variety, I write ten times as much logging code that is “only going to be executed in a context where things are so messed up the performance overhead is going to the least of our concerns”. In that latter case, what I care about is how fast the code runs and with how minimal contention in the “logging disabled” case. More importantly, I’m not going to want to spend much time getting the logging tweaked exactly right, so the notion of checking if logging is enabled, then copying, then unlocking, then checking if I copied any data for logging, then logging (and then if it’s C freeing up any heap allocated bits) is not going to appeal to me. Finally, consider the possibility of a failure that occurs after when you would have logged if you had a recursive lock, but before you get out of your transaction. Now you have to put your logging inside a destructor/finally block/else branch/cleanup label that one uses “goto” to reach. That’s just inherently uglier than simply logging and being done with it. Recursive locks are evil, but code complexity is worse.

Is it a smart move to pay the performance penalty associated with recursive locks all the time, only because there may be the chance that logging is enabled and the logging framework uses the same database as its back-end?

Absolutely not. Indeed, I’d want to have my persistent store be configurable so that if I want/need the extra performance and know that I don’t need recursive locks, I could stick to normal locks (that said, vendors often choose to have slower, bullet proof code, rather faster code that requires more smarts to configure and use properly… such vendors tend not to be used often for HPC work, so I’ll forgive Mr. Suess for not considering this case ;-).

Ignoring the specific database case, the general case is where you have a component which cross-cuts over multiple loosely coupled or even orthogonal components, and requires the use of mutexes. If you are lucky, you can find ways to give each of the loosely coupled components their own mutex for the cross-cutting component, but if not, recursive mutex’s may be the easiest way to avoid tight coupling.

Mr. Suess also takes issue with my notion of recursive locks being useful for page-level locking if you have to lock two records that may or may not be on the same page. His suggested algorithm is to always lock the record with the smaller primary key first, and then only acquire a lock for the second record if it is on the same page. It is a good strategy, and one that avoids deadlocks without having to think very carefully.

I would say the above approach is getting dangerously close to a recursive lock. Yeah, you don’t have a counter, but you are going to have to have some variable that tracks whether you need to unlock the second record (and third, if there are more lookups). To the extent that it isn’t the same, you are losing some ecapsulation of the locking. But that’s not the big problem. The big problem is that the algorithm relies on two assumptions:

  1. You have a priori (before doing any locking) knowledge of the primary keys of both records.
  2. Your API provides you with a way of determining whether two primary keys share the same page.

Case #1 is often not true. For example, if your records have a parent-child structure, you might not know the primary key of the parent until you’d already locked the child (or vice-versa). You could then unlock after getting the info and then lock them in the appropriate order, but of course, you’ve burned up some extra synchronization time (read: less concurrency) and by then the parent-child relationship might have changed.

Case #2 is surprisingly often not true. The reason for this is that people like encapsulation, and exposing things like page/bucket allocations is seen as unseemly. I’m not sure I entirely agree, but I can’t think of a case where this was exposed (this is one of many reasons why programmers tend to hate page-level locking).

Mr. Suess does come rather close to identifying where the recursive key might cause you problems in this scenario. Using it to safely and cleanly do page level locking with two records works iff you have only one thread that is locking multiple records. Otherwise the first thread might try to lock page A, then page B, while another thread tries to lock page B, then page A. Deadlock ensues. So where does this approach come in handy? Well, when you have some batch process that runs in the background (saying auditing the database’s data, deletes orphaned descendants or perhaps doing some data mining) that does more significant analysis than the interactive processes.

So yeah, this is one of those dark areas where you really should only use it if you know exactly what you are doing.

Mr. Suess also does not understand this case:

the lock explicitly exists as a trade off between concurrency and resources like CPU, memory and IO (caching forexample).

I’ll take responsibility for his confusion, because I didn’t provide nearly enough context. I’ll do so now:

One of the better arguments against recursive locks is that you should never hold on to a lock long enough that reach a point where you might try to lock the same lock again without having enough context to know if you still are holding on to it. This makes a lot of sense, particularly if you are trying to maximize concurrency.

However, in the real world, your code often runs on systems with maybe one or two CPU’s (I say this of course as the world is moving strongly in the direction of more and more cores, so if the trend continues this will be a less and less reasonable case). As a consequence, it is often acceptible to trade-off concurrency for better CPU efficiency, reduced I/O utilization, or reduced working set size (really smart code employs the strategy pattern and decides at run time what is the better trade-off). So, holding a lock extra long so that you can reuse the results of an expensive computation and have it still be valid, avoid having to reread data some data from an I/O device, or avoiding having to copy relatively large data structures around may be a perfectly reasonable choice.

Of course, when you are designing a synchronization API, you tend to think in terms of developers trying to maximizing scalability, rather than maximizing efficiency, but a lot of applications benefit far more by maximizing the latter.

In closing, I just couldn’t resist multilating the amusing quote from the comp.programming.threads list:

Having a mutex API without support for recursive mutexes is akin to having a policy to always use prophylactics, but to never use Plan B: it is probably all for the best and you might feel morally superior, but you leave yourself open to the possibility of some additional complexities that you hadn’t planned on. ;-)

When Recursive Locks Are Good

Posted by Christopher Smith Mon, 25 Sep 2006 16:00:00 GMT

This old post by David Butenhof somehow floated to the top of the reddit ether today. It is a great read, and he makes some very strong points, but I think he’s looking at it too much from a systems perspective, and missing the key advantage of recursive locks that matters for application development: encapsulation.

Now, to say he missed encapsulation is not entirely fair. He clearly is aware of the concept, but feels you should never hold on to a lock long enough for encapsulation to matter. I hear what he’s saying, but crossing encapsulation boundaries doesn’t necessarily mean you are spending that much time holding on to a lock. It just means that the lock is being used by two disparate systems that really shouldn’t have to know about each other.

Here’s a simple case that I’ve seen come up: you have some kind of transactional persistent store (read: a database). Your application updates said persistent store fairly regularly. During those updates, it often needs to acquire locks because certain bits of the persistent store aren’t reentrant (say the local write-back cache). Fortunately the system is cleverly designed so that in the 90% case the non-reentrant code isn’t touched, so you still get a high level of concurrency. You app, being the good app that it is, also has a generic logging framework. It supports all kinds of different logging levels and targets, and has this wonderful efficient and reentrant interface where it only chews up 4 CPU cycles in the event that a log is disabled. It’s so great that you even log directly from your transactions with all the transactional context making it to the log. All is great right?

Now what if you configure the logging system to have the transactional persistent store as a logging target? ;-)

The problem is that your logging framework shouldn’t have to care whether you already have a lock on the persistent store. Now, Mr. Butenhof’s argument is that you shouldn’t do the logging until you are done with the lock, but there is important information in the context of the transaction that would be lost once the lock was released. You could copy the information, but you’d be copying data while holding the lock for a logging event that may very well be disabled, and even if it was enabled, the probabilities that both the application and the logging framework need the lock are so low, that this actually increases the average length of time locks are held, thereby increasing contention, not to mention adding overhead in general. The best part is that to even know that this work might necessary you have to look at the config file.

Now, you could make the lock manage all this for you, but ultimately it boils down to you implementing the recursive lock yourself, and probably in a much less efficient way. In short, as with all things, it’s hard to make sweeping categorical statements, and this is why API designers typically have to provide more flexibility than they want to.

Specific contexts where recursive locks are helpful:

  • the probability that the lock will be used recursively is small (say you have to lock two records in a table that has page-level locking)
  • the lock explicitly exists as a trade off between concurrency and resources like CPU, memory and IO (caching for example).
  • the lock protects a common resource that is shared by two highly disparate systems (see above)

Not surprisingly, there tends to be a lot of overlap between those cases.

I would say that in general, recursive locks should not be your first choice, as all the drawbacks mentioned in that post do exist. That said, understanding those drawbacks, and knowing when and how to use recursive locks is key for implementing concurrent systems.