¶Multithreading is hard
As we see increases in CPU clock speed slowing as it becomes harder to scale chips and CPU manufacturers start shifting toward multicore designs, we're starting to see greater interest in the computer industry in multithreaded applications to continue improving performance. I have a prediction regarding this: we will see lots of buggy programs for a while. We already got a taste of this when SMP and then hyperthreaded CPUs started appearing, when programs — and particularly drivers — started blowing up on systems that could simultaneously run more than one logical thread. The reason I predict this is for a simple reason:
Efficient multithreading is hard.
The reason I bring this up is that as I try to rewrite my video filter system to take advantage of multiple threads, I find that it is a very difficult task and that it is taking much longer than I would like. This is not to say that I even remotely believed it was easy, but I like challenges more than chores. Multithreading requires a new way of thinking and invites a whole new class of bugs that makes it much more challenging than regular single-threaded programming. Here are a few of the tips (or rather scars) that I've picked up over the years.
How not to do parallelism
I cringe when I hear people say that they're writing single-threaded code now but plan to make it multithreaded later. This is anywhere from hard to impossible and is asking for disaster.
First n00b mistake I usually see is people thinking that multithreading code simply requires adding locks everywhere. The result is usually at least one of:
- Deadlocks galore due to not following a lock hierarchy.
- Lock-stepped threads that don't actually get any parallelism.
- A lot of CPU burned both in synchronization primitives and in context switches.
The other possibility is a complete lack of locks, in which case running the application becomes a crapshoot and the odds on the Don't Pass bet start looking pretty good.
Deadlocks
If you are writing multithreaded code and don't know what a lock hierarchy, look it up. Now.
There are four conditions for a deadlock to occur:
- There must be a cyclical lock dependency.
- There must be at least two threads involved, each of which already has one or more locks and is attempting to acquire another.
- The locks in question must not be preemptable, i.e. no lock stealing.
- The locks in question must not be shareable.
This is really easy to accomplish with a stock mutex such as a Win32 critical section: the last two conditions are already satisfied. All you need is one thread trying A->B and another thread doing B->A, and you are skirting disaster. What a lock hierarchy does is enforce a particular ordering so you can never have a cycle. If you enforce that you always take B after A whenever you take both, it isn't possible to have an A<->B deadlock. You don't need to establish a strict ordering; a weak ordering is sufficient to prevent cycles, which you can get by arranging all locks in a tree. Another helpful tip is that if a lock is only a leaf, i.e. you never take any nested locks with it, that lock can never contribute to a deadlock.
One place where you can get smacked with a deadlock is whenever you have callbacks. Firing a callback while holding an internal lock is a recipe for disaster if the callback then issues a synchronous request on another thread that tries to use your object. There are often ways to finagle the code so it drops the lock when issuing the callback, but if you're not careful these can cause internal deadlocks instead, so I like to simply keep callbacks as simple as possible. Posting messages to a queue is a good solution.
Another gotcha is hidden locks — all locks must be accounted for in a lock hierarchy. The Win32 loader lock is usually what bites people in the a$$. If you are executing code in DllMain() you are under loader lock, and that includes static constructors or destructors in a DLL. In particular, waiting for a thread to exit from a static destructor practically guarantees a deadlock since it only takes one DLL that hasn't called DisableThreadLibraryCalls() to receive the thread detach call to its DllMain(). Another one I've stepped on is attempting to send window messages from the callback of a DirectShow sample grabber — if you attempt to Stop() the filter graph from the UI thread, that thread will block waiting for the Sample Grabber to release a sample, and the callback will block in SendMessage() waiting for the message loop.
Windows NT and its descendants offer some help with regard to tracking down deadlocks. If a process is started under a debugger, all of its critical sections will have debug nodes associated with them that are hooked into a central list. WinDbg's !locks command will dump this list and show you all the locks that are currently outstanding along with the thread IDs of the holders. This will very rapidly show you which critical sections and threads are involved in the cycle.
Lock-stepping threads
Multithreading doesn't help with latency, only with throughput. If you want a particular operation to run faster with multiple threads, you have to break it apart into smaller stages and run them in parallel without lock-stepping the pipeline stages together. If you break a single operation into three parts with a linear dependency of A->B->C and don't have a second piece of data to process, you aren't going to accomplish anything because you can't possibly run more than one stage at a time. It's thus desirable to maintain queues between stages to try to keep all of them busy and to reduce lock wait time. Such queues don't necessarily have to be very long; just one buffer can help a lot, and two buffers may be enough. It works fine for video display after all (double-buffering and triple-buffering, respectively).
Here's a stereotypical way to process a thread-safe queue:
void ProcessQueue() { queueLock.Lock(); Queue::const_iterator it(queue.begin()), itEnd(queue.end()); for(; it!=itEnd; ++it) Process(*it);
queue.clear(); queueLock.Unlock(); }
Assuming Process() is simple, it's safe, right? Well, yes, except that the entire queue is locked while you're processing it. If this is being called from a worker thread that isn't doing much else other than calling this function, you might find that the other threads that are trying to post into the queue are blocking often and the queue can't be kept full enough to keep the worker busy.
What you want to do is hold the lock for the shortest time possible:
void ProcessQueue() { queueLock.Lock(); queue2.swap(queue); queueLock.Unlock(); Queue::const_iterator it(queue2.begin()), itEnd(queue2.end()); for(; it!=itEnd; ++it) Process(*it); queue2.clear(); }
This increases the memory requirements slightly and requires an O(1) swappable container, but otherwise gets you much lower lock contention for very little cost. Notice that only the entry queue needs to be locked; queue2 doesn't need to be locked because it is only ever touched by the ProcessQueue() function.
You'll notice that in the above, all processing occurs in the form of calls to a Process() function. This can be a bit of a pain if you have a series of asynchronous calls that need to happen in a sequence. When writing multithreaded code I often lament that C++ doesn't support coroutines, which would make it much easier to write such sequences without resorting to a thread per sequence. Support for aborting would be required, though, perhaps handled by throwing an exception from the continuation point. Instead, I have to swiss-cheese my routines into a state-machine-like form.
Efficient multithreaded systems have low lock contention, since every clock you spend waiting on a lock is a clock that could have been spent doing useful work instead. That means worker threads need to have something to work on a ahead of time and can't just receive jobs right out of the previous stage like a single-threaded system might. This is why VirtualDub has a pipeline of data buffers between the I/O thread and the processing thread: it covers the varying latency of I/O reads and ensures that the processing thread always has something to do, assuming that the I/O thread can keep up.
Atomic primitives are a cheap trick to avoid locks — they're basically single operations that the CPU implements as uninterruptable, so you get the safety of a lock without having to take one. In Win32, these are implemented by the Interlocked*() primitives, many of which are directly implemented by the Visual C++ compiler as intrinsics that generate LOCKed instructions. Using these will often result in both faster and smaller code, and they're great for counters. The catch is that you have to squeeze all of your interactions into a single word, as that is usually all you can atomically manipulate. I'm fond of the "single-pointer queue," which simply uses InterlockedExchangePointer() to push data in and out of the queue. This works best when the consumer only cares about the latest data item, not all of the ones since its last pass.
Sleep() is not a synchronization primitive
I cannot count how many times I have seen people do this:
while(!flag) Sleep(1);
Now, there are a number of variants of this in common use, including Sleep(10), Sleep(100), and even Sleep(500). They all have one thing in common: they all suck. This either causes lots of unnecessary context switches or adds latency to the wakeup, and simply leads to poor performance. It also tends to lead to bugs because the flag is often not set in a thread-safe manner: it needs to be volatile at least, and often needs memory barriers or atomic operations, but invariably just ends up being a plain bool. See below for why this is bad.
The common place I see this is waiting for a Win32 thread to exit. I used to come up with all sorts of goofy, elaborate event-based systems to do this myself until I discovered that thread handles are waitable. Just use WaitForSingleObject() on the thread handle and you're done. No risking race conditions with the thread exit code. I think pthreads has pthread_join() for this too.
Every once in a while, a poor soul will reinvent the really bad version of this by using Sleep(0), which consumes a lot of CPU time. Sleep(0) on Win32 also only yields to threads of equal or higher priority. I always know when someone has used one of these because all the fans suddenly turn on in my computer. What's really fun is when the zero-wait loop ends up starving a lower-priority thread that is supposed to set the flag, and what's supposed to be a 10ms wait becomes 400ms or even seconds. This situation is known as priority inversion.
I try never to use polling loops like this within VirtualDub — I always try to use waits on events whenever possible so waiting threads can wake up as soon as possible. When I eliminated a polling loop from the shutdown of the Dubber class a while back, it dropped the time for stopping a preview from about 400ms to nearly instantaneous.
A bit too lockless
Nearly any data that is shared needs to be under lock or handled with equivalent algorithms. The one major exception is data that is constant and treated as read-only the entire time that multiple threads are using it; any mutable data definitely needs to be protected. This includes the tiniest data — especially counters, such as reference counts. Contrary to popular opinion, using operator++() in C doesn't ensure that you will only get a single-instruction increment:
00000000: A1 00 00 00 00 mov eax,[?blah@@YAHXZ] 00000005: 40 inc eax 00000006: A3 00 00 00 00 mov [?blah@@YAHXZ],eax 0000000B: C3 ret
It's only a one-instruction window. Would you actually hit this in practice? The answer is an emphatic yes, even on single-CPU systems — and in fact, in as little as a couple of hours! In fact, you might hit it constantly on a single-CPU system if the scheduling points happen to line up just right. If you manage to get the compiler to generate a single INC instruction you are OK on non-HT single-CPU, even without a LOCK prefix, because x86 uses instruction restart. On any other system you must use a locked atomic increment if you are not using explicit locks, even if the parallel execution is just from Hyperthreading Technology. For these reasons it's essential to test on both single-CPU and HT/SMP systems. Microsoft COM requires that IUnknown reference counting be implemented in a thread-safe manner, preferably using InterlockedIncrement() and InterlockedDecrement(); I do this as well for my own internal refcounts using a class called VDAtomicInt.
Another fun mistake is forgetting the volatile keyword, which tells the compiler that a variable can be modified at any time. I got bit by this in a loop as follows:
while(!mTimeToExit) { // bool mTimeToExit; if (!doSomething()) wait(); }
Most of the time the compiler will let you slide doing this. Occasionally, though, you get a bright one that deduces that this->mTimeToExit is not modified by any code in the loop and is not aliased by another pointer. Visual C++ is especially apt to do this if you play dangerously and use /Oa (assume no aliasing). The result is that the compiler reads the flag only once and caches it, so the loop never exits at all. volatile is required to prevent the compiler from caching it. On some platforms you will need memory barriers as well to ensure that the modified flag is not trapped in the cache or written out-of-order, but fortunately the x86 processor-ordered memory model is extremely forgiving in this regard.
Application
I meant to write about the current state of my new video filter system, but I realized that just writing about multithreading first would be an entire blog entry, and I've been a bit lax about writing. (It was about 10PM when I started writing this, and it's past midnight now, which is why I often don't blog. I still enjoy it, though.) Besides, I've wanted to write about some pet peeves for a while now. The multithreading issues in the new video filter system have been bad enough that I've had to write a debugging console into VirtualDub to debug frame aliasing corruption in the frame caches, which is due to the complexity of fully asynchronous operation. Delving into that is more than enough for another day, however.