Current version

v1.10.4 (stable)


Main page
Archived news
Plugin SDK
Knowledge base
Contact info
Other projects



01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004


Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ Multithreading is hard

As we see increases in CPU clock speed slowing as it becomes harder to scale chips and CPU manufacturers start shifting toward multicore designs, we're starting to see greater interest in the computer industry in multithreaded applications to continue improving performance. I have a prediction regarding this: we will see lots of buggy programs for a while. We already got a taste of this when SMP and then hyperthreaded CPUs started appearing, when programs and particularly drivers started blowing up on systems that could simultaneously run more than one logical thread. The reason I predict this is for a simple reason:

Efficient multithreading is hard.

The reason I bring this up is that as I try to rewrite my video filter system to take advantage of multiple threads, I find that it is a very difficult task and that it is taking much longer than I would like. This is not to say that I even remotely believed it was easy, but I like challenges more than chores. Multithreading requires a new way of thinking and invites a whole new class of bugs that makes it much more challenging than regular single-threaded programming. Here are a few of the tips (or rather scars) that I've picked up over the years.

How not to do parallelism

I cringe when I hear people say that they're writing single-threaded code now but plan to make it multithreaded later. This is anywhere from hard to impossible and is asking for disaster.

First n00b mistake I usually see is people thinking that multithreading code simply requires adding locks everywhere. The result is usually at least one of:

The other possibility is a complete lack of locks, in which case running the application becomes a crapshoot and the odds on the Don't Pass bet start looking pretty good.


If you are writing multithreaded code and don't know what a lock hierarchy, look it up. Now.

There are four conditions for a deadlock to occur:

This is really easy to accomplish with a stock mutex such as a Win32 critical section: the last two conditions are already satisfied. All you need is one thread trying A->B and another thread doing B->A, and you are skirting disaster. What a lock hierarchy does is enforce a particular ordering so you can never have a cycle. If you enforce that you always take B after A whenever you take both, it isn't possible to have an A<->B deadlock. You don't need to establish a strict ordering; a weak ordering is sufficient to prevent cycles, which you can get by arranging all locks in a tree. Another helpful tip is that if a lock is only a leaf, i.e. you never take any nested locks with it, that lock can never contribute to a deadlock.

One place where you can get smacked with a deadlock is whenever you have callbacks. Firing a callback while holding an internal lock is a recipe for disaster if the callback then issues a synchronous request on another thread that tries to use your object. There are often ways to finagle the code so it drops the lock when issuing the callback, but if you're not careful these can cause internal deadlocks instead, so I like to simply keep callbacks as simple as possible. Posting messages to a queue is a good solution.

Another gotcha is hidden locks all locks must be accounted for in a lock hierarchy. The Win32 loader lock is usually what bites people in the a$$. If you are executing code in DllMain() you are under loader lock, and that includes static constructors or destructors in a DLL. In particular, waiting for a thread to exit from a static destructor practically guarantees a deadlock since it only takes one DLL that hasn't called DisableThreadLibraryCalls() to receive the thread detach call to its DllMain(). Another one I've stepped on is attempting to send window messages from the callback of a DirectShow sample grabber if you attempt to Stop() the filter graph from the UI thread, that thread will block waiting for the Sample Grabber to release a sample, and the callback will block in SendMessage() waiting for the message loop.

Windows NT and its descendants offer some help with regard to tracking down deadlocks. If a process is started under a debugger, all of its critical sections will have debug nodes associated with them that are hooked into a central list. WinDbg's !locks command will dump this list and show you all the locks that are currently outstanding along with the thread IDs of the holders. This will very rapidly show you which critical sections and threads are involved in the cycle.

Lock-stepping threads

Multithreading doesn't help with latency, only with throughput. If you want a particular operation to run faster with multiple threads, you have to break it apart into smaller stages and run them in parallel without lock-stepping the pipeline stages together. If you break a single operation into three parts with a linear dependency of A->B->C and don't have a second piece of data to process, you aren't going to accomplish anything because you can't possibly run more than one stage at a time. It's thus desirable to maintain queues between stages to try to keep all of them busy and to reduce lock wait time. Such queues don't necessarily have to be very long; just one buffer can help a lot, and two buffers may be enough. It works fine for video display after all (double-buffering and triple-buffering, respectively).

Here's a stereotypical way to process a thread-safe queue:

void ProcessQueue() {
    Queue::const_iterator it(queue.begin()), itEnd(queue.end());
    for(; it!=itEnd; ++it)
queue.clear(); queueLock.Unlock(); }

Assuming Process() is simple, it's safe, right? Well, yes, except that the entire queue is locked while you're processing it. If this is being called from a worker thread that isn't doing much else other than calling this function, you might find that the other threads that are trying to post into the queue are blocking often and the queue can't be kept full enough to keep the worker busy.

What you want to do is hold the lock for the shortest time possible:

void ProcessQueue() {
    Queue::const_iterator it(queue2.begin()), itEnd(queue2.end());
    for(; it!=itEnd; ++it)

This increases the memory requirements slightly and requires an O(1) swappable container, but otherwise gets you much lower lock contention for very little cost. Notice that only the entry queue needs to be locked; queue2 doesn't need to be locked because it is only ever touched by the ProcessQueue() function.

You'll notice that in the above, all processing occurs in the form of calls to a Process() function. This can be a bit of a pain if you have a series of asynchronous calls that need to happen in a sequence. When writing multithreaded code I often lament that C++ doesn't support coroutines, which would make it much easier to write such sequences without resorting to a thread per sequence. Support for aborting would be required, though, perhaps handled by throwing an exception from the continuation point. Instead, I have to swiss-cheese my routines into a state-machine-like form.

Efficient multithreaded systems have low lock contention, since every clock you spend waiting on a lock is a clock that could have been spent doing useful work instead. That means worker threads need to have something to work on a ahead of time and can't just receive jobs right out of the previous stage like a single-threaded system might. This is why VirtualDub has a pipeline of data buffers between the I/O thread and the processing thread: it covers the varying latency of I/O reads and ensures that the processing thread always has something to do, assuming that the I/O thread can keep up.

Atomic primitives are a cheap trick to avoid locks they're basically single operations that the CPU implements as uninterruptable, so you get the safety of a lock without having to take one. In Win32, these are implemented by the Interlocked*() primitives, many of which are directly implemented by the Visual C++ compiler as intrinsics that generate LOCKed instructions. Using these will often result in both faster and smaller code, and they're great for counters. The catch is that you have to squeeze all of your interactions into a single word, as that is usually all you can atomically manipulate. I'm fond of the "single-pointer queue," which simply uses InterlockedExchangePointer() to push data in and out of the queue. This works best when the consumer only cares about the latest data item, not all of the ones since its last pass.

Sleep() is not a synchronization primitive

I cannot count how many times I have seen people do this:


Now, there are a number of variants of this in common use, including Sleep(10), Sleep(100), and even Sleep(500). They all have one thing in common: they all suck. This either causes lots of unnecessary context switches or adds latency to the wakeup, and simply leads to poor performance. It also tends to lead to bugs because the flag is often not set in a thread-safe manner: it needs to be volatile at least, and often needs memory barriers or atomic operations, but invariably just ends up being a plain bool. See below for why this is bad.

The common place I see this is waiting for a Win32 thread to exit. I used to come up with all sorts of goofy, elaborate event-based systems to do this myself until I discovered that thread handles are waitable. Just use WaitForSingleObject() on the thread handle and you're done. No risking race conditions with the thread exit code. I think pthreads has pthread_join() for this too.

Every once in a while, a poor soul will reinvent the really bad version of this by using Sleep(0), which consumes a lot of CPU time. Sleep(0) on Win32 also only yields to threads of equal or higher priority. I always know when someone has used one of these because all the fans suddenly turn on in my computer. What's really fun is when the zero-wait loop ends up starving a lower-priority thread that is supposed to set the flag, and what's supposed to be a 10ms wait becomes 400ms or even seconds. This situation is known as priority inversion.

I try never to use polling loops like this within VirtualDub I always try to use waits on events whenever possible so waiting threads can wake up as soon as possible. When I eliminated a polling loop from the shutdown of the Dubber class a while back, it dropped the time for stopping a preview from about 400ms to nearly instantaneous.

A bit too lockless

Nearly any data that is shared needs to be under lock or handled with equivalent algorithms. The one major exception is data that is constant and treated as read-only the entire time that multiple threads are using it; any mutable data definitely needs to be protected. This includes the tiniest data especially counters, such as reference counts. Contrary to popular opinion, using operator++() in C doesn't ensure that you will only get a single-instruction increment:

  00000000: A1 00 00 00 00     mov         eax,[?blah@@YAHXZ]
  00000005: 40                 inc         eax
  00000006: A3 00 00 00 00     mov         [?blah@@YAHXZ],eax
  0000000B: C3                 ret

It's only a one-instruction window. Would you actually hit this in practice? The answer is an emphatic yes, even on single-CPU systems and in fact, in as little as a couple of hours! In fact, you might hit it constantly on a single-CPU system if the scheduling points happen to line up just right. If you manage to get the compiler to generate a single INC instruction you are OK on non-HT single-CPU, even without a LOCK prefix, because x86 uses instruction restart. On any other system you must use a locked atomic increment if you are not using explicit locks, even if the parallel execution is just from Hyperthreading Technology. For these reasons it's essential to test on both single-CPU and HT/SMP systems. Microsoft COM requires that IUnknown reference counting be implemented in a thread-safe manner, preferably using InterlockedIncrement() and InterlockedDecrement(); I do this as well for my own internal refcounts using a class called VDAtomicInt.

Another fun mistake is forgetting the volatile keyword, which tells the compiler that a variable can be modified at any time. I got bit by this in a loop as follows:

while(!mTimeToExit) {   // bool mTimeToExit;
    if (!doSomething())

Most of the time the compiler will let you slide doing this. Occasionally, though, you get a bright one that deduces that this->mTimeToExit is not modified by any code in the loop and is not aliased by another pointer. Visual C++ is especially apt to do this if you play dangerously and use /Oa (assume no aliasing). The result is that the compiler reads the flag only once and caches it, so the loop never exits at all. volatile is required to prevent the compiler from caching it. On some platforms you will need memory barriers as well to ensure that the modified flag is not trapped in the cache or written out-of-order, but fortunately the x86 processor-ordered memory model is extremely forgiving in this regard.


I meant to write about the current state of my new video filter system, but I realized that just writing about multithreading first would be an entire blog entry, and I've been a bit lax about writing. (It was about 10PM when I started writing this, and it's past midnight now, which is why I often don't blog. I still enjoy it, though.) Besides, I've wanted to write about some pet peeves for a while now. The multithreading issues in the new video filter system have been bad enough that I've had to write a debugging console into VirtualDub to debug frame aliasing corruption in the frame caches, which is due to the complexity of fully asynchronous operation. Delving into that is more than enough for another day, however.


Comments posted:

About that volatile keyword.. It's probably very easy to forget to define one. I wonder if this would be a valid workaround:

bool Process() {
// Lock is a class, the destructor leaves c_s
/* code */
return bReturn; // bool bReturn;


So, the loop will break when bReturn == false, and bReturn can be changed in another thread (in c_s). The tricky part is calling the destructor. Whether it occurs before or after reading bReturn in Process().

BiShop - 20 08 05 - 06:14

"When writing multithreaded code I often lament that C++ doesn't support coroutines" -- If you haven't seen it already, you might want to take a look at "Coroutines in C" by Simon Tatham:

You can see it in action in Putty's SSH protocol implementation:
(search for crBegin etc macros)

It's possibly one of the worst abuses of CPP that I've seen in real life code, but does actually work if you are careful and makes things like network protocol state machines much easier to read.

Petteri Kangaslampi - 20 08 05 - 06:20

BTW: a little bug in that code I posted:
Lock lock(critical_section);
instead of

BiShop - 20 08 05 - 06:34

Nope. It will tend to work because locks are typically handled an OS function, about which the compiler doesn't know anything and assumes that anything can happen within, include formatting a hard disk or changing bReturn. However, if the lock is done entirely in user-space code and global optimizations are enabled, the compiler might still be able to detect that the variable "can't be changed." This is especially apt to happen if inlining occurs Volatile is still required.

That is indeed an awful hack, and unfortunately, using static variables means that the coroutines aren't reentrant, which is a disaster for multithreading. It's also incompatible with several C++ constructs, notably construction in the middle of a function and exception handling.

The C# compiler in Visual Studio .NET 2005 has a really neat trick: it can invisibly transform a iterator member function into a state machine, so that you can use "yield return" in the middle of a function. It simply stores the locals in a custom class object that implements IEnumerator and rewrites the control flow into a state machine. Unfortunately, the fatal flaw is that you can only yield in the top-level function, because it still uses the native stack.

Phaeron - 20 08 05 - 15:03

"On some platforms you will need memory barriers as well to ensure that the modified flag is not trapped in the cache or written out-of-order, but fortunately the x86 processor-ordered memory model is extremely forgiving in this regard."

if it is indeed so, then why are there instructions (on the P4) for implementing memory barriers? do you say that they are superflueous? ...and is there windows API to access these instructions, or can they be called only using asm{} ?

User - 21 08 05 - 09:54

The P4 fencing instructions are only usually required when accessing memory that has been marked as Write Combining (WC) or using streaming write instructions that force the WC memory model (movnti, movntq, movntps, maskmovq). Write combining memory has out-of-order writes, so you need to fence them to ensure that the data has been written to memory before other code or even another device tries to read it. In user code, you will usually only have to worry about issuing __asm {sfence} or the _mm_sfence() compiler intrinsic when doing streaming writes; the other case you're likely to hit is video memory, but in that case either Direct3D or the video driver is likely to do the store fence for you on the surface unlock.

All normal memory that you deal with in Windows on x86 is marked write-back (WB). The write-back memory model uses strong write ordering, in which all writes must execute in-order. (Actually, it uses processor ordering, which means that writes are only strongly ordered with respect to a single CPU, but that difference is not usually a problem.) The strict write ordering means that you never have to worry about the release of a lock getting committed to memory before the shared variable that it protects, which would violate the lock. On some platforms a weak memory model is used and writes may be reordered, so synchronization primitives and lockless algorithms must use the appropriate fences to force write ordering when necessary.

Phaeron - 21 08 05 - 17:33

but... most memcpy's i've seen (for example on bsd, or the recommended intel/AMD memcpy implementations), use movntq! and i haven't seen any sfence instructions before / after it. so, every time i call memcpy() i can get scrambled write-order, on single processor. isn't that a bug? and hyperthreading is considered 1 or 2 procs here? what about multicore chip (intel merom for example)? and don't amd/intel chips do write-combining on complete cache lines (64 bytes or 128 bytes), even if memory is defined as write-back? this assumption doesn't seem to be very good one on most new intel/amd processors...

User - 22 08 05 - 00:25

The AMD fast memcpy(), at least, most definitely does sfence at the end; streaming copiers that don't fence in some manner, and don't declare that it must be done by the caller, are broken. You won't notice it in most cases because the CPU only has a very limited number of write buffers that will fill up very quickly, but it's dangerous nevertheless. Also, you probably won't see it on a single CPU system because the CPU will forward from the write buffers to loads; you'd need to either have multiple CPUs or be working with I/O memory on devices to hit the problems.

Hyperthreading Technology exposes most or all of the problems you'd see on an SMP system. Multiple-thread races, including an unlocked single-instruction ADD to memory, can definitely be seen on one HT-enabled processor. I don't know if a WC race can, though, since the write buffers may be shared. I think multicore CPUs have separate L2 caches, so I would think that those would be virtually the same as SMP with regard to what we're talking about here. But I haven't looked into that.

The write combining that is done with write back (WB) memory is different than with write combining (WC) memory. In the WB case, the writes are collected into the L1 cache line, and for that to happen, the CPU first has to have exclusive access to it. In a multi-CPU system, this results in a Read For Ownership (RFO) request going out to the other CPUs to obtain exclusivity on that cache line. The other CPUs then have to fire bus requests for the modified cache line to be written back and returned to shared status so they can read the modified data. WB semantics require this arbitration happen in such a way that the externally the writes occur in the correct sequence, no matter how the first CPU handles it internally.

Section 7.2 of the third volume of the IA-32 architecture manual covers memory ordering on IA-32 in excruciating detail, including the effect of using instructions that force write combining mode. It's a few very dense pages that ultimately assure you that the IA-32 memory model is very forgiving.

Phaeron - 22 08 05 - 01:50

Comment form