Tuesday, June 26, 2007

Got Multicore? Think Asymmetric!

Multicore CPU are now widely available, yet many applications are not tapping into their true potential. Sure, web applications, and more generally container-based applications have an inherent degree of coarse parallelism (basically at the request level), and they will scale fairly well on new CPU. However, most client-side applications don't fall in the same pattern. Also, some server-side applications (like batch processing) are not intrinsically parallel as well. Or maybe they are?

A few months ago, I was consulting on the design of the next generation of a (server-side) banking application. One of the modules was a batch processor, basically importing huge files into a database. For several reasons (file format, business policies), the file had to be read sequentially, processed sequentially, and imported into the database. The processing time was usually dominated by a single huge file, so the obvious technique to exploit a multicore (use several instances to import different files in parallel) would have not been effective.
Note that when we think of parallelism in this way, we're looking for symmetric parallelism, where each thread performs basically the same job (process a request, or import a file, or whatever). There is only so much you can do with symmetrical parallelism, especially on a client (more on this later). Sometimes (of course, not all the times), it's better to think asymmetrically, that is, model the processing as a pipeline.

Even for the batch application, we can see at least three stages in the pipeline:
- reading from the file
- doing any relevant processing
- storing into the database
You can have up to three different threads performing these tasks in parallel: while thread 1 is reading record 3, thread 2 will process record 2, and thread 3 will store [the processed] record 1. Of course, you need some buffering in between (more on this in a short while).
Actually, in our case, it was pretty obvious that the processing wasn't taking enough CPU to justify a separate thread: it could be merged with the read file operation. What was actually funny (almost exhilarating :-) was to discover that despite the immensely powerful database server, storing into the database was much slower than reading from the file (truth to be said, the file was stored in an immensely powerful file server as well). A smart guy in the bank quickly realized that it was our fault: we could have issued several parallel store operations, basically turning stage two of the pipeline into a symmetrical parallel engine. That worked like a charm, and the total time dropped by a factor of about 6 (more than I expected: we were also using the multi-processor, multi-core DB server better, not just the batch server multicore CPU).

Just a few weeks later (meaningful coincidence?), I stumbled across a nice paper: Understand packet-processing performance when employing multicore processors by Edwin Verplanke (Embedded Systems Design Europe, April 2007). Guess what, their design is quite similar to ours, an asymmetric pipeline with a symmetric stage.

Indeed, the pipeline model is extremely useful also when dealing with legacy code which has never been designed to be thread-safe. I know that many projects aimed at squeezing some degree of parallelism out of that kind of code fails, because the programmers quickly find themselves adding locks and semaphores everywhere, thus slowing down the beast so much that there is either no gain or even a loss.
This is often due to an attempt to exploit symmetrical parallelism, which on legacy, client-side code is a recipe for resource contention.Instead, thinking of pipelined, asymmetrical parallelism often brings some good results.
For instance, I've recently overheard a discussion on how to make a graphical application faster on multicore. One of the guy contended that since the rendering stage is not thread-safe, there is basically nothing they can do (except doing some irrelevant background stuff just to keep a core busy). Of course, that's because he was thinking of symmetrical parallelism. There are actually several logical stages in the pipeline before rendering takes place: we "just" have to model the pipeline explicitly, and allocate stages to different threads.

As I've anticipated, pipelines need some kind of buffering between stages. Those buffers must be thread safe. The banking code was written in C#, and so we simply used a monitor-protected queue, and that was it. However, in high-performance C/C++ applications we may want to go a step further, and look into lock-free data structures.

A nice example comes from Bjarne Stroustrup himself: Lock-free Dynamically Resizable Arrays. The paper has also a great bibliography, and I must say that the concept of descriptor (by Harris) is so simple and effective that I would call it a stroke of genius. I just wish a better name than "descriptor" was adopted :-).

For more predictable environments, like packet processing above, we should also keep in mind a simple, interesting pattern that I always teach in my "design patterns" course (actually in a version tailored for embedded / real-time programming, which does not [yet] appear on my website [enquiries welcome :-)]. You can find it in Pattern Languages of Program Design Vol. 2, under the name Resource Exchanger, and it can be easily made lock-free. I don't know of an online version of that paper, but there is a reference in the online Pattern Almanac.
If you plan to adopt the Resource Exchanger, make sure to properly tweak the published design to suit your needs (most often, you can scale it down quite a bit). Indeed, over the years I've seen quite a few hard-core C programmers slowing themselves down in endless memcpy calls where a resource exchanger would have done the job oh so nicely.

A final note: I want to highlight the fact that symmetric parallelism can still be quite effective in many cases, including some kind of batch processing or client-side applications. For instance, back in the Pentium II times, I've implemented a parallel sort algorithm for a multiprocessor (not multicore) machine. Of course, there were significant challenges, as the threads had to work on the same data structure, without locks, and (that was kinda hard) without having one processor invalidating the cache line of the other (which happens quite naturally in discrete multiprocessing if you do nothing about it). The algorithm was then retrofitted into an existing application. So, yes, of course it's often possible to go symmetrical, we just have to know when to use what, at which cost :-).