Tuesday, June 26, 2007

Got Multicore? Think Asymmetric!

Multicore CPU are now widely available, yet many applications are not tapping into their true potential. Sure, web applications, and more generally container-based applications have an inherent degree of coarse parallelism (basically at the request level), and they will scale fairly well on new CPU. However, most client-side applications don't fall in the same pattern. Also, some server-side applications (like batch processing) are not intrinsically parallel as well. Or maybe they are?

A few months ago, I was consulting on the design of the next generation of a (server-side) banking application. One of the modules was a batch processor, basically importing huge files into a database. For several reasons (file format, business policies), the file had to be read sequentially, processed sequentially, and imported into the database. The processing time was usually dominated by a single huge file, so the obvious technique to exploit a multicore (use several instances to import different files in parallel) would have not been effective.
Note that when we think of parallelism in this way, we're looking for symmetric parallelism, where each thread performs basically the same job (process a request, or import a file, or whatever). There is only so much you can do with symmetrical parallelism, especially on a client (more on this later). Sometimes (of course, not all the times), it's better to think asymmetrically, that is, model the processing as a pipeline.

Even for the batch application, we can see at least three stages in the pipeline:
- reading from the file
- doing any relevant processing
- storing into the database
You can have up to three different threads performing these tasks in parallel: while thread 1 is reading record 3, thread 2 will process record 2, and thread 3 will store [the processed] record 1. Of course, you need some buffering in between (more on this in a short while).
Actually, in our case, it was pretty obvious that the processing wasn't taking enough CPU to justify a separate thread: it could be merged with the read file operation. What was actually funny (almost exhilarating :-) was to discover that despite the immensely powerful database server, storing into the database was much slower than reading from the file (truth to be said, the file was stored in an immensely powerful file server as well). A smart guy in the bank quickly realized that it was our fault: we could have issued several parallel store operations, basically turning stage two of the pipeline into a symmetrical parallel engine. That worked like a charm, and the total time dropped by a factor of about 6 (more than I expected: we were also using the multi-processor, multi-core DB server better, not just the batch server multicore CPU).

Just a few weeks later (meaningful coincidence?), I stumbled across a nice paper: Understand packet-processing performance when employing multicore processors by Edwin Verplanke (Embedded Systems Design Europe, April 2007). Guess what, their design is quite similar to ours, an asymmetric pipeline with a symmetric stage.

Indeed, the pipeline model is extremely useful also when dealing with legacy code which has never been designed to be thread-safe. I know that many projects aimed at squeezing some degree of parallelism out of that kind of code fails, because the programmers quickly find themselves adding locks and semaphores everywhere, thus slowing down the beast so much that there is either no gain or even a loss.
This is often due to an attempt to exploit symmetrical parallelism, which on legacy, client-side code is a recipe for resource contention.Instead, thinking of pipelined, asymmetrical parallelism often brings some good results.
For instance, I've recently overheard a discussion on how to make a graphical application faster on multicore. One of the guy contended that since the rendering stage is not thread-safe, there is basically nothing they can do (except doing some irrelevant background stuff just to keep a core busy). Of course, that's because he was thinking of symmetrical parallelism. There are actually several logical stages in the pipeline before rendering takes place: we "just" have to model the pipeline explicitly, and allocate stages to different threads.

As I've anticipated, pipelines need some kind of buffering between stages. Those buffers must be thread safe. The banking code was written in C#, and so we simply used a monitor-protected queue, and that was it. However, in high-performance C/C++ applications we may want to go a step further, and look into lock-free data structures.

A nice example comes from Bjarne Stroustrup himself: Lock-free Dynamically Resizable Arrays. The paper has also a great bibliography, and I must say that the concept of descriptor (by Harris) is so simple and effective that I would call it a stroke of genius. I just wish a better name than "descriptor" was adopted :-).

For more predictable environments, like packet processing above, we should also keep in mind a simple, interesting pattern that I always teach in my "design patterns" course (actually in a version tailored for embedded / real-time programming, which does not [yet] appear on my website [enquiries welcome :-)]. You can find it in Pattern Languages of Program Design Vol. 2, under the name Resource Exchanger, and it can be easily made lock-free. I don't know of an online version of that paper, but there is a reference in the online Pattern Almanac.
If you plan to adopt the Resource Exchanger, make sure to properly tweak the published design to suit your needs (most often, you can scale it down quite a bit). Indeed, over the years I've seen quite a few hard-core C programmers slowing themselves down in endless memcpy calls where a resource exchanger would have done the job oh so nicely.

A final note: I want to highlight the fact that symmetric parallelism can still be quite effective in many cases, including some kind of batch processing or client-side applications. For instance, back in the Pentium II times, I've implemented a parallel sort algorithm for a multiprocessor (not multicore) machine. Of course, there were significant challenges, as the threads had to work on the same data structure, without locks, and (that was kinda hard) without having one processor invalidating the cache line of the other (which happens quite naturally in discrete multiprocessing if you do nothing about it). The algorithm was then retrofitted into an existing application. So, yes, of course it's often possible to go symmetrical, we just have to know when to use what, at which cost :-).

7 comments:

Anonymous said...

Per prima cosa, ciao a tutti :)
Poi, post interessante, che mostra come sia possibile "riscoprire" pratiche classiche della progettazione di device elettronici anche nella programmazione tradizionale.
In realtà non ho moto da aggiungere a questo post (ho sempre considerato Carlo un mostro sacro da quando lo leggevo sulle riviste Infomedia), se non che ho dei problemi con il link all'articolo di Stroustrup ;)

Ciao,
Eros

Carlo Pescio said...

Grazie Eros,
il link era sbagliato (mancava un http:// davanti ed era diventato un url relativo...). Dovrebbe essere a posto adesso...

Miki said...

Come al solito, un post incredibilmente interessante! Ne aprofitto subito per chiederti un approfondimento su Resource Exchanger.
Devo dire che spesso ci sorprendi con l'acutezza delle tue soluzioni: ma come riesci ad affrontare i problemi sempre da punti di vista nuovi e originali?

Carlo Pescio said...

Eros: ripensandoci un momento, la tua osservazione e' in effetti piuttosto profonda...
Le CPU, da tempo immemore ormai (anche monocore quindi) sono strutturate in pipeline, ad es. prefetch, decoding, ALU, ecc. Questo permetteva un piccolo parallelismo "gratuito", nel senso che la maggior parte dei programmatori non si preoccupava di "non rompere la pipeline", affidandosi al compilatore. (tipicamente la pipeline delle vecchie CPU si rompeva con i salti, con gli indirizzamenti indiretti, ecc).
Ricordo ancora quando, con l'integrazione del coprocessore matematico nella CPU, gia' si discuteva del fatto che il miglior scheduling avrebbe dovuto mescolare istruzioni intere e floating point, per dare ai due "core" la possibilita' di lavorare in parallelo (a granularita' molto fine). Crollava uno dei punti fermi, ovvero che un programma che usava solo l'aritmetica intera fosse inerentemente piu' veloce di uno che usava i floating point: il programma piu' veloce era quello misto, se riuscivamo a sfruttare bene la pipeline (cosa piu' facile in assembly che in altri linguaggi).
Via via che la struttura interna delle CPU diventa piu' complessa, il guadagno indotto dal (micro)parallelismo "gratuito" diventa sempre piu' piccolo, e l'architettura del software va in qualche modo adattata per sfruttare a dovere un (macro) parallelismo...

Carlo Pescio said...

Miki: sul resource exchanger faro' in modo di tornare in futuro.
La domanda e' troppo complimentosa :-), diciamo che ho da tempo in mente un post sul ruolo del (famigerato?) "software architect", dove forse riusciro' a dare una risposta indiretta...

Anonymous said...

Ciao Carlo,
non pensavo di aver fatto un commento così profondo... il fatto è che anche se vengo da un background da "softwareista" puro ho avuto la fortuna di lavorare per un po' di tempo nella progettazione di device digitali e di programmare a basso livello DSP... in questo mondo stare attenti al tipo di parallelismo a cui tu accennavi è quasi la regola, e farlo per qualche anno "mi ha svegliato molto rispetto a questa prassi".
Quindi il mio era un moto di felicità verso un analisi interessante di un caso molto reale dove questo tipo di tecniche ha portato giovamento.

Anonymous said...

Concordo in pieno con le osservazioni di Eros e di Carlo.
Aggiungo da parte mia che il multicore e' davvero una rivoluzione per chi progetta software. L'aumento di potenza questa volta non e' "lineare", obbliga a ripensare le architetture e gli algoritmi (se lo si vuole sfruttare).
E sono tantissime le applicazioni che possono beneficiare del multicore... Il piu' e' rendersene conto: spesso si pensa che la maggiore potenza serva solo a chi sa gia' di averne bisogno, a chi non basta l'hw attuale. Invece io sono convinto che serve a tutti, per fare qualcosa che prima era impossibile.
Penso che riuscire a sfruttare il multicore sia una delle sfide di questi anni.