You are viewing kristiannielsen

Kristian Nielsen - Micro-benchmarking pthread_cond_broadcast()
September 5th, 2010
06:38 pm

[Link]

Previous Entry Share Next Entry
Micro-benchmarking pthread_cond_broadcast()

In my work on group commit for MariaDB, I have the following situation:

A group of threads are going to participate in group commit. This means that one of the threads, called the group leader, will run an fsync() for all of them, while the other threads wait. Once the group leader is done, it needs to wake up all of the other threads.

The obvious way to do this is to have the group leader call pthread_cond_broadcast() on a condition that the other threads are waiting for with pthread_cond_wait():

  bool wakeup= false;
  pthread_cond_t wakeup_cond;
  pthread_mutex_t wakeup_mutex

Waiter:

  pthread_mutex_lock(&wakeup_mutex);
  while (!wakeup)
    pthread_cond_wait(&wakeup_cond, &wakeup_mutex);
  pthread_mutex_unlock(&wakeup_mutex);
  // Continue processing after group commit is now done.

Group leader:

  pthread_mutex_lock(&wakeup_mutex);
  wakeup= true;
  pthread_cond_broadcast(&wakeup_cond);
  pthread_mutex_unlock(&wakeup_mutex);

Note the association of the condition with a mutex. This association is inherent in the way pthread condition variables work. The mutex must be locked when calling into pthread_mutex_wait(), and will be obtained again before the call returns. (Check the man page for pthread_cond_wait() for details).

Now, when I think about how these condition variables work, something strikes me as somewhat odd.

The idea is that the broadcast signals every waiting thread to wake up. However, because of the associated mutex, only one thread will actually be able to wake up; this thread will obtain a lock on the mutex, and all other to-be-awoken threads will now have to wait for this mutex! Only after the first thread releases this mutex will the next thread wakeup holding the mutex, then after releasing the third thread can wake up, and so on.

So if we have say 100 threads waiting, the last one will have to wait for the first 99 threads to each be scheduled and each release the mutex, one after the other in a completely serialised fashion.

But what I really want is to just let them all run at once in parallel (or at least as many as my machine has spare cores for). There is another way to achieve this, by simply using a separate condition and mutex for each thread, and have the group leader signal each one individually:

Waiter:

  pthread_mutex_lock(&me->wakeup_mutex);
  while (!me->wakeup)
    pthread_cond_wait(&me->wakeup_cond, &me->wakeup_mutex);
  pthread_mutex_unlock(&me->wakeup_mutex);

Group leader:

  for waiter in <all waiters>
    pthread_mutex_lock(&waiter->wakeup_mutex);
    waiter->wakeup= true;
    pthread_cond_signal(&wakeup_cond);
    pthread_mutex_unlock(&wakeup_mutex);

This way, every waiter is free to start running as soon as woken up by the leader; no waiters have to wait for one another. This seems advantageous, especially as number of cores increases (rumours are that 48 core machines are becoming commodity).

"Seems" advantageous. But is it really? Let us micro-benchmark it.

For this, I start up 5000 threads. Each thread goes to wait on a condition, either a single shared one, or distinct in each thread. The main program then signals the threads to wakeup, either with a single pthread_cond_broadcast(), or with one pthread_cond_signal() per thread. Each thread records the time they woke up, and the main program collects these times and computes how long it took between starting to signal the condition(s) and wakeup of the last thread. (Here is the full C source code for the test program).

I ran the program on an Intel quad Core i7 with hyperthreading enabled, the most parallel machine I have easy access to. The results is the following:

pthread_cond_broadcast(): 46.9 msec
pthread_cond_signal(): 17.6 msec

Conclusion: pthread_cond_broadcast() is slower, as I speculated. I would expect the effect to be more pronounced on systems with more cores; it would be interesting if readers with access to such systems could try the test program and comment below on the results.

Tags: , , , ,

(10 comments | Leave a comment)

Comments
 
From:(Anonymous)
Date:September 5th, 2010 08:14 pm (UTC)

CV is not a event.

(Link)
It seems what you want is a "event". Try using a semaphore (sem_post/sem_getvalue) or FUTEX_WAKE.
[User Picture]
From:pingback_bot
Date:September 5th, 2010 08:15 pm (UTC)

Micro-benchmarking pthread_cond_broadcast()

(Link)
From:jorela
Date:September 6th, 2010 09:28 am (UTC)

Added futex wake (broadcast/signal) and tested on "Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz"

(Link)
jonas@perch:/tmp> for m in signal broadcast futex_signal futex_broadcast; do for i in `seq 3`; do ./a.out 1000 $m; done; done
Time for wakeup all using signal: 0.00882006 seconds
Time for wakeup all using signal: 0.00598502 seconds
Time for wakeup all using signal: 0.0101359 seconds
Time for wakeup all using broadcast: 0.00740504 seconds
Time for wakeup all using broadcast: 0.00699377 seconds
Time for wakeup all using broadcast: 0.00806093 seconds
Time for wakeup all using futex_signal: 0.00543499 seconds
Time for wakeup all using futex_signal: 0.00562286 seconds
Time for wakeup all using futex_signal: 0.00497603 seconds
Time for wakeup all using futex_broadcast: 0.004071 seconds
Time for wakeup all using futex_broadcast: 0.00385094 seconds
Time for wakeup all using futex_broadcast: 0.00391102 seconds
From:kristiannielsen
Date:September 6th, 2010 10:42 am (UTC)

Re: Added futex wake (broadcast/signal) and tested on "Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz"

(Link)
Thanks for posting those numbers!

I like the futex calls in Linux, but they are not portable of course. We used them (with
fallback to pthreads stuff on non-Linux) when we implemented the NDB multithreading
support.
From:(Anonymous)
Date:September 6th, 2010 05:24 pm (UTC)

Results on a 48 cores Magny-Cours box

(Link)
On a 48 cores Magny-Cours box:

Time for wakeup all using signal: 0.0220308 seconds
Time for wakeup all using signal: 0.02145 seconds
Time for wakeup all using signal: 0.0213821 seconds
Time for wakeup all using broadcast: 0.11098 seconds
Time for wakeup all using broadcast: 0.111336 seconds
Time for wakeup all using broadcast: 0.106276 seconds

Regards,
Didier.
From:(Anonymous)
Date:September 6th, 2010 05:28 pm (UTC)

Results on a 32 cores Nehalem EX box:

(Link)

On a 32 cores Nehalem EX box (64 virtual CPUs with hyperthreading):

Time for wakeup all using signal: 0.0174789 seconds
Time for wakeup all using signal: 0.017889 seconds
Time for wakeup all using signal: 0.018132 seconds
Time for wakeup all using broadcast: 0.057833 seconds
Time for wakeup all using broadcast: 0.054671 seconds
Time for wakeup all using broadcast: 0.0584559 seconds

Regards,
Didier.
From:kristiannielsen
Date:September 6th, 2010 06:22 pm (UTC)

Re: Results on a 32 cores Nehalem EX box:

(Link)
Thanks!
So five times faster on these boxes, cool!
From:(Anonymous)
Date:February 17th, 2011 07:20 pm (UTC)

Hello, world!

(Link)
How do you do?
From:(Anonymous)
Date:March 19th, 2011 01:36 am (UTC)

personal credit score 3 in 1

(Link)
Satisfy stop me to contrive collection
From:(Anonymous)
Date:March 21st, 2011 11:27 pm (UTC)

Hi-Fi cinema

(Link)
Can I links here?
Powered by LiveJournal.com