I just spotted an excellent article on how reducing the number of cores used by a multi-threaded actually increased it’s performance. This seems counter intuitive at first, but it is a sad reality. It is very easy to create contention across threads in a multi-threaded app which in turn lowers performance.
A few months ago, I experienced similar results while hacking on ActiveMQ. I noticed that passing messages from producer connections to consumer connections was dramatically faster if the producer and consumer were being serviced by the same thread. I decided that the next version of the broker would to be need to be built using a thread management framework which could optimize itself so that those connections could collocate onto one thread if possible.
Then I saw the the libdispatch API (it forms the foundation of the Grand Central Dispatch technology in OS X) and fell in love with it’s simplicity and power. I realized that implementation of that API could in theory provides the threading optimizations I was looking for. So I started hacking on HawtDispatch, a Java/Scala clone of libdispatch.
The central concepts to libdispatch and hawtdispatch are global and serial queues. Global queues are executors which execute tasks concurrently using a fixed size thread pool. Serial queues are executors without an assigned thread and which execute tasks in FIFO order. When tasks added to a serial queue, the serial queue gets added to a global queue so that the serial queue can execute it’s tasks. Multiple serial queues execute concurrently on the global queue.
The overhead of a serial queue is very small, it’s just a several counters and a couple of linked lists. You can use them like a lightweight thread. Feel free to create thousands of them. If you squint at it just right, they allows you to use erlang style actor thread model.
Now that you have an idea how HawtDispatch is used, lets get back to what kinds of optimizations it can do to help with cross thread contention. HawtDispatch generally uses a concurrent linked list to enqueue a task in serial queue, but there are times when it can avoid that synchronization of the concurrent linked list. For example, if the serial queue is currently executing in the current thread, then an enqueue can just add the task to a non synchronized linked list. HawtDispatch also supports ‘pining’ a serial queue to one of the threads in the global queue’s thread pool. This allows you to force serial queues to collocate onto one thread so that when they do need to communicate, there is no thread contention involved.
But you still run into cases where you need to move tons of events from one serial queue to another which is executing on a different thread. For these cases, you use a custom event source. It allows you to coalesce a bunch of events generated on on thread as a single event delivered to the another queue. HawtDispatch will aggregate custom events into a thread local (to avoid contention) and once the current thread has drained all execution queues, it will deliver those custom events to their target queues.
This post is already getting kind of long, so I’ve have to do a follow up post on how all that interacts with network IO events. But the general idea is, yes, keeping stuff on 1 core is fast, but it won’t scale once your CPU bound, so having a framework like can HawtDispatch help minimize cross thread contention while still providing the ability to scale up to multiple cores as load increases.