STOMP Messaging Benchmarks: ActiveMQ vs Apollo vs HornetQ vs RabbitMQ

I’ve recently run STOMP benchmarks against the lastest releases of the 4 most feature packed STOMP servers:

STOMP is an asynchronous messaging protocol with design roots are based on HTTP protocol.  It’s simplicity has made the protocol tremendously  popular since it reduces the complexity of integrating different platforms and languages.  There are a multitude of client libraries for your language of choice to interface with STOMP servers.

The benchmark focuses on finding the maximum producer and consumer throughput for a variety of messaging scenarios.  For example, it benchmarks a combination of all the following scenario options:

  • Queues or Topics
  • 1, 5, or 10 Producers
  • 1, 5, or  10 Consumers
  • 1, 5, or 10 Destinations
  • 20 byte, 1k, or 256k message bodys

The benchmark warms up each scenario for 3 seconds.  Then for 15 seconds it samples the total number of messages that were  produced/consumed every second.  Finally, the destination gets drained of any messages before benchmarking the next scenario.  The benchmark generates a little HTML report with a graph for each scenario run where it displays messaging rate for each of the servers over the 15 second  sampling interval.

I’ve run the benchmarks on a couple if different machines and I’ve posted the results at: http://hiramchirino.com/stomp-benchmark/
Note: the graphs can take a while to load as they are being generated on the client side using the excellent flot javascript library.

Since anyone can get access to an EC2 instance to reproduce those results, the rest of this article will focus on the results of the obtained on the EC2 High-CPU Extra Large Instance.  If you want to reproduce, just spin up a new Amazon Linux 64 bit AMI and then run the following commands in it:

sudo yum install -y screen
curl https://nodeload.github.com/chirino/stomp-benchmark/tarball/master | tar -zxv
mv chirino-stomp-benchmark-* stomp-benchmark
screen ./stomp-benchmark/bin/benchmark-all

 

Note: RabbitMQ 2.7.0 sometimes dies midway through the benchmark.  It seems RabbitMQ does not enforce very strict flow control and you can get into situations where it runs out of memory if you place too much load on it.  It seems that crash becomes more likely as you increase the core speed of the cpu or reduce the amount of physical memory on the box.  Luckily, the RabbitMQ folks are aware of the issue and hopefully will fix it by the next release.

The ‘Throughput to an Unsubscribed Topic’ benchmarking scenario is interesting to just get a idea/base line what the fastest possible rate a producer can send to server.  Since there are not attached consumers, the broker should be doing very little work since it’s just dropping all the messages that get sent to it.

The Queue Load/Unload scenarios a very important to look at if your application uses queues.  You often times run into situations where messages start accumulating in a queue with either no consumers or with not enough consumers to keep up with the producer load.  This benchmark first runs a producer for 30 seconds enqueuing non-persistant messages and then runs a producer enqueuing persistant messages  for 30 seconds.  Finally, it runs a consumer to dequeue the messages for 30 seconds.  An interesting observation in this scenario is that Apollo was the only sever which could dequeue at around the same maximum enqueue rates which is important if you ever want your consumers to catch up with fast producers.

 

The Fan In/Out Load Scenarios help you look at cases where you have either multiple producers or multiple consumers running against a single destination.  It helps you see how performance will be affects as you scale up the producers and consumers.  You should follow the “10 Producers” columns and “10 Consumers” rows to really get a sense of which servers do well as the you increase the number of clients on a single destination.

The Partitioned Load Scenarios look at how well the server scales as you start to increase load on multiple destinations at different message sizes.

I’ve tried to make the benchmark as fair as possible to all the contenders, all the source code to the benchmark is available on github.  Please open an issue or send me pull request if you think of ways to improve it!

HawtDispatch Event Based IO

My previous post promised a follow up to explain how network IO events are handled by HawtDispatch.  Before I get into the details, I urge you to read Mark McGranaghan’s post on Threaded vs Evented Servers.  He does an excellent job describing how event driven servers scale in comparison to threaded servers.   This post will try to highlight how HawtDispatch provides an excellent framework for the implementation of event based servers.

When implementing event based servers, there are generally 2 patterns used, the reactor pattern and the proactor pattern.  The reactor pattern can be though of as being a synchronous version of the proactor pattern.  In a reactor pattern IO events are serviced by the thread in the IO handling loop.  In a proactor pattern the thread processing the IO event loop passes off the the IO event to another thread for processing.  HawtDispatch can support both styles of IO processing.

HawtDispatch uses a fixed sized thread pool sized to match the number of cores on your system.  Each thread in the pool runs an IO handling loop.  When a NIO event source is created, it gets assigned to one of the threads.  When network events occur the source causes callbacks to occur against on the dispatch queue targeted in the event source.  Typically that target dispatch queue is set to a serial queue which the application uses to handle the network protocol.  Since it’s a serial queue, the handling of the event can be done in a thread safe way.  The proactor pattern is being used since the serial queue can execute in any of the threads in the thread pool .

To use the reactor pattern, HawtDispatch supports ‘pinning’ the serial queue to a thread.  When a dispatch source is created on a pinned dispatch queue, then the event source gets registered against the same ‘pinned’ thread.  The benefit of the reactor pattern is that it avoids some of  cross thread synchronization needed for the proactor pattern and provides cheaper GCs.  The down side to the reactor pattern is that you may have to manage reblanacing network sources across all the available thread.  Lucky HawtDispatch does support moving pinned dispatch queues and sources to different threads.

Scaling Up with HawtDispatch

I just spotted an excellent article on how reducing the number of cores used by a multi-threaded actually increased it’s performance.  This seems counter intuitive at first, but it is a sad reality.  It is very easy to create contention across threads in a multi-threaded app which in turn lowers performance.

A few months ago, I experienced similar results while hacking on ActiveMQ.  I noticed that passing messages from producer connections to consumer connections was dramatically faster if the producer and consumer were being serviced by the same thread.  I decided that the next version of the broker would to be need to be built using a thread management framework which could optimize itself so that those connections could collocate onto one thread if possible.

Then I saw the the libdispatch API (it forms the foundation of the Grand Central Dispatch technology in OS X) and fell in love with it’s simplicity and power.  I realized that implementation of that API could in theory provides the threading optimizations I was looking for.  So I started hacking on HawtDispatch, a Java/Scala clone of libdispatch.

The central concepts to libdispatch and hawtdispatch are global and serial queues.  Global queues are executors which execute tasks concurrently using a fixed size thread pool. Serial queues are executors without an assigned thread and which execute tasks in FIFO order.   When tasks added to a serial queue, the serial queue gets added to a global queue so that the serial queue can execute it’s tasks.  Multiple serial queues execute concurrently on the global queue.

The overhead of a serial queue is very small, it’s just a several counters and a couple of linked lists.  You can use them like a lightweight thread. Feel free to create thousands of them.  If you squint at it just right, they allows you to use erlang style actor thread model.

Now that you have an idea how HawtDispatch is used, lets get back to what kinds of optimizations it can do to help with cross thread contention.  HawtDispatch generally uses a concurrent linked list to enqueue a task in serial queue, but there are times when it can avoid that synchronization of the concurrent linked list.  For example, if the serial queue is currently executing in the current thread, then an enqueue can just add the task to a non synchronized linked list.  HawtDispatch also supports ‘pining’ a serial queue to one of the threads in the global queue’s thread pool.  This allows you to force serial queues to collocate onto one thread so that when they do need to communicate, there is no thread contention involved.

But you still run into cases where you need to move tons of events from one serial queue to another which is executing on a different thread.  For these cases, you use a custom event source.  It allows you to  coalesce a bunch of events generated on on thread as a single event delivered to the another queue.  HawtDispatch will aggregate custom events  into a thread local (to avoid contention) and once the current thread has drained all execution queues, it will deliver those custom events to their target queues.

This post is already getting kind of long, so I’ve have to do a follow up post on how all that interacts with network IO events.  But the general idea is, yes, keeping stuff on 1 core is fast, but it won’t scale once your CPU bound, so having a framework like can HawtDispatch help minimize cross thread contention while still providing the ability to scale up to multiple cores as load increases.

Fuse Community Day: San Francisco

I just found out I’m going to heading out to San Francisco to attend the Fuse Community Day!

Progress Software is sponsoring an Apache ServiceMix, ActiveMQ, CXF & Camel Community Day on Thursday, December 10th, at the Hyatt Hotel in Burlingame. Join us at this free event and meet committers and founders of Apache ServiceMix, ActiveMQ, CXF and Camel that have successfully implemented enterprise application and deployed these projects in production.

Should fun to meet users/developers of these kick ass Apache based projects.  If you plan on going, make sure your register for the event.  It’ll be nice to meet everyone!

Python messaging: ActiveMQ and RabbitMQ

Dejan just posted a nice writeup comparing the performance of ActiveMQ to RabbitMQ in the case of python clients.  Interesting results:

… both ActiveMQ and RabbitMQ are decent brokers that will serve their purpose well in normal conditions, but put to their extremes in terms of throughput, scalability and reliability, ActiveMQ currently outperforms RabbitMQ for messaging usage in Python.