STOMP Messaging Benchmarks: ActiveMQ vs Apollo vs HornetQ vs RabbitMQ

I’ve recently run STOMP benchmarks against the lastest releases of the 4 most feature packed STOMP servers:

STOMP is an asynchronous messaging protocol with design roots are based on HTTP protocol.  It’s simplicity has made the protocol tremendously  popular since it reduces the complexity of integrating different platforms and languages.  There are a multitude of client libraries for your language of choice to interface with STOMP servers.

The benchmark focuses on finding the maximum producer and consumer throughput for a variety of messaging scenarios.  For example, it benchmarks a combination of all the following scenario options:

  • Queues or Topics
  • 1, 5, or 10 Producers
  • 1, 5, or  10 Consumers
  • 1, 5, or 10 Destinations
  • 20 byte, 1k, or 256k message bodys

The benchmark warms up each scenario for 3 seconds.  Then for 15 seconds it samples the total number of messages that were  produced/consumed every second.  Finally, the destination gets drained of any messages before benchmarking the next scenario.  The benchmark generates a little HTML report with a graph for each scenario run where it displays messaging rate for each of the servers over the 15 second  sampling interval.

I’ve run the benchmarks on a couple if different machines and I’ve posted the results at: http://hiramchirino.com/stomp-benchmark/
Note: the graphs can take a while to load as they are being generated on the client side using the excellent flot javascript library.

Since anyone can get access to an EC2 instance to reproduce those results, the rest of this article will focus on the results of the obtained on the EC2 High-CPU Extra Large Instance.  If you want to reproduce, just spin up a new Amazon Linux 64 bit AMI and then run the following commands in it:

sudo yum install -y screen
curl https://nodeload.github.com/chirino/stomp-benchmark/tarball/master | tar -zxv
mv chirino-stomp-benchmark-* stomp-benchmark
screen ./stomp-benchmark/bin/benchmark-all

 

Note: RabbitMQ 2.7.0 sometimes dies midway through the benchmark.  It seems RabbitMQ does not enforce very strict flow control and you can get into situations where it runs out of memory if you place too much load on it.  It seems that crash becomes more likely as you increase the core speed of the cpu or reduce the amount of physical memory on the box.  Luckily, the RabbitMQ folks are aware of the issue and hopefully will fix it by the next release.

The ‘Throughput to an Unsubscribed Topic’ benchmarking scenario is interesting to just get a idea/base line what the fastest possible rate a producer can send to server.  Since there are not attached consumers, the broker should be doing very little work since it’s just dropping all the messages that get sent to it.

The Queue Load/Unload scenarios a very important to look at if your application uses queues.  You often times run into situations where messages start accumulating in a queue with either no consumers or with not enough consumers to keep up with the producer load.  This benchmark first runs a producer for 30 seconds enqueuing non-persistant messages and then runs a producer enqueuing persistant messages  for 30 seconds.  Finally, it runs a consumer to dequeue the messages for 30 seconds.  An interesting observation in this scenario is that Apollo was the only sever which could dequeue at around the same maximum enqueue rates which is important if you ever want your consumers to catch up with fast producers.

 

The Fan In/Out Load Scenarios help you look at cases where you have either multiple producers or multiple consumers running against a single destination.  It helps you see how performance will be affects as you scale up the producers and consumers.  You should follow the “10 Producers” columns and “10 Consumers” rows to really get a sense of which servers do well as the you increase the number of clients on a single destination.

The Partitioned Load Scenarios look at how well the server scales as you start to increase load on multiple destinations at different message sizes.

I’ve tried to make the benchmark as fair as possible to all the contenders, all the source code to the benchmark is available on github.  Please open an issue or send me pull request if you think of ways to improve it!

Fuse Community Day: San Francisco

I just found out I’m going to heading out to San Francisco to attend the Fuse Community Day!

Progress Software is sponsoring an Apache ServiceMix, ActiveMQ, CXF & Camel Community Day on Thursday, December 10th, at the Hyatt Hotel in Burlingame. Join us at this free event and meet committers and founders of Apache ServiceMix, ActiveMQ, CXF and Camel that have successfully implemented enterprise application and deployed these projects in production.

Should fun to meet users/developers of these kick ass Apache based projects.  If you plan on going, make sure your register for the event.  It’ll be nice to meet everyone!

STOMP Clarification

I just saw a tweet which demonstrates that the STOMP spec still needs more clarification. I think Brian McCallister, the founding architect of protocol, will agree that one of the tenets of the protocol was for it to be simple enough to even use by user which directly connects to a server via telnet.

And to support that use case, newlines after the frame terminator are a natural occurrence. But it might be easier to describe it as:

  • A stomp frame may have zero or more newlines preceding it’s command verb.

ActiveMQ Protobuf Implemtation Features

I promised I would follow up on my previous post on how the “The ActiveMQ Protobuf Implementation Rocks!”.

So you might be asking yourself, what’s the secret sauce? Well before I get into that, let me first explain the class model that our proto compiler generates.

For every message definition in the ‘.proto’ file, the compiler will generate 3 classes:

  • the message interface: is implemented by the bean and buffer classes. It has all the ‘getters’.
  • the bean class: has all the ‘setters’ and ‘merge’ methods
  • the buffer class: has all the encoding and decoding methods. It does not allow mutation.

The message interface also defines the freeze(), frozen(), and copy() methods which allow you to make an instance immutable, check to see if an instance is immutable, and create a mutable copy. Buffer classes are alway immutable. Bean class can transition to being immutable via freeze(). freeze() naturally returns a buffer object. copy() naturally returns a bean object.

This bean model gives substantial flexibility. Besides making it easy to transition from immutable to mutable and back, the message interface lets you implement business methods that operate against either type of instance. You could use the bean class purely in a builder style to always generate a buffer instance, or you could just use them like traditional java bean objects.

Once a bean instance is frozen, any attempts to modify the instance will throw assertion errors if assertions are enabled in your JVM. So the CPU cost of validating program correctness can can be disabled at run time.

Finally, the most important feature of the buffer class is that it holds on to either the byte array that it was created from or the frozen bean that created it, and sometimes both, after it builds one from the other. This has several implications. Firstly, once a buffer is encoded to a byte[], subsequent encoding passes are free. This is also true when a buffer is decoded, as the next encoding is free since it still retains the original encoding of the message. And the other benefit that this provides which the benchmark highlighted, is that deferred decoding is possible. A newly created buffer class will not decode the data until a field is accessed. This also true of the nested messages that are encoded in a buffer. While the outer message may get decoded, the nested message will not be decoded until it’s fields are accessed.

The ActiveMQ Protobuf Implementation Rocks!

While reading Comparing the Java Serialization Options I ran across the a cool google code project which has done an excellent job benchmarking a wide variety of serialization options for java.

I’ve had been researching the protobuf encoding format for a while and really liked it. But I did not really like the Java implementation that Google had published. It was kinda clunky to use and I saw several optimizations that could be used that were missing. Optimizations that could create huge performance wins when applied to the usage patterns of an enterprise messaging system like Apache ActiveMQ. So I created a new protobuf implementation in the ActiveMQ project.

Naturally, I was curious to see how the activemq protobuf implementation stacked up against similar technologies. So I grabbed the V1 benchmark source code and added our implementation to it. If you want to do the same, apply this patch.

Once I ran the benchmark and I was very pleased with the results. I’m including the performance graphs of our impl and standard protobuf and thrift for comparison.

As it turns out, our implementation looks awesome in the benchmark! How about that decoding speed!

It’s getting late here.. so I’ll have do a follow up post explaining how come we did so much better.