Performance tuning in a typical JEE environment is an art. Like any other artform, it only gets better with experience. There are a lot of moving parts to consider - load balancers, JEE containers, application design - frameworks like ORM mappers and IOC engines, JVM performance - garbage collection configuration, database prformance, database access, disk IO. In a distributed environment, there are even more bottlenecks introduced - serialization overhead of marshalling and unmarshalling java objects, overhead of keeping the cluster coherent, overhead of avoiding the split-brain problem and so on. One of the most overlooked or least proiritized area is network overhead. Understanding how data is being moved across your servers and "how much" data is being moved across and then fine-tuning this network overhead can bring a tremendous performance boost to your application.
My motivation to write this article came from a recent visit I made to one of our customers. They have a distributed JEE middleware layer which uses protocols like JMS and RMI to move data across multiple JVMs running on multiple commodity hardware boxes. The JMS layer is used for sending asynchronous messages whereas the RMI layer is used for sending broadcast messages to multiple consumers. During very high usage, the network pipe gets extremely busy (up to 70%). This is a gigabit network we are talking about. Obviously, there was a concern of the network getting completely clogged at the rate at which the usage was growing. In fact, the network would get so busy at peak usage times, that they had to introduce artificial pauses and they couldn't afford to send real time messages any more. This is the reason I was there to demonstrate how Terracotta could reduce this overhead significantly. We dicided to replace the RMI-boradcast layer with Terracotta DSO as a pilot. Let's look at how RMI and Terracotta would work in this scenario to understand the test.
A normal RMI invocation works as follows: a client application invokes a method on an object on the server. The parameters are marshalled and sent out over the network to the server machine, where the computations are performed, results are marshalled and sent back to the client. The two main areas where delays occur are the marshalling (object serialization and deserialization) and the network overheads. Object serialization overhead is a lot higher if you are sending deep object graphs as parameters since a deep-clone serialization (and deserialization) needs to be performed. In the above mentioned case, each consumer JVM ran a RMI server and the client JVM (the broadcaster) would iterate through a list of consumer IPs or hostnames and send a message (invoke a method) on each of the consumer JVMs.
Terracotta is an open source clustering solution that works as follows: each clustered JVM is policed or managed by a central Terracotta server. Whenever a shared object is mutated (within the boundaries of a lock), the delta level fine grained (field level) changes are maintained, and propagated to other clustered JVMs if and when the same object is accessed. Since there is no serialization and deserialization involved, there is no marshalling and unmarshalling overhead - no matter how deep the objec graph is. Also, since Terracotta only ships fine-grained changes, the difference in the amount of data sent across the network can be huge depending on the case. All this is possible with Terracotta as it maintains physical object identity across the cluster (i.e. foo == bar is preserved, not just foo.equals(bar)).
Ok, all this is in theory. Let's look at the actual test now. To simulate the customer environment, I wrote a simple RMI application with 1 broadcaster node broadcasting to 3 consumer nodes an object graph which is 10 levels deep and is of total size 10k. For this result, the boradcast happened every 10 millisenconds and I used ntop to monitor network usage.
The RMI broadcast sent a total of 2.6 GigaBytes of data over one hour i.e. ~ 722 KiloBytes per second.
With Terrcotta, the overall network usage over one hour was 491 MegaBytes i.e. ~ 136 KiloBytes per second.
This is a 5.3 times improvement by using the Terracotta solution. Now if you scale these results to 40 consumers and 4 broadcasters (actual real world scenario) - the improvement is ~ 11 times.
Let's analyze these results. In this case, Terracotta is able to achieve such improvements because in the entire 10K object graph which is 10 levels deep, only one String value (100 bytes) is changing and being broadcasted. With the RMI solution, the entire 10K object graph is being broadcasted to every consumer node whereas Terracotta only sends the bytes that have changed.
If we consider a non-broadcast use case where the changes do not need to be sent to every other node in the cluster, Terracotta would do a lot better whereas a serialization based clustering solution would still send the entire object graph to every other node to keep the cluster coherent. (There are some serialization based clustering solutions that implement a buddy replication system or some other mechanism where the object graph is not shipped to every other node, but the drawbacks associated with those is outside the scope here.) Terracotta guarantees cluster coherency whithout having to send the changes to every peer in the cluster because the central server knows which object is being accessed on which node through the lock semantics and by preserving object identity. In fact, due to the object identity being preserved, Terracotta sends fine grained delta-level changes ONLY WHERE they need to be sent i.e. only to the JVM where the same object is accessed.
All of the above improvements are over and above not having to maintain any RMI code and working with natural POJOs only. I have not pointed out the performance gains in the above test seen as a result of no serialization and not having to marshall and unmarshall the 10K, 10 level deep object graph as that is outside the scope of this discussion. I wanted to point out the impact network overhead can have on a typical JEE application.
For those who are interested, in the next part of this article I will post the source code of the test and instructions on how to run it.