A couple months ago, I wrote an article about how Hadoop infrastructure should use C++/LLVM, not Java, to be as scalable and efficient as possible. And to be competitive with Google infrastructure. Discussions surrounding Java vs C++ performance often seem to morph into something bordering on religion, are muddled with arguments about various tweaks that can be done in one language or the other, and then dissipate into the abyss of non-action. Very few discussions focus on the real issue.
I thought I'd take a different tact, and benchmark what I believe the real problem with Java is, vis-a-vis its use in large scale settings. Rather than focus on the relative Java-vs-C++ performance, I instead benchmarked the behaviour of multiple benchmarks running concurrently. In a desktop setting, a user may primarily be running one active application. It is possible for the overhead of a single Java VM to get buried in the over-supply of cores/memory/MHz/cache that are available on the PC. But in a large-scale server setting, it's desirable to max out the resources with as many workloads as the hardware can handle, hopefully without making usage of any one resource unbalanced w.r.t. the rest of the resources. And thus, any overhead imparted also means displacement of other work that's not done. Nothing comes for free; sometimes it's just not noticed (which is the "dirty little secret" of Java benchmarking I noted in the mentioned article).
So how does Java performance fare, when there is more than one thing going on? That's what I set out to benchmark. I highly recommend people repeat the same kind of methodology on whatever workloads they deem appropriate. For a quick-and-dirty assessment, I used the benchmarks from here, with the binaries (C++, i686 versions) and compiled Java classes as-is (ran with -server option). But rather than be concerned with Java-vs-C++ performance, I benchmarked each language relative to itself when a number of benchmarks were loaded on the same core.
To do this, I first timed each C++ benchmark individually. Based on those times, I created bundles of benchmarks to be run sequentially, each bundle taking approximately 30 seconds to complete end-to-end. Then for each of Java and C++ separately, I ran from 2 to 10 bundles both serially and in parallel, to see how much more efficiently the system could run the work when it had access to all the work-to-be-done in parallel. This is multi-tasking, and it's how real systems work. Any Java overhead will impart "displacement" of some kind, be it in terms of MHz, cache consumption, etc. And this is the kind of methodology that will illuminate such overhead in real terms.
The above graphic shows the results. Given concurrency, the system can complete Java workloads (bundles) about 15% faster than for the serial case. But for C++, it's about 30% faster. Benchmarking other (more real) workloads will have differing results, and I'd like to encourage others to do so. Note that this C++ multi-tasking advantage is cumulative with whatever gains it may have over Java on single task benchmarks. Keep in mind that large cluster scale applications are the kinds that are worth running after profile-guided compilation. This is a way to offer C++ applications some of the dynamic profiling gains that Java offers, without the Java overhead.
bundle 1: matrix + nestedloop + random + sieveFor some other reading, you can look at why the Hypertable project chose C++ over Java. And to clear up some mis-conceptions: 1) no, LLVM does not JIT C/C++ code, but it does offer later stage optimizations relative to traditional compilation and linking, and 2) Google's Dalvik (Java-like) VM is an endpoint (desktop, mobile device) technology, and not as relevant to the focus of this article.
bundle 2: fibo + hash
bundle 3: hash2 + heapsort + matrix + strcat
bundle 4: methcall + random + sieve + strcat
bundle 5: nestedloop + objinst + hash
bundle 6: sieve + random + nestedloop + matrix
bundle 7: hash + fibo
bundle 8: strcat + matrix + heapsort + hash2
bundle 9: strcat + sieve + random + methcall
bundle 10: hash + objinst + nestedloop
Disclosure: no positions