Hadoop should target C++/LLVM, not Java (because of watts)

Over the years, there have been many contentious arguments about the performance of C++ versus Java. Oddly, every one I found addressed only one kind of performance (work/time). I can't find any benchmarking of something at least as important in today's massive-scale-computing environments, work/watt. A dirty little secret about JIT technologies like Java, is that they throw a lot more CPU resources at the problem, trying to get up to par with native C++ code. JITs use more memory, and periodically run background optimizer tasks. These overheads are somewhat offset in work/time performance, by extra optimizations which can be performed with more dynamic information. But it results in a hungrier appetite for watts. Another dirty little secret about Java vs C++ benchmarks is that they compare single-workloads. Try running 100 VMs, each with a Java and C++ benchmark in it and Java's hungrier appetite for resources (MHz, cache, RAM) will show. But of course, Java folks don't mention that.

But let's say for the sake of (non-)argument, that Java can achieve a 1:1 work/time performance relative to C++, for a single program. If Java consumes 15% more power doing it, does it matter on a PC? Most people don't dare. Does it matter for small-scale server environments? Maybe not. Does it matter when you deploy Hadoop on a 10,000 node cluster, and the holistic inefficiency (multiple things running concurrently) goes to 30%? Ask the people who sign the checks for the power bill. Unfortunately, inefficiency scales really well.

Btw, Google's MapReduce framework is C++ based. So isn't Hypertable, the clone of Google's Bigtable distributed data storage system. The rationale for choosing C++ for Hypertable is explained here. I realize that Java's appeal is the write-once, run anywhere philosophy as well as all the class libraries that come with it. But there's another way to get at portability. And that's to compile from C/C++/Python/etc to LLVM intermediate representation, which can then be optimized for whatever platform comprises each node in the cluster. A bonus in using LLVM as the representation to distribute to nodes, is that OpenCL can also be compiled to LLVM. This retains a nice GPGPU abstraction across heterogeneous nodes (including those including GPGPU-like processing capabilities), without the Java overhead.

Now I don't have a problem with Java being one of the workloads that can be run on each Hadoop node (even script languages have their time and place). But I believe Hadoop's Java infrastructure will prove to be a competitive disadvantage, and will provoke a mass amount of wasted watts. "Write once, waste everywhere..." In the way that Intel tends to retain a process advantage over other CPU vendors, I believe Google will retain a power advantage over others with their MapReduce (and well, their servers are well-tuned too).

Disclosure: no positions