Sunday, May 10, 2009

Hadoop should target C++/LLVM, not Java (because of watts)

Over the years, there have been many contentious arguments about the performance of C++ versus Java. Oddly, every one I found addressed only one kind of performance (work/time). I can't find any benchmarking of something at least as important in today's massive-scale-computing environments, work/watt. A dirty little secret about JIT technologies like Java, is that they throw a lot more CPU resources at the problem, trying to get up to par with native C++ code. JITs use more memory, and periodically run background optimizer tasks. These overheads are somewhat offset in work/time performance, by extra optimizations which can be performed with more dynamic information. But it results in a hungrier appetite for watts. Another dirty little secret about Java vs C++ benchmarks is that they compare single-workloads. Try running 100 VMs, each with a Java and C++ benchmark in it and Java's hungrier appetite for resources (MHz, cache, RAM) will show. But of course, Java folks don't mention that.

But let's say for the sake of (non-)argument, that Java can achieve a 1:1 work/time performance relative to C++, for a single program. If Java consumes 15% more power doing it, does it matter on a PC? Most people don't dare. Does it matter for small-scale server environments? Maybe not. Does it matter when you deploy Hadoop on a 10,000 node cluster, and the holistic inefficiency (multiple things running concurrently) goes to 30%? Ask the people who sign the checks for the power bill. Unfortunately, inefficiency scales really well.

Btw, Google's MapReduce framework is C++ based. So isn't Hypertable, the clone of Google's Bigtable distributed data storage system. The rationale for choosing C++ for Hypertable is explained here. I realize that Java's appeal is the write-once, run anywhere philosophy as well as all the class libraries that come with it. But there's another way to get at portability. And that's to compile from C/C++/Python/etc to LLVM intermediate representation, which can then be optimized for whatever platform comprises each node in the cluster. A bonus in using LLVM as the representation to distribute to nodes, is that OpenCL can also be compiled to LLVM. This retains a nice GPGPU abstraction across heterogeneous nodes (including those including GPGPU-like processing capabilities), without the Java overhead.

Now I don't have a problem with Java being one of the workloads that can be run on each Hadoop node (even script languages have their time and place). But I believe Hadoop's Java infrastructure will prove to be a competitive disadvantage, and will provoke a mass amount of wasted watts. "Write once, waste everywhere..." In the way that Intel tends to retain a process advantage over other CPU vendors, I believe Google will retain a power advantage over others with their MapReduce (and well, their servers are well-tuned too).

Disclosure: no positions

2 comments:

srowen said...

(I thought this was intriguing enough, and was proud enough of my reply on this to the mahout-dev list, that I will favor you with a cross post here!)

The difference in power consumption between a fully loaded machine and
idle isn't so large (the figure 50% sticks in my head?), but the
difference between a fully loaded and half-loaded machine is quite
small. That is, if the hard disk is up, processor is at full speed,
all memory is fully powered, then using all or most is not a big deal.
Power consumption drops only if you are really idle.

I don't have numbers to back this up at my fingertips, though they're
informed by figures I've seen in the past. I think that's what one
would need to evaluate this argument, and I have a different intuition
about how much this could matter.

The main argument here seems to be, basically, that Java competes well
in wall-time performance by better parallelism and more memory usage.
Maybe, that's an interesting question. Is LLVM going to be more
efficient than Java? unclear, both have an overhead I suppose. But
again interesting question.


But, the topic really does matter. Wasting time means wasting energy,
and when we get to distributed cluster scale, it matters to the
environment. At Google they do a good job of keeping teams really
clear about how much their operations are costing -- it is staggering
sometimes. Developers who might run a big job, oops, see it fail,
start it up again, oops, wrong argument again... might think twice
when the realize how many pounds of CO2 their mistake just pumped into
the atmosphere.

(Mahout folks will now appreciate why I have been messing with the
code all over to try to micro-optimize for performance. I think there
is still not enough attention given to efficiency yet, but hey it's at
0.1.)


And, I think I agree with the conclusion of the blog post for a
different reason:

The Java/C++ performance gap for most apps is pretty negligible these
days. Why? I actually think given a fixed amount of *developer* time,
one can make a faster Java app than C++ app. Why? I can develop
faster, against a larger and more stable collection of libraries,
spend less time debugging, leaving more time to optimize the result.

But that does hit a certain plateau. Given enough developer time, I
can get native code to run faster than even JITted Java. I myself am
hard-pressed to optimize my code (Mahout - Taste) further in Java
without drastic measures.

It may take a lot of time to actually beat Java performance in C++,
but, as the scale of your operations grows, the return on that 1%
improvement you eke out grows. And of course -- when we talk about
code headed for Hadoop, we are definitely talking about large-scale
operations.

For reference, of course, Google operates at such a scale that they
use a C++-based MapReduce framework. It is just almost always
worthwhile to spend the time to beat Java performance.

This isn't going to be true of all users of distributed computing
frameworks, so it's not inherently wrong that Hadoop is in Java, but,
I did find myself saying "hmm, Java?" the first time I heard of
Hadoop.


But isn't this what this whole Hadoop streaming business is about?
letting you farm out the computation itself to whatever native process
you like and just using Hadoop for the management? because that of
course is fine.

rgomes1997 said...

Hi,

I wouldnt like to comment your concerns about power consumption but I'd like to contribute with some ideas.

1. If you consider RTSJ (Real Time System Java) you could use ITC (Initialization Time Compilation) instead of JIT. RTSJ can speed up your Java application too if you use "Soft Real Time Threads", which is not difficult to implement and can prevent GC to manage memory you can manage easily yourself (Scoped Memory).

These links may be of your interest:

* http://java.sun.com/javase/technologies/realtime/reference/doc_2.1/release/JavaRTSCompilation.html

* http://www.rtsj.org/specjavadoc/book_index.html

2. IBM has a very interesting research project called X10 which generates code in Java and/or C++ as output. The input language is something based on Scala (see release 1.7.x).
You could use it to Write-Once- Run-Everywhere, does not matter if you have a JVM or your your native OS.

A very interesting improvement over Scala is that X10 does not use MPI but it uses PGAS, which is beneficial as STM but provides maximum performance for local data.

IBM X10 Language
* http://www.x10-lang.org/
* http://dist.codehaus.org/x10/documentation/languagespec/x10-173.pdf

STM (Software Transactional Memory)
* http://en.wikipedia.org/wiki/Software_transactional_memory

PGAS (Partitioned Global Address Space)
* http://en.wikipedia.org/wiki/Partitioned_global_address_space

Regards

Richard Gomes
http://www.jquantlib.org/index.php/User:RichardGomes