8

Stuff that bothers me: “100x faster than Hadoop”

 2 years ago
source link: https://erikbern.com/2013/04/27/stuff-that-bothers-me-100x-faster-than-hadoop.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Stuff that bothers me: “100x faster than Hadoop”

2013-04-27

The simple way to get featured on big data blog these days seem to be

  1. Build something that does 1 thing super well but nothing else
  2. Benchmark it against Hadoop
  3. Publish stats showing that it's 100x faster than Hadoop

Spark claims their 100x faster than Hadoop and there's a lot of stats showing Redshift is 10x faster than Hadoop. There's a bunch of papers with similar claims. I spent five minutes Googling “Xx faster than Hadoop” and found a ton of other stats.

(Btw, when people say this, I generally take it to mean that Z is y times faster than Hadoop Mapreduce. Just nitpicking.)

Anyway, these stats bother me a lot because everyone knows that

  • Horizontal scalability comes at a very high price, because things get I/O bound. That's fine, because you can always throw more hardware at the problem.
  • Flexibility comes at a price, and that's totally fine for most people. Hadoop supports pretty much anything that can be reduced to a series of Mapreduce jobs, which in practice turns out to me most stuff.
  • Ease of use comes at a price, and that's fine. There's a reason a lot of people choose Python over C++, after all. Ok, writing mapreduce jobs in Java sucks, but there's a lot of nice tools out there to make it simple (subtle product placement: check out Luigi)

I think Spark is a really cool piece of technology, so don't get me wrong. I just think it's stupid to compare things between Hadoop and Spark when clearly they are two very different products with different use cases. Just as you wouldn't compare a Tokyo Cabinet to MySQL or whatever. So please never ever say that something is X times faster than Hadoop again.

Tagged with: math


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK