People say "should you run Hadoop in the cloud?". I say "it depends".
I think there is value in Hadoop-in-cloud, I talked about doing it in 2010 at Berlin Buzzwords 2010; since then I've had more experience with using Hadoop and implementing cloud infrastructures.
- If your data is stored in a cloud provider's storage infrastructure, doing the analysis locally is the only rational action. It's that "work near the data" philosophy.
- If you are only doing some computation -say nightly- then you can rent some cluster time. Even if compute performance is worse, you can just rent some more machines to compensate.
- You may be able to achieve better security through isolation of clusters (depends on your IaaS vendor's abilities).
- No upfront capex; fund from ongoing revenue.
- Easier to expand your cluster; no need to buy more racks, find more rack space.
- You don't need to care about the problems of networking.
- Less of a problem of heterogenous clusters if you expand later.
- Cost of data storage can only increase at a rate proportional to ingress/retention rates.
- Cost of cluster time increases at a rate proportional to analysis performed. There is no "spare cluster time" for low priority work.
- Even if CPU time can scale up, IO rate of persistent data may not.
- Hadoop contains lots of assumptions about running in a static infrastructure; it's scheduling and recovery algorithms assume this.
- HDFS assumes failures are independent, and places data accordingly (Google's Availability in Globally Distributed Storage Systems paper shows this doesn't hold in physical infrastructures, my notion of failure topologies expands on that)
- MR blacklists failing machines, rather than releasing them and requesting new ones.
- Worker nodes handle failure of master nodes by spinning on the hostname, not querying (dynamic) configuration data for new hostnames. Some of the HA HDFS may address that, I'm not tracking it enough.
- Topology scripts are static. I've been slowly tweaking topology logic in 0.23+ but haven't put the dynamicness in there yet (HDFS and MR cache (name->rack) mappings on the assumption that the data is coming from slow to exec scripts, not fast & refreshable in-VM data).
- Schedulers assume #of machines are static, don't allocate and release compute nodes based on demand and with knowledge of cost and quantum of CPU rentals. (I'm not sure quantum is the right term there, I mean the fact that VMs may be rented by the hour, 15 minutes, etc, so your scheduler should retain them for 59 minutes after acquiring them.
- Scheduling doesn't bill different users for their cluster use in a way that is easily mapped to cluster time.
A lot of these are tractable, you just have to put in the effort. The Stratosphere team in Berlin are doing lots of excellent work here, including taking a higher level query language and generating an execution plan that is IaaS aware -you can optimise for fast (many machines) or lower cost (use less machines more efficiently).
In comparison, a physical cluster:
- Offers a lower cost/TB of any corporate filestore to date other than people's desktop computers (which have a high TCO and power cost that is generally ignored), so enables you to store lots of stuff you would otherwise discard.
- Let's you choose the hardware optimised for your current and predicted workloads.
- Has free time for the low priority background work as well as the quicker queries that near-real-time UIs like.
- May be directly accessible from desktops in the organisation (depends on security model of cluster).
- Is easily hooked up to Jenkins infrastructure for execution of work as CI jobs.
- Let's you do fancy tricks like striping of different MR versions across the racks for in-rack locality and different sets of task trackers for foreground vs background work, and different JTs (reduces memory use, cost of failure, etc).
- Is way, way easier to hook up to internal databases, log feeds. To do ETL into your corporate oracle servers, you will need to run something behind the firewall to fetch it off the IaaS storage layer, rather than have your reducers push it to the RDBMS itself.
This is why I say "it depends" -it depends on where you collect your data and what you plan to do with it.
As for the way Hadoop doesn't currently work so well in such infrastructures, well the code is there for people to fix. It's also a lot easier to test in-cloud behaviour, including resilience to failure, than it is with physical clusters.
[Photo: Descending Mt Ranier, 2000]