2012-07-31

Welcome to Chaos

Welcome to Bristol

Netflix have published their "Chaos Monkey" code on Github; ASL Licensed. I have already filed my first issue, having looked through the code -an issue that is already marked as fixed.

Netflix bring to the world the original Chaos Monkey -tested against production services.

Those of us playing with failures, reliability and availability in the Hadoop world also need something that can generate failures, though for testing the needs are slightly different:
  1. Failures triggered somewhat repeatedly.
  2. Be more aggressive.
  3. Support more back ends than Amazon -desktop, physical, private IaaS infrastructures.
#1 and #2 are config tuning -faster killing, seeded execution.

#3? Needs more back ends. The nice thing here is that there's very little you need to implement when all you are doing is talking to an Infrastructure Service to kill machines; the CloudClient interface has one method:
  void terminateInstance(String instanceId);
That needs to be aided with something to produce a list of instances, and of course there's the per-infrastructure configuration of URLs and authentication.


My colleague Enis has been doing something for this for HBase testing.; independently I've done something in groovy for my availability work, draft package, org.apache.chaos. I've done three back ends :
  1. SSH in to a machine and kill a process by pid file.
  2. Pop up a dialog telling the user to kill a machine (not so daft, good for semi-automated testing).
  3. Issue virtualbox commands to kill a VM.
All of these are fairly straightforward to migrate to the Chaos Monkey; they are all driven by config files enumerating the list of target machines, plus some back-end specific options (e.g. pid file locations, list of vbox UUIDs).

Then there's the other possibilities: VMWare, fencing devices on the LAN, ssh in and issue "if up/down" commands (though note that some infrastructures, such as vSphere, recognise that explicit option and take things off HA monitoring). All relatively straightforward. 

Which means: we can use the Chaos Monkey as a foundation for testing how distributed systems, especially the Hadoop stack components, react to machine failover -across a broad set of virtual and physical infrastructures. 


That I see the appeal of.


Because everyone needs a Chaos Monkey.

[update 13:24 PST, fixed first name, thank you Vanessa Alvarez!]

2012-07-26

Reminder to #Hadoop recruiters: do your research

It was only last week that I blogged about failing Hadoop recruiter approaches.

Key take-aways were
  • Do your research from the publicly available data.
  • Use the graph, don't abuse it.
  • Never try to phone me.
Given it is fresh in my blog, and that that blog is associated with my name and LI profile, it's disappointing to see that people aren't reading it.

Danger Vengeful God

A couple of days ago, someone trying to get me on Twitter looking for SAP testers, which is as relevant and appealing to me as a career opportunity in marketing.

Today, some LI email from Skype.com
From: James
Date: 26 July 2012 07:35
Subject: It's time for Skype.
To: Steve Loughran

Dear Steve,

Apologies for the unsolicited nature of the message; it seemed the most confidential way to approach you, although I shall try to contact you by phone as well.

I am currently working within Skype's Talent Acquisition team and the key focus in my role is to search and hire top talent in the market place. I am currently looking at succession planning both for now but also on a longer term plan. I would be keen to have a conversation with you about potential opportunities, and introduce myself and tell you more about Skype (Microsoft) and our hiring in London around "Big Data".

I look forward to hearing from you.
James
See that? He's promising to try and contact me by phone as if I'm going to be grateful. Yet last week I stated that as my "do that and I will never speak to you" policy.  Nor do I consider unsolicited emails confidential -as you can see.

I despair.

The data is there, use it. If you don't, well, what kind of data mining company are you?

For anyone who does want an exciting and challenging job in the Hadoop world, one where your contributions will go back into open source and you will become well known and widely valued, can I instead recommend one of the open Hortonworks positions? We. Are. Having. Fun.

As an example, here is a scene at yesterday's first birthday BBQ:
Garden Party


Eric14, co-founder and CTO on the left, Bikas on the right, wearing buzz lightyear ballons after food and beverages; nearby Owen is wearing facepaint. Bikas is working on Hadoop on Windows, and has two large monitors showing Windows displays in his office, alongside the Mac laptop -the outcome of that work is not just that Hadoop will be a first class citizen in the Windows platform, you'll get excellent desktop and Excel integration. Joining that team and you get to play with this stuff early -and bring it to the world.
 
I promise I will not phone anyone up about these jobs.

2012-07-19

Defending HDFS


It seems like everyone is picking on HDFS this week.

Why?
Limited Press @ #4 Hurlingham Road

Some possibilities
  1. There are some fundamental limitations of HDFS that suddenly everyone has noticed.
  2. People who develop filesystems have noticed that Hadoop is becoming much more popular and wish to help by contributing code, tests and documentation to the open source platform, passing on their experiences running the Hadoop application stack and hardening Hadoop to the specific failure modes and other quirks of their filesystems.
  3. Everyone whose line of business is selling storage infrastructure has realised that not only are they not getting new sales deals for hadoop clusters
  4. Hadoop HDFS is making it is harder to justify the prices of "Big Iron" storage.
If you look at the press releases, action two, "test and improve the Hadoop stack" isn't being done by the "legacy" DFS vendors. These are the existing filesystems that are having Hadoop support retrofitted - usually by adding locality awareness to SAN-hosted location independence and a new filesystem driver with topology information for Hadoop. A key aid to making this possible is Hadoop's conscious decision to not support full Posix semantics, so it's easier to flip in new filesystems (a key one being Amazon S3's object store, which is also non-Posix).

I've looked at NetApp and Lustre before. Whamcloud must be happy that Intel bought them this week, and I look forward to consume beer with Eric Barton some time this week. I know they were looking at Hadoop integration -and have no idea what will happen now.

GPFS, well, I'll just note that they don't quote a price, instead say "have an account team contact you". If the acquisition process involves an account team, you know it wont be cents per GB. Client-side licensing is something I thought went away once you moved off Windows Server, but clearly not.

CleverSafe. This uses erasure coding as a way of storing data efficiently; it's a bit like parity encoding in RAID but not quite, because instead of the data being written to disks with parity blocks, the data gets split up into blocks and scattered through the DFS. Reading in the blocks involves pulling in multiple blocks and merging them. If you over-replicate the blocks you can get better IO bandwidth -grab the first ones coming in and discard the later ones.

Of course, you then end up with the bandwidth costs of pulling in everything over the network -you're in 10GbE territory and pretending there aren't locality issues, as well as worrying about bisection bandwidth between racks.

Or you go to some SAN system with its costs and limitations. I don't know what CleverSafe say here -potential customers should ask that. Some of the cloud block stores use e-coding; it keeps costs down and latency is something the customers have to take the hit on.

I actually think there could be some good opportunities to do something like this for cold data or stuff you want to replicate across sites: you'd spread enough of the blocks over 3 sites that you could rebuild them from any two, ideally.


Ceph: Readingthe papers, it's interesting. I haven't played with it or got any knowledge of its real-world limitations.


MapR. Not much to say there, except to note the quoted Hadoop scalability numbers aren't right. Today, Hadoop is in use in clusters up to 4000+ servers and 45+PB of storage (Yahoo!, Facebook). Those are real numbers, not projections from a spreadsheet.

There are multiple Hadoop clusters running at scales of 10-40 PB clusters, as well as lots of little ones from gigabytes up. From those large clusters, we in the Hadoop dev world have come across problems, problems we are, as an open source project, perfectly open about.

This does make it easy for anyone to point at the JIRA and say "look, the namenode can't...", or "look, the filesystem doesn't..." That' something we just have to recognise and accept.

Fine: other people can point to the large HDFS clusters and say "it has limits", but remember this: they are pointing at large HDFS clusters. Nobody is pointing at large Hadoop-on-filesystem-X clusters, for X != HDFS, -because there aren't any public instances of those.

All you get are proof of concepts, powerpoint and possibly clusters of a few hundred nodes -smaller than the test cluster I have access to.

If you are working on a DFS, well, Hadoop MapReduce is another use case you've got to deal with -and fast. The technical problem is straightforward -a new filesystem client class. The hard part is solving the economics problem of a filesystem that is designed to not only store data on standard servers and disks -but to do the computation there.

Any pure-storage story has to explain why you also need a separate rack or two of compute nodes, and why SAN Failures aren't considered a problem.

Then they have to answer a software issue: how can they be confident that the entire Hadoop software stack runs well on their filesystem? And if it doesn't, what processes have they in place to get errors corrected -including cases where the Hadoop-layer applications and libraries aren't working as expected?
Another issue is that some of the filesystems are closed source. That may be good from their business model perspective, but it means that all fixes are at the schedule of the sole organisation with access to the source. Not having any experience of those filesystems, I don't know whether or not that is an issue. All I can do is point out that it took Apple three years to make the rename() operation atomic, and hence compliant with POSIX.. Which is scary as I do use that OS on my non-Linux boxes. And before that, I used NTFS, which is probably worse.

Hadoop's development is in the open; security is already in HDFS (remember when that was a critique? A year ago?), HA is coming along nicely in the 1.x and 2.x lines. Scale limits? Most people aren't going to encounter them, so don't worry about that. Everyone who points to HDFS and says "HDFS can't" is going to have to find some new things to point too soon.

For anyone playing with alternate filesystems to hdfs:// file:// and s3://, then -here are some things to ask your vendor:
  1. How do you qualify the Hadoop stack against your filesystem?
  2. If there is an incompatibility, how does it get fixed?
  3. Can I get the source, or is there an alternative way of getting an emergency fix out in a hurry?
  4. What are the hardware costs for storage nodes?
  5. What are the hardware costs for compute nodes?
  6. What are the hardware costs for interconnect.
  7. How can I incrementally expand the storage/compute infrastructure.
  8. What are the licencing charges for storage and for each client wishing to access it?
  9. What is required in terms of hardware support contracts (replacement disks on site etc), and cost of any non-optional software support contracts?
  10. What other acquisition and operational costs are there?
I don't know the answers to those questions -they are things to ask the account teams. From the Hadoop perspective:
  1. Qualification is done as part of the release process of the Hadoop artifacts.
  2. Fix it in the source, convince someone else too (support contracts, etc)
  3. Source? See http://hadoop.apache.org/
  4. Server Hardware? SATA storage -servers depend on CPU and RAM you want.
  5. Compute nodes? See above.
  6. Interconnect? Good q. 2x1GbE getting more popular, I hear. 10 GbE still expensive
  7. Adding new servers is easy, expanding the network may depend on the switches you have.
  8. Licensing? Not for the Open Source bits.
  9. H/W support: you need a strategy for the master nodes, inc. Namenode storage.
  10. There's support licensing (which from Hortonworks is entirely optional), and the power budget of the servers.
Server power budget is something nobody is happy about. It's where reducing the space taken up by cold data would have add-on benefits -there's a Joule/bit/year cost for all data kept on spinning-media. The trouble is: there's no easy solution.

I look forward to a time in the future when solid state storage competes with HDD on a bit by bit basis, and that cold data can be moved to it -where wear levelling matters less precisely because it is cold- and warm data can live on it for speed of lookup as well as power. I don't know when that time will be -or even if.

[Artwork, Limited Press on #4 Hurlingham Road. A nice commissioned work.]

2012-07-08

Nobody ever got fired for using Hadoop on a cluster


Over a weekend in London I enjoyed reading a recent Microsoft Research paper from Rowstron et al., Nobody ever got fired for using Hadoop on a cluster.

The key points they make are
  1. A couple of hundred GB of DRAM costs less than a new server
  2. Public stats show that most Hadoop analysis jobs use only a few tens of GB.
  3. If you can load all the data into DRAM on a single server, you can do far more sophisticated and stateful algorithms than MapReduce
They then go on to demonstrate this with some algorithms and examples examples that show that you can do analysis in RAM way, way faster than by streaming HDD data past many CPUs.
Avebury Stone Circle

This paper makes a good point. Stateless MapReduce isn't ideal way to work with things, which is why counters (cause JT scale problems), per-task state (dangerous but convenient) and multiple iterations (purer; inefficient) come out to play. It's why graph stuff is good.

Think for a moment though, what Hadoop HDFS+MR delivers.
  • Very low cost storage of TB to PB of data -letting you keep all the historical data for analysis.
  • I/O bandwidth that scales with the #of HDDs.
  • An algorithm that, being isolated and stateless, is embarassingly parallel
  • A buffered execution process that permits recovery from servers that fail
  • An execution engine that can distribute work across tens, hundreds or even thousands of machines.
HDFS delivers the storage, MapReduce provides a way to get at the data. As Rowstron and colleagues note, for "small" datasets, datasets that fit into RAM, it's not always the best approach. Furthermore, that falling cost of DRAM means that you can start predicting cost/GB of RAM in future, and should start thinking "what can I do with 256GB RAM on a single server?" The paper asks the question: Why then, Hadoop? Well, one reason the paper authors note themselves:
The load time is significant, representing 20–25% of the total execution time. This suggests that as we scale up single servers by increasing memory sizes and core counts, we will need to simultaneously scale up I/O bandwidth to avoid this becoming a bottleneck.
A server with 16 SFF HDDs can give you 16 TB of storage today; 32 TB in the future, probably matched by 32 cores at that time. The IO bandwidth, even with those 16 disks, will be a tenth of what you get from ten such servers. It's the IO bandwidth that was a driver for MapReduce -as the original Google paper points out.  Observing in a 2012 paper that IO bandwidth was lagging DRAM isn't new -that's a complaint going back to the late 1980s.

If you want great IO bandwidth from HDDs, you need lots of them in parallel. RAID-5 filesystems with striped storage deliver this at a price; HDFS delivers it at a tangibly lower price. As the cost of SDDs falls, when they get integrated into the motherboards, you'll get something with better bandwidth and latency numbers (I'm ignoring wear levelling here, and hoping that at the right price point SSD could be used for cold data as well as warm data). SSD at the price/GB of today's HDDs would let you store hundreds of TB in servers, transform the power budget of a Hadoop cluster, and make random access much less expensive. That could be a big change.

Even with SSDs, you need lots in parallel to get the bandwidth -more than a single server's storage capacity. If, in, say 5-10 years you could get away with a "small" cluster with a few tens of machines, ideally SSD storage, and lots of DRAM per server, you'd get an interesting setup. A small enough set of machines that a failure would be significantly less likely, changing the policy needed to handle failures. Less demand for stateless operations and use of persistent storage to propagate results; more value in streaming across stages. Less bandwidth problems, especially with multiple 10GBe links to every node. Lots of DRAM and CPUs.

This would be a nice "Hadoop 3.x" cluster.

Use an SSD-aware descendent of HDFS that was wear-leveling aware, maybe mix HDD for cold storage, SSD for newly added data. Execute work that used RAM not just for the computations, but for the storage of results. Maybe use some of that RAM for replicated storage -as work is finished, forward it to other nodes that just keep it in RAM for the next stages in a batch mode, stream directly to them for other uses. You'd gain storage capacity and bandwidth that a single server will always lag compared to a set of servers, while being able to run algorithms that can be more stateful.

In this world, for single rack clusters, you'd care less about disk locality (the network can handle that better than before), and more about which tier of storage the data was. Which is what Ananthanarayanan et al., argued in Disk-Locality in Datacenter Computing Considered Irrelevant. You could view that rack as storage system with varying degrees of latency to files, request server capacity 64 GB at a time. The latter, of course, is what the YARN RM lets you do, though currently its memory slots are measured in smaller numbers like 4-8GB.

YARN could work as a scheduler here -with RAM-centric algorithms running in it. What you'd have to do is ensure that most of the blocks of a file is stored in the same rack (trivial in a single-rack system), so that all bandwidth consumed is rack-local. Then bring up the job on a single machine, ask for the data and hope that network bandwidth coming off each node is adequate for the traffic generated by this job and all the others. Because at load time, there will be a lot of network IO. Provided that data load is only a small fraction of the analysis -the up front load time- this may be possible.

It seems to me that the best way to keep that network bandwidth down would be to store the data in RAM for a series of queries. You'd bring up something that acted as an in-RAM cache of the chosen dataset(s), then run multiple operations against it -either sequential or, better yet, parallelised across the many cores in that server. Because there'd be no seek time penalty, you can do work in the way the algorithm wants, not sequentially. Because it's in DRAM, different threads could work across the entire set simultaneously.

Yes, you could have fun here.

People have already recognised that trend towards hundreds of GB of RAM and tens of cores -the graph algorithms are being written for that. Yet they also benefit from having more than one server, as that helps them scale beyond a single server.

Being able to host a single-machine node with 200GB of data is something HDFS+YARN would support, but so would GPFS+Platform -so where's the compelling reason for Hadoop? The answer has to be that which today justifies running Giraph or Hama in the cluster -for specific analyses that can be performed with data generated by other Hadoop code, and so that the output can be processed downstream within the Hadoop layers.

For the MS Research example, you'd run an initial MR job to get the subsets of the data you want to work with out of the larger archives, then feed that data into the allocated RAM-analysis node. If that data could be streamed in during the reduce phase, network bandwidth would be less, though restart costs higher. The analysis node could be used for specific algorithms that know that access to any part of the local data is at RAM speeds, but that any other HDFS input file is available if absolutely necessary. HDFS and HBase could both be output points.

That's the way to view this world: not an either/or situation, which is where a lot of the Anti-MapReduce stories seem to start off, but with the question "what can you do here?", where the here is "a cluster with petabytes of data able to analyse it with MapReduce and Graph infrastructures -and able to host other analysis algorithms too?". Especially as once that cluster uses YARN to manage resources, you can run new code without having to buy new machines or hide the logic in long lived Map tasks.

The challenge then becomes, not "how to process all the data loaded into RAM on a server", but, "how to work with data that is stored in RAM across a small set of servers?" -the algorithm problem. At the infrastructure level, "how to effectively place data and schedule work for such algorithms, especially if the no of servers/job and duration is such that failures can be handled by checkpointing rather than restart that shard of the job". And "we can rent extra CPU time off the IaaS layer"

Finally, for people building up clusters, that perennial question crops up: many small servers vs fewer larger servers? I actually think the larger servers have a good story, provide you stick to servers with affordable ASPs and power budgets. More chance of local data, more flexible resource allocation (you could run something that used 128GB of RAM and 8 cores), and less machines to worry about. The cost: storage capacity and bandwidth; the failure of a high-storage-capacity node generates more traffic. Oh, and you need to worry about network load more too, but less machines reduces the cost of faster switches.

[Photo: the neolithic Avebury stone circle, midway between Bristol and London]

[updated: reread the text and cleaned up]

2012-07-07

The Hadoop recruiter minefield

There's a lovely post from Meebo on The Recruiter Honeypot, which looks at how recruiters work from the startup perspective.

I have no experience of that. What I do have is a fleshed out LI profile, on account of that is where I filled in my resume for the Hortonworks team, rather than have some paper thing elsewhere. Now I can see who is recruiting for Hadoop skills -and who is good and bad at it.

The trees are all that remains

The Meebo article praised trying to contact recruiters by phone. That may offer a good time-to-hire strategy, but I'm not sure. A London based recruiter tried that to get me to join Spotify in Finland that way -by phoning me at work.

The conversation went something like

me: "Hello?"
them (excitedly): "Hello, I'm ... I'm phoning up to see if you're interested in working at Spotify"
(tersley) "How did you get this number?"
(smugly) "I called your front desk and asked to be put through"
(tensley) "You interrupted me"
(obliviously) "Spotify are a great startup and they've got vacancies in Finland..."
(tenseley) "I don't like being interrupted when I'm coding"
(mildly apologetically) "Sorry, but there's this great opportunity and -"
(interrupting) "Don't call me at work again"
(desperately) "But, but, -how can I contact you?"
(harshly) "email -work out the address yourself"
hangs up.

That sixty second phone call broke my concentration. It may have cost the recruiter half an hour of time to work out the phone number, dial through etc, time they billed their customer. But it also cost me half an hour of my time which I happen to value more than their time. And it left me bearing ill will towards both the recruiter and their employer.

A day later they contacted me via LinkedIn. Not via any of my email addresses -all of which are easy to find if you put some effort in- but via LI.
Following on from our conversation.

I am currently recruiting for a Technical Lead of Analysis at Spotify, based in Stockholm. Hence, I would be very keen to talk to you directly to discuss your situation and the role in further detail.

I would be very keen to speak to you ASAP to send/brief you on the role and salary package, discuss your current situation in further detail and identify whether the role would be suitable for you or anyone else you may know?

If you are interested, please do provide me with your contact details (mobile or skype id) and a preferable time to chat.
My reply essentially told them that their phone call was an unwelcome interruption and that I was not prepared to talk to them about anything.
-Not interested

-I will note that you haven't done enough research to see what bits of Hadoop I work on -which, given its open source, is pretty weak. 
-I was pretty unhappy you called me. As a developer, I can't work with interruptions, and don't like them. Calling me disrupted what I was doing and cost me time. Your tactic may work for other vacancies, but for software jobs your strategy seems designed to alienate your potential candidates. It's also why I won't provide you with details of anyone else who may be suitable: I don't want them to blame me for ruining their coding times.

I'm not going to mark you as a contact, as per http://bit.ly/m0j1qM

-Steve
To design and build great things, you do need time to get things done without interruptions. Any company that implicitly permits developers being interrupted without warning before they've even joined the company is a warning sign. It'd be like a job interviews that require three meetings with powerpoint slides and an interviewer calendar-scheduling problem that is NP-hard. That's an organisation trapped in Outlook.

I was findable by email -if the recruiter was serious they could even have worked out my home phone number. But no, they chose their standard algorithm "call them at work", despite the fact that this algorithm is likely to find their target in one of two states "doing something like emails where they may be amenable to interruption, especially if they are looking for new employment" and "busy"

I don't know the probability of developers being in either state (I prefer "busy"), but if you are recruiting you do have to consider that the "call up your targets at work" algorithm may lead to success -but it can also generate more ill will towards your company than other techniques. Oh, and when looking for developers, you want the ones who can be productive.

Marca's Mercedes?

Looking at LI, approaches come in various forms.

1. The "friend request from a complete stranger".
Daniel P- has indicated you are a person they've done business with at F-: 
Apologies for contacting out of the blue but I was wondering if you would be interested in discussing an opportunity with a company developing real-time machine learning software. If this is not of interest please archive. Dan
I dismiss these automatically because of my "how can we be friends if I don't know you" policy, which  is clearly stated on my LI profile. People who don't read that haven't done any research at all. Why this policy? It says "do not disturb". It also pays for the salary of Wittenauer, Jakob, and others that I have respect for. Recruiters who don't actually bother to pay LI the money for the premium recruiter account aren't serious professionals, nor do they contribute indirectly to Hadoop.

2. The desperate approach from somewhere round the world. In the past month I've been contacted by some Russian company offering a choice of London, NYC or Moscow, and Japanese company.
"I came across to your Linkedin profile and wanted to speak with you about the possible job opportunity in Tokyo. My client is a global internet conglomerate based in Tokyo and they are looking for Hadoop engineers for their big data group. 100% of the daily communication is being done in English in this team so no Japanese language skill is required. If you are interested, we can have a casual discussion over the phone or Skype to talk more about this position. Please let me know how I can get in touch with you."
I view this as a metric of Hadoop's success -and a sign of the fact that companies trying to make Hadoop a central part of their business. It also could be sign of the company being in some high-visibility firefighting project that is desperately trying to upskill to meet utterly unrealistic expectations, committing on some project for which they don't have any understanding of at all. I do feel sorry for everyone involved here -and so the "sorry" note was at least polite.

3. The utterly under-researched approach. These are just painful. Here's one I was tempted to forward it to Mike Olson and say "if these are your partners, I'd worry about their data mining skills.

Take this one.
Steve,

I'm conducting a selective search to identify a knowledgeable individual that could assist G- in standing up an authorized Hadoop training practice. Your profile came to the top of my list.

G- has entered into a partnership with Cloudera.

If you could you let me know your interest in working with G- in a contract capacity to assist us building this product line, I'd certainly appreciate it. Also, if you know of anybody that may be interested in such an opportunity, please point them my way. thanks in advance for responding to my inquiry.

No attempt to consider the fact that I now work with the experts in Hadoop internals, the best QA team I've ever met, and am having lots of fun. Instead they say "would you like to work as a contractor teaching people about Hadoop" and say "we are a partner of Cloudera" as a metric of competence. Yet they are clearly not au-fait with the Hadoop world, including the fact that Hadoop experts are being offered jobs in Tokyo, Moscow and London, so there is no need to take up some contract position where your work alternates between "unpaid" and "doing the same course again".

Tip: if you are trying to get one of the core Hadoop developers to work as an instructor, you have to make it appealing. That wasn't it. I'd recommend something that not only competed with the #2 approaches in terms of compensation, but offered opportunities to do interesting Hadoop-related work as well as lecturing. Something like "three days a week you'll be teaching everything from basic Hadoop up to how to make Hadoop a key part of the next generation of enterprise applications. The remaining two days will be spent working in a part of Hadoop of your choice -including the new layers, tools to help end users, and doing papers and research in the system. We will also provide $XYZ of EC2 cluster time for you to do that work at realistic scales.

4. The very ambitious NIH complete rewrite of everything approach. While I'm not interested in working on these, they are fascinating enough that I follow up just to find out what is happening. A couple of months ago someone asked me if I was interested in
"a role working for a company who rejected Hadoop for their own in-house developed tool which can handle more data?"
This obviously merits a followup, given that known users of the stack include: Y!, Facebook, LI, eBay, Twitter -and that MS effectively stopped work on Dryad because even they couldn't justify the R&D spend for producing a complete competitor to a platform that is getting more sophisticated over time.
Your customer is clearly optimistic that they can sustain a competitive edge against the combined R&D spend of the Hadoop community, and have also taken a strategic decision to abandon the emergent layers above Hadoop MR as well as the integration tooling with data sources and back end databases.
their reply.
"They're not looking to commercially compete with Hadoop but have simply rejected it in favour of their own tool which they feel offers more to their machine learning based software used for real-time 1to1 customer intelligence across multiple touchpoints."
I can see some of the appeal there -some aspects of the MR layer is biased towards scale over latency (e.g. heartbeat-driven workload pushing out, creation of new JVMs for specific task fractions. But I would not start from scratch there, as what you gain in speed you lose in the NRE costs of developing your own stack, the fact that you don't have the tools above: Pig, Hive, Mahout, Spring Data, and the fact that while there is a shortage in Hadoop skills, you can at least be confident that there is more than one person in the world outside your own company that knows Hadoop. No internal project has that luxury, which is why the recruiters are forced into approaching the Hadoop people. Any of whom will need retraining before they can contribute or use the in-house platform. That's the problem with NIH -the stuff you build is isolated.

Private Parking for Netscape communications

The approaches which do at least get a polite declination, then are:

Emails to me that know who I am. Special mention of the Facebook recruiters here who actually cited some of my public presentations and then explained how that work would related directly to their problems, and why working on that stuff at Facebook would be even more interesting, challenging and fun.

Emails to me from people I know -and introductions from others. The apache & UK development community is not that big -if you know someone in that world, they may know others. If LinkedIn says someone is one hop away, rather than doing the unpersonal "LI introduction" thing, get the contact you have in common to do an intro via email. Use the LI graph to your advantage, but personalise the approach. Example: someone from Microsoft who saw that we both knew Savas, got him to do the email intros.

One painful failure here is recruiters who didn't look at the expertise & experience of their employers, and so failed to turn anonymous approaches into direct approaches.

As an example, I got hit on by someone recruiting for IMdB who didn't look at the LI histories of the founders like Col Needham and his colleagues, and so didn't know that the original Bristol group are all ex-HPLabs.

" Would you be interested in IMDb? They would be particularly interetsed in your Hadoop and and Cassandra experience. "

I forwarded that to Col and said "get them to do their homework". Interesting to know that IMdB are playing w/ Cassandra though -I wonder if they done the Perl to ZK bridge too?

Others that spring to mind here include Google and Amazon. Yes, they are big an anonymous companies, but at least build graphs of your staff and people you are curious about.

Therefore, here's my counterpoint to the meebo story for the recruiters themselves.
  1. Do your homework before contacting anyone. That's shared history with staff from the co. you are recruiting for, specific skills of the person you are approaching -whether or not that actually matches your needs, etc. If there is shared history, approach the contacts in the company to see what they know of/think of the "target," then get them to reach out direct with an introduction. You may think you lose your fee, but if the target knows people in the company you'll do that anyway.

  2. Personalise. That means look at the specific skills and experience of the target and shape your email accordingly. Look at the blog postings, what they say on twitter, anything stuck on slideshare. Look any papers and books they've written -and read them. If they've done things like that it may be their skills are not what you are looking for, they may have opinions you don't want -or it may be they are even better than what you were expecting and you should invest more effort in that individual than others.
  3. Make the lifestyle appealing. Personalise more than just technical issues. Look at the target's place of residence, lifestyle, etc. And shape accordingly: "our office is based within half an hour of some excellent skiing," "we have a great team who love to party out hours," "the schools in the area are good". "housing is affordable," "you can work from home 2 days a week, and at customers on the continent 3 days".

  4. Make it appealing and interesting. That's not just compensation, but things that may appeal to developers in this world: scale of operations, area of work, algorithms ("cutting edge graphs", "near-real-time", "tier-1-web-property"), hardware ("ASIC-hosted algorithms with multi-TB of RAM/server). On that topic,  "an office kegerator" is not that compelling. Near my home I have the reference public houses of The Highbury Vaults and The Hillgrove within two and six minutes walk respectively, along with some tier 2 and tier 3 establishments : Cotham Porter Stores, Beerd, the White Bear, Colston Arms, the Scotchman and His Pack, the Robin Hood, the Green Man, the Hare on the Hill and the Kingsdown Vaults. My beer storage infrastructure is outsourced into a High Availability solution with both redundant service nodes and multiple routing options.

  5. Don't end your missive with "do you know anyone else who might be interested?" This is the recruitment equivalent of saying "I have admired you for a long time and would like you to be my [girl|boy]friend -and if not which of your friends am I likely to get into bed with?" It destroys any perception that the recipient is considered unique and valued. Don't destroy the recruitment opportunity by trying to get the candidate to do your work for you.

  6. Never phone me at work. Which means, now I work from home up until the late evenings, never phone me at all.
[photos: the abandoned netscape campus. Nothing left in the parking lot but a slowly rusting mercedes. Netscape took the web mainstream, and died in the process.]

2012-07-02

Visiting the France Hadoop Users Group



At the invitation of Cedric Carbone from Talend, I went over to Paris to join in the third France HUG event, giving my talk and listening to the others.

It was really good to meet a group of people all of whom are involved in this stuff. These were Hadoop users, not "curious about all the hype" users -and I enjoyed not just talking about Hortonworks, Hadoop and HDP-1, but listening to what they are up to and the issues they have. I also got some lessons in technical french -all the Hadoop product names are designed for the language.

Now my HDP-1 slides:
I gave a quick demo at the end of the stuff I'm doing on availability -not the bit where vSphere recognises and kills a VM hosting a failed service, restarting it on the same physical host or, if that host itself is in trouble, elsewhere. Instead I showed how the JT can be set up to not only reload its queue of jobs, but go into "safe mode" either at the request of a remote administrator, but also when it detects that the filesystem is offline.

Being able to put the JT into safe mode is beneficial not just for dealing with unplanned availability issues, but planned DFS maintenance. When you flip the switch on the JT it doesn't kill tasks in flight, but it doesn't worry if they fail; it doesn't blacklist tasktrackers or consider the job as failing. You can't schedule new jobs in safe mode either -while that's implicit in that HDFS isn't there to save your JARs, this just makes it more formal. When the FS comes out of safe mode, it reschedules the queued work (the whole job), and new requests can be added.

The UI can show safe mode too -though I've realised that before the code is frozen, the JT setSafeMode() call should take a string explaining why the system has entered this state. It would be set automatically on DFS failure, while a manual request would pick up whatever explanation you asked for, e.g. "Emergency Switch maintenance". Anyone who goes to the JT status page would see this message.

DFS could have the same -in fact maybe the JT ought just to support a message of the day feature. That's feature creep: knowing why things are down is better. Indeed, I could imagine it being part of the payload of exceptions clients get: "W. says no jobs today".

The other talks gave me a chance to revise some French, with HCatalog and an attendee of the Hadoop Summit giving their summary.

Afterwards: baguette, fromage, Kronenbourg 1664 -stuff worth travelling over for.

One interesting discussion I had was on the topic of ECC DRAM in servers. Having done all the reading on availability last year, I not only think that ECC is essential in servers, the time when it should be in desktop systems is drawing close. When machines are shipping with 4+GB of DRAM in even a laptop, P(single-bit-error) is getting high, as Nightingale's "Cells cycles and platters paper showed.

Yet as the attendees pointed out, today you are being given a choice of five ECC'd servers vs 15 non-ECC, and the tangible benefits of more servers is higher than the probabilistic benefits of ECC'd servers.

Why is ECC so expensive? It's not the RAM, which is adds lg(data bits + ecc bits) worth of DRAM: 6 bits for a 32 bit word, 7 bits for a 64 bit line. The percentage of extra ECC bits over DRAM decreases per line width, so as we move to wider memory buses, the incremental $ and W cost of ECC should decrease. Why then the premium? Chipsets and motherboards. ECC is viewed as a premium luxury that isn't even an option in the consumer chipsets, Atom parts in particular. That's just an attempt to gain a price premium on server designs -not in the chipset and mainboard, but on the CPUs itself.

This led into a discussion on what would the ideal Hadoop worker node be. Not an enterprise-priced hot-swap-everything box -that makes sense for the master nodes, but not for the many workers. There are the 'web scale' server boxes that the PC manufacturers will sell you if you have a large enough order. These are the many-disk, nearly no end user maintenance design that you can buy in bulk -but it has to be bulk. They can contain assumptions that they are the sole inhabitants of a rack (things like front-panel networking needing a compatible ToR switch), and they are for organisations whose clusters are so big that disk failures are treated as an ongoing operations task, which can be addressed by decommissioning that node and then replacing the disks at leisure. At scale this is the right tactic -as the decommissioning load is spread across the entire cluster's network bandwidth, it's not an expensive operation LAN-wise.

In smaller clusters you not only lack the spare disk capacity to handle a 24 TB server outage, the impact on your bandwidth is higher. Provided you can power off the server, swap over the disk, have the machine booted, the disk formatted and mounted and Datanode live within 15 minutes, you do a warm swap today. If servers were made for that swap in/out to be easier than with the current web-scale servers, this could/should be possible.

What do we want then?

  • Easy to warm swap disks in and out
  • Not mandate single model /rack -so network ports, PSUs and racks can't require this.
  • ECC memory
  • Not require the latest and best CPUs and chipsets. A bit behind the model curve makes a big difference in ASP.
  • Low peak power budget. I know the I7 chipsets can power off idle cores; even raise the voltage/clock speed of the active cores in such a situation to finish their work harder in the same power envelope. But in a rack you need to consider the whole rack power budget, and how to stay not only within the rack's power budget, but how to stay in the billing range of the colocation site -some of whom bill by peak possible power budget, not actual use.
  • Topology information. Some of the HP racks now do this for the admin tools -be nice to see how to get that into Hadoop.
  • Support 2x1 Gb LAN, LAN with throughput that meets the peak loads of re-replicating an entire lost rack.
  • Option for adding lot more of RAM, without you having to pull out and throw away the RAM it initially came with.
  • Blinking lights on the front that you can control in the user-level software. You know you need it. This may seem facetious but it's how you direct people in the datacentre to the box they need to look at, it's how you qualify the network topology (light up all boxes in row of all racks -boxes that don't light are offline or mis-connected). Plus Thinking Machines shows that Blinky Lights look good at scale.
  • Option to buy compatible CPUs to fill in the spare socket in twelve months time. That means the parts should still be on the price list.
  • Option to buy compatible servers 12 months down the line, where compatible means "same OS", even if storage and CPU may have incremented.
  • CPU parts with excellent Hadoop/Java performance, good native compression and checksumming.
  • Linux support for everything on the motherboard.
  • Options for: GPU, SSD
  • aybe: USB BIOS updates. On a 50 node rack this is just about manageable and means you can skimp on the ILO management board and matching network.
  • Ability to get system state (temp, PSU happiness) into OS and then into the management tools of the cluster -even the Hadoop cluster.

No need for:

  • RAID controllers.
  • Hot swap PSUs, though ability to share peer PSUs is handy. A cold-swap PSU and an on-site spare should suffice, given the rate of PSU failures ought to be much, much less than those of disks.
  • Top of the line CPU parts whose cost ramps up much more than performance verses the previous model.
  • Dedicated management LAN and ILO cards. Nice and makes managing large clusters much easier, but they add cost that small clusters can't justify.
  • Over-expensive interconnect (did anyone say Infiniband?)

I don't know who is doing this yet -if you look at the prepackaged Hadoop stacks they are all existing Hardware. Things will no doubt change -and once that happens, once people starting optimising hardware for Hadoop, we will all get excellent value for money. And Blinking Lights on the front of our boxes that MR jobs can control.