A WS-I standards body for Hadoop? -1 to that

see no evil at sunset

There's an article from IBM which argues that Hadoop needs to copy WS-I and Oasis to have a set of standards:
In the early 2000s, the service-oriented architecture world didn’t begin to mature until industry groups such as OASIS and WS-I stabilized a core group of specs such as WSDL, SOAP, and the like.
I despair. Anyone who credits OASIS and WS-I for this does not know their history -or is trying to rewrite it.

The initial interop came from the soapbuilders mailing list, of which Sam Ruby (IBM, ASF),  Davanum Srinivas , CA->WSO2->IBM and dims at apache, Glen Daniels (at the time, Macromedia + apache), all played key parts. Anyone in IBM curious about Soapbuilders should ask Sam about it.

Soapbuilders was the engineers: finding problems, fixing them. Agile, fast, focused on problems, fast turnaround in the evolving SOAP stacks.

It died. Killed by WS-I

WS-I wasn't the engineers, it was the standards people. Suddenly the battle lines were drawn over whose idea was going to be standardised. Take the problem of shipping binary data around. Base-64? works, but inefficiently. Microsoft's DIMESoap with AttachmentsMTOM?  WS-I was where the battles were fought, sometimes won, sometimes solved with "all of them"

I have been known to express opinions on the cause of interop problems in SOAP; I'm not going to revisit that, except to note that the focus of SOAP interop settled on Java<->.NET interop, which was addressed not by standards bodies but by plugfests, the standards themselves being too vague to cover the hard issues, especially defining which proper subsets of WS-* different stacks would support, and which proper subsets of WSDL could be generated and parsed. Ideally the parseable-subset was superset of the generateable subset, but, well, there can always be surprises. Those discussions would end up on soapbuilders, where maybe the developers could fix them without the standards team getting in the way.

I'm going to pick on one specification as the worst of all standards. WS-A.
  1. Replaced a simple concept "URL" with an arbitrary lump of XML.
  2. Went through an iterative development process that resulted in multiple versions
  3. Used xml namespaces to identify the versions. 
  4. Used three namespaces: 2003, 2004, 2005/08 to identify four different versions. The 2005/08 one has both an "interim' and "final" release.
That's what happens when you put a standards body like WS-I in charge of something as simple as a URL. They not only make a vast mess of it, you get a vast mess in different XML namespaces.

As for Oasis? WS-RF. That's all I need to say for anyone involved in; WS-*, or any of the grid proposals. A standard for managing "resources" -leased things at the end of WS-A addresses, across a network. A standard that managed to include two different WS-A versions in it.

Think about that for a minute. You have a "standard" that is produced by a standards body that is somehow affiliated with the UN and has some official status, yet they cannot push out one of the WS-* specification suites without incorporating different concepts of "how to address a SOAP endpoint" in the context of that single 'coherent' suite of documents -end up including non-normative drafts as well as the final things.

I am not going to make any statements on Hadoop standardisation in this context as it will probably be taken for an official stance rather than my personal opinions.  There is a section on Hadoop Compatibility that I wrote in Defining Hadoop; it sounds like some people and organisations ought to read that article.

 I do want to close with the following point
WS-I and Oasis were not bodies capable of producing "standard", where "standard" could be defined as a coherent, consistent and testable set of protocols. Instead they were places where vendors could push their own agendas, where the winners became the organisations capable of funding the most participants in the standards process, or those willing to do the most back-room deals with others.

The compromises needed to get anything out the door in even a hopelessly untimely manner produced an incoherent mess of XML namespaces, schemas and protocol issues that anyone working on SOAP stacks still has to deal with today.
REST did not win just because it was architecturally cleaner, because it was more powerful. It won because the alternative was the set of WS-* specifications that came out of WS-I and OASIS. Those organisations did not set WS-* on its route to global success; they condemned it to the niche of intra-enterprise Java/.NET communications, a decade after CORBA could have done the same thing better.

[photo: sunset on Nelson Street from St Michael's Hill]


Hadoop in Practice - "Applied Hadoop"

Recent train journeys to and from London have given me a chance to get the laptop out and read some of the collected PDFs of things I know I should read.

St Pauls Graffiti

I was given a PDF copy of Hadoop in Practice [Holmes, 2012] on account the fact of I'd intermittently been in the preview program -but I'd not looked at it in any detail until now. The (unexpectedly ) slow train journeys to and from London have been an opportunity to unfold the laptop and read it -and, at home, while I wait for EC2 do respond to whirr requests, to read it to the end -though not in as much detail as it deserves.

The key premises of this book are
  1.  You've read one of the general purpose "this is Hadoop" books -either the Definitive Guide or Hadoop in Action.
  2.  You want to do more with Hadoop.
  3.  You aren't concerned with managing the cluster.
  4.  You are concerned about how to integrate a Hadoop cluster with the rest of your organisation.
#3 means that there's nothing here on metrics, logging or low-level things. This is a book for developers and (yes) architects; less the operations people. Even so, the sections on integration with other systems, especially hooking up to log sources and databases that they need to know about.
Although it starts off with a quick overview of Hadoop and MapReduce, internals -such as how HDFS works- are relegated to appendices for the curious. Instead, the first detailed chapter looks at Ingress and Egress, or, so as not to scare readers, "Moving Data in and Out Hadoop", looking mostly at Flume, mentioning Chukwa and Scribe, and then into using Oozie-scheduled MR Jobs to pull data -something in an example in the book.

It doesn't delve into the aspects of this problem you'd need to worry about in production -data rates, the risk that MR pull jobs can either overload the endpoints or, unless they are split up well, can create imbalanced filesystems. Ops problems -or just too much to worry about right now.  What it does do is show why a workflow engine like Oozie is useful: to automate the regular work.

It glues the Hadoop ecosystem together. Want to parse XML? grab the XML input reader from Mahout.  Want to work with JSON? Twitter's Elephant Bird… etc. In fact the serialization chapter went into the depths XML and JSON parsing -and showed the problems, so justifying the next stage: Protobuf, Avro and Thrift.

There's a chapter on tuning problems which focuses more on code-level issues than hardware; this is where the line between ops & developers gets blurred. I think I'd have approached the problem in a different order, but the tactics are all valid.

Installation-wise, Alex points everyone at a version of CDH without LZO support; he has to talk people through building it. I don't know where Cloudera stand on that, as I know yum -y install hadoop-lzo works for HDP., and is up there with hadoop-native hadoop-pipes hadoop-libhdfs  and snappy as RPMs to add (update: see below). I'd have liked to seen bigtop as the centre of the universe, so be more neutral -something to hope for in the second edition

There's a few chapters on "data science" stuff: bloom filters, simple graph operations, R & Hadoop integration. I get the feeling that this section is very handy if you know your statistics and want to do work with a new toolset. The problem I have there is a personal one: I've forgotten too much of what I new about statistics. min, max, mean, Poisson, Gaussian and Weibull distributions;the notion of Markov chains are all concepts I know about -but ask me the equation behind a Poisson distribution and I stare as blankly at the questioner as our pet rabbit does when asked why he's been chewing power cables: there's no comprehension going on behind the eyeballs. I really need something that covers "statistics for people who used to know it vaguely -using R & Pig as the tools". There's a good argument for all developers to know more stats. This book isn't that -it does assume you know your statistics, at least better than I do.

Alex Holmes delves into MRUnit, which is a good way for unit testing individual operations. I tend to do something else: MiniMRCluster -but that one, while more authentic, can push problems onto different threads and so make it harder to identify root causes of problems -or isolate tests. MRUnit doesn't have that flaw, and nor does LocalJobRunner -which also gets coverage. The only thing that grated against me there was that the tests were done in Java -I've been using Groovy as my test language for the whole of 2012, and sheer verbosity of setting up lists in Java, and the crudeness of JUnit's assertions compared to Groovy's assert statements is painful to look at.

For anyone who's never used Groovy, its assert statement takes advantage of the compile-on-demand features of the language. On an assertion failure, the output walks through the entire expression tree, evaluates every part in turn and gives you the complete tree for your debugging pleasure. You can write one all-encompassing assertion, rather than break down each part of a large query into various assertNotNull, assertTrue, assertEquals calls -and if the single assert fails, there should be enough information for you to track down the cause.  That's why I like testing in Groovy, irrespective of whether or not your production code is in Java.

Other points: the ebook comes with your email address at the bottom, but no epub-esque security. This works on your Linux workstation as well as whatever tablet you choose to own -and relies on publicity & guilt to stop sharing. Which is probably a good strategy. That eBook comes with a feature I've never seen before: the page numbers in the contents match exactly the page numbers in the book -there must be some Framemaker magic that tells Preview &c the offset to apply after the user hits the "go to page" button.

Summary: this isn't book for newbies -precisely because it delves into Applied Hadoop. Even so, it's something you ought to have to hand, just so you aren't one of the people posting questions to user@hadoop that everyone else stares and generally refuses to answer., the "hello, I have got a pseudo-distributed cluster that cannot find localhost, here is the screenshot of the DOS console, please help!!!" -while forgetting to even include the screenshot of their hadoop.bat command line failing as they've forgotten to do something foundational like install Java.

Everyone but @castagna will learn something new -in fact maybe even him, because he needs something to read on test runs and trains to London (which is where I'm writing this, somewhere between Reading and London Paddington)

Update: Eric Sammer says of the LZO thing "hadoop-lzo in cdh, it's because of license concerns that we don't distrib."


Rethinking JVM & System configuration languages

Rethinking JVM & System configuration languages

I've been busy in Apache Whirr, with a complete service that installs HDP-1 on a set of cluster nodes -WHIRR-667; the source all up on Github for people to play with.. As a result someone asked me why I'm not using SmartFrog to provision Hadoop clusters
Having used it as a tool for a number of years, I'm aware of its flaws:

Specification language
  • Hard to track down where something gets defined
  • x-reference syntax a PITA to use and debug
  • Fuzzy distinction about LAZY eval vs. pre-deploy evaluation (LAZY is interpreted at deployment, but 'when' is ambiguous)
  •  RMI is wrong approach: brittle, often undertested in real world situations, & doesn't handle service restarts as references break.
  •  Wire-format serialized Java objects; the Object->Text->Parse->Object serialization proved surprisingly problematic (not defining the text encoding didn't help)
  •  Security so fiddly that we would often turn it off.
  •  Doesn't work unless Java is installed and network up -so no so good for basic machine setup from inside the machine itself, only outside-in (which is partly what Whirr does.
  •  Java doesn't let you get at many of the OS-specific details (permissions, process specifics); you end up hacking execs to do this.
  •  The way you imported other templates (#import keyword) was C-era -multiple imports would take place, the order in which they were loaded mattered.
  •  Shows its age -doesn't use dependency injection and becomes hard to work with (NB: whirr doesn't inject either)
In defence:
  •    it's not WS-*
  •    language better than XML (especially spring XML)
  •    good for writing distributed tests in
  •    Most XML languages insert variable/x-ref syntaxes in different ways (ant, maven, XSD, ...); SF has a formal reference syntax that doesn't change.
  • Being able to x-ref to dynamic data as well as static is powerful, albeit dangerous as the values can vary depending on where you resolve the values, as well as changing per run. And they stop you doing more static analysis of the specification.
  • Being able to refer to string & int constants in java source convenient too (classpath issues notwithstanding). Example, I could say :
serviceName: CONSTANT org.smartfrog.package.Myclass.SERVICE;

    The constant would then be grabbed from source. This may seem minor, but consider how often string constants are replicated in configuration files as well as source -and how a typo on either side creates obscure bugs. Eliminating that duplication reduces problems.
Looking at Whirr I can see how the two-level property file config design has limits (all extended services need to have their handlers declared in every config that uses them); templates of some form or other would correct this.

Ignoring the specific issue of VM setup (I need to write a long blog there criticising the entire concept of VM configuration as it is today, as it's like linking a C++ app by hand), I'd do things differently.
I think we need a post-properties, post-SF language, language: a strict superset of JSON, to which it could be compiled down to, property expansion in x-refs, ability to declare what attributes to inject/are mandatory, some Prolog  & Erlang-style list syntax to make list play easier. No dynamic values, because that prevents evaluation in advance.

"org.apache.whirr.hdp.Hdp1": org.apache.whirr.hadoop.Hadoop {
  "port": 50070,
  "logdir": "/var/log/${user}",
  //Extend the list of things to inject
  "org.smartfrog.inject": ["logdir" |super:"org.smartfrog.inject"]

The template being extended would be this:
"org.apache.whirr.hadoop.Hadoop": {
  "timeout": 60000,
  "port": 50070,
  "description": "hadoop",
  "org.smartfrog.class": "org.apache.whirr.service.hadoop.HadoopClusterAction",
  "org.smartfrog.inject": ["timeout", "port","install" "configure","user"],
  "org.smartfrog.require": [install", "configure"]

This would compile down to an expanded piece of JSON; as it would expand out, you could use it as a pre-JSON anywhere.
"org.apache.whirr.hdp.Hdp1":  {
  "timeout": 60000,
  "port": 50070,
  "description": "hadoop",
  "logdir": "/var/log/mapred",
  "org.smartfrog.inject": ["logdir" ,"timeout", "port","install" "configure","user"],
  "org.smartfrog.class": "org.apache.whirr.service.hadoop.HadoopClusterAction",
  "org.smartfrog.require": [install", "configure"]

  1. Importing is a troublespot -if you required fully qualified template references that mapped to specific package & file names, then you could just have a directory path tree (a la Python), possibly with zip file/JAR file bundling, and have the templates located there.
  2. I'm avoiding worrying about references; you'd need a syntax outside of strings to do this. It'd be a lot simpler than the SF one -fully qualified refs again, up/down the current tree, and to the super-template.
  3. No runtime references.
This syntax would be parseable in multiple languages; expandable to pure JSON would be the serialization format.
 A Java interpreter could take that and execute it, doing attribute injection where requested, failing if a required value is missing. Behind the scenes you'd have things that do stuff. I'd also look very closely about using Java at all, not just because I'm enjoying living in a half-post-Java world (Groovy for tests, GUIs &c), but because it

One other possibility here is that given it's JSON, embrace JavaScript more fully. What if you have not only the configuration params, but the option of adding .JS code in there too; you could have some fun there.

A cluster would be defined from this, here using  the same role-name concept that whirr uses with something like
"1 hadoop-namenode+hadoop-jobtracker, 512 hadoop-tasktracker+hadoop-datanode"

In a JSON template language you'd split things up more & use lists. It's more verbose, yet tunable.
Your cluster templates would extend the basic ones, so a cluster targeting EC2 could extend "org.apache.whirr.hdp.Hdp1" and add the EC2 options of AMI location, AWS cluster (West Coast 2, obviously), as well as authentication details, -or leave that to the end.  (There's some thoughts on mixins arising here, let's not go there, but I can see the value)

stevecluster:  ClusterSpec org.apache.whirr.hdp.Hdp1{
 "templates" : {
    "manager": {
       "Services": ["hadoop-namenode", "hadoop-jobtracker"],
       "Count": "1"
    "worker": {
       "Services": ["hadoop-tasktracker", "hadoop-datanode"],
       "Count": "255"

 A template without the login facts would need to be given the final properties on startup, props that could be injected as system properties.  (launch-cluster —conf stevecluster.jsx -Dstevecluster.ec2-ami=us-west2/ami5454). Properties set this way would automatically override anything set. That is, unless there is (somehow) support for a final attribute, which Hadoop likes to stop end users overwriting some of the admin-set config values with their own.  Without going into per-key attributes, you could have a special key, final, which took a list of which of the peer attributes were final. Actually, thinking about it more, @final would be better. Which would be hard to turn into JSON…

I could imagine using the same template language to generate compatible properties files today; this JSON-template stuff would just be a preprocess operation to generate a .properties file. That's making me thing of XSLT, which is even scarier than mixins.

I have no plans to do anything like this.

I just think a template-extension to JSON would be very handy, that some aspects of the SmartFrog template language are very powerful & convenient, irrespective of how they are used.
If someone were to do this, the obvious place in Apache-land would be in commons-configuration, as then everything which read its config that way would get the resolved config. That framework is built with hierarchical property files -think log4.properties, so resolves everything to a string and then converts to numbers afterwards. Lists and subtrees are likely to be trouble here -albeit fantastic.


After a week of OS/X mail, I'm (almost) pining for Outlook

Jamaica Street

Because networking from the hotel room last week was limited to a tethered 3G phone, I switched to a local email program for my messages, saving bandwidth and allowing offline use. That email program was Apple Mail for Mountain Lion. I then decided to follow through by using it for a whole seven days. Never again.

First, the UI isn't that great. The most glaring problem is that it's read/unread marker is a small pastel coloured blue dot to the left of the summary -a summary that has the sender in bold, the first couple of lines below. Every other modern email program (outlook, thunderbird, gmail, Y! mail, live.com) uses bold to mean "unread", but no Apple think "bold is for the sender", and "unread can be a barely visible dot to the side".

I could maybe get used to that. What I can't get used to is the way that emails on a gmail account seem to magically get deleted, even though I didn't delete them. It looks a bit like there is some auto-aging feature, but it deletes entire conversations, and does it without warning. Fortunately, very fortunately, gmail moved the messages to the "bin" folder, where I've been able to select them all and restore them to the inbox.

It's destroyed my trust in the program. If you can't rely on it not to discard conversations, you can't rely on it. At which point, it's in the do not use category.

What does that leave for the machine. Thunderbird, and, er, Outlook for OS/X. Having the latter installed, I'm considering that with IMAP ->Gmail. This could be some leftover from my time at HP; I am secretly missing large ppt-ware and msword documents hitting my inbox, maybe even missing the bizarre dialogs that would pop up.

My past issues with Outlook on Windows are well documented. That set of blog entries are the best argument as to why I shouldn't try Outlook on OS/X.

[photo: jamaica street; painting the lamp post to match the wall is becoming a tradition. It makes for better front-on photos if they've done it right, as the lamp post becomes invisible]


Strata EU: Logistics

I was at Strata EU last week -the first time ORA had hosted it in the EU.

Rather than go into the details, I'm going to look more at logistics. As a speaker, I got to stay in the hotel, the London Hilton Metropole, positioned where the Westway flyover rises off Edgware Road; 3.2 miles from where I grew up West Hampstead. The hoteal was very close to Paddington Station, ideally positioned for people coming from LHR or Bristol. Unfortunately, I was approaching from Portsmouth, so ended up at London Waterloo, South of the River.

A sunday evening was the ideal time to try out my Boris Bike key and cycle over there in the half hour of free-ride time you get. I first took the footbridge over the river to Charing Cross and then over Trafalgar Square before starting this -negotiating one of the bridges of death didn't appeal to me.

Getting the bicycle proved harder than I thought as the key wouldn't let me pick any up, plugging in by the touch screen brought up a page saying "call Transport for London". Which I did, above the traffic noise, and got someone who said I had yet to authenticate the key and had to do that there and then, including answering one of the security questions. Without getting the laptop out I couldn't do that, but managed to get by without it -and at the end being told the answer to the question, which involved Boris and some very negative phrases. They must get that a lot.

When I got the key in the post, TfL had included a nice map of central London showing all the bike rental sites. What that map didn't do was show sensible cycling routes. I could certainly get to the hotel via Regent St, Oxford St, Marble Arch and then Edgware Road -trivial routing- but not one that leaves you happy.

Instead I used the cycling layer of an Open Street Map viewer on my phone and meandered up the expensive parts of Westminster, over Hyde Park and then up, where I got fed up of repeatedly location checking and just went up Edgware Road instead, soon to dock the bike. Some blue lines on the TfL map would have been convenient.

This was my first trip on the TfL rental bikes, and they were a surprising.
  1. They are barges with awful friction and rolling resistance. I know they are powering some blinky LED lights, but even so they are slow. The gearing doesn't help either; it goes low but its top option would be low-mid-ring on my commuter.
  2. Those blinky lights are pretty awful, especially the front one. The only way you'd be seen against the illumination of chicken fast food restaurants on the Edgware Road would be as silhouette eclipses the chicken broilers in the front windows. You are in darkness on Hyde Park too -these are not for nightime MTB races.
  3. The brakes are dire too, with minimal reaction. I'd view that condition on my commuter as an emergency, not a normal state of affairs. I've realised why they are so bad -if they were set up the way mine are -light touch onto disk brakes- too many riders would be straight over the front bars as they (literally) hit that first junction. You just need to keep your speed down, especially given the inertia of the land-barge.
  4. Not a good turning circle.
Overall: not great, but they got me to where I wanted to be without going near the tube; given some time I'll learn my way around better.
The hotel was OK, except I couldn't get the wifi to work in my room, even when entering my (surname, room) info. A call to reception informed me that I actually needed to pay extra for wifi. That was like falling back in time. I almost expected them to tell me that there was a phone socket for my modem. I declined the option of room wifi and just flipped my mobile into Wifi hotspot mode to take advantage of my "unlimited" data option that I'd bought from 3 this month. Functional, albeit slow.

The room was on floor 10 -in the morning I could see the tower block near my house & from there orient myself to the trees behind, hence to the trees above. That's the closest I've been to it for 15 years. Maybe I should visit it sometime.

The next day, breakfast and conference. I found a good cafe nearby with Illy coffee and chocolate croissants -something to remember the next time I am in Paddington station waiting for a train.
The conference was fun -loitering near the booth meant I spent more time meeting other attendees than in talks -but the few I made were good. In particular: James Cheshire's visualisation talk showed some beautiful visualisations of data projected or animated onto maps of London; a talk on Cause and Effect really laid down how to do effective tests -a key point being a negative result is a result, so don't ignore it.

I also enjoyed Isabel Drost's talk on big data mistakes, where she got everyone to own up to getting things wrong -like creating too many small files, accidentally deleting the root tree of the FS, running jobs that bring down the cluster, etc. A lot of the examples credited someone called "Steve" -I have to own up to being this person. I consider breaking things to be an art. Indeed, I couldn't even watch her slides without having to file a bugzilla entry: https://issues.apache.org/ooo/show_bug.cgi?id=120767
cat and mouse
If there was one problem with the conference site -it was that rooms were too scattered. After day 1 you'd learned your way around, but it still took five minutes to get to each talk -cutting each talk down by five minutes. It also stopped you running out of a talk you didn't like and going to another one. Not that I'd do that -or expect anyone in the audience of my talk to do such a thing.


Ingress and Egress

Last week someone from British Telecom/BT came round to boost my networking, running Fibre to the Cabinet and then re-enabling the existing Copper-to-the-Home from there.
Upload statistics
As the graph shows, it's got a lot faster than the virgin cable: download has gone from 12.7 (vs. a promised 20 Mb/s) to ~54-55 Mb/s, a 4x improvement, while upload has gone from a throttled 2Mb/s to 15 Mbits -7x. That 7x upload speed is that I was really looking -both the ADSL and Cable offerings are weak here, with sky being the worst at 0.8Mb/s, which is pretty atrocious. The cable modem offering also suffered from collapsing under load in the evening, especially when the students were back (i.e. this time of year). I don't have that problem any more.

Now I can not only download things, I can upload them. In fact, this network is now so fast that you can see other problems. As an example, the flickr uploader used to crawl through each photo upload. Now it sprints up -so much so that the per-photo fixup at the end becomes the pause in the progress bar, not a minor detail.

Its on the downloads though, where problems arise -problems down to TCP and HTTP. HTTP likes to open a new connection on every GET/POST/PUT/HEAD/whatever operation. TCP has throttling built in to stop it flooding the network. Part of that throttling is slow-start: rather than streaming data at the full rate claimed by the far end, TCP slowly ramps its window size based on the acknowledgements coming back to it. Acknowledgements that depend on the far end getting back to the remote host -and hence the round trip time -not the bandwidth. Even though my bandwidth has improve in both directions, the distance to remote servers and the number of hops is roughly the same -only now that slow start is visible.

Take an example: the NetFlix progress bar at the start of a video. It begins, slowly filling up. Suddenly half way along it picks up speed and fills the rest of the bar in 1 second, compared to the 4-5 seconds for the first half.

What I am seeing there is latency in action.

It shows the real difference between 100Mb/s LAN and WAN connections at a sizeable fraction of that. 100MB/s LAN isn't too bad for pushing data between two boxes adjacent to each other -and ramp up time is neglible. Over a distance, its latency and round trip times that make short-lived TCP operations -of which HTTP GETs are a key example- way slower than they need to be.

Google have a paper discussing this and arguing for increasing the initial window size. For those of us with long-but-fat-pipes, this makes sense. I don't know about all those mobile things though.