2012-05-19

Possible Stonebraker Trajectories

City Farm, St Werburgh's

In my time I've managed to say bad things about SQL in an email discussion that had the (now officially late) Jim Gray in the CC:. I also faulted the first generation of grid engines for believing the storage vendors when they said "don't worry about storage" when I was participating in a panel with the lead of the Condor project. Despite this, the ACM doesn't give me a blog page.

It's a shame, then, to see the ACM letting Stonebraker publish another of his rants, Possible Hadoop Trajectories

First he gets a dig in at Java developers "discovering parallel processing". Actually, they've had threads and small clusters for a long time actually. What Hadoop brings to the planet is the opening up of the ability to work with thousands of cores and PB of data to lots of people. The query languages, Pig and Hive, mean that you don't need to learn Java either -any more than you need Objective C skills to use an iPhone.

Then he tries the "it's so inefficient the planet dies" story. At least they are planning to build their next supercomputer by a hydro plant -but it's still going to have a power budget of Megawatts, so calling Hadoop environmentally unfriendly is ironic. It's like Airbus saying their planes are greener than Boeings' -when both are flying people across the Atlantic(*).

If there's one thing that really annoys me here it's a bit from the opening paragraph:
"we applaud Hadoop for its success in this area, which we believe is due largely to the simplicity and accessibility of its environment. "
Exactly.
  1. It is simple to use because Map(x):(k,y) and Reduce(k, [y1,...,yn]): z are easy to understand and play with. You can write efficient routines without knowing relational algebra and set theory, unlike, say RDBMs [Codd71].
  2. It is accessible because it runs (slowly) on your laptop and massively faster on your production cluster. It is also economically accessible because it is free to download and start to play with. You may need training or support -which Hortonworks will gladly offer -but that payment is optional. You can learn through the books and trial and error. You can learn to support your own system. Doing so does give you the duty to rummage through the code yourself, but if you contribute any fixes you have made back, even your in-house support efforts benefit the community as a whole.
There is nothing wrong with simplicity and accessibility. This is why PHP is one of the key development platforms of Facebook. When Facebook wanted those PHP developers to work with Hadoop, they didn't say "go learn Java". They said "here's an SQL bridge", called Hive. For those people who already known SQL, Hive lets you work with Hadoop without having to write a line of Java. There is nothing wrong with that and it does not make sense to denounce Hadoop because someone wrote tooling to help SQL experts work with it.

That does not mean that SQL is a good language. That little fact has been forgotten since RDMBs's became widespread, when developers learned to write things like "SELECT * FROM users WHERE name="steve"". SQL is a language designed to make script injection the default operation; something SELECT * FROM users where name=""; DROP TABLE users".

SQL started out as SEQUEL: "Structured English Query Language". It was written on the expectation that business people -presumably the same people that COBOL was targeted at- would sit at their shiny new IBM teletype and type in an 'ad-hoc query'. That's right: SQL was not targeted at developers, but "normal people" -and to be easy for COBOL and PL/I developers to embed. A key goal of the SQL language was to present the same capabilities, and a consistent syntax, to users of the PL/I and COBOL host languages and to ad hoc query users.[Chamberlin81].

Nowadays, the main experts in SQL are people like Facebook's PHP devs, and script hackers. Java developers run from it, hence the broad set of O/R mapping tools. Enterprise Java Beans were first; someone had a vision that people would write reusable "beans" to represent enterprise entities (User, Customer, Purchase), and that there would be some kind of market for that. Well, that died, but Hibernate and Spring keep letting Java devs write distributed database transactions without having to learn SQL. Where are Stonebraker's snide language-elist comments then? Why no ACM article saying "ORM tools have finally brought the power of the database to Java developers"? Is it because he felt that ORM was a good idea, or that he recognizes that tools to make working databases easier benefited him?

The harsh truth is that SQL is not a particularly good language for expressing relations and predicates. Back in 1984 the illustrious C.J Date (as in "Introduction to Database Systems" Date) published a 47 page dot-matrix-printed critique of the language [Date84] -an article whose criticism on the difficulty embedding SQL into PL/I is effectively the precursor to all critiques of O/R mapping. It's SQL/Language mapping, and there've been problems mixing code and SQL back since System R first booted. A key problem is that all it does is read and write data from the DB, but for programs you need more than that, so you end up mixing SQL queries in that COBOL-esque syntax with the real code, either through some contrived ORM process or some hand-rolled string construction thing that at best is a maintenance task and at worst leaves your entire site's credit card records up on pastebin.

If you did want to work with databases properly, you'd need a programming language which makes relations and predicate calculus integral parts of the language: Prolog, Linq and, effectively, Erlang. Linq interests me as it is the most recent attempt, and because Dryad/Linq showed that it could do more than just database lookups.

Returning to System R, the database from which DB2 and Oracle DB are derived, [Chamberlin81] concludes with a lovely sentence:
We feel that our experience with System R has clearly demonstrated the feasibility of applying a relational database system to a real production environment.
Which can be translated as: "even though people preferred more efficient low-level data storage techniques, hand-tuned for the specific application, pre-written in assembly language, COBOL or PL/I, the System R team -including the illustrious and now sadly absent Jim Gray- felt that making working with data easier outweighed the alternative.

That's something Stonebraker appears to have missed. The RDBMs isn't an end in itself -it's a means to an end. A tool. As is Hadoop. A tool to let you work with data at a scale and price point that that the commercial RDBMs can't play at.

Is MapReduce the meta-algorithm to solve everything? Of course not. The Stratosphere team in Berlin, the Asterix team at UC Berkeley are key leaders in the academic space -there a both ideas and code to pick up here. Then there's the real world projects coming out of the web companies, who do have to work at a scale and price point that RDMBs's can't match: Pig, HBase, Hama, Giraph, S3; other key-value stores nearby: Cassandra, Project Voldemort. All of these worked for their organizations.

Which is why I have a quote; a slight mutation of the system R conclusion based on the experiences of all the Hadoop users:
We feel that our experience with Hadoop has clearly demonstrated the feasibility of applying a Hadoop system to a real production environment.
For anyone interested in things like Stratosphere, the Graph Layer, what Yarn allows &c, there's a two day workshop after Berlin Buzzwords, "Beyond MapReduce" -free for all conference attendees. Stonebraker is cordially invited to attend the conference and the workshop. I'll gladly sit next to him on a panel and say things he won't agree with.

(*) This post was written on an A340-400 between SFO and LHR. I do have all the cited papers on my laptop. If you are going to argue with the RDBMs people, you need to know where they are coming from.

[Chamberlin81] D Chamberlin et al., A History and Evaluation of System R, 1981.
[Codd71] E. F Codd, A Database Sublanguage Founded on the Relational Calculus, 1971
[Date84]: C.J. Date, A Critique of the SQL Database Language 1984