Spec vs Test

Moored to a bollard

One of the things I've been busy doing this week is tightening the filesystem contract for Hadoop, as defined in FileSystemContractBaseTest : there are some assumptions in there that are clearly "so obvious" it wasn't worth testing.
  1. that isFile() is true for a file of size zero & size n for n>0
  2. that you can't rename a directory to a child dir of depth 1 and depth n>1
  3. that you can't rename the root directory to anything
  4. that if you write a file in one case, you can't read it using a filename in a different case (breaks on local for NTFS and HFS+)
  5. that if you have a file in one case and write a file with the same name in a different case, you have two files with different case values.
  6. that you can successfully read a file where the byte range can be in 0-255.
  7. That when you overwrite an existing file with new data, and read in that file again, you get the new data back.
  8. You can have long filenames and paths
Of these tests, #1 and #7 are directly related to some problems I've encountered in collaborating on implementing a Hadoop Filesystem for OpenStack, in collaboration with Rackspace and Mirantis. #1 -OpenStack Swift doesn't differentiate a dir with no children from an empty directory -they are both 0-byte objects which become directories when there is >1 objects with a path name under that object's name. #2: we don't eventual consistency here.

The others? Things I thought of as I went along, looking at assumptions in the tests (byte ranges of generated test data), and assumptions so implicit nobody bothered to specify them: case logic, mv dir dir/subdir being banned, etc. And experience with Windows NT filesystems.

This is important, because  FileSystemContractBaseTest is effectively the definition of the expected behaviour of a Hadoop-compatible filesystem.

It's in a medium-level procedural language, but it can be automatically verified by machines, at least for the test datasets provided, and we can hook it up to Jenkins for automated tests.

And when an implementation of FIleSystem fails any of the tests, we can point to it and say "what are you going to do about that?"

If there is a weakness: dataset size. HDFS will let you create a file of size > 1PB if you have a datacentre with the capacity and the time, but our tests don't go anywhere near that big. Even the tests against S3 & OpenStack (currently) don't try and push up files >4GB to see how that gets handled. I think I'll add a test-time property to let you choose the file size for a new test "testBigFilesAreHandled()"

The point is: tests are specifications. If you write the tests after the code they may simply be verifying your assumptions, but if you do them first, or you spend some time thinking "what are some foundational expectations -byte ranges, case sensitivity, etc- you can come up with more ideas about what is wanted -and more specifications to write.

Your tests can't prove that your implementation really, really matches all the requirements of the specifications, and its really hard to test some of the concurrency aspects (how to simulate the deletion/renaming of a sub-tree of a directory that is in the process renamed/deleted). Code walkthroughs are the best we can do in Java today.

Despite those limits, for the majority of the code in an application, tests + reviews are mostly adequate. Which is why I've stated before: the tests are the closest we have to a specification of Hadoop's behaviour, other than the defacto behaviour of what gets released by the ASF as a non-alpha, non-beta release of the Apache Hadoop source tree (with some binaries created for convenience).

Where are we weak?
  • testing of concurrency handling.
  • failure modes, especially in the distributed bit.
  • behaviour in systems and networkings that are in some way "wrong" -Hadoop contains some implicit expectations about the skills of whoever installs it.
I'd love to know how better to test this stuff, or at least prove valid behaviour in Java with the Java5+ memory model on an x86 part , a 1GbE ethernet between the cluster (with small or jumbo frames), & something external talking to the cluster off another network.


[photo: van crashed into a wall due to snow and ice on nine-tree hill, tied to a bollard with some tape to stop it sliding any further]


Hadoop Summit: Call for Papers

There's a CFP out for papers for this summer's (US) Hadoop summit.

Mt Shasta

Having reviewed one of the tracks for the forthcoming Hadoop Summit EU, we -Johannes Kirschnick, Bill de hÓra and myself- had to go through 30+ abstracts, and pick under 10 of them. That's a 2:1 rejection rate -so we had to be ruthless. Rather than pick abstracts based on author, employer, whether or not they were on the review committee (i.e. me), whether they were a world class public speaker people would willingly travel to hear speak (i.e. me), or whether they were friends or colleagues, we had one basic critera: "would this be interesting and relevant for the audience", where the audience were people who had paid to come to Amsterdam to hear about the future of Hadoop."

That translated into:
  1. Does it show the future of a core component of the Hadoop stack? (HDFS, YARN, HBase, ... )?
  2. Or : Does it show the future of a new-but-potentially essential part of the stack?
  3. Does it sound interesting?
  4. Is it going to be relevant to the audience within the next 18 months ?
That focus on near-line futures (the stuff where you'd discuss using the present tense in French), meant that some of the sessions that would be so profound they'd be memorable for decades (specifically, Hadoop: Embracing Future Hardware, by S. Loughran), had to be left out.  Which is a shame, because there were some really interesting things I'd like to hear about -but I'm not representative of the audience. Here I am, at an in-law's house in London, laptop out, headphones on with Underworld playing loud enough to avoid the whining sound coming from an 11 year old complaining that the lockdown of the other laptop and consequent upgrade to JDK 7 has stopped minecraft. The IDE is showing the code for a converged HADOOP-8545 SwiftFS Filesystem;  a terminal building Hadoop locally so that Maven doesn't decide to pull a Hadoop-snapshot from the Apache repos, even though I built one locally myself only yesterday. Today's trunk benefits from a review and commit by Suresh od my HADOOP-9119 patch, one that verifies that when you overwrite a file in the FS with a new file, you get the contents of the new file back. (Because eventual consistency may not be what you want).  My notion of "Hadoop future" is not that of people who don't have spreadsheets full of JIRA issues to hand -and with them in mind, we have had to be pragmatic about what people want to know about.

We've also had to go through those proposals and sift those more-immediate future talks based on which ones sound the best. Which was done based on the abstracts.

The better the abstract, the better rank your paper got.

Some of the talk proposals had proposals competing directly with others for the same area and theme. When that happened, the one with the most compelling abstract got in. When there were some up and coming parts of the Hadoop stack, we had to select between them, again based on the abstract"

This is important because a vague one "we will show some interesting developments in Project XYZ"  isn't going to get in based on (speaker, org). That's for keynotes. For the technical talks, we needed to know enough about the talk to decide whether or not it was interesting and compelling for the audience.

The result of that is, we hope, an excellent set of talks for the audience, even though the limited space meant that coverage of all that is going in the Hadoop world is incomplete, and there will be some late-breaking features coming along. The should be some space for lighting talks, demo sessions, and presumably when I'm loitering by the Hortonworks booth I'll give demos of what I've been doing. There will be ways to see really new stuff.

What you won't get is full talks on such topics -which is why if you are working on something that you think is excellent and compelling to the audience of the next Hadoop Summit:

Submit a proposal with a really compelling abstract

[Photo, Sunset on Mount Shasta, CA, August 26 2012 -overnight stop on our return from Crater Lake to Mountain View[


Gmail account (not mine) potentially 0wned

A household security announcement:

2013-01-07 21:00 GMT Potential security breach of a home gmail account.
At or around 02:00 last night (GMT) at least five people, including myself were sent a URL by my wife.
  • As she's been in London, I've only just got access to this laptop
  • A curl of the link shows it has a javascript malware page; I haven't looked at what the contents are, but it's clearly trying to 0wn the browser.
  • It's too late to use the google account activity log to see what's up -it only goes back 12 hours.  I should have known about that feature this morning. For that reason we aren't sure what happened.
  • Her password was Bristol street name (not ours) + a number: weak entropy; it may have been brute-forced. Alternatively, it may not have come from her account at all, as not many people seem to have got it. I will look through my deleted items list and the headers.
  • Firefox and Chrome are up to date.
  • Thunderbird is up to date and not used for gmail
  • No bounce mail came into the gmail account. For that reason we believe that the message was not sent  to all addresses in the contact list. It may not have been compromised, though something did know that the five of us (at least) knew each other. This could be from some other email that is in the inbox of someone else who has been compromised.
  • There is no email in her mailbox that contains everyone's email address.
  • There's no obvious sign of contamination of gmail, such a filter to hide bounce responses that spamming everyone would inevitably generate.
  • even though flash is set to auto update, it hasn't picked up the most recent release, as the interval between emergency out of band flash updates is much less than the check rate of the flash updater. Whoever wrote it was optimistic and assumed that you'd update flash to get new features, not to stop it being one of the key attack routes for clients.
  • Java 1.6 is installed, though I disabled it on both browsers some time last year.
  • Acrobat pro is the default viewer of PDF files for firefox and thunderbird
  • the default app for  microsoft apps is the MS office suite
  • although MS word is set to check for updates weekly, it does not have the Nov 13 2012 critical update. The implication here is that MS Office automated update checking is broken.
  • we have adblock and flashblock to keep adverts out and flash pages from strangers away. Everyone should do this.
I have 3 theories.
  1. the password was gained through some brute force attack
  2. some malware gained access to the system via flash, acroread, MS office or, possibly though unlikely, Java. If this is the case, the Mac laptop has to be considered compromised.
  3. someone generated some spoofed emails.
Immediate Actions
  1. The password has been changed to a pass phrase.
  2. We have switched to 2 way authentication on google; a text is sent to the phone when logging in on a browser without the cookies, and you get device specific logins for IMAP clients. You can also generate sets of ready-to-use auth keys for use when travelling without a phone -which is enough to make me switch too.
  3. I've verified that neither gmail or google docs contains any of our credit card numbers. Apart from the last four digits of my number from an itunes receipt, all is well. Yes, I know about apple icloud's vulnerability to hacking with those digits [ http://www.wired.com/gadgetlab/2012/08/apple-amazon-mat-honan-hacking/all/ ], but there's little that can be done there -it's apple's side. If any card no was in a file visible from gmail I'd have had to revoke it via the bank.
  4. Updating the AV software, rebooting.
  5. Updating flash, acrobat, MS office.
  6. Update Java and make sure that java1.6 isn't on the box (you need to install the full JDK for this)
  7. Making Apple Preview the default PDF viewer for firefox and thunderbird
  8. maybe: installing Apache Office and making that the default viewer of MS Office apps from browsers. I suspect some end-user resistance there.
  9. Forcing proper pass phrases across all of Bina's accounts -her login password was compromised by a ten year old in 2012 so as to get extra time on the home computer. I do not consider the replacement to be much better.
If I see any signs of the laptop being compromised, it's rebuild time. The only reason I'm not doing it now is that I don't know how to do this on a mac -yet.

22:09 Update
  • Be aware that Installing Java7 re-enables java plugins, even if disabled. Turn it off in the (new) Java control panel, and then verify in the browser
  • Thunderbird picks up the whole set of installed plugins -including any newly re-enabled Java7 plugin, and flash. This is very serious, as it makes recipients vulnerable to targeted flash or Java attachs.
  • AV scanners are happy.
  • The headers show that the messages came from a different domain but were routed via gmail.
  • I didn't think google mail would do that to unauthenticated accounts -which makes me suspect it was a brute force attack
I'm concluding that the gmail login was guessed by brute force. The 2-way auth and new password should prevent this happening again, though we have to consider the contacts list and (personal) emails compromised.