Greylisting - like blacklisting only more forgiving

How not to fix a car

Reading the paper  The φ Accrual Failure Detector has made me realise something that I should have recognised before: blacklisting errant nodes it too harsh -we should be assigning a score to them and then ordering them based on perceived reliability, rather than a simple reliable/unreliable flag.

In particular: the smaller the cluster, the more you have to make do with unreliable nodes. It doesn't matter if your car is unreliable, if it is all you have. You will use it, even if it means you end up trying to tape up an exhaust in a car park in Snowdonia, holding the part in place with a lead acid battery mistakenly placed on its side.

Similarly, on a 3-node cluster, if you want three region servers on different nodes, you have to accept that they all get in, even if sometimes unreliable.

This changes how you view cluster failures. We should track the total failures over time, and some weighted moving average of recent failures -the latter to give us a score of unreliability, giving us a reliability score of 1-reliability, assuming I can normalise unreliability to a floating point value in the range 0-1.

When specifically requesting nodes, we only ask for those with a recent reliability over a threshold; when we get them back we first sort for reliability and try to allocate all role instances to the most reliable nodes (sometimes YARN gives you more allocations than you asked for). We may have some allocations on nodes > the reliability threshold.
That threshold will depend on cluster size -we need to tune that based on the cluster size provided by the RM (issue: does it return current cluster size or maximum cluster size).

What to do with allocations above the threshold?
  1. discard them, ask for a new instance immediately: high risk of receiving the old one again
  2. discard them, wait, then ask for a new instance: lower risk.
  3. ask for a new instance before discarding the old one the soonest of (when the new allocation comes in, some time period after making the request). This probably has the lowest risk precisely because if there is capacity in the cluster we can't get that old container, we'll get a new one on an arbitrary node. If there isn't capacity, when we release the container some time period after making the request, we get it back again. That delayed release is critical to ensuring we get something back if there is no space.
What to do if we get the same host back again? Maybe just take what we are given, especially in case #3 and we know that the container was released after a timeout. It'll be above the threshold, but let's see what happens -it may just be that now it works (Some other service blocking a port has finished, etc). And if not, it gets marked as more unreliable.

If we do start off giving all nodes a reliability of under 100%, then we can even distinguish "unknown" from "known good" and "known unreliable". This gives applications a power they don't have today -a way to not trust the as-yet-unknown parts of a cluster

 If using this for HDD monitoring, I'd certainly want to consider brand new disks as less than 100% reliable at first, and try to avoid storing data in >1 drive below a specific reliability threshold, though that just makes block placement even more complex

I like this design --I just the need the relevant equations


Hoya as an architecture for YARN apps


Someone -and I won't name them- commented on my proposal for a Hadoop Summit EU talk, Secrets of YARN development: "I am reading YARN source code for the last few days now and am curious to get your thoughts on this topic - as I think HOYA is a bad example (sorry!) and even the DistributedShell is not making any sense."

My response: I don't believe that DShell is a good reference architecture for a YARN app. It sticks all the logic for the AM into the service class itself, doesn't do much on failures, avoids the whole topics of RPC and security. It introduces the concepts but if you start with it and evolve it, you end up with a messy codebase that is hard to test -and you are left delving into the MR code to work out how to deal with YARN RM security tokens, RPC service setup, and other details that you'd need to know in production

Whereas Hoya
  • Embraces the service model as the glue to building a more complex application. Shows my SmartFrog experience in building workflows and apps from service aggregation.
  • Completely splits the model of the YARN app from the YARN-integration layer, producing a model-controller design. Where the model can be tested independently of YARN itself.
  • Provides a mock YARN runtime to test some aspects of the system --failures, placement history, best-effort placement-history reload after unplanned AM failures --and lays the way for simulating the model can handle 1000+ clusters.
  • Contains a test suite that even kills HBase masters and Region Servers to verify that the system recovers.
  • Implements the secure RPC stuff that Dshell doesn't and which isn't documented anywhere that I could find.
  • Bundles itself up into a tarball with a launcher script -it does not rely on Hadoop or YARN being installed on the client machine.
So yes, I do think Hoya is a good example

Where it is weak is
  1. It's now got too sophisticated for an intro to YARN.
  2. I made the mistake of using protobuf for RPC which is needless complexity and pain. Unless you really, really want interop and waste a couple of days implementing marshalling code I'd stick to the classic Hadoop RPC. Or look at Thrift.
  3. I need to revisit and cleanup of bits of the client side provider/template setup logic.
  4. We need to implement anti-affinity by rejecting multiple assignments to the same host for non-affine roles.
  5. It's pure AM-side, starting HBase or Accumulo on the remote containers, but doesn't try hooking the containers up to the AM for any kind of IPC.
  6. We need to improve its failure handling with more exponential backoff, moving average blacklisting and some other details. This is really fascinating, and as Andrew Purtell pointed me at phi-accrual failure detection, is clearly an opportunity to some interesting work.
I'd actually like to pull out the mock YARN stuff out for re-use --same for any blacklisting code written for long-lived apps.

I also filed a JIRA "rework DShell to be a good reference design", which means implement the MVC split and add a secure RPC service API to cover that topic.

Otherwise: have a look at the twill project in incubation. If someone is going to start writing a YARN app, I'd say: start there. 


My policy on open source surveys: ask the infrastructure, not the people

An email trickling into my inbox reminds me to repeat my existing stance on requests to complete surveys about open source software development: I don't do them.


The availability of the email address of developers in OSS projects may make people think  that they could gain some insight by asking those developers questions as part of some research project, but consider this
  1. You won't be the first person to have thought of this -and tried to conduct a survey.
  2. The only people answering your survey will be people who either enjoy filling in surveys, or who haven't been approached, repeatedly before.
  3. Therefore your sample set will be utterly unrealistic, consisting of people new to open source (and not yet bored of completing surveys), or who like filling in surveys.
  4. Accordingly any conclusions you come to could be discounted based on the unrepresentative, self-selecting sample set.
The way to innovate in understanding open source projects -and so to generate defensible results-  is to ask the infrastructure: the SCM tools, the mailing list logs, the JIRA/bugzilla issue trackers. There are APIs for all of this.

Here then are some better ideas than yet-another-surveymonkey email to get answers whose significance can be disputed:
  1. Look at the patch history for a project and identify the bodies of code with the highest rate of change -and the lowest. Why the differences? Is the code with the highest velocity the most unreliable, or merely the most important?
  2. Look at the stack traces in the bug reports. Do they correlate with the modules in (1)?
  3. Does the frequency of stack traces against a source module increase after the patch to that area ships? or does it decrease? That is, do patches actually reduce the #of defects, or as Brooks said in the Mythical Man Month, simply move around. 
  4. Perform automated complexity analysis  on source. Are the most complex bits the least reliable? What is their code velocity?
  5. Is the amount of a discussion on a patch related to the complexity of the destination or the code in the patch?
  6. Does that complexity of a project increase of decrease over time?
  7. Does the code coverage of a project increase or decrease over time?
See? Lots of things you could do -by asking the machines. This is the data-science way, not asking surveys against a partially-self-selecting set of subjects and hoping that it is in some way representative of the majority of open source software projects and developers.

[photo: ski lifts in the cloud, Austria, december 2013]


Television Viewing & the Deanonymization of Large Sparse Datasets.

[preamble: this is not me writing against collecting data analysing user behaviour, including Tv viewing actions. I cherish the fact that Netflix recommends different things to different family members, and I'm happy for the iPlayer team to get some generic use data and recognise that nobody actually wants to watch Graham Norton purely from the way that all viewers stop watching before the introductory credits are over. What is important here is that I get things in exchange: suggestions, content. What appears to be going on here is that a device I bought is sending details on TV watching activity so as to better place adverts on a a bit of the screen I paid for, possibly in future even interstitially during the startup of a service like Netflix or iPlayer. I don't appear to have got anything in exchange, and nobody asked me if I wanted the adverts let alone the collection of the details of myself and my family, including an 11 year old child.]

Graham Norton on iPlayer

Just after Christmas I wandered down to Richer Sounds and bought a new TV, first one in a decade, probably second TV we've owned since the late 1980s. My goal was a large monitor with support for free to air DTV and HD DTV, along with the HDMI and RGB ports to plug in useful things, including a (new) PS3 which would run iPlayer and Netflix. I ended up getting a deeply discounted LG Smart TV as the "smart" bits came with the monitor that I wanted.

I covered the experience back in March, where I stated that I felt that smart bit was AOL-like in its collection of icons of things I didn't want and couldn't delete, it's dumbed down versions of Netflix and iPlayer, and its unwanted adverts in the corner. But that's it, the netflix tablet/TV integration compensates for the weak TV interface, and avoids the problem of PS3 access time limits on school nights, as the PS3 can stay hidden until weekends.


Last week I finally acceded to the TV's "new update available" popups, after which came the "reboot your TV" message. Which I did, to then get told that I had to accept an updated privacy policy. I started to look at this, but after screen 4 of 20+ gave up, mentioning it briefly on that social networking stuff (who give me things like Elephant-Bird in exchange for their logging my volunteered access -access where I turn off location notification in all devices).

I did later regret not capturing that entire privacy policy by camera, and tried to see if I could find it on line, but at the time, the search term "LG SmartTV privacy policy" returned next to nothing apart from a really good policy for the LG UK web site, which even goes into the detail of identifying each cookie and its role. I couldn't see the policy after a quick perusal of the TV menus, so that was it.

Only a few days later, Libby Miller pointed me at an article by DoctorBeet, who'd spun wireshark up to listen to what the TV was saying, and so showing how his LG TV is doing an HTTP forms  POST to a remote site of every channel change, as well as details on filenames in USB sticks.

This is a pretty serious change on what a normal television does. DoctorBeet went further and looked at why. Primarily it appears to be for advert placement, including in that corner of the "smart" portal, or a start time after you select "premium" content like iPlayer or netflix. I haven't seen that which is good -an extra 1.5MB download for an advert I'd have to stare through is not something I'd have been happy with.

Anyway, go look at his article, or even a captured request.

I'm thinking of setting up wireshark to do the same for an evening. I made an attempt yesterday but as the TV is CAT-5 to a 1Gbs hub, then an ether over power bridge to get into the base station, it's harder than I'd thought. My entire wired network is on switched ports so I can't packet sniff, and the 100 MB/s hub I dredged up from the loft turned out to be switched too. That means I'd have to do something innovative like use the WEP-only 802.11b ether to wifi bridge I also found in that box, hooked up to an open wifi base station plugged into the real router. Maybe at the weekend. A couple of days logs would actually be an interesting dataset even if it just logs PS3 activity hours as time-on-HDMI-port-1

What I did do is go to the "opt out of adverts" settings page DoctorBeet had found, scrolled down and eventually followed some legal info link to get back to the privacy settings. Which I did photo this time, and which are now up on Flickr.

Some key points of this policy

Information considered to be non personally identifiable include MAC addresses and "information about the live content you are watching"

LG Smart TV Privacy Policy

That's an interesting concept, which I will get back to. for now. note that that specific phrase is not indexed anywhere into BigTable, implying it is not published anywhere that google can index it.
Phrase not found: "information about the live content you are watching"

Or "until you sit through every page with a camera this policy doesn't get out much"

If you have issues, don't use the television

LG Smart TV Privacy Policy

That's at least consistent with customer support.

Anyway. there's a lot more slides. One of them gives a contact, who when  you tap in to LinkedIn not only shows that he's the head of legal at LGE UK,  that he's one hop away from me: datamining in action.

Now, returning to a key point: Is TV channel data Non-personal information?

Alternatively: If I had the TV viewing data of a large proportion of a country, how would I deanonymize it?

The answer there is straightforward, I'd use the work of [2004 Arvind Narayanan and Vitaly Shmatikov], Robust De-anonymization of Large Sparse Datasets.

In that seminal paper, Narayanan and Shmatikov took the anonymized Netflix dataset of (viewers->(movies, rankings)+), and deanonymized it by comparing film reviews on Netflix with IMDb reviews, looking for reviews that appeared on IMDb shortly after a Netflix review with ratings matching/close to that a Netflix review. They then took the sequence of a viewers' watched movies and looked to see if a large set of their Netflix review met that match critera. At the end of which they managed to deanonymize some Netflix viewers -correlating them with an IMDb reviewer may standard deviations out from from any other candidate. They could then use this  match to identify those movies which the viewer had seen and yet not reviewed on IMDb.

The authors had some advantages, both netflix and IMDb had reviews, albeit on a different scale. the TV details don't so the process would be more ad-hoc

  1. Discard all events that aren't movies
  2. Assume that anything where the user comes in late to some threshold isn't a significant "watch event" and discard.
  3. Assume that anything where the user watches all the way to the end is a significant "watch event" and may be reviewed later.
  4. Assume that watching events where the viewer changes channel some distance into a movie -say 20 min- as a significant watch failure event, which may be reviewed negatively.
  5. Consider watch events where the user was on the same channel for some time before the movie began as less significant than when they tuned in early.
  6. If information is collected when a user explicitly records a movie, a "recording event", that is treated even more significantly.
  7. Go through the IMDb data looking for any reviews appearing a short time after a significant set of watch events, expecting higher ratings from significant watch events and recording events, and potentially low ratings from a significant watch failure.

I don't know how many matches you'd get here -as the paper shows, it's the real outliers you find, especially the watchers of obscure content.

Even so, the fact that it is would to possible to identify at least one viewer this way shows that TV watching data is personal information. And I'm confident that it can be done, based on the maths and the specific example in the Robust De-anonymization of Large Sparse Datasets paper.

Conclusion: irrespective of the cookie debate, TV watching data may be personal -so the entire dataset of individual users must be treated this way, with all the restrictions on EU use of personal data, and the rights of those of us with a television.


Foreign News

The cracks all the way to the top of the small feudal island-state of Great Britain became visible this week, as a show trial and exposure of police and state security activities exposed the means the regime retains power.

Stokes Croft Royal Wedding Day

For centuries Britain has endured a caste system, where those at the bottom had little education or career prospects, while those in the ruling "upper class" lived an entirely separate life -a life that began with a segregated education from their school, "eton", to their universities, oxford and cambridge and then employment in "the city" or political power in "parliament". Similar to the French Polytechniques system, while it guarantees uniformity and consistency amongst the hereditary rulers, the lack of diversity reduces adaptability. Thus the elite of this island have had trouble leading it out of the crises that have befallen it since 2008 -when it became clear that it offshore tax-haven financial system had outgrown the rest of the country. The emergency measures taken after the near-collapse of the countries economy have worsened the lives for all outside a small elite -exacerbating the risks of instability.

This month some of the curtains on the inner dealings of that ruling oligarchy were lifted, giving the rest of the country a glimpse into the corrupt life of the few. A show trial of the editors of a newspaper showed how the media channels -owned by a few offshore corporations- were granted free reign by the rulers, in exchange for providing the politicians with their support and the repetition of a message that placed the blame for the economic woes on the previous administration and outgroups such as asylum seekers and "welfare scroungers".

A disclosure of how the media were creating stories based on intercepting the voicemail messages of anyone of interest forced the government to hand a few of the guilty to the legal system -while hoping that the intiminate relationship between these newspaper editors and those in government do not get emphasised. Even so, this scandal has already forced the government to postpone approving a transaction that would give a single foreign oligarch, Murdoch, near absolute control of television and the press. Public clamour for some form of regulation of the press has also forced the regime to -reluctantly- add some statuatory limitations to their actions. It remains to be see what effect this has -and whether the press will exact their revenge on the country's rulers.

A few miles away, in the country's "parliament", the MPs exercised some of their few remaining privileges of oversight. The "plebgate" affair represented a case in which the feared police, "the Met" were grilled over their actions. Normally the Met is given a free hand to suppress dissent and ensure stability across the lower castes, but in "plebgate" the police were caught on CCTV and audio recordings making false accusations about one of the rulers. The thought that "the Met" could turn on their masters clearly terrifies them: the grilling of the police chiefs represents the public part of a power struggle to define who exactly is in charge.

Alongside this, the heads of the state security apparatus were interviewed over the increasingly embarrasing revelations that they had been intercept the electronic communications of the populace of the country, "the subjects" as they are known. This comes as no surprise to the rulers, who recognise that with the mainstream media being part of the oligarchy, any form of organised dissent will be online. Monitoring of facebook and google is part of this -during the 2011 civil unrest, calls even were made by the press and politicians to disable some of these communications channels. Again, the rulers have to walk a fine line between appearing concerned about these revelations, while avoiding worsening those relationships which are critical for keeping the small hereditary elite in power.

Given the interdepencencies between the rulers, the press and the state security forces, no doubt these cracks will soon be painted over. Even so, irrespective of the public facade, it may be a while before the different parts of what is termed "the establishment" trust each other again.


Maverick and Applications


One action this week was a full OS/X roll on three boxes; fairly event free. The main feature for me is "better multiscreen support". There's also now an OS/X version of Linux's Powertop; as with powertop is more of a developer "your app is killing the battery" than something end users can actually do anything with -other than complain.

The other big change is to Safari, but as I don't use that, it's moot.

The fact that its a free upgrade is interesting -and with Safari being a centrepiece of that upgrade, maybe the goal of the upgrade is to accelerate adoption of the latest Safari and stop people using Firefox & Chrome. The more market share in browsers you have, the more web sites work in it -and as Safari is only used on a macs, it can't have more desktop browser market share than the market share apple have in the PC business itself. A better Safari could maximise that market share -while its emphasis on integration with iPads and iPhones rewards people who live in the single-vendor-device space, making us owners of Android phones feeling left out.

One offering that did get headlines was "Free iWork", but that turns out to be "Free on new systems"; if you have an existing OS/X box, you get to pay $20 or so per app -same as before.

Except, if you go to the apple app store, the reviews of the new suite from existing users are pretty negative: dumbed down to the point where users with existing spreadsheets, documents and presentations are finding things missing -where in keynote a lot of the fancy "make your presentations look impressive" features are gone.

They're not going to come back, now that iWork is a freebie.

If the NRE costs of maintaining iWork are now part of the cost of the Mac -and OS upgrades are going the same way. Even if apple maintain market share and ASP margins over Windows PCs, the software stack costs have just gone up.

Which means those applications have gone from "premium applications with revenue through the app store", with a business plan of "be compelling enough to sell new copies as well as regular upgrades from out existing customer base", to "bundled stuff to justify the premium cost of our machines".

That's a big difference, and I don't see it being a driver for the iWork suite being enhanced with features more compelling to the experts.

Where apple are likely to go is cross-device and apple cloud integration, again to reward the faithful single-vendor customers. Indeed, you do get the free apple iCloud versions of the iWork apps, which look nice on Safari -obviously. Apple's business model there: upsell storage, does depend on storage demand, but the harsh truth is, it needs a lot of documents to use up the 4GB of free storage. Photographs, now, they do take up space, which clearly explains why the new iPhoto has put work in iPhoto to iCloud sharing. Yet it does still retain Flickr sharing, which, with 1TB of storage, must be a competitor to iCloud for public photos, while facebook remains a destination for private pics.

I wonder whether that Flickr uploader will still be there the next time Apple push out a free update to the OS and applications
[photo: a line of Tuk Tuks, Tanzania/Kenya Border]


Hadoop 2: shipping!

Deamz & Soker: Asterix

The week-long vote is in- Hadoop 2 is now officially released by Apache!

Anyone who wants to use this release should download Hadoop via the Apache Mirrors.

Maven and Ivy users: the version you want to refer to is 2.2.0

The artifacts haven't trickled through to the public repos as of 2013-10-16-11:18 GMT -they are on the ASF staging repo and I've been using them happily all week.

This release marks an epic of development, with YARN being a fundamental rethink of what you can run in a Hadoop cluster: anything you can get to run in a distributed cluster where failures will happen, the Hadoop FileSystem the API for filesystem access  -be it in Java or a native client- and data is measured by the Petabyte.

YARN is going to get a lot of press for the way it transforms what you can do in the cluster, but HDFS itself has changed a lot. This is the first ASF release with active/passive HA -which is why Zookeeper is now on the classpath. CDH 4.x shipped with an earlier version of this- and as we haven't heard of any dramatic data loss events, consider it well tested in the field. Admittedly, if you misconfigure things failover may not happen, but that's something you can qualify with a kill -9 of the active namenode service. Do remember to have to >1 zookeeper instance before you try this -testing ZK failure should also be a qualification process. I think 5 is a number considered safer than 3, though I've heard of one cluster running with 9. Nobody has admitted going up to 11.

This release also adds to HDFS
  1. NFS support: you can mount the FS as an NFS v3 filesystem. This doesn't give you the ability to write to anywhere other than the tail of a file -HFDS is still not-Posix. But then neither is NFS: its caching means that there is a few seconds worth of eventual consistency across nodes (\cite{Distributed Systems, Colouris, Dollimore & Kindberg, p331}).,
  2. Snapshots: you can snapshot some of a filesystem and roll back to it later. Judging by the JIRAs, quotas get quite complex there. What it does mean is that it is harder to lose data by accidental rm -rf operations.
  3. HDFS federation: datanodes can store data for different HDFS namenodes, -Block Storage is now a service- while clients can mount different HDFS filesystems to get access to the data. This is something of primarily of relevance to people working at Yahoo! and facebook scale -everyone else can just get more RAM for their NN and tune the GC options to not lock the server too much]
Hadoop 2 also adds is extensive testing all the way up the stack. In particular, there's a new HBase release coming out soon -hopefully HBase 0.96 will be out in days. Lots of other things have been tested against it -which has helped to identify any incompatibilities between the Hadoop 1.x MapReduce API (MRv1) and Hadoop 2's MRv2, while also getting patches into the the rest of the stack where appropriate. As new releases trickle out, everything will end up being built and qualified on Hadoop 2.

Which is why when you look at the features in Hadoop 2.x, as well as headline items "YARN, HDFS Snapshots, ...", you should also consider the testing and QA that went into this -this is the first stable Hadoop 2 release -the first one extensively tested all the way up the stack. Which is why everyone doing that QA -my colleagues, Matt Foley's QA team, the Bigtop developers, and anyone else working with Hadoop that built and tested their code against Hadoop 2.1 beta and later RCs -and reported bugs.

QA teams: your work is appreciated! Take the rest of the week off!

[photo: Deamz and Soker at St Philips, near Old Market]