2014-08-11

Distributed Computing Papers #1: Implementing Remote Procedure Calls

Col de la Machine



Stonebreaker is at it again, publishing another denunciation of the Hadoop stack, with the main critique being its reimplentations of old stuff at google. Well, I've argued in the past that we should be thinking past google, but not for the reason Stonebraker picks up.

My issue is that hardware and systems are changing and that we should be designing for those future designs: mixed-moving-to-total SSD, many-core NUMA, faster networking, v18n as deployment. Which is what people are working on in different parts of Hadoop-land, I'm pleased to say.

But yes, we do like our google papers —while incomplete, google still goes to a lot of effort to describe what they have built. We don't get to read much about, say, Amazon's infrastructure, by comparison. Snowden seems to have some slideware on large government projects, but they don't go into the implementation details. And given the goal the goal of one of the projects was to sample Yahoo! chat frames then we can take it as a given that those apps are pretty dated too. That leaves: Microsoft, Yahoo! Labs, and what LinkedIn and Netflix are doing —the latter all appearing in source form and integrating with Hadoop from day one. No need to read and implement when you can just git clone and build locally.

There's also the very old work, and if everyone is going to list their favourite papers —I'm going to look at some other companies: DEC, Xerox, Sun and Apollo.

What the people there did then was profound at the time, and much of it forms the foundation of what we are building today. Some aspects of the work have fallen by the wayside —something we need to recognise, and consider whether that happened because it was ultimately perceived as a wrongness on the face of the planet (CORBA), or because some less-powerful-yet-adequate interim solution became incredibly successful. We also need to look at what we have inherited from that era, whether the assumptions and goals of that period hold today, and consider the implications of those decisions as applied to today's distributed systems.

Starting with Birrell and Nelson, 1984, Implementing Remote Procedure calls.

This is it: the paper that said "here is how to make calling a remote machine look like calling some subroutine on a local system".
The primary purpose of our RPC project was to make distributed computation easy. Previously, it was observed within our research community that the construction of communicating programs was a difficult task, undertaken only by members of a select group of communication experts.
Prior to this, people got to write their own communication mechanism, whatever it was. Presumably some low-level socket-ish link between two processes and hand-marshalled data.

After Birrell's paper, RPC became the way. It didn't just explain the idea, it showed the core implementation architecture: stubs on the client, a service "exporting" a service interface by way of a server-side stub and the RPC engine to receive requests, unmarshall them and hand them off to the server code.

It also did service location with Xerox Grapevine, in which services were located by type and instance. This allowed instances to be moved around, and for sets of services to be enumerated. It also removed any hard-coding of service to machine, something that classic DNS has tended to reinforce. Yes, we have a global registry, but now we have to hard-code services to machines and ports "namenode1:50070", then try to play games with floating IP addresses which can be cached by apps (JVMs by default), or pass lists of the set of all possible hosts down to the clients...tricks we have to do because DNS is all we are left with.

Ignoring that, RPC has become one of the core ways to communicate between machines. For people saying "REST!", if the client-side implementation is making what appear to be blocking procedure calls, I'm viewing it as a descendent of RPC...same if you are using JAX-RS to implement a back end that maps down to a blocking method call. That notion of "avoid a state machine by implementing state implicitly in the stack of the caller and its PC register" is there. And as Lamport makes clear: a computer is a state machine.

Hence: RPC is the dominant code-side metaphor for client and server applications to talk to each other, irrespective of the inter-machine protocols.

There's a couple of others that spring to mind:

message passing. Message passing comes into and falls out of fashion. What is an Enterprise Service Bus but a very large message delivery system? And as you look at Erlang-based services, messages are a core design. Then there's Akka: message based passing within the application, which is starting to appeal to me as an architecture for complex applications.

shared state. entries in a distributed filesystem, zookeeper znodes, tuple-spaces and even RDBMs tables cover this. Programs talk indirectly by publishing something to a (possibly persistent) location, recipients poll or subscribe to changes.

We need to look at those more.

Meanwhile, returning to the RPC paper, another bit of the abstract merits attention
RPC will, we hope, remove unnecessary difficulties, leaving only the fundamental difficulties of building distributed systems: timing, independent failure of components, and the coexistence of independent execution environments.
As the authors call out, all you need do do is handle "things happening in parallel" and "things going wrong". They knew these issues existed from the outset, yet the way RPC makes the fact that you are calling stuff remotely "transparent", it's easy for us developers to forget about the failure modes and the delays.

Which is the wonderful and terrible thing about RPC calls: they look just like local calls until things go wrong, and then they either fail unexpectedly, or block for an indeterminate period. Which, if you are not careful, propagates.

Case in point, HDFS-6803, "Documenting DFSClient#DFSInputStream expectations reading and preading in concurrent context". That's seems such a tiny little detail, wondering if "should the positioned read operation readFully(long position, byte[] buffer, int offset, int length) be thread safe and, if the implementation is up to it, reentrant?".

To which my current view is yes, but we shouldn't make it invisible to others. Why? If we try to hide that a seek/read/seek sequence in progress, you either have to jump through hoops caching and serving up a previous position, or make getPos() synchronize, such as proposed in HDFS-6813. Which effectively means that getPos() has been elevated from a lightweight variable fetch, to something that can block until an RPC call sequence in a different thread has completed, successfully or otherwise. And that's a very, very different world. People calling that getPos() operation may not expect an in-progress positioned read to surface, but nor do they expect it to take minutes —which it can now do.

And where do we have such potentially-blocking RPC calls inside synchronized method calls? All over the place. At least in my code. Because most of the time RPC calls work -they work so well we've gotten complacent.



(currently testing scheduled-window-reset container failure counting)
(photo: looking ahead on Col de La Machine, Vercours Massif — one of the most significantly exposed roads I've ever encountered)

2014-08-08

Modern Alpinism

I am back from a 2+ week vacance in the French Alps —the first since 2009. More specifically, the first since the DVLA took away my driving license on the basis that I was medically unfit to drive. They now consider me no more dangerous than anyone else, so we could drive down there. My main issue with driving now is that I've forgotten things: motorway awareness, the width of my car (very important in bristol), what it feels like to go round corners fast enough for car to indicate through the steering wheel that it doesn't want to —and worst of all, how to parallel park into small gaps.

Two weeks in the French Alps, along with 2-day approaches/retreats is a good revision there, giving me the learning matrix of
Day Night Bad Weather
Autoroute
French village
French urban (excluding Paris)
Windy country roads
Alpine passes


There's a few left, and I didn't try any overtakes, but otherwise that's reasonable coverage of the configuration space.

While there: MTB work, some on-road work including a 100 mile epic, a bit of basic Alpine hut hiking/scrambling —as well as staying with friends, eating too many cheese-based products (tartiflettes &c), consuming french beer and cidre, catching the Tour de France Chamrousse Stage and generally having fun.

I got to review some of the places last visited in 2009:
break time

Where I'm pleased to see that I don't seem significantly rounder

Col du Barrioz



The main difference this time was that someone was rinsing their clothes in this village fountain half-way up Col du Barrioz, so we had to sprint off before we got harassed, crossing the pass and sheltering from some rain in a bar/tabac where I opted to not drink that day

Col du Barrioz

a policy which the camera's EXIF data implies lasted exactly nine minutes

Col du Barrioz



Anyway: an excellent holiday —it was great to be back in the Alps.

2014-05-28

Can I have password for my hotel room's shower?

A conversation you never get at a hotel when you check in:
"how many people will be having showers?"
"oh, three of us"
"OK, here are three vouchers for hot water. Keep them handy as you'll need to retype them at random points in the day"
"thank you. Is the login screen in a random EU language and in a font that looks really tiny when I try to enter it, with a random set of characters that are near impossible to type reliably on an on-screen keyboard especially as the UI immediately converts them to * symbols out of a misguided fear that someone will be looking over my shoulder trying to steal some shower-time?"
"Why, yes -how very perceptive of you. Oh, one more thing -hot water quotas"
"hot water quotas?"
"yes, every voucher is good for 100 Litres of water/day. If you go over that rate then you will be billed at 20 c/Litre."
That's a lot!
"Yes, we recommend you only have quick showers. But don't worry, the flow rate of the shower is very low on this hot water scheme, so you can still have three minutes worth of showering without having to worry"
"'this' hot water scheme?"
"yes -you can buy a premium-hot-water-upgrade that not only gives you 500L/day, it doubles the flow rate of the shower.
"oh, I think I will just go to the cafe round the corner -they have free hot water without any need for a login"
"if that is what you want. Is there anything else?"
"Yes, where is my room?"
"It's on the 17 floor -the stairs are over there. With your luggage you could get everything up in two goes -it will only take about fifteen minutes"
"17 floors! Fifteen Minutes! Don't you have a lift?"
"Ah -do you mean our premium automated floor-transport service?  Why yes, we do have one. It won't even add much to your bill. Would you like to buy a login? First -how many people will plan on using the lift every day -and how many times?

2014-05-14

LG: this is not your television -certainly not your kids'

If there's one difference between the current "internet of things" concept with predecessors: ubiquitous computing, JINI, cooltown, etc, it is that it is  not just devices with internet connectivity, there's a presumption that those devices are generating data for remote systems, and making use of that processed data.

This can be a great thing. Devices with integrated awareness of the global aggregate datasets have a great potential to benefit the owners, myself included. And I'm confident that when that data starts to be collected, it'll be in Hadoop clusters, code I've help author.

But we need to start thinking now about how to deliver an Internet of Things-that-benefit-the owner. If the connectivity and data analysis is designed to benefit someone else, then its gone from a utility to a threat.

It is critical that we make sure that the emergence of the "Internet of Things" does not become perceived as a threat to those of us who who own those things. If not, the vision and opportunity will not be realised. Which is why I'm starting to worry about my television -to the extent that not only am I not applying a new system update which includes a critical 'terms and conditions update", I'm thinking of composing a letter to the uk Information Commissioners Office on its topic.

This a 16 month old telly, one I first reviewed in 2013, where I implied its "smart" features were like AOL's or those 1998-era home PCs pre-cluttered with junk you didn't want.

Later on last year, it turned out that LG Smart TVs were discreetly uploading terrestrial TV watching data, along with USB filenames, and that it's privacy policy considered such viewing data anonymous.

LG got some bad press there, which they are reacting to with that new upgrade -disabling smart TV features until you agree to its new policy. Apply the system upgrade and iPlayer gets disabled until you consent to the new policy -one that pretty much enumerates any possible way to extract information about the user short of videoing everything you do.

Here are some points in the new and improved privacy policy which grab my attention,.

accept it or the device doesn't work: "in order to gain access to the full range of Smart TV services, you must agree to our Privacy Policy, which facilitates a greater exchange of information between your LG Smart TV and our systems.". That's for access to BBC iPlayer and NetFlix -third party services.

Viewing information includes data from HDMI devices: Viewing Information may include the name of the channel or program watched, requests to view content, the terms you use to search for content, details of actions taken while viewing (e.g., play, stop, pause, etc.), the duration that content was watched, input method (RF, Component, HDMI) and search queries.

You opt out of all EU data protection: "By agreeing to this Privacy Policy you expressly consent to us and our business associates and suppliers processing your data in any jurisdiction"

The section I really want to call out is the paragraph on "Protecting the Privacy of Children"

Protecting the privacy of children is important to us. For that reason, none of our Smart TV services are directed at anyone under 13 and they are not structured specifically to attract anyone under 13. We also do not knowingly collect or maintain personal information from users who are under 13. Should we learn or be notified that we have collected information from users under the age of 13, we will promptly delete such personal information.


This is something that nobody could say with a straight face. The privacy policy states that it has the right to monitor all terrestrial TV viewing -including CBBC, CBeebies and other kids channels- and push that upstream. The "viewing information" also includes information on external inputs, so perhaps even playing content from a games console is monitored.


Actual use case


This is clearly an illegal use of a television: two children are trying to play a game on it. Which -for better or worse- children do. And yet it is now something that LG are pretending their "smart TV is not for"

Someone should tell the marketing department that children can't use Smart TVs, as their UK site says otherwise, with that phrase "LG Smart TV's Game World provides family entertainment.". Maybe they mean " families where all the kids are over 13".  Except also on that page,"Enjoy hours of free 3D content including documentaries, sports, kids and music concerts and rent the latest 3D Disney movies exclusively with LG Smart TV.". We have actually operated a say-no-to-Disney policy for 12 years, but I believe that they do target children under the age of thirteen.

Any assertion that the TVs advanced features aren't there for use by children aged twelve and under are bogus -and the site and marketing shows this. The fact that providers like Netflix and iPlayer have kids content shows they are targeting children. If LG didn't want kids to use that content, they'd have approached the organisations and said "leave the kids content out on our machines"

So what now? 

1. I'm not applying the update, so haven't accepted the T&C changes. I wonder what's going to happen there? Is everything suddenly going to stop working in a big server-side switch, or will I just be assumed to have accepted -and my data will be collected as if I have agreed.

2. I'm debating contacting LG to say "a twelve year old uses our television, please stop collecting data on it". This would really put them on the spot to see how they react.

3. I'm not happy about the data-out-of-EU policy. I know web sites -remote servers- have such a policy. But can consumer goods, bought at a shop down the road have a set of T&Cs that say to work all EU data protection laws have to be discarded?

What happens on these smart TVs is important -it's an example of how a traditional offline  consumer device is being wired up to the rest of the net -and we need to define now what everyone's expectations should be. We consumers should expect to be the owners of the machines -and in control of the data. Vendors of the devices have opportunities to make great uses of the data -but they have to do it in a way that bring tangible benefit. Better advertising placement on a TV you've bought isn't such a benefit -at least to me. And if it is -why isn't it on the product web pages?

LG are being leading edge here -but right now they are almost becoming a "what not to do" story. And with every update to their firmware, things only get worse

2014-05-04

Fundamental Flaws in Android's "contacts-includes-search" feature

One under-reported aspect of the Android 4.4 release is that it adds -by default apparently- searches inside your contacts.
"why not mix searches with your contacts"

That is, whereas before you had a list of people you wanted to phone, which you could search and then call. now it can also include "nearby places that include your query"

A key point is being missed here. When I start typing in a name of someone in the contacts list, I am not trying to "query" something, just do a lookup of people in small table. And if there isn't a match there, it is not that I want the search broadened -it is that I have mistyped someones name and would like some fuzzy matching.

What do I get then, if I mistype "andes" instead of the surname of a friend, "anders"
because random locations are exactly what you want

I don't get a "showing results for anders" dialog the way you get if you misspell something in google search.

No, I get a hostel and a set of apartments in Santiago, Chile, and a mountain sports shop in New Hampshire.

None of these are nearby by any definition of "nearby" except that used by astronomers, where "inner solar system'" is considered close.

Maybe, just maybe, if the "search nearby places" feature did actually search nearby then it could be useful, though as there are so many other search bars in an android phone (top of the front page, in maps, in chrome) that I have never stared at the phone thinking "I wonder if there is a way to look up things on the web from here?" I serious doubt that.

But there is no way that I can conceivably consider returning two hotels in south america and a US shop as a nearby search, it is no use whatsoever.

I find it hard to conceive what developer team managed to come up with a search threshold where the cutoff for "not nearby" appears to be somewhere outside the orbit of the moon. No doubt there's some bug where the fact that java byte is and so setting one to 180 degrees would result in the number being chopped down, but really, couldn't the QA write a test that entered a key term not in contacts -and fail the test if the top three results included one or more entry on a different continent, or even hemisphere.

2014-01-15

Greylisting - like blacklisting only more forgiving

How not to fix a car



Reading the paper  The φ Accrual Failure Detector has made me realise something that I should have recognised before: blacklisting errant nodes it too harsh -we should be assigning a score to them and then ordering them based on perceived reliability, rather than a simple reliable/unreliable flag.

In particular: the smaller the cluster, the more you have to make do with unreliable nodes. It doesn't matter if your car is unreliable, if it is all you have. You will use it, even if it means you end up trying to tape up an exhaust in a car park in Snowdonia, holding the part in place with a lead acid battery mistakenly placed on its side.

Similarly, on a 3-node cluster, if you want three region servers on different nodes, you have to accept that they all get in, even if sometimes unreliable.

This changes how you view cluster failures. We should track the total failures over time, and some weighted moving average of recent failures -the latter to give us a score of unreliability, giving us a reliability score of 1-reliability, assuming I can normalise unreliability to a floating point value in the range 0-1.

When specifically requesting nodes, we only ask for those with a recent reliability over a threshold; when we get them back we first sort for reliability and try to allocate all role instances to the most reliable nodes (sometimes YARN gives you more allocations than you asked for). We may have some allocations on nodes > the reliability threshold.
That threshold will depend on cluster size -we need to tune that based on the cluster size provided by the RM (issue: does it return current cluster size or maximum cluster size).

What to do with allocations above the threshold?
options
  1. discard them, ask for a new instance immediately: high risk of receiving the old one again
  2. discard them, wait, then ask for a new instance: lower risk.
  3. ask for a new instance before discarding the old one the soonest of (when the new allocation comes in, some time period after making the request). This probably has the lowest risk precisely because if there is capacity in the cluster we can't get that old container, we'll get a new one on an arbitrary node. If there isn't capacity, when we release the container some time period after making the request, we get it back again. That delayed release is critical to ensuring we get something back if there is no space.
What to do if we get the same host back again? Maybe just take what we are given, especially in case #3 and we know that the container was released after a timeout. It'll be above the threshold, but let's see what happens -it may just be that now it works (Some other service blocking a port has finished, etc). And if not, it gets marked as more unreliable.

If we do start off giving all nodes a reliability of under 100%, then we can even distinguish "unknown" from "known good" and "known unreliable". This gives applications a power they don't have today -a way to not trust the as-yet-unknown parts of a cluster

 If using this for HDD monitoring, I'd certainly want to consider brand new disks as less than 100% reliable at first, and try to avoid storing data in >1 drive below a specific reliability threshold, though that just makes block placement even more complex


I like this design --I just the need the relevant equations

2014-01-06

Hoya as an architecture for YARN apps

Sepr@Bearpit

Someone -and I won't name them- commented on my proposal for a Hadoop Summit EU talk, Secrets of YARN development: "I am reading YARN source code for the last few days now and am curious to get your thoughts on this topic - as I think HOYA is a bad example (sorry!) and even the DistributedShell is not making any sense."

My response: I don't believe that DShell is a good reference architecture for a YARN app. It sticks all the logic for the AM into the service class itself, doesn't do much on failures, avoids the whole topics of RPC and security. It introduces the concepts but if you start with it and evolve it, you end up with a messy codebase that is hard to test -and you are left delving into the MR code to work out how to deal with YARN RM security tokens, RPC service setup, and other details that you'd need to know in production

Whereas Hoya
  • Embraces the service model as the glue to building a more complex application. Shows my SmartFrog experience in building workflows and apps from service aggregation.
  • Completely splits the model of the YARN app from the YARN-integration layer, producing a model-controller design. Where the model can be tested independently of YARN itself.
  • Provides a mock YARN runtime to test some aspects of the system --failures, placement history, best-effort placement-history reload after unplanned AM failures --and lays the way for simulating the model can handle 1000+ clusters.
  • Contains a test suite that even kills HBase masters and Region Servers to verify that the system recovers.
  • Implements the secure RPC stuff that Dshell doesn't and which isn't documented anywhere that I could find.
  • Bundles itself up into a tarball with a launcher script -it does not rely on Hadoop or YARN being installed on the client machine.
So yes, I do think Hoya is a good example

Where it is weak is
  1. It's now got too sophisticated for an intro to YARN.
  2. I made the mistake of using protobuf for RPC which is needless complexity and pain. Unless you really, really want interop and waste a couple of days implementing marshalling code I'd stick to the classic Hadoop RPC. Or look at Thrift.
  3. I need to revisit and cleanup of bits of the client side provider/template setup logic.
  4. We need to implement anti-affinity by rejecting multiple assignments to the same host for non-affine roles.
  5. It's pure AM-side, starting HBase or Accumulo on the remote containers, but doesn't try hooking the containers up to the AM for any kind of IPC.
  6. We need to improve its failure handling with more exponential backoff, moving average blacklisting and some other details. This is really fascinating, and as Andrew Purtell pointed me at phi-accrual failure detection, is clearly an opportunity to some interesting work.
I'd actually like to pull out the mock YARN stuff out for re-use --same for any blacklisting code written for long-lived apps.

I also filed a JIRA "rework DShell to be a good reference design", which means implement the MVC split and add a secure RPC service API to cover that topic.

Otherwise: have a look at the twill project in incubation. If someone is going to start writing a YARN app, I'd say: start there.