Followup: where can we do interesting things at low-levels

I managed to offend Andrew Purtell of Intel in my last post, so to make up here's some issues why Hadoop benefits from his work

Jamaica Street Feb 2012

Hardware is changing, CPU arch, memory layouts, cross-core synchronization, instructions built in the CPU. For efficient code we need everything from minimal branch misprediction to things that cut down core operations -especially decompression, checksums, encryption.

That's why the amount of native code in Hadoop is only going to increase -and that's going to take more than just skill to write -it takes the work to identify the problems and bottlenecks in the first place.

Java gets in the way here. You don't know the layout of a java class in memory, so it's near impossible to design cache-efficient designs, pulling in a class on a single cache line, avoiding false sharing within a struct. Even java's volatile opcode forces in a synchronisation barrier -making it much more expensive than C/C++ volatile. It's memory manager didn't go NUMA-aware until Java7 -it's been lagging hardware, but openJDK may speed things up there. Except now there's the problem of testing Hadoop on trunk builds of openJDK.

Looking forward, the end of HDDs is inevitable, but before then, mixed SSD and HDD will be the trend -and HDFS has to use that, somehow deciding where to put data, and knowing where it is.

If you are scheduling an MR job, and you have fast 10GbE networking, then the job placement problem becomes less of "where is there a slot on any of the three server with the data" and more "which of my 3 replicas are in SSD" -then doing work on or near that server, as things will complete that much faster.

And to have some storage of replicas across tiers will need HDFS knowing about the options, and making decisions. Those decisions will be hard, and you'll need something like a rebalancer in each host, moving cold data out of the SSD, so leaving space for warm data. Unless the SSD is actually used for write-through data, in which case the server you wrote two now has two copies, one on SSD, one on HDD. That may be a viable layout for two of the replicae. With P(SSD-fail) << P(HDD fail), you may even consider having more two-copy blocks, though there you would want them on separate boxes.

Other fun areas are OS integration, the underlying filesystem and networking stack. If you look at how today's servers & OSes are tuned, its towards things like databases. In a world where it's many-server designs, parallel operations across a cluster, things could be different

So yes: lots of areas to play in here, all welcome. Especially with tests.

[Photo: Jamaica Street. Centre looks like it is by Paris]

x86 Parts and Server Farms

The Cycle by 3Dom

Intel have ~85% market share in x86 parts; AMD having the rest. That would seem good except for a quote from that report:
Total MPU shipments decreased 6% quarter-over-quarter in the fourth quarter, well below normal seasonality of down 1% quarter-over-quarter due to slowing PC demand and share loss to tablets. We note microprocessor shipments have now declined sequentially in four of the last five quarters.
On the desktop Intel managed to surf the success of Windows & the corresponding demand for x86 parts, so fund the R&D and the Fabs needed to stay ahead of more space efficient RISC CPU architectures. Intel could bring things out like the P5 superscalar CPU with the Pentium, then with the Pentium Pro introduce the first mass market speculative execution CPU architecture, the P6. For these parts and their successors, performance came before power.

The growing performance of those desktop parts not only killed the non-x86 Unix workstation business, they brought the x86 into the server world, pretty much taking a sharp knife to all the Unix on home-rolled CPU business -only IBM's Power and the SPARC lines remain, though how long SPARC will survive remains to be seen. (I'm ignoring Itanium, it being an Intel part too).

That's why when the Hadoop community talks about "commodity" servers, x86 parts are implicit in there. And with 85% market share, that's a lot of CPUs. Or, if you are in networking business, a lot of potentially 10GbE network ports -as nobody else is interested in them.

Intel are supplying the parts for those clusters then, which means they know the number that is selling, and they know something bad about it: most large clusters ship with only one socket filled up.

Problem: Hadoop is usually IO intensive; RAM is often a limiting factor on work. Fill in both sockets and you need to double the RAM too -but even then, may end up idling for data coming off disk or network.

Problem: On the spreadsheets used to cost out a big cluster -you have not only the rows needed to set the parts for each node (ram, storage, CPU), there's one for power. Fill that second socket and not only do you take a CAPEX hit, your OPEX looks worse as your power budget can jump up. And while x86 parts can crank back their wattage when idle, that says "when that second socket you filled is underused, your power bill isn't so bad", as in "If you still overspent on the CAPEX then your OPEX is slightly better".

On a big cluster, not filling that second socket can save enough money you could add another rack of servers -and the storage that comes with it. Storage is a tangible benefit.

Problem: GPUs are where the cycles are coming from, and in HPC clusters, it's worth compiling your code down to it -and working to make sure your code can compile down -and people are exploring GPUs in Hadoop clusters, especially as YARN delivers more cluster flexibility. Intel's attempts to get into the GPU business have failed -apart from the lower power display chipsets used in laptops when they are in battery mode, so GPUs aren't a revenue stream.

Problem on a big datacentre the buyer gets to negotiate prices. This is at its worst (for Intel, AMD, Dell, HP, SuperMicro, SGI) on the big cloud farms, as it will be a take-it-or-leave-it deal for thousands of nodes, and people like Amazon care about both capital cost of parts, and the operational cost of them. The people doing the negotiating know about bill of material curves (see below), and won't be rushing to by the latest part, not if last quarter's part costs significantly less for not much performance difference.

Problem: virtualisation may reduce demand for servers. The more you move to virtual server hosting, the less you need servers for every individual application. The adoption of virtualisation may have driven demand for new blade systems, but now that servers have consolidated, demand may level out -except for the new clusters, HPC racks and cloud service providers.

Problem: The desktop and laptop market are stagnating: the growing end-user products are smart phones and tablets, and with the exception of MS Surface Professional, Intel have barely a presence. Arm owns that market, and AMD have announced their plans to build server-side ARM parts and chipsets. AMD, with only 15% market share, have little to lose after all.

It's not just the CPU design that matters here, it's storage too. As I discussed in my Hadoop Summit EU talk proposal, Hadoop needs to plan ahead here. Tiered storage, many-core WIMPy parts, that's something Hadoop can adopt, especially if we start the groundwork.

And what's part of the nightly Jenkins build of Hadoop? Hadoop on ARM.

There we have it then: Hadoop clusters becoming a big part of the enterprise datacentre, but without demand for CPU parts compared to virtualisation blades and classic HPC systems. GPUs and parts moving from the desktop and ARM cores from the tablet -all have their eyes on the server market, offering massive CAPEX and OPEX savings. Oh, and the desktop market going away while the tablet and phone market remains elusive.

Yes, if you were the premier vendor of today's desktop & server CPUs, you'd be worrying about these -and thinking "even if the desktop business is going down, how can we increase demand for our parts in a growing Hadoop cluster business?"

That can be done in three ways: getting that second socket filled, making sure the CPUs sold have the latest-and greatest parts, and expanding demand for Hadoop clusters.

There is also the SSD business, which is money on the table as far as Intel are concerned: stuff to take from the HDD vendors. There may be competitors -such as Samsung- but if INTC can do mainboards with SSD hooked off the PCIex bus rather than pretending to be disk, there are opportunities that they can exploit.

Finally: 10GbE networking. This is something you need for the latest massive-storage-capacity servers anyway, unless your cluster has so many machines that the loss of single 36TB server isn't significant and the cost of 10GbE switches is. Lower cost of 10GbE switches would be nice.

If you look at the Intel Distribution for Apache Hadoop, then, you can see two features: a justification for multiple CPU sockets filled with the latest parts (encryption using new AES opcodes), and SSD speedup, though not the explicit tiering that the HDFS developers are starting to talk about.

What's not so obvious is why do their own bundling of this, which means they have to take on the entire QA process -which isn't something I've seen in terms of JIRAs filed- and it leaves CPU demand out of mainstream Hadoop. Unless Intel's Hadoop-related product obtains market share rapidly, I'm not sure why they didn't just try and get the CPU-intensive features in early.
[Photo: 3Dom's "The Cycle"; went up on Stokes Croft over the weekend"]


The Cheddar Expedition

Last weekend I took advantage of a confluence of events: sunny weather and no parental obligations, to go for a quick ride down to Cheddar and back.

Plan: southwest via Long Ashton/Festival Way to Backwell; meander on to Yatton. From there, the Strawberry Line; a traffic free cycle route along a disused railwayline to Axbridge Cheddar. A curved approach, but a flat one -my left knee is still recovering from something I did to it in January. Cheddar is the other side of the Mendip hills; a minor 300 metre summit, but with most of the roads having sustained 15-25% gradients, it's not easy.

The old railwaypath would get me over on a flat and quiet road, then a single climb up the spectacular Cheddar Gorge, down the other side and then a fairly undulating route home along country lanes.

Departure time: 12:30. Expected return time: 17:00-17:30, "before it gets dark"

Actual return time: 18:30, "an hour after sunset"

To be fair, I did get to see a very dramatic sunset.
Mendips sunset

It's just that I had same feeling you get if you are up some cliff or mountain when the sun goes down and you hadn't intended to be there at that time.

Sunset can mean you are in trouble. It's why sunrise is somewhat a nicer thing to see in the mountains, though as it means that the avalanche, serac collapse and other ice-melt-triggered objective hazards are becoming more likely, even that can be unwelcome. That's in the Alps though. Here sunset is the one to worry about, as it means you have a lot of work in darkness to get home.

Why did I come in an hour late? Route closures.

The first problem was that the Strawberry Line's tunnel through the crest of Mendips was closed. There was no advance signage of this until I got to the path closure, followed a sign through some "farm mud" track, and was the dumped on the A38 road. Of all the routes over the Mendips that you can cycle, this has to be the one marked "least pleasant". At the summit I turned off and got onto a bridlepath, so could get off the main road. Of course, being just a track in the woods, it was fairly muddy, so not having tyres with tread on them meant the descent was fairly slow.

Eventually I did get the bike down, and then continued on to Cheddar, coming in about half an hour late.

When I get to Cheddar though, there are lots of signs "Cheddar Gorge shops still open". That's a warning sign. You only get signs staying "still open" if it was considered likely that you thought they'd be closed -which left me to wondering "why would they be closed".

The reason people may have thought they were closed, as it turned out, were that the gorge road had been shut since November due to flooding. The whole of Mendips is a limestome massif, with all drainage in the southern watershed being underground. There'd been enough rain that the caves flooded and blocked up; the water surfaced in the gorge, and caused lots of problems. Which were now my problems.

At this point it's 3pm, 2-2.5 hours to go, and I'm on the wrong side of the hills with the route I came out of play, and the route I'd planned to take back also closed. This is not ideal. I have some on-road options, but either it's back on the main roads or carry on down to Wells and up, which is pretty time consuming.

Immediate action: cafe for hot chocolate, chocolate of some other form and wifi for the phone to download some high-res OpenStreetMap datasets.

It'd have been straightforward if I'd been a bike set up for the conditions, meaning tyres that had traction or control in mud. I didn't
Cheddar Gorge Bridleway

For the uphill I could just get off and push -at least my shoes had good traction- but the undulations and descents were hard, not least because you don't want to lose control a short distance away from a 200 metre cliff edge:

Cheddar Gorge Bridleway

I didn't get off this bridleway until about 16:30, which then fed forward to the schedule for the rest of the ride. Sunset before I'd got past the airport, darkness from there to home.

Anyway: two route closures and a bicycle not set up for off-road alternative options resulted in me getting home well after dark. That's when you are glad that modern bike lights are so good.

And to be fair, I did get some great views, not just the sunset, but the late sun over the Somerset Levels, here with Glastonbury Tor, sometimes known by its more fanciful name "Isle of Avalon"

The gorge itself is good from up here, though you only see the top half. It cuts right down to sea level, a geological feature from the last ice age.

Cheddar Gorge Bridleway

In 1903 Britain's oldest complete human skeleton was found here: a 9000 year old stone-age era inhabitant of the area.

In 1997 mitochondrial DNA sampling of the town found someone who was a direct descendant on the maternal side.

That's fascinating on it's own: some of the people living in the town are descendants of the same people that lived their nine thousand years earlier. Whatever changes happened in the UK: romans, saxons, vikings, whatever, they just lived through it all. And they came up with what can be -if done properly- a most excellent cheese.

Footnote: cavers cleared out the blocked sinkholes earlier this week and the road is now intermittently open.


Load balancing, distribution and failure modes of distributed applications


I've just been catching up on the RapGenius/Heroku row: that Heroku's load distribution algorithm "random placement" was killing performance as requests were still be routed to blocked worker nodes -dynos.

It's an interesting problem -to me, the failure to explain this to customers is indefensible. It's not an unexpected problem, it's an intrinsic part of the architecture, so talk about it sooner rather than later.


What about the actual "random distribution of incoming requests" architecture?
  • Eliminates the need to share queue statistics to the front end nodes, so scales better with the #of front end and back end nodes.
  • If all requests take a similarish time to execute (some normal distribution with no long tail), and incoming requests have their own predictable style (say poisson distribution), then scattered scheduling should work well. You've got independent incoming requests being scattered out to a pool of workers that will get their work done. Scattering the work randomly leads to a balanced spreading of the load from each front-end node, if all front-end nodes choose a different set of random nodes at each time t, then the total work on the system gets distributed.
Assuming the incoming request rate for any particular time of day & day of week follows a specific poisson distribution, you could calculate the rate of requests needing serviced, combine that with the service processing distribution and estimate how many worker nodes you'd need for your SLA.

Where does it fail?
  1. If the front end nodes aren't random enough  -they keep picking the same workers and building up queues there.
  2. If the front end nodes aren't independently random -instead they are picking the same nodes as others, while leaving others idle. Again queues build up.
  3. If failed worker nodes aren't detected.
  4. If the time for worker nodes to complete a request has a long-tail built in to some of the requests, so causing some requests to take up much more time, so reducing throughput on that node.
  5. If overloaded worker nodes aren't detected, and queues build up worse there.

#4 & #5: long-tail requests with requests still hitting the overloaded worker nodes appears to be the problem hitting rapgenius. And once those queues build up, it takes a long time for them to go away: latency sucks.

What could be done here?
  1. worker nodes to provide queue information to the front-end nodes, they could pick two or three workers and route to the one with the shortest queue. Needs: protocol to poll before submitting; assumes that polling 3 nodes is enough to find one that isn't overloaded.
  2. tuple-space style "who wants this work?" scattered requests: any of the worker nodes can accept a request based on its workload. This is a pull model rather than a push model. Needs: a T-space that can handle the rate of change.
  3. Something central to collect workload stats and share it with the front end nodes. Needs: something central, data collection & publishing.
  4. Workers to publish their queue details to something distributed (T-Space or multicast load details), front-end nodes to collect this and use it in their decision making.
There: lots of choices. I shall wait to see what Heroku come up with.  Given their existing front-end-push model, I could see #1: check queues on >1 worker, or #4, workers to publish load, being most appealing, with #1 the simplest.

Before everyone goes "that's obvious! Why didn't they think of it!", a cautionary tail.

Way back in 2001, working on one of the early Web Services, XMLRPC requests would come in to some Level7 load balancing thing that would then direct work to the app server back ends processing it. Those machines would talk to other services in the cluster, and return the responses.

Except one day, one of the JVMs failed to find one of the nodes it depended on; some transient DNS failure. Of course, as we all now know, Java caches DNS failures unless you tell it not to: that transient failure was uprated to a permanent failure.

As a result, incoming XMLRPC requests did end up getting serviced very quickly -they were serviced by being turned into failure responses and sent back to the caller.

Which meant that the queue for that app server was significantly shorter than for all the other nodes in the cluster.

Which meant that the L7 router directed more work its way -amplifying a failure from, say 1in 8 requests, to about 1 in 2.

The front end router didn't know that the requests were failing, all it knew was that the observed liveness of the app server "fields requests, returns responses" seemed valid. It was fulfilling its goal of "send work to the machines with the shortest queue".

It was just unfortunate that the machine with the shortest queue wasn't actually live, in the strict "performing useful work" definition of liveness.

That's why a random distribution of requests has some appeal to me: not only is it simpler, it avoids that failure-amplification which we encountered, where fast-failing requests create the illusion of a shorter queue.

As a fun exercise, consider what it would take to automate detection of the failed node. It's not enough for the L7 router to observe that HTTP 500 responses are coming back, because a series of invalid requests could trigger that. You need to submit valid requests to the application from outside, and, if any of them fail, track down which worker node is playing up.

That, for the curious, is precisely why Apache Axis has  two features that I added based on my experiences here.

1. a health page, happyaxis.jsp.This gives the system state, returns 200 if happy, 500 if not -for the things upstream to react to- and is designed for humans to parse too.

2. The other feature, which is much less obvious, is that Axis can include the hostname in an AxisFault; with its addHostnameIfNeeded() method, a hostname may be returned to the caller. If enabled, you can push down some of the fault diagnostics to the clients, as people phone you up saying "I am getting failures from 'host4'", rather than them phoning you up saying "One request in eight is failing". Your remote automated liveness tests submitting reference work will now let you track down this problem without waiting for the call -or looking at the request logs to see which box has the problems. In a virtual world, rm that VM.

Anyway, the key points are this.
  1. Random distribution of workload is both simpler to implement, and somewhat resilient to some failure modes that convert to shorter request queues.
  2. Adding diagnostics to error responses aids in tracking down problems.
On reflection, I think it was that project where my obsession with useful network diagnostics came from.

Update: I've thought of another way to identify an overloaded worker node.
  1. Have each worker node declare its (moving) average service interval from accepting onto the queue to completion.
  2. Have the front-end servers maintain a moving average of the service interval declared by the last few worker nodes given work.
  3. The front end servers pick a worker node at random, as before.
  4. Include a query of the service interval & queue depth into the request forwarding protocol. The front end servers connect to a node in the back, ask it's for it's service interval, then decide whether or not to forward the request.
  5. The front end node can then implement a local policy for job submission that takes into account the current quoted service interval of the selected worker node in comparison with its moving average of job submissions. If the interval is significantly higher, the front-end node may choose to try another worker node to see if its interval is significantly less than that of the first node probed. If so, work could be sent there -and the moving average updated.
This approach doesn't add any more communications if all is working well (the load info can go into the first TCP ACK packet), and is entirely distributed. It's also a very simple evolution of the random distribution strategy.

[photo: a giant lizard made out of old CDs on a building in Madrid]