2012-11-19

AWS: why the bias towards US-east?

As MastodonC will point out, Amazon's US-East sites are the most polluting, not just because they have a high CO2 footprint, but because the coal they (and the other east coast) industries burn is polluting in other ways, such as sulphur. It's not as bad as, say, a steelworks (having had relatives living near  Ravenscraig Steelworks I can vouch for this), but as datacentres can be placed near other electricity sources, it's needless.

I intermittently use US-West-2, up in Oregon, where the melting snow creates electricity.

Crater Lake Tour 2012

Unfortunately there's an implicit bias in the AWS APIs towards US-East. Where's the default site for S3 Buckets? US-East. Where's the default site for EC2 instances? US-East. What is the default location for EMR jobs? The same -to the extent that the command line clients treats requesting a different site as "uncommon":

Uncommon Options
 --debug               Print stack traces when exceptions occur
 --endpoint ENDPOINT   EMR web service host to connect to
 --region REGION       The region to use for the endpoint
 --apps-path APPS_PATH Specify s3:// path to the base of the emr public bucket to use. e.g s3://us-east-1.elasticmapreduce


Because of all the implicit "us-east" bias, it becomes self reinforcing. Once you've got a bucket on S3 east, that's where you want to run your webapps otherwise you get billed for the remote bandwidth. Once you've got the webapps, that's where your logs go, hence even more reason to run your MR jobs on the same site: it's where your data lives.

Because it's the default location for stuff, it's also the default location for people serving up data on the site: RPM and Maven repositories, public datasets. This pushes you towards that location so as to avoid the costs of downloading that data from other sites, as well as the speed gain.

Why the bias? Either it's where the the majority of servers lie, or through a combination of cost of electricity, site PUE and bandwidth, it's got the lowest operating costs -hence the most profit per CPU-hour, MB stored or MB downloaded.

That's a shame, because amazon themselves have better options. They're being crucified by Parliament over their tax avoidance strategies -it'd be tactically wise to have something positive to talk about.

[Photo: Crater Lake & Mt Thielsen. Smoke is a forest fire blowing up from CA]

2012-11-17

Crater Lake: T+11

Following on from my "Page Mill T+20" trip, in late August we ended up Crater Lake for the Corvallis "Mid Valley Bicycle Club" annual circuit of the lake.

2001:

Crater Lake

and 2012
Crater Lake Tour 2012
The colours are different as in 2012 the Lassen fires 80 miles to the south are adding a light smoke to the air.

The original picture was taken with a Sony camera, 2048*1536; 3 Megapixels. The resolution is less than my desktop monitor, which makes it appear grainy as a background.

in 2012:, the original size of 4000*3000 means four times as many dots; the panasonic compact has a leica lens and makes up for the loss of a viewfinder by the ability to display a grid + diagonals over the image, to increase P(horizontal(horizon)).

In August 2001, Bina was 5 months pregnant; now our son is 10 and did the loop on a tandem, working with Mike Wilson, who races in the PNW CX circuit in the category below his age to make it more challenging. That may seem to have given Alexander some help -but it also meant that he was made to do it at a fairly aggressive pace, with none of this resting business.

Crater Lake Tour 2012
I did it on a borrowed MTB, with knobbly tires, and took a couple of detours to add 12+ miles to my route. Even so, compared to Alexander, I look suspiciously tired.

Crater Lake Tour 2012

I got back to the campground (because of those detours, honest) about a hour after him, tired, needing my rest and refreshments.

What is the ten year old doing? Running around chasing chipmunks. Then he comes over and tries to steal my beer.

Crater Lake Tour 2012

That's it then -isn't it? I may as well retire now.

Were it not for the fact that university education is becoming so expensive that my son will need a large amount of cash to get through it then I have no further contribution to make towards my DNA's survival.

2012-11-16

And now: the People's Republic of Bristol

There was an election round England and Wales yesterday. Mostly it was for a new position: Police Commissioner, which was so uninspiring that one polling station in Newport, Wales, had a turnout of exactly zero.
Stokes Croft

In Bristol, we had something else: a Mayoral Election -one decided by a first choice/second choice voting system. The three main parties, some of the "troublemaker parties" -greens, Respect. And some independents, including one who lives in a van near stokes croft.

The results are in, and today we have something new in the city: an Independent Mayor.

I've met George Ferguson a couple of times -he's done a lot of the city, and, as they say , could "organise a piss-up in a brewery" -as he owns one of the local brewerys.

This could show a profound change: the locals would rather have someone in charge who wasn't beholden to a party line coming from London, who stated clearly that he'd be appointing his cabinet (from the existing councillors?) on merit, not just from the subset of those from a single party.

There are some other factors at play: a large proportion of voters from Liberal Democrat strongholds appear to have gone for George Ferguson -and those areas had the highest turnout. My own ward was at the 20% turnout -and when I dropped round to leave our two postal vote envelopers they were pleasantly surprised. As an attempt to raise awareness and interest in elections, it's failed.


It'll be interesting to see how having an independent works out. Patronage has always been one of the ways a political party achieves loyalty, and I wonder how many people in the council will be working for him, rather than against him.

In the meantime -I shall head down to the Canteen, Stokes Croft, and have one of his beers there.

2012-11-14

A Hadoop Standards Body? It's called the Apache Software Foundation

I am writing this on the ICE502 train from Mannheim to Frankfurt. To my left, my friend Paolo Castagna pages through the emails from Cloudera HQ that are slowly trickling into his phone; I'm out of network range so can't go over the small-kids (kleinerkinder) compartment and skype in to a Hortonworks team meet.

We are on our way back from ApacheCon EU.
Zooming in

Over the last week, the topics of the talks I've attended have included (and omitting my own): Cassandra development, RDF processing in Apache Hadoop (ask Paolo there), Logging futures, post-Apache Maven build tools, Apache Open-Office cloud integration, Cloud Stack, Apache HBase status quo -Lars show how all the HDFS work we've been doing is really going to benefit Apache HBase there, NoSQL ORM, Apache Mahout, and many others. A large proportion of the Apache Hadoop Datacentre Stack is there -and we can sit down and discuss issues. It may be an internal issue: how to move away from commons-logging; it may be something cross project, such as how HDFS could let HBase explicitly request a block placement policy for each region server that kept all replicas on the same rack., or it could be something indirectly relevant like Apache Open Office slideshow improvements.

We've been treated to slides from Steve Watt of HP showing their prototype Arm-64 server systems, which will offer tens of servers in a 2U unit -a profound achievement. We've been treated to some excellent beer at the Adobe reception, which went from 18:00 until we were evicted at 21:00.

I met lots of people, some I knew, some I'd never met face to face before, some who were complete strangers until this week. We've been in the same talks, eaten at the same tables, drunk beer in the two restaurants and the cafe in this town, discussing everything from OSGi classloading in Apache Karaf, Jumbo Ethernet frames and what to do when remains of a decomposing whale ends up in your datacentre. Those people I was in the cafes included Lars George (Cloudera), Steve Watt (HP), Isabel Drost (Nokia), and three people who had a whale-related incident in their facility.
A whale? a whale?

Not once did anyone say: "Let's give some standards body the Apache Hadoop trademark and the right to define our APIs as well as the exact semantics of the implementation!"

Nobody said that. Not even whispered it.

Because from the open source perspective, it makes no sense whatsoever. The subject that did come up was "Jackson versioning grief -which relates to an open JIRA.

I gave a talk saying there is lots of work, and pointing people at svn.apache.org, and issues.apache.org , saying "get involved" -and discussing how to do so.

Key things to do
  • gain trust by getting on the lists and being visible (and competent, obviously)
  • help review other people's patches than just your own
  • don't try and do big things in Apache HDFS (risk of data lost) or Apache MapReduce (performance and scale risks).
What I did emphasise is that we do want more people helping -and that we need to improve how this is done. I did not suggest that we could do this through "under an industry forum—either an established group or one that is specifically focused on big data.".

What I suggested was -and these are entirely personal opinions -
  1. some mechanism for mentoring in external development projects, so that they don't fail, get neglected, or appear without any warning -and creating integration problems.
  2. better distributed development, so that those of us outside the Bay Area can be involved in the development. Google+ events, more pure-online meetings in various timezones. The YARN event that Arun organised is something I want to praise. here: we remote attendees got webex audio and remote slideshare. Even so it was very late in the EU evenings and there's always an imbalance between people in the room -the visible, vocal audience, and people down the speaker phone.
  3. better patch integration through Git and Gerrit. Even if svn is the normative repo, we should be able to accept patches as pull requests that go through Gerrit review; people can update their patches trivially through merging trunk with their branch and pushing out their branch to a public repo.
I also mentioned tests. Not just tests of new features -where we are obsessive about "no features without tests", but in improving the coverage of the system, and formalising the semantics of the system.

If there is ambiguity in the behavior of bits of Apache Hadoop, tests added to the Apache  source repository, svn.apache.org, define that behaviour. Regression testing the entire stack finds problems, which is why we love to do that -especially things like testing how repeated runs Apache HBase's functional tests suites succeed while our test infrastructure is triggers NameNode failover, or how the deployment of Yahoo!'s existing applications on the new MRv2 engine in YARN improves performance at those applications -while finding any regressions in MRv2 from the MRv1 runtime.

Testing against Apache Hadoop is the way to guarantee compatibility with Apache Hadoop -because the Apache Hadoop code is Hadoop.

At the root of the svn.apache.org/hadoop source tree, in the Apache tarballs and RPMs, and in those products that include the ASF artifacts or forks thereof is a file: LICENSE.TXT
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
What does that mean? It means:

Anyone is free to write whatever distributed filesystem they want, implement whatever distributed computing platforms on top if that they choose -but they cannot call it Hadoop.

There's a nice simple metric here:

If you can't file bug reports against something in issues.apache.org, it's not an apache product, and hence not Apache Hadoop

For that reason: I'm not convinced that the Hadoop stack needs to care about the compatibility concerns of people trying to produce alternative platforms, any more than Microsoft needs to care about the work in Linux to run Windows device drivers.