PUE, CO2 and NYT

Crater Lake Tour 2012

The NY Times has published an article "exposing" the shocking power wasted in a datacentre. It's an interesting read, even if their metrics "1.8 trillion gigabytes" take work to convert into meaningful values, which, assuming they use the HDD vendor's abuse of the values G and T in their disks specs, work out as:
"2,000 gigabytes": ~2TB
"50,000 gigabytes": ~50TB.
"roughly a million gigabytes": ~1 PB.
"1.8 trillion gigabytes": ~1.8 Exabytes.
"76 billion kilowatt hours: 76e9 KWh = 76e6 MWh = 76e3 GWh

There's already a scathing rebuttal, which doesn't say much I disagree with.

One part of the NYT article involved looking round a "datacenter" and discovering lots of unused machines, services that only get used intermittently. I'm assuming this is some kind of enterprise datacentre, a room or two set up a decade ago to host machines. Those underused machines should be eliminated; their disk image converted to a VM and then hosted under a hypervisor. Result: less floor space, CPU power and HDD momentum wasted.

Those enterprise datacentres are the ones whose PUE tends to be pretty bad -because it's mixed in with the rest of the site's aircon budget, and not as significant & visible a cost as it is for the big facilities. Google, Amazon and Facebook do care about this; they are probably the people backing the ARM-based servers, such as those running Hadoop jenkins builds. What those vendors care about tends to be cost though: cost of HW, cost of power, cost of land, cost of packets.

What the article doesn't look at -but the folks at MastodonC will presumably cover at Strata EU- is not the energy cost of computation, but the CO2 cost, Those datacentres in VA, where Amazon US-East is, have awful CO2 footprints, being all coal-powered. That's why it's ironic that the NYT complains about Amazon's diesel generators being pollution -in a part of the world where mountain-top mining converts entire mountains into smoke. They'd have been better off looking at the CO2 footprint of the datacentres, and of the other industries in the area.

MastodonC's dashboard is why I'm storing data and spinning up t1.micro instances in US-West 2 -Oregon; lowest CO2 footprint of their US sites.

I was also kind of miffed as the paper's criticism of power lines "financed by millions of ordinary ratepayers". Surely freeways were "financed by millions of ordinary ratepayers", yet the NYT has never done a shocking critique of Walmart's use of them to ship consumer goods round in fuel-inefficient diesel trucks, despite the fact an energy efficient alternative (electric trains) have existed for decades.

One thing the NYT does hint at is the storage cost -and hence the power cost- of old email attachments. It makes me think that I should clean some of the old junk up. What they don't pick up on is the dark secret of Youtube: the percentage of videos that are of cats. If you want someone to blame, blame the phones that make taking such videos trivial, and the people who upload them.

[Photo: Crater Lake, OR. The sky is hazy as the forest fires in Lassen and west of Redding are bringing smoke up from CA].


My Hadoop-related Speaking Schedule

I'm back from the US, where I had lots of fun getting the HA HDP-1 stuff out the door -I know about Linux Resource Agents, and too much about Bash -though that knowledge turns out to be terrifyingly useful.

Here's a pic of me sitting outside a cabin in Yosemite Valley where we spent a couple of nights -Camp 4 wasn't on the permitted accommodation list this time.
Curry Camp Cabin, Yosemite

Some people may be thinking "cabin?" "Yosemite?" and "Isn't that where all those people caught Hantavirus and died?". The answer is yes -though they were in wooden-walled tent-things about 100 metres away, and the epidemology assessments show that even for them the risk is very small. The press like headlines like "20,000 people may be at risk" -missing the point that the larger the set of people "present" for the same number of "ill", the smaller P(ill | present). Which is good as P(die | ill)=0.4.

Even so,  I've had some good discussions with the family doctor and the UK Health Protection Agency, who did write a letter saying "if you show symptoms of flu within 6 weeks of visiting, get to a hospital for a blood test". As the doctor said "we don't get many cases of Hantavirus in Bristol", so it's not something they are geared up for. You know that when they start looking at the same web pages you've already read.

Well, we've got 1-2 weeks left to go. And it was excellent in Yosemite, though next time I'd stay more in Tolumne Meadows than in the valley itself (too busy), and maybe sort out the paperwork to go back-country. 


Assuming that I remain alive for the next fortnight, here are where I'm going to be speaking over the next few months.

Strata EU: Data Availability and Integrity in Apache Hadoop.

I've already done a preview of the talk at a little workshop in Bristol -the live demo of RHEL HA failover did work, so I hope to repeat it. I'll be manning the Hortonworks Booth and wearing branded T-shirts, so will be findable -though I plan to attend some of the talks. In particular, one of the people behind Spatial Analyis UK will be talking -and I just love their maps.

Big Data Con London, Hadoop as a Data Refinery.

Here I'll be exploring the "Data Refinery" metaphor as a way to visualise and communicate the role of the Hadoop stack in existing organisations.

ApacheCon EU, Introduction to Hadoop-dev.

I'm going to talk about the Hadoop development process, QA and testing, contributions. This isn't going be a basic "here's SVN", or a "Hortonworks and Cloudera can handle everything" talk, but one that looks at the current process -both strengths and weaknesses. As a committer who was not only on their own for some years, but still in a different TZ, I know the problems that arise. I believe it is essential for people using Hadoop in the field to get their feedback in, through JIRA, tests & patches. If there is one thing that I think needs work is to have a semi-formalised process for external projects to do mentored work relating to Hadoop. That's companies, individuals, interns and university research. All to often we don't know that someone is working on a feature until they turn up with something big that cuts across the projects -and at that point it's too late to shape, to open up to external input, or to even comprehend. Just as apache has an incubator, I think we need something structured -as the alternative is that this work falls on the floor and ends up wasted.