Monday, September 10, 2012

They are SPOFs hiding everywhere!

One of my test systems sent me a few hundred emails between 2:38 and 7:45am JST. Just a test server so it didn't go to my phone and I noticed the problem around 7:30am. No paying clients were affected.

I dived straight in and found my Rackspace UK box couldn't find api.qqtrend.com. But DNS lookup worked for me. Other DNS was  working on the Rackspace box. I also found it couldn't ping the DNS server (at GoDaddy). But I could from my office LAN. And I also could from a U.S. server.

Conclusion (totally wrong - see below): Rackspace UK data centre had issues.

So I logged in to Rackspace, to check for alerts, and post a support ticket. There was a mention, phrased very vaguely, saying "someone else" has DNS problems. Hhmmm. By this time I could ping the GoDaddy server, but DNS lookup still failed. Uncertain I pinged around a bit more, and by that time DNS had started working.

I.e. The underlying problem had already been fixed, it was just taking time to spread through the internet, and I could have "fixed it" with no effort by just staying in bed 15 minutes longer. Oh well.

Anyway, it turns out it was a sociopath attacking GoDaddy: http://www.bbc.co.uk/news/business-19549367


Here are his reasons:
    "i'm taking godaddy down bacause well i'd like to test how the cyber security is safe and for more reasons that i can not talk now." (sic)

OK, English may not be his first language. But even allowing for that, he is not coming across as an upstanding member of society. Not protesting, no cause, just wanting to see if he had the skills to annoy a lot of people. This is a guy (gal?) who badly needs a girlfriend/boyfriend. If you know him, introduce him to someone. Please.

I know GoDaddy had a big PR screw-up by initially supporting SOPA, but they had the courage and sense to listen to people and change their position. Still a good company in my mind.

But, the silver lining is it nicely illustrated we (QQ Trend) have a Single Point Of Failure, at the DNS and registrar level, that had been overlooked. We'd previously got server1 and server2 as the two endpoints. We have them in different continents, and different cloud providers (no secret: Amazon and Rackspace). And I thought that was solidity to boast about. I was about to add server3 in a third continent as an option for customers who really, really need 100% uptime.

But what today's problem reveals is that if all three servers are on the same domain: server1.example.com, server2.example.com, server3.example.com, then we have a potential issue with DNS, and even with the registrar.

I think we need an alternative domain, at a completely different registrar, with DNS at an independent ISP. Then at the script level we add that in as one of the failover endpoints. They'll point to same three servers.

For instance, however big Amazon or GoDaddy (or any infrastructure provider) get, however many data centres they have around the world, they are still open to attacks, politics and human error inside their organization. We're service providers building on top of their infrastructure. It is our job to accept their limitations and do something about it.



1 comment:

darren said...

GoDaddy sent an apology, which I thought was good. PR trick? Of course, but it is amazing how many companies don't apologize (a sign of trouble at the top, and/or of lawyers instilling debilitating fear into management).

What I was particularly impressed with was they didn't blame the outage on attackers, or events beyond their control; instead they took responsibility: "The service outage was due to a series of internal network events that corrupted router data tables. Once the issues were identified, we took corrective actions..."

Again, I take that as a sign of quality at the top.