On the 3rd of October, after some quite high load, my website crashed and went offline. Given I’d just gone to Sweden, this was a bit awkward.
Looking at the stats – on the 2nd of October it was getting on average 5 page loads *per second* – except it wasn’t evenly spread traffic – it peaked much high with 9% of the whole day’s traffic happening between 7pm and 8pm.
At the time, the site was running on a simple Debian BigV VM, with 1GB ram, SATA disk storage, 2 cores and 1 GB of swap and I thought it might be interesting to look at the architecture I was using, which let a relatively low specification machine handle 25,000 page views in 24 hours with 9,000 unique visitors.
When a request is sent for a page on the website, the first thing it hits in nginx – if it’s a static asset – an uploaded image like the one above, or part of the theme etc. it is served directly from disk by nginx. If it’s anything else, it’s sent to Varnish, which has a cache of pages it’s previously loaded sitting in memory (malloc), if the page hasn’t been found in the varnish cache, it’s passed back to Apache and the WordPress/PHP/MySQL stack sorts it out and sends it back. The next time that page is asked for, Varnish will send the cached version.
I installed WordPress installed from the Debian package, for ease of upgrades and there are two performance related WordPress plugins installed – W3 Total Cache and WP Varnish. W3 Total Cache does a bit of caching into memcache, and a few other tweaks, but the majority of the load is handled in Varnish. Using Apache to do webserver-ing makes life simpler, because we don’t have to mess around rewriting .htaccess rules into a different syntax.
WP Varnish is basically the hook we need to flush the varnish cache whenever something changes – when someone comments, when a page is updated, when a new page is added – WP Varnish will issue varnish with a “flush” that will ensure that viewers see the most up to date page.
The nice thing about this is that users can look at pages without ever touching Apache or the database once, with the page being dumped out over the network port from RAM, and the images etc, simply being read off the disk – resulting in fast server response times, and great scaling.
When, on the third, Varnish crashed, I hadn’t ever envisaged that amount of traffic – in fact I’d deliberately made the server relatively small/underspecified to see how it would perform under pressure.
Arguably, it’s over complicated, has unnecessary complication (W3 Total Cache + Memcache seem like recipes for confusion and pain) and I could just use nginx, php-fpm and varnish, but this setup gives me the flexibility and known quantity that is Apache, whilst letting me still scale my site to reddit-proof-like proportions.
I think really the lesson I learnt, is that all the services should be running under runit or monit, to make sure that, in the event of a service stopping responding, it’ll be automatically restarted.
I recently came across this this photo – some of the things that I took on holiday with me back in 2006:
The book is an interesting mix of Kevin Mitnick – a notorious former black/greyhat computer hacker/cracker – talking to former associates about other alleged hits.
Obviously, in the same way as watching Frank Abagnale‘s Catch Me If You Can doesn’t mean you support the passing of fraudulent checks or posing as airline pilots, clearly I also don’t endorse any of the things described in the Art of Intrusion – but the really valuable thing about the book is that it allows you to get inside the minds of ‘the bad guys’, see and understand how and why they do things.
The prequel to The Art of Intrusion is slightly different. The Art of Deception is the story of Kevin Mitnick’s own run from the FBI – Mitnick famously evaded the FBI for 2 and a half years before his arrest, during which time he managed to gain unauthorised access to the voicemail of the FBI officer who’d been assigned to his case (allowing him to evade capture for some time longer).
A few weekends ago, I was Blue Light Camp – billed as “the first truly interdisciplinary emergency services unconference in the UK”. As the name implies, there were many people from a variety of different emergency services backgrounds and so when I saw a talk titled The Art of Deception, I vaguely remembered the book, and wandered along. Kate Norman of an NHS trust (or known better to me as a friend of Ian Forrester), had recently read the book and was interested in people’s opinions. No-one else had read the book, but the discussion that followed was quite insightful.
I hadn’t gone along to talk internet security, in theory, yes, I’ve been in ‘Cyber Security’ competitions but largely my aim of attending this event was to listen, learn and meet some passionate and enthusiastic “blue lights”. The discussion was interesting because we really covered a lot of ground; privacy online, uses of social media and website’s being taken down/defaced.
The question was: “What can one do about one’s website being defaced/hacked/DDOS’d/etc?”
I think really the answer is quite simple: “You can apologise and do your best to bring things back to normal as fast as you can with the resources you have available”.
Ultimately, whatever you do, you can never be fully confident your website is secure – in the same way that you can be confident that whilst you’re a good driver, even if you’ve done advanced driving courses, someone can still drive into your rear end at a traffic lights or cut you up on a motorway and a collision happens. Even if you took all the possible precautions, there’s still some risk involved.
In terms of compromise of websites; even if your penetration testers haven’t found any serious flaws in your CMS (hint: if this happens, hire someone else), even if your base operating system is all patched and up to date, it’s not unlikely that tomorrow, someone will discover a vulnerability that affects one of them, and that your regime of patching doesn’t happen that quickly because you value stability.
It’s a very thin line to tread, and ultimately, it’s wisest to recognise that you’re going to do your best, but at some point in the next 10 years, you’ll need to apologise to your users. Being good at apologising to your users is not a skill to be sniffed at. If you can do it well, explain what happened in terms the users and your management understand then so much the better. There are worse things your could do than looking into the best ways to apologise to your users – to me this seems like a good use of training time.
During the session at Blue Light Camp I brought up this XKCD cartoon:
The amusing thing about me reading The Art of Intrusion was that it was 2006. 6 years ago. I was a teenager. I was still at school, and that must have been a library book (I’ve never owned a copy of it). It was just one of the security orientated books I read at the time (along with Bruce Schnier’s “Secrets and Lies in a Networked World”)
The types of attack, the types of thinking described in the books are alive and well today – there isn’t a problem with legislation – illegal acts are quite clearly illegal – yet really there’s been many years in which to learn how best to respond to security issues.
What scared me though is how far we’ve come in terms of the pervasiveness of technology since 2006 (back then government websites were mainly brochures, I hadn’t joined Facebook yet, Twitter really didn’t exist), and yet the basic premises of responsible and realistic net security are still not well known.
How can we fix this? How can one explain net security to the masses? As in ‘nothing is ever truly safe’ not ‘you need a password with lower and upper case and numbers’? As in ‘we fucked up, we’re really sorry, have some cake’.
I don’t know the answer, but I think it’s probably not going to be by prepending everything with “cyber” and trying to scare the shit out of everyone.
At Blue Light Camp I described Kevin Mitnick as “a bad person”.
I was asked: “well did did anyone die because of him?”
I responded negatively..
“Well on the scale of people we deal with, he’s not a very bad person then!”
Rick Falkvinge’s website Falkvinge.net recently frontpaged reddit.
Actually, in the default setup, he had three articles, (#16, #22, #24), which, as he says, is a record for him.
Why is this a big deal? Well with reddit being Alexa-ranked 133 and getting about 8.7 million visitors every day, being on the front page 3 times at once, means you’re going to get a lot of traffic in a relatively short space of time. Think of 3 phone numbers being read out concurrently on 3 TV stations that all point to the same call centre – that’s falkvinge.net
This is pretty much a nightmare scenario to prepare for from a systems administration point of view as you have to prepare for lots of traffic in a short window of time. In addition, with social media, you don’t have the foggiest clue how popular something is going to be – something posted to reddit is much more likely not to generate much traffic or a smallish amount of traffic than it is to cripple your webserver, so you actually need to be constantly prepared for lots of traffic in a short window of time.
Stats for the 24 hours when I had 3 articles on Reddit’s front: 421 gigs of data served, 21.7M HTTP requests, peak 630 reqs/sec
Rick has a somewhat customised WordPress setup with the W3 Total Cache plugin on the latest version of Ubuntu, probably definitely with Apache from what I can tell. It’s anyone’s best guess what hardware it’s running on (UPDATE: this is the hardware he’s running). Fairly standard as far as I can see – it’s mainly static content and not outrageously interactive or personalised. There are some images, but they don’t form the main part of the site.
Again, I could not have survived that traffic peak without @CloudFlare (see previous tweet)
Rick’s solution to the problem is to the “cloud” Infrastructure-As-A-Service provider Cloudflare, which is essentially is a caching reverse proxy/CDN combined with a Distributed DNS service. What this means in practice is that they’re able to use Cloudflare to handle these unexpected large peaks in traffic without changing their infrastructure.
Using a blackbox called Cloudflare to scale one’s website is all very well, but doesn’t suit everyone and presents an interesting sysadmin challenge:
How would you build a setup for a simple-ish WordPress instance, like Rick’s, to cope with the levels of traffic he mentions?
As a geek, one generally gets good at fixing things.
An interesting thing about technology, as opposed to say, carpentry, is that generally it’s very very small things that have significant implications. Frequently you spend a lot more time looking for the problem than you do actually implementing the solution.
- The symptoms: your website is taking a long time to load
- Diagnosis: check reproducibility, check server load, check for user error, check server error logs, see strange message in them and google.
- The problem: there’s a memory limit in the webserver program that’s set too low
- The solution: double a number in a config file and restart the webserver program
- The fix: do the solution (takes less than a minute)
One of the downsides of this, is that it’s really difficult to predict how long it’s going to take you to fix something. If fixing the problem is quick, yet correctly diagnosing the problem is much more time consuming, things can be frustrating for end users who ask the perfectly reasonable question:
When will it be fixed?
which as you can see doesn’t really have an easy answer – by the time you’re completely sure you’ve correctly diagnosed the problem, you’ve probably already fixed it.
Someone on reddit very eloquently summed up how you should explain the situation next time:
“Imagine you had lost your keys. You have no idea where they are. Now, tell me, when will you have found them?”
Inspired by a post on /r/sysadmin
Ok, I hear you: I work with computers and the Internet – you understand that.
But what exactly does a “Systems Administrator” at a “hosting company” do? Indeed… what’s a “hosting company”? What do you mean by “Systems Administrator”?
These are frequent questions I encounter when explaining to friends and family how I’m gainfully employed. Actually, that’s a lie. I wish they were frequent questions.
More realistic conversations go like this:
> So Tim, what is it that you do?
> I work with computers.
> Oh, I’m sorry. I guess someone has to.
> So Tim, what is it that you do?
> I work in IT.
> Ahh, perhaps you can help me: I’ve been having this problem where my work emails don’t seem to be displayed how they used to be, can you help?
> So Tim, what is it that you do?
> I help people over the phone fix problems on their webserver computers.
> You’re one of those fucking patronising arseholes I have speak to on the phone?! Why don’t you ever listen? Why does nothing ever work? I hate you and your whole life!
As you can tell, people have many misconceptions about my job. Largely they’re just ignorant about how technology works, and have memories of computers and business IT systems breaking down.
But frequently, the problem isn’t preconceptions, but the absence of a frame of reference. To many, websites are things that ‘occur’ on your screen when you ‘have the internet.’ Explaining that there is a computer behind them usually conjures ideas of a desktop computer somewhere (complete with mouse, speakers and screen) somehow ‘broadcasting a website’. Until this morning, I think this was a problem for my parents.
My parents are intellectually curious: they’re interested in knowing things simply because knowing them is interesting. They’re happy debating about politics, linguistics, science, history, botany and pretty much everything else. They’re (apparently!) very proud of their son, but they had limited comprehension of anything beyond ‘he works with computers, Linux and the internet.’ Clearly, I needed to straighten a few things out.
So I arranged to take a day’s leave and spend it in the office with my parents, explaining to them what I do, giving them a datacentre tour and filling in their knowledge so that they would feel they fully understand what I do, what the company does, and why people pay us money.
That day was today. I went through and explained:
- Vaguely ‘how websites work’
- What important terminology means (and a test!)
- What a client-server relationship refers to
- What happens when a server is overloaded
- A simplified explanation of the process from the web browser to database server and back
- A simple explanation of how you can split services out onto different physical servers to scale a website up
- The difference between dynamic and static websites
- What clustering means and a few advantages of it
- Where a hosting company comes in
- Websites need servers
- Servers need network, power and a stable environment.
- We have all those things
- We rent servers to people
- People put their websites on our servers
- We help people if they have issues
- What a datacentre provides servers
- Stable environment (cooling, fire suppression, physical security, etc.)
- Subtleties of different types of hosting
- Dedicated servers (traditionally more powerful)
- Virtual servers (easy scaling up and down)
- “Cloud servers”
- Why the company is great
- Because I work there!
- Honesty and professional integrity
- Friendly, knowledgeable colleagues
- Technical “no nonsense” approach
- How things work behind the scenes
- How we can use text chat to communicate
- How we interact with customers (CRM and phone)
- How we know if things break (monitoring systems)
- How I can (in theory at least) work from home/anywhere in the world
- Where I come in
- Pretty face and general awesomeness
- Amazing sense of modesty
- Helping people with technical issues
- Giving extra help and support to those who want to pay for it (managed hosting)
- Attempting to explain things to parents
I’ve spent most of the day showing them what I mean by ‘hosting’, answering questions, giving them a datacentre tour, and basically explaining simply what we do.
And that took a whole day, and I’m still fearful I’ll get a call along the lines of
“Tim, what is a server?”
So I’m still at a loss of how to explain what I do to friends and family. My current favourite explanation is courtesy of Rich Quick, but frankly, I’m not really sure I’ll ever be able to explain my work easily to nontechnical people:
> So Tim, what is it that you do?
> You know in Formula One, you have engineers in the pits who change the car’s tires, put fuel in it, fix things, give the driver advice and generally make the car go faster?
> Yeah, of course.
> Well, I do that, but for websites!