Jamis Buck came on with 37signals as a full-time employee on Wednesday 3rd. On Thuesday the 4th at 7 AM (my time), we embarked on what would become a forty-four hour struggle to keep Basecamp alive during the worst server ills in the history of the application.
It all started out with plans for a tech upgrade of Basecamp. We had been running a development version for a long time on Rails 0.10 and FastCGI — it was time to take it live. See, until this Wednesday, we had been running Basecamp on mod_ruby for production since the launch. And the production site was still running 0.9.5 of Rails.
Considering how major the upgrade was going to be, we decided to bite it off in chunks:
- Get the new version of Basecamp running while keeping Apache/mod_ruby
- Move to Apache/FastCGI
- Move to lighttpd/FastCGI
We had also gone through a lot of trouble to verify that everything would work. Even though we had been running Apache/FastCGI in development for a long time, we got mod_ruby up again and verified that it would all work. We spent a whole day combing Basecamp for issues related to the Rails 0.10 upgrade. In other words, we felt prepared.
Converting hundreds and hundreds of megabytes to UTF-8
It didn't start out too well. As part of the tech upgrade, we had decided to do a long overdue conversion of the Basecamp content to UTF-, but our estimations on how long it would take was way off. Instead of taking less than 90 minutes (which was the window we had reserved as scheduled downtime), it took more like 4 hours! Iconv sure did decided to take its sweet time chewing through the many hundreds of megabyte of content.
So stress was already in the picture by then. But it would get worse. A lot worse.
Another part of the upgrade involved moving from Ruby 1.8.1 to 1.8.2. That in turn made it a good idea to recompile mod_ruby and get on the latest version (we were still running 1.0.7 with 1.2.2 as the latest).
That's where it started going from bad to worse. Or rather, that's where we started to discover what a mess we were really in. As another part of the upgrade also involved pulling out a SSL offloading card (that wasn't helping anything) and going back to the kernel from before the SSL upgrade, it could just as well (and probably more so) have been related to that. Or all of the above.
Chewing off larger chunks
In any case, once we had all that in place and where getting ready to test, mod_ruby wouldn't start. Or it would start some times, then die. Other times not at all. It all seemed terribly periodic. As one who've had problems with mod_ruby in the past, I foolishly ascribed it to either mod_ruby acting up or our inability to install it correctly.
But there was no time for assigning blame. We were already plenty behind schedule, so we decided to move to step 2: Apache/FastCGI. That didn't work very well either. Partly because of some issues with mod_fastcgi that would throw odd "ioctl device" errors when we started it up (which later revealed itself as being because mod_fastcgi needed to be both user and group owner of the IPC directory). Partly because I was blind to the httpd.conf reading
<IfModule fastcgi.c and not
<IfModule mod_fastcgi.c. We obviously weren't well-prepared for a premature move up the chunky road map.
And then it appeared. Our what-we-thought-to-be savior! As mod_ruby and mod_fastcgi on Apache were both proving unable to get us in the air, giving step 3 a shoot suddenly looked like a "why not" (despite being about a week before we thought we had to).
So we did. Aside from some fumbling with the SSL certificates (lighttpd needs them in the combined pem format where Apache has crt and key separate), it took almost no time to get lighttpd ready for launch. We certainly hadn't tested things as much as we would have liked, but with the site already down, there was little to lose. And so it happened that Basecamp returned to the air for the first time after an extended downtime period running lighttpd with FastCGI.
It worked! (Or so we thought). The site was alive, fast and responsive, just in time for the US market to wake up. We congratulated ourselves on the courage to move up the road map at an accelerated pace, kept our eye on the application log (that was flying by as all the Europeans finally could get through to do their daily business), and marveled at the incredibly low memory consumption of the new setup (roughly 300MB vs 3GB!). Jamis had been up for a long time so we was sent off to bed.
The horror is unveiled
But our joy and self-congratulations were premature to say the least. Small clues soon emerged that all was not right. The SSL part of the site was unusual slow. lighttpd would some times jump really high in resource consumption for no apparent reason.
Unknowingly, we had ventured all the way into the belly of the beast. And then I turned on the light to reveal its horror:
tail -f log/lighttpd.error.log.
Oh. My. God. It was a regular death zone. The server and FCGIs fighting for their life at an exhausting rate and with depressing low success. The lighttpd error log was sizzling hot. Tens if not a hundred entries per minute. Socket is not connected, error-handler not found, Software caused connection abort, emergency exit: fastcgi, fcgi-server disabled.
Thus ensued 6 or 7 hours spent doing not much else than keeping the site reasonably alive. Restarting lighttpd every 15-20 minutes, wasting stuck FCGIs, and feeding Jan Kneschke (the author of lighttpd) a never ending stream of error logs and ktraces. We managed to beat back a bunch of the more harmless errors through a good handful of patches being applied on the fly. But besides trimming the log file down to only contain the most nasty stuff, we weren't significantly more successful in stopping the blood-sheet.
As the daily rush on Basecamp withered down, so did the temperature in the boiler room. lighttpd was still crashing once an hour, the FCGIs were still being killed of regularly, but at least it didn't need constant typing and all of your attention across 6-7 terminals to keep the engine together.
Working on FastCGI
The quieter time gave us a chance to investigate some of the FastCGI killings and thanks to a patch by Aredridel (I'm ever grateful!), we gained access to the original exceptions. That proved most helpful. We cleared out a few bugs on that account, but were also left baffled by Ruby crash bugs from marshal and IO.
By then, the heaviest load on Basecamp had been relieved. lighttpd wasn't crashing any more, but the FCGIs were still dying regularly. lighttpd did a good job decommissioning the fallen and replacing with new troops, though.
So with a total of around ten patches applied to lighttpd, ruby-fastcgi, and Rails, we felt that maybe, just maybe, we had averted the crisis. And with 20 straight hours in my chair, it was just in time. My legs were numb. My knees hurt even when I went to bed. It had not been a healthy day at work.
We continued to think that the crisis was if not avoided then contained for another 14 hours as lighttpd stayed alive through the night.
And then the second Basecamp rush hour wave hit. BAM! We were back in the death zone. Not nearly as bad as the day before, but lighttpd was still caving in about once per hour. The error log was again filling up with critical stuff and we were back to tearing our hair out.
During all the time with lighttpd, we had desperately trying to get Apache back on its feet using both mod_ruby and FastCGI. mod_ruby was dead in the water and when ever we tried giving Apache a shot with FastCGI, it blew up four times as fast as lighttpd.
Maybe something is wrong
This had now been going on for close to forty hours. Everything we tried crashed and burned. But then, in the brief pockets of serenity, we started to wonder. Why are we having this many problems with everything we try? Why didn't any of this show up on the staging server? Or when we ran our (albeit limited) load tests? The more we thought, the greater our suspicion grew. Maybe we weren't all to blame. Maybe the guilty man was Mies — our server.
With a strong suspicion in hand, we expropriated Ta-da List from our other web server and got Basecamp setup on it. Tests looked promising. We flipped the switch on one of the production domain (Basecamp has clients spread across 5 domains). Errors in the log? None. Resource expenditure by lighttpd? Minimal.
Fuck, fuck, fuck, fuck, fuck, fuck!
The jury was back. We had just wasted forty hours of our lives growing disillusioned by our apparent total lack of skill in keeping Basecamp in a meaningful state of production. My knees were x'ed for nothing. I had forced our new employee into sleep deprivation and all-day shifts for nothing.
WE HAD GAWD DAMN A BROKEN SERVER!
In a state of equal shock and joy, we moved all of the production domains over to the new server. Everything Just Worked. No FGCI killings, no explosions in resource expenditure, no freezes.
We're still not sure what caused Mies to go insane. It was a trusty server for little under a year. But now it is most certainly not. And we've taken the consequence and sentenced Mies to the penalty of death. That's right, the server is dead man walking. If I had my desire fulfilled, it would be ripped out of the server cabinet and burned in the backyard. May you rest in peace, you crazy mofo!
So Murphy tells us that everything that can go wrong, will go wrong. And we tend to use that to blame ourselves for lack of preparation or missing the willingness to act. And most of the time we should. But some times, just some times, the cause is not your shoddy software. Or your half-assed testing procedures. It's the misplaced belief that of course it's not the OS/Server/Whatever holy entity causing you grief.
It's what we think we know that isn't so.
Basecamp is now running lighttpd with FastCGI. And we're loving it. If there's anything to blame lighttpd for, it's fighting too good of a battle against stacked odds. If we had just been confined to Apache, we would probably have suspected Mies sooner as nothing would have worked.