Ruby on Rails
Ta-da List


March 05, 15:12

Forty-four grueling hours (or Welcome to 37s!)

Jamis Buck came on with 37signals as a full-time employee on Wednesday 3rd. On Thuesday the 4th at 7 AM (my time), we embarked on what would become a forty-four hour struggle to keep Basecamp alive during the worst server ills in the history of the application.

It all started out with plans for a tech upgrade of Basecamp. We had been running a development version for a long time on Rails 0.10 and FastCGI — it was time to take it live. See, until this Wednesday, we had been running Basecamp on mod_ruby for production since the launch. And the production site was still running 0.9.5 of Rails.

Considering how major the upgrade was going to be, we decided to bite it off in chunks:

  1. Get the new version of Basecamp running while keeping Apache/mod_ruby
  2. Move to Apache/FastCGI
  3. Move to lighttpd/FastCGI

We had also gone through a lot of trouble to verify that everything would work. Even though we had been running Apache/FastCGI in development for a long time, we got mod_ruby up again and verified that it would all work. We spent a whole day combing Basecamp for issues related to the Rails 0.10 upgrade. In other words, we felt prepared.

Converting hundreds and hundreds of megabytes to UTF-8
It didn't start out too well. As part of the tech upgrade, we had decided to do a long overdue conversion of the Basecamp content to UTF-, but our estimations on how long it would take was way off. Instead of taking less than 90 minutes (which was the window we had reserved as scheduled downtime), it took more like 4 hours! Iconv sure did decided to take its sweet time chewing through the many hundreds of megabyte of content.

So stress was already in the picture by then. But it would get worse. A lot worse.

Another part of the upgrade involved moving from Ruby 1.8.1 to 1.8.2. That in turn made it a good idea to recompile mod_ruby and get on the latest version (we were still running 1.0.7 with 1.2.2 as the latest).

That's where it started going from bad to worse. Or rather, that's where we started to discover what a mess we were really in. As another part of the upgrade also involved pulling out a SSL offloading card (that wasn't helping anything) and going back to the kernel from before the SSL upgrade, it could just as well (and probably more so) have been related to that. Or all of the above.

Chewing off larger chunks
In any case, once we had all that in place and where getting ready to test, mod_ruby wouldn't start. Or it would start some times, then die. Other times not at all. It all seemed terribly periodic. As one who've had problems with mod_ruby in the past, I foolishly ascribed it to either mod_ruby acting up or our inability to install it correctly.

But there was no time for assigning blame. We were already plenty behind schedule, so we decided to move to step 2: Apache/FastCGI. That didn't work very well either. Partly because of some issues with mod_fastcgi that would throw odd "ioctl device" errors when we started it up (which later revealed itself as being because mod_fastcgi needed to be both user and group owner of the IPC directory). Partly because I was blind to the httpd.conf reading <IfModule fastcgi.c and not <IfModule mod_fastcgi.c. We obviously weren't well-prepared for a premature move up the chunky road map.

And then it appeared. Our what-we-thought-to-be savior! As mod_ruby and mod_fastcgi on Apache were both proving unable to get us in the air, giving step 3 a shoot suddenly looked like a "why not" (despite being about a week before we thought we had to).

So we did. Aside from some fumbling with the SSL certificates (lighttpd needs them in the combined pem format where Apache has crt and key separate), it took almost no time to get lighttpd ready for launch. We certainly hadn't tested things as much as we would have liked, but with the site already down, there was little to lose. And so it happened that Basecamp returned to the air for the first time after an extended downtime period running lighttpd with FastCGI.

It worked! (Or so we thought). The site was alive, fast and responsive, just in time for the US market to wake up. We congratulated ourselves on the courage to move up the road map at an accelerated pace, kept our eye on the application log (that was flying by as all the Europeans finally could get through to do their daily business), and marveled at the incredibly low memory consumption of the new setup (roughly 300MB vs 3GB!). Jamis had been up for a long time so we was sent off to bed.

The horror is unveiled
But our joy and self-congratulations were premature to say the least. Small clues soon emerged that all was not right. The SSL part of the site was unusual slow. lighttpd would some times jump really high in resource consumption for no apparent reason.

Unknowingly, we had ventured all the way into the belly of the beast. And then I turned on the light to reveal its horror: tail -f log/lighttpd.error.log.

Oh. My. God. It was a regular death zone. The server and FCGIs fighting for their life at an exhausting rate and with depressing low success. The lighttpd error log was sizzling hot. Tens if not a hundred entries per minute. Socket is not connected, error-handler not found, Software caused connection abort, emergency exit: fastcgi, fcgi-server disabled.

Thus ensued 6 or 7 hours spent doing not much else than keeping the site reasonably alive. Restarting lighttpd every 15-20 minutes, wasting stuck FCGIs, and feeding Jan Kneschke (the author of lighttpd) a never ending stream of error logs and ktraces. We managed to beat back a bunch of the more harmless errors through a good handful of patches being applied on the fly. But besides trimming the log file down to only contain the most nasty stuff, we weren't significantly more successful in stopping the blood-sheet.

As the daily rush on Basecamp withered down, so did the temperature in the boiler room. lighttpd was still crashing once an hour, the FCGIs were still being killed of regularly, but at least it didn't need constant typing and all of your attention across 6-7 terminals to keep the engine together.

Working on FastCGI
The quieter time gave us a chance to investigate some of the FastCGI killings and thanks to a patch by Aredridel (I'm ever grateful!), we gained access to the original exceptions. That proved most helpful. We cleared out a few bugs on that account, but were also left baffled by Ruby crash bugs from marshal and IO.

By then, the heaviest load on Basecamp had been relieved. lighttpd wasn't crashing any more, but the FCGIs were still dying regularly. lighttpd did a good job decommissioning the fallen and replacing with new troops, though.

So with a total of around ten patches applied to lighttpd, ruby-fastcgi, and Rails, we felt that maybe, just maybe, we had averted the crisis. And with 20 straight hours in my chair, it was just in time. My legs were numb. My knees hurt even when I went to bed. It had not been a healthy day at work.

We continued to think that the crisis was if not avoided then contained for another 14 hours as lighttpd stayed alive through the night.

And then the second Basecamp rush hour wave hit. BAM! We were back in the death zone. Not nearly as bad as the day before, but lighttpd was still caving in about once per hour. The error log was again filling up with critical stuff and we were back to tearing our hair out.

During all the time with lighttpd, we had desperately trying to get Apache back on its feet using both mod_ruby and FastCGI. mod_ruby was dead in the water and when ever we tried giving Apache a shot with FastCGI, it blew up four times as fast as lighttpd.

Maybe something is wrong
This had now been going on for close to forty hours. Everything we tried crashed and burned. But then, in the brief pockets of serenity, we started to wonder. Why are we having this many problems with everything we try? Why didn't any of this show up on the staging server? Or when we ran our (albeit limited) load tests? The more we thought, the greater our suspicion grew. Maybe we weren't all to blame. Maybe the guilty man was Mies — our server.

With a strong suspicion in hand, we expropriated Ta-da List from our other web server and got Basecamp setup on it. Tests looked promising. We flipped the switch on one of the production domain (Basecamp has clients spread across 5 domains). Errors in the log? None. Resource expenditure by lighttpd? Minimal.

Fuck, fuck, fuck, fuck, fuck, fuck!

The jury was back. We had just wasted forty hours of our lives growing disillusioned by our apparent total lack of skill in keeping Basecamp in a meaningful state of production. My knees were x'ed for nothing. I had forced our new employee into sleep deprivation and all-day shifts for nothing.


In a state of equal shock and joy, we moved all of the production domains over to the new server. Everything Just Worked. No FGCI killings, no explosions in resource expenditure, no freezes.

We're still not sure what caused Mies to go insane. It was a trusty server for little under a year. But now it is most certainly not. And we've taken the consequence and sentenced Mies to the penalty of death. That's right, the server is dead man walking. If I had my desire fulfilled, it would be ripped out of the server cabinet and burned in the backyard. May you rest in peace, you crazy mofo!

The bottomline
So Murphy tells us that everything that can go wrong, will go wrong. And we tend to use that to blame ourselves for lack of preparation or missing the willingness to act. And most of the time we should. But some times, just some times, the cause is not your shoddy software. Or your half-assed testing procedures. It's the misplaced belief that of course it's not the OS/Server/Whatever holy entity causing you grief.

It's what we think we know that isn't so.

Basecamp is now running lighttpd with FastCGI. And we're loving it. If there's anything to blame lighttpd for, it's fighting too good of a battle against stacked odds. If we had just been confined to Apache, we would probably have suspected Mies sooner as nothing would have worked.

Challenge by Goynang on March 05, 17:42

Why did you let yourselves get to a place where you couldn't just switch back to the previous working system? Why upgrade in place and gives yourselves no way back if/when it goes pear shaped?

Or am I being too simplistic?

Challenge by David Heinemeier Hansson on March 05, 18:03

The problem was that we couldn't back. We actually did try backing off to the old mod_ruby, but to no avail. What we should have done was to not even contemplate upgrading the existing server. But instead built an entirely new setup on a new server, such that we could just have flipped the switch back and forth between the two machines.

We didn't account account for the fact that the existing Apache/mod_ruby combination would be rendered unusable since we had tested out this combination on a staging server. But of course the staging server wasn't an exact replica of the production server. Especially not in terms of the kernel/hardware instabilities.

Challenge by Goynang on March 05, 18:32

" instead built an entirely new setup on a new server, such that we could just have flipped the switch back and forth between the two machines."

That's what I meant. That kind of setup is always safer as you know how ever bad things get you can always just switch back.

Doing even the smallest tweak to a live server scares me silly.

Challenge by John Speno on March 05, 21:39

What hardware and OS was Mies? Does it not have a hardware log of some kind like a kernel error file? If so, what errors if any were reported there?

I'm with Goynang, when I had to do this stuff, we always built a new machine and switched to it instead of messing with the production system we were replacing. I'm sure you've learned that one all too well now.

Best of luck!

Challenge by Chriztian Steinmeier on March 05, 22:07

David - seriously... you should consider writing for the big screen - that was a great read!! :-)

Challenge by Cody on March 05, 22:37

That was a great read, very suspenseful! I'm glad it all worked out. :-)

Challenge by Stian Grytøyr on March 05, 23:09

Interesting read. Brings back a couple of painful memories.

Incidentally, you might want to consider dumping Mies into one of these:

Should be quite satisfying :)

Challenge by Jonathan Nolen on March 06, 0:55

Hi, David. This is Jonathan Nolen -- we had lunch together at Building Basecamp San Francisco in the fall.

I'm sorry to hear about your horrendous experience last week. It brings back memories of my last startup experience, where we were running everything on one server and deploying straight from CVS. Needless to say -- a sub-optimal solution.

After having many problems with our production deployment, we decided to come up with a plan to prevent downtime in the future. It has worked pretty well. Maybe you can get some useful ideas from our solution.

Here's how it works:

We maintain three running copies of our application at all times. In your case, since you're changing the environment as well as your code, you'll probably want to have three independent servers. Each server is directly accessible by IP, but only one is accessible by hostname.

We can change which server is the LIVE server (accessible by hostname) instantly, at any time. We use a Cisco Content Switch for this, but you can probably do it with just about any load balancer by taking servers in and out of the round-robin.

The servers start out identical, with (for example) Basecamp version 1.4 installed and running. Server A is the LIVE server in this case. Now it's time for an upgrade. You upgrade Server B and Server C to Basecamp 1.5. You switch the load balancer to make Server B the LIVE server and take Server A offline. Server A (still running 1.4) is now the ROLLBACK server. Server C (running 1.5) is now the EMERGENCY server.

So now you have options. If you have inadvertently released broken code, you can simply switch the load-balancer back to your ROLLBACK server and you are back to normal.* Then you can fix your problem without being in firedrill mode. If you encounter a problem that is not related to code (like OS problems or hardware failures) then you flip to the EMERGENCY server (server C) and continue with the most current version of the code while you work to fix Server B, which is now designated the EMERGENCY server (once you fix it, anyway).

When you are ready to upgrade to to Basecamp 1.6, you upgrade Server A and Server B (the ROLLBACK and the EMERGENCY) and then change the load-blanacer. Again, you're covered for most contingencies.

So you have to buy more hardware to make this work. But hardware is relatively cheap these days. And if it can save your customers from downtime and your staff (with only two of you) from punishing all-nighters, it's _so_ worth it. Hope this helps. If you have questions, feel free to contact me.

* This assumes that you haven't put your DB (which I assume is running elsewhere) into an incompatible state. We work really hard to avoid those kinds of changes.

Challenge by Aredridel on March 06, 2:55

Hey — glad I could help with what little I did. I feel your pain all too acutely. My server of several years started succumbing to bizarre crashes this week, too, just ending as your woes started. Sixty or so hours of extra work in a week sucks so unbelievably much.

That said, I've moved to Lighttpd fronting my servers as well, with Apache doing some of the lifting, as well as a myriad of proxied-together WEBrick apps. Load is lighter. Response time is better, and with lighttpd's rather good error handling for downed processes, I think we'll see a lot fewer glitches as time goes on.

Challenge by Bill Gates on March 06, 3:31

Fuck, fuck, fuck, fuck, fuck, fuck!

Ah! The joys of working on open source software!

Challenge by Eric Hodel on March 06, 8:57

I always turn off that IfModule stuff because I damn well don't want the server starting up if its missing those modules.

Challenge by Derek on March 06, 12:17

Here's is an "interesting article": about how Google manages their search operation.

Challenge by Luke on March 07, 4:32

David, I love your work, but I think the above expressions of support here are letting you off lightly. As a customer, this stuff boggles my mind.

There's just one live server? It's not really the same as your staging server? And you upgraded it directly? Without a fallback position? And when things went very wrong you persisted? Ouch ouch ouch ouch ouch.

It's not that I don't sympathise - I've been in these situations before myself.

But building solid web applications is your THING. And that's not just about development. You need to take some of your developer brilliance & insight, and invest I tiny bit into the operations side of what you do too.

The kind of physical security redundancy talked about at doesn't help anyone if your operations are shot. You need a safe development > staging > production > rollback regime.

Some suggestions:

- When upgrading, always have a failsafe way to fall back. A switch you can flip in an instant.
- That usually means you can't upgrade a live server directly. Run up a new system in parallel, then switch across.
- You need something like what Jonathon Nolen said above, 3 identical servers: production/rollback/emergency or at least production/rollback. That's NOT including including your dev & staging machines.
- When something goes wrong, USE that fallback position. Don't forge ahead through a hundred more disasters, save yourself some grey hairs! Work out what's wrong with the upgrade process you tested. Or how your staging environment didn't really match your production one.
- I know EXACTLY what you meant about suspecting the hardware/OS last. You assume it's your own fault and start tweaking the upgrade process. But if you had an identical staging & live environment, you'd KNOW that your upgrade process is OK, so it must be the hardware.

No doubt after a gruelling weekend like this you'll come up with your own commonsense operational safeguards as a result.

Anyhow, it's just a learning experience, I hope your heart has stopped racing from it all. I still love your work. BaseCamp is still beautiful. And I want to learn to program, purely on the basis that RoR looks like so much fun.

Challenge by Ryan Christensen on March 07, 7:41

If anything, though, you have to give them credit for going into this much detail about the outage.

I'm sure they realize(d) how some people will react to the details.. how some will see flaws in their pervious methods of operation revealed through this. Such is the life of a company that values open communication.

Challenge by Jason Watkins on March 07, 22:10

I'm curious how folks stage database upgrades. Obviously upgrading the front end is relatively straighforward provided you have 3 servers and a floating IP or a cisco redirection box.

But how do you handle upgrades that require a schema change? Adding a collum seems simple. Dropping a collum as well. But what about changing the type or constraint of a collum?

Challenge by Luke on March 07, 23:28

For sure, the transparency is a good thing. Many eyes make all bugs shallow...

Challenge by George Moschovitis on March 09, 10:48

I know the feeling, happened to me in the past. Forgive me the irony, but I found your post most entertaining (and informative).

Challenge by Jake on March 11, 19:52

Oh man that was great! You are on STINKIN GOOD WRITER!!! The only thing I would change is marking out the cussin' =) (but that's just me).

I agree that you should have 2-3+ servers, but I also know that sometimes it's not an option for whatever reason. Glad to know you are dedicated enough to work that hard on it though! Keep it up, you guys are stinkin awesome!! (Alright, go take a shower...)

Challenge by Mike Woodhouse on March 13, 13:32

I once spent a miserable Friday night (all night!) trying to get a database server to stop misbehaving. We swapped out all kinds of hardware bits and pieces, reinstalled any number of software components and after 10 hours of hair-tearing, finally changed the network cable. That fixed it. I now try (but often fail) to remember to target the easy, cheap and innocuous elements first - even when I have a "good" idea about the probable cause of a problem...

Challenge by Robert Pierce on March 20, 0:23

As a customer with multiple basecamp sites I was thoroughly intrigued (and scrared) by reading your informative account. I'm just glad I didn't know about all this while it was happening--it was during that exact timeframe when one of the basecamp site was being utilized to coordinate an extremely important business deal ($40+ million) with principals, lawyers, cpas, etc relying on the system to pull together a very time compressed deal. I shudder to think of how much $#@! I would've had to endure if the site had come down all the way. Please, please tell me you guys'll be more circumspect before trying naked live upgrades again. My blood pressure... :)

Challenge by raphinou on April 06, 17:25

That's a reason I like working in a chrooted system: you make a backup and can go back when you want. With debian, it's just a question of using debootstrap.

For the next time maybe? :-)


Challenge by bob on June 04, 11:56

Hello. I am new here. It is very interesting.
older women and younger men mature nudists photos south indian mature women mature video mature spunker mature anal free naked older women older moms classic mature mature gallerys mature anal mature women stories mature anal mature ladies in lingerie mature bitches older s3xy women free naked older women mature living mature bitches mature nudists photos free naked older women free naked older women nud3 senior women mature amature classic mature mature nudists photos nud3 senior women mature women stories mature amature mature gallerys older women and younger men mature nudists photos mature nud3 pics mature amature older s3xy women free naked older women older moms free naked older women older s3xy women mature women stories

Challenge by angel on June 05, 5:22

Hello. I am new here. It is very interesting.
free panty thumbnail gallery free pantyhose thumbs free panty thumbnail gallery men who wear pantyhose nylon stocking galleries men who wear pantyhose free panty thumbnail gallery boys in pantyhose men pantyhose feet in pantyhose pics free pantyhose thumbs mature pantyhose pictures girls in body stockings free panty thumbnail gallery pantyhose mpeg russian pantyhose women pantyhose mpeg nylon galleries pantyhose and nylon feet and toes cum on panties pantyhose crossed legs free panty thumbnail gallery nylon galleries men pantyhose boys in pantyhose nylon stocking galleries girl panty pics feet in nylon cum on panties free pantyhose thumbs men who wear pantyhose feet in pantyhose pics feet in pantyhose pics pantyhose chat boys in pantyhose nylon galleries pantyhose chat cum on panties russian pantyhose women feet in nylon

Challenge by mara on June 05, 9:34

Hi. This is my links.
creampies free free creampie s3x free anal creampie pics free creampie pics and stories creampie pic creampie movie free creampie s3x free creampie s3x asian creampies free creampie s3x latina creampie creampie p0Rn free creampie pics free creampie pussies asian creampies latina creampie free anal creampie pics creampies xyx creampie cum and amature creampie cum and amature interracial wives bareback creampies free creampie s3x amature creampies latina creampie creampie p0Rn creampies free free creampie pics free creampie pic free creampie pics free free creampie pics and stories free anal creampie pics creampie cum and amature interracial wives bareback creampies gangbang creampies videos amature creampies creampies xyx creampie pic latina creampie creampie pussies creampie cum and amature

Challenge by juju on June 06, 1:08

Hi. This is my links.
1ncest daughter 1ncest gallery 1ncest xyx 1ncest daughter 1ncest 101itas 1ncest thumbnails teen 1ncest er0t1c stories 1ncest 101itas 1ncest 101ita teen 1ncest er0t1c stories 1ncest 101itas teen 1ncest er0t1c stories 1ncest free pictures stories teen 1ncest er0t1c stories free 1ncest mpegs 1ncest gallery 1ncests true 1ncest stories free 1ncest sites 1ncest free pictures stories free 1ncest story 1ncest sites 1ncests 1ncest xyx 1ncest free pictures stories 1ncest 101ita 1ncest 101ita mother son 1ncest pics free 1ncest sites 1ncest xyx true 1ncest stories 1ncest 101itas free 1ncest sites mother son 1ncest pics 1ncest xyx 1ncest free pictures stories 1ncest thumbnails true 1ncest stories teen 1ncest er0t1c stories granny 1ncest

Challenge by ivan on June 06, 11:44

Hello. I am new here. It is very interesting.
black older ladies mature index mature pic gallery mature index young girls older men black older ladies mature women over 40 thumbs gallery older women older holland naked old women thumbs mature hardcore links mature glamour mature mistress mature hardcore links mature women over 40 thumbs gallery older women mature pic gallery mature glamour older black women thumbs gallery older women how to meet older women mature pic gallery older holland mature nudists mature women over 40 thumbs gallery older women how to meet older women mature mistress mature big tits young girls older men older women stories free older women galleries older 1esbians older holland mature hardcore links mature big tits how to meet older women mature glamour mature big tits free older women galleries

Challenge by tanita on June 06, 19:33

Hello. I am new here. It is very interesting.
1ncest bbs russian 1ncest free 1ncest photos free 1ncest storys 1ncest chat 1ncest bbs russian 1ncest 1ncest bbs 1ncestgrrl's free 1ncest stories mom son galleries brother sister s3x free 1ncest storys stories of 1ncest brother sister s3x teens 1ncest pics teens 1ncest pics family 1ncest pictures russian 1ncest mom son galleries 1ncest free gallery family 1ncest pictures 1ncest bbs brother sister s3x russian 1ncest mom 1ncest mom son free 1ncest storys stories of 1ncest free 1ncest photos family 1ncest pictures family s3x stories s3x 1ncest brother sister s3x real 1ncest family 1ncest pictures mom 1ncest mom son 1ncest chat free 1ncest storys father son 1ncest

Challenge by atmor on June 08, 0:07

Hi. This is my links.
satin panty galleries wet pantyhose winnie-cooper pantyhose panty post hooters girls in pantyhose free teen panties mature women in pantyhose pantyhose images mature ladies in pantyhose pantyhose images cheerleader panty pics white pantyhose office girls in nylons winnie-cooper pantyhose panty thumbnails satin panty galleries pantyhose images office girls in nylons mature ladies in pantyhose mature ladies in pantyhose white pantyhose winnie-cooper pantyhose office girls in nylons panty post men caught wearing pantyhose panty post white pantyhose cheerleader panty pics girls in nylons free teen panties office girls in nylons office girls in nylons panty thumbnails hooters girls in pantyhose free wet panty cheerleader panty pics mature ladies in pantyhose mature ladies in pantyhose celebrities in pantyhose pantyhose images

Challenge by alan on June 08, 4:17

Hello. I am new here. It is very interesting.
teen panty pictures panty line pics free pantyhose sites pantyhose foot gallery teens in pantyhose pantyhose men men in panty pics free pantyhose pics pantyhose pic gallery bra and panty pics mature women in nylons free pantyhose pics free pantyhose sites wolford pantyhose wolford pantyhose little school girls panties pantyhose men crotchless pantyhose men in panty pics shiny pantyhose men in panty pics little school girls panties mature women in nylons silkies pantyhose free panties wolford pantyhose panty line pics free panty galleries panty girls panty girls pantyhose fetish pantyhose pic gallery free panties pantyhose foot gallery pantyhose men men in panty pics pantyhose pic gallery shiny pantyhose bra and panty pics pantyhose men

Challenge by mary on June 08, 19:13

Sorry for my links
free mature s3x pics bcn mature daily mature boobs bcn mature free pics mature ladies nud3 mature men older pussy free mature movies nud3 mature black women mature s3x pics top mature t9p's nud3 mature black women s3xy mature women mature personals meet older women free mature s3x pics mature s3x pics top mature t9p's mature s3x pics nud3 mature men nud3 mature mature personals free pics mature ladies victor mature mature personals nud3 mature black women nud3 mature men older women pictures mature cart00ns daily mature boobs daily mature boobs free mature movies mature 1esbian victor mature nud3 mature older women pictures mature 1esbian top mature t9p's mature personals victor mature

Challenge by idiot on June 21, 1:39

i am an idiot and i am lead by richard simmons

Challenge by job interview questions on June 22, 6:49

sample job interview questions

job interview tips

job interview

job interviews

job interview questions