While there wasn't much hesitation before sharing our troubles with the latest upgrade, I did pause for just a second. Will people think less of us because we didn't do this or that? How will the story reflect in the light of calm analysis and perfect hindsight? Luke offered just that view in the comments:
...building solid web applications is your THING. And that's not just about development. You need to take some of your developer brilliance & insight, and invest I tiny bit into the operations side of what you do too.
Which is a fair stance. But it's of course also what makes a lot of companies terrified of sharing stories of trouble. And what gave me pause. In the end, though, I believe that the net benefit of transparency far outshines the small bruises you'll inevitably in return.
As an example, we're currently conducting an open-ended questionnaire with our customers asking what they like, don't like, and would love see changed. I took especial comfort in this comment on what to like about Basecamp:
The clean interface, ease of use, basic tasks are easy to execute. All features aside, I greatly appreciate the transparency of the operation. Not only has david released Rails to the world, but you guys (or David) had the balls to discuss the hardships felt during your recent upgrades. Most companies would leave it at "we apologize for the failures".. but letting/having a partner discuss the issues in detail is something you just don't get every day. It's extremely refreshing, and something more companies should emulate.
I believe this is part of the competitive advantage that small companies can enjoy if they dare. The freedom to say that you made mistakes. How you made them and why. The freedom as a partner of the company to say fuck, if that's how you feel.
P.S.: It definitely seems like the transparency is infectious. Jamis just shared another story on what can happen when you go to the keyboard ill. Hopefully we'll soon return to the stories about how we're conquering the world instead of how we're failing to ;)
I for one appreciate that kind of stories. To begin with, it makes you and 37signals look like human beings, as opposed to perfect entities without the human signature trait of flaws. I don't know about you, but I prefer humans to robots (no offence, robots).
Other than that, stories such as it feels real. I mean, preaching about patterns is educational, well and good, and all. But it's not real. Please, do write real stories from time to time (but don't preach any less).
Sharing a story about a misstake you've made doesn't make you less competent, after all, we all learn the most from our mistakes. If you didn't make any you wouldn't be getting any (or much) better.
I do agree with your stance on providing details about the failure even though it could have hurt you.
These types of failures are not that uncommon even for companies that have millions invested in the application development and the IT infrastructure. We have many environments. Those being, development, integration, staging, and production. Our staging and production environment hardware costs about $250k each or a little over $500k together. This means we have a production mirror that we can test our rollouts on that is exactly the same as production (firewalls, load balancers, san, you name it.)
1, Invest in a staging environment (not your dev/initg env) and have a “Deployment Procedure/Plan” that will detail exactly what needs to take place in this rollout. You can have something like a pullet point list of all the tasks that need to be completed. Have an area for signoff.
2, Create an “Installation Qualification” document. This will details all the steps to verify your install went correctly. Something like, tested database connection, testing example customer, and then run through a basic smoke test.
We have averted several major issues with our session consistency via multiple web nodes and some load balancer issues by having this staging environment and the policy framework in place that guides the rollout. It might not be feasible to have something like this for a smaller company but it sure helped us.
After all our investment in planning, testing, procedures, hardware, and great staff we still run into issues.
I, too, think you did a great job posting the details of the problem. While there may be a bit of grumbling that it happened, you've shown a great deal of resiliency, tenacity, and honesty -- all of which should help customers and future customers feel even better about you.
Please also post about the improvements you're making going forward as a result of this problem. Again, I think it will be a win for you in the long-term.
(The ideas about enterprise websites above are all good stuff. It's what I do for a living, and it's important for any business that wants to stay afloat on the Internet.)
I think it was a good idea to tell all. You have built a community of users and keeping quiet just puts up a wall between them. Some people will criticize you for it, but the majority will be grateful.
Even M$ and Apple and others have had similar problems. Have you ever tried getting to Apple.com during MacWorld? How many times have you heard of some security problem in M$ software…
No matter what your planning is Murphy will find a way to screw everything up. We programmers are used to making things work and when something goes wrong we blame our logic. Truth is we are human we make mistakes no matter how smart we are and when a company admits that I think it is pretty cool.
I happened to be experimenting with Basecamp when all this was going on, and my initial conclusion was, "wow, Rails must be a lot more unstable than I thought." So it is mutually useful to know that wasn't the source of the problem.
I think having a duplicate environment even if it is minimal that you can practice your rollouts will be of an extreme help and can avert some major issues. You want to separate this from your development/integration server as you typically have many more development libraries / an insecure / trying out stuff, issues to deal with.
But you need to have a documented (again if this is a small company if might be documented I peoples heads) rollout framework. If you have a document that outlined a detailed specific plan on how the application is to be installed you are less likely to forget something and run into an issue.
Obviously mistakes happen or something goes wrong on the hardware or software environment. You should always have a rollback plan that will allow you to get to a state when you where before the rollout took place. We also have an image server; basically we take a snapshot just before a rollout where we can re-image any server from bare metal within about 5 minutes of a fatal error. FYI, we have used this several times and is will worth it if you run into something major and have to retreat!!!
An image server is also good if you have a fatal hardware error or something and you need to bring back a machine. This could be rollout independent and is in our case.
So many procedures and things to think about. Maybe we should create an open-source IT infrastructure maintenance group where we can document some of this stuff for the general public.
I have never seen anything like this in the public domain before and is mostly very general.
Challenge by Goynang on March 09, 21:05
Sharing your troubles impresses me more more than it worries me. I might get worried if *every* post was about errors and problems though. It's not like you have staff admitting to rm -f accidents or anything - ah, hang on. ;)
How about letting us know what steps you have taken to avoid the problems ever happening again?
Challenge by Luke on March 09, 23:18
Can't I have my cake and eat it too?
Please don't be insulted, I'm impressed that you told us. Most companies try to maintain plausible denial over the actual cause of their problems. An organisation this transparent is incredibly refreshing. A blog that allows instant feedback from real customers is even better.
But you have to expect some constructive criticism when you have this transparency & these feedback mechanisms. After all, that's what the they're for, right?
You won't just get people saying "I wish the yellow fade was a pale blue fade". You'll also get some saying "Ack, your upgrade procedure is totally broken! Anyhow, here are some suggestions on how to fix it..."
So I don't think LESS of you guys. Because I know the broken operation process is temporary, it will be fixed. (Thanks, in part, to the helpful suggestions you've received in comments to these two posts?) But the transparency & responsiveness is permanent.
Ditto on the staging environment. It's absolutely foolhardy to deploy without one.
While reading the original article, I was cringing while you were describing all of these deployment issues that you were discovering in production. They all may have been related to the sick server, but when I read it, I was thinking,"You didn't test iconv with a snapshot of your production data before launch? You decided that production was the best place to test the various combinations of apache/lighthttpd/fastcgi?"
What would you have done if everything had gone wrong? When would you have made the decision to roll back, or would you have been unable to?
In all of our deployments, we have various go/no-go points. We partition off a production server group, stage into it (so app servers, load balancers, web servers, database servers, search servers, etc. are all exactly the same as the rest of the farm), run smoke tests, and decide whether to push the release out to the rest of the cluster. If we decide to go further, we take the site dark if needed, push the app out to the rest of the farm, test again while monitoring logs and alerts, and if it passes, then we go live. At any of the go/no-go points, we have a full plan in place for rolling back to the prior state.
We're dealing with a lot more servers and tiers, (and so we need this heavyweight process) but the end result is the same -- try everything before you do it with your bare tush exposed to the world.
So now the question is, does 37s publish *more* discussing the behind-the-scenes operations, what they'll do next time, their planned infrastructure changes that are in progress, and/or reply to the strong comments made in criticism? Should they be expected to, or will this rolling snowball get the best of them?
At what point is a company being too transparent (if there even is such a thing)? Or, even better -- at what point does/should a company define a ceiling to their "transparency" for the sake of [insert-reason-here]?
Challenge by Jason Fried on March 10, 7:52
We're a private company so we'll share what we feel comfortable sharing. Right now we feel comfortable sharing a lot and plan to continue down that track, but the real answer is "it depends." David has posted a couple messages, we've posted a couple at SvN, but certainly we've been up to more than just those few things.
And of course there are certainly things about our operation that are private and will remain private. We don't make sales numbers public, for example.
re: your PS. It's definitely easier to write about what you screwed up on. #1, writing about it allows you to think about what went wrong outside of the fog of the battle, #2, it also allows you to break it down logically, and #3, while writing about it you can think about how you might have done it better, so to avoid the missteps.
Bombs are much more fun to read about anyway - It helps me in figuring out what I'd have done.
Who wants to read about how good I am with regular expressions or httpd.conf anyway?
Keep the comm lines open. We all learn from it.
Challenge by David Morton on April 10, 6:29
By all means, be honest. An open, honest explanation is the best way to calm irate customers. Once as a sysadmin for an ISP, I accidently nuked our entire company website, and backups were mysteriously missing. I managed to piece it together from cached files from various computers around the office.
As a self imposed punishment, I answered a lot of the phone calls myself, and when people complained, I simply said, "I'm sorry, that was my fault". In every case, it just blew them away, and took the argument right out of their mouth. None of them were upset after I talked to them, and most were chuckling about "those complex computers" after that.
BTW, the accident was a very weird code malfunction that took me several days to track down. It was basically a rm -rf * that started in the wrong location due to a missing LDAP record. I added a validity check on that parameter after that! :)