I won't grow up,
I don't want to go to school.
Just to learn to be a parrot,
And recite a silly rule.
If growing up means
It would be beneath my dignity to climb a tree,
I'll never grow up, never grow up, never grow up
Not me!
- Peter Pan
I love what Facebook calls the Hacker Way, the idea that anyone can make a difference by doing more and by talking less, by moving quickly, and by constantly improving. I love that GameChanger has embraced this ethos. But some of our recent hack attempts have lead to embarrassing and unacceptable site crashes. And with those screwups came pressure to "grow up" and shelve the Hacker Way.
Three months ago, we decided to fix our problems without "growing up." Though we can't tolerate embarrassing bugs and crashes, we need to eradicate them in a way that doesn't destroy the hacker spirit that fuels the company. Instead, our answer to these problems is to relentlessly driving down the risk of hackery.
I didn't always take that approach. At a past company, we — and I was a primary culprit — established a culture of extreme caution. Though we described ourselves as "agile", we introduced a functional spec process, and then a design spec process and then a test spec process. We mandated signoffs by an ever-growing list of stakeholders. We took responsibilities away from individual developers and delegated them to specialists. In short, we systematically eliminated any possibility of screwing up by adding processes, reviews, and checklists. To be fair, it worked at preventing catastrophes. But, in the process, we de-empowered ourselves. New employees never felt as if they could change things. We sucked the fun out of what we were doing. And we slowed down.
Preserving the Hacker Way is as important as preventing screwups. My colleagues at GameChanger are building a foundation of technology and process that lets us have fun and make customers happy. I should note that I can't take credit. Lots of people are working hard to make our product safe for customers and hackers alike. Here's what we're doing.
Measure everything
We realized that we didn't know enough about how our system was behaving. We installed graphite and statsd and built APIs that made it easy for everyone to measure the behavior of the systems they were building. The graphs were a quick prod to action and improvement. Were we really making that many database calls? Were our queues really blowing up dramatically under load? The graphs surfaced bugs, which we fixed. And we proudly demo'ed "before" and "after" graphs to the whole company.
Monitor aggressively
Having data handy was only useful if we acted on it. Automated alerts were our prod to action. If a bad deploy causes errors to spike, we find out right away. If a queue suddenly explodes in size, we get notified. We constantly tune our thresholds so that they are just right — not so frequent to spam us with false positives, not so laid back that our customers suffer needlessly.
Automate relentlessly
Testing GameChanger is hard. It's a distributed, asynchronous system comprising a web site, and API, dozens of queue processors, an iOS app, and an Android app. All of these parts interact and they change frequently.
To assure that these parts work together well, we've been relentlessly automating our testing. We've set up a Jenkins instance that runs
- pylint checks and PyVows tests against our server-side code
- Javascript tests in a headless browser
- mobile app tests againsts headless simulators
- Selenium WebDriver "smoke tests" against multiple browsers
- continuous deployment of test versions of our mobile app
In the mean time, our mobile team is working on automating crash reporting and response with Crashlytics.
Eliminate noise
Noise is the enemy of automation. People ignore automated alerts and tests when there are false positives and duplicate notifications. We hate noise. We keep our email inboxes clear of Jenkins notifications by routing them to HipChat. We're moving our alerts to nagios, which allows us to acknowledge known problems and clear our inboxes of noise.
Reduce batch size
Big changes are inherently riskier than small ones. But keeping changes small requires conscious effort. Our mobile team is building tighter, more focused releases. Our server teams rely on a feature flag framework that lets us ship and test incomplete features without exposing them to customers.
Eat our own dog food
We routinely release risky changes to staff before enabling them for everyone. A couple of weeks back, we noticed that a staff-only jQuery upgrade broke our core checkout flow. We fixed the problem before releasing that upgrade to all users.
Review prudently
Every code change goes through mandatory pre-commit code review. Because we keep our batches small, code reviews only take a few minutes and require few changes. Since I'm a new employee, code reviews have often corrected my misunderstandings about our system. Reviews have also prompted us to come to quick agreement on design conventions and coding standards. It's easy to get carried away with code reviews, but I've been impressed with our commitment to keeping reviews quick and informative.
Fix stuff that doesn't work
When a part of our system constantly causes us problems, we fix it. We recently launched a rewritten version of our streaming server — and crushed a bunch of database load in the process. We also fixed post-deploy error spikes caused by inadequate load balancing.
Learn from every mistake
Despite our efforts, we still screw up sometimes. A common conclusion from our post-mortems is that we need to simplify development. Recently, for instance, we broke some page layouts in IE when accidentally deleted an HTML doctype from a base template. In our post-mortem, we realized that we were making a repetitive change to our thirteen base templates, and that we messed up one of the thirteen edits. The mistake slipped past the original committer and the code reviewer (me). We realized that having so many base templates was an invitation for error and we're now working on consolidating those thirteen templates into one.
We're also reducing library version management by using a standardized requirements.txt file. We're simplifying web UI development with a style guide.
Over time, we're getting smarter and our system is getting simpler. That's the point.
So, is it working?
We only started these initiatives three months ago and the results have been incredible:
- Web error spikes are half as frequent as they used to be, and that doesn't include improvements we made before we started measuring.
- Our support team used to tell us when our messaging and streaming queues weren't working. Now we tell them — and we fix them fast.
- More times than we can count, a failing smoke test or pylint check has preventing us from a disastrous deploy.
We all recognize that there's tons more to do. Our system can become a lot simpler and more reliable. We need to close some monitoring loopholes. We still to root out some false positives. We are honing our skills at writing automated tests. But the bottom line is that we're becoming more reliable, and we're keeping the Hacker Way alive and well. We haven't burdened ourselves with tons of rules and processes. We're continuing to experiment with new libraries, to roll out new tools, to invent quirky traditions, and to have fun.
We haven't grown up and we don't plan on it.