Sunday
Apr142013

What the heck should I be doing right now?

I forbade myself from coding this weekend.

I desperately want to code. That's why I've forbidden myself. I've spent the last two weeks writing a sweet set of iOS build scripts. At first, the entire iOS build process made no sense and I was hacking blindly. Then I started to figure things out. And then the vision of the Perfect Build started to burn into my mind. I could think about nothing else. And I willingly flung myself at it. Each commit brought me closer, but my progress enlightened me to how much better things could be.  I left work on Friday night at 10 pm happy, but exhausted. On my Saturday train ride to Philadelphia, I mentally obsessed about how to eliminate three more configuration settings and how to blow away a hard-coded value.

This feeling of exhausted, obsessed happiness is what first got me excited about coding. I remember experiencing it for the first time in 1992, when my friend Jed and I were on the phone late every night trying to figure out our AP Pascal assignments. I didn't sleep much. I didn't care. Nothing felt as great or as fulfilling as a problem's transformation from impossible to solvable to perfectable.

I think people without this obsessive streak have limited potential as programmers. Sooner or later, you hit the bug or the limitation that requires you to understand your code really deeply. At GameChanger, for instance, our DB developers have become MongoDB whisperers. To build a scalable system on top of MongoDB, they've had to understand how it works really well, and they've had to become absurdly familiar with its logging and its quirks. I hit this point with CSS a few years ago and with Javascript. And now I'm hitting it with Apple's serpentine build tools and process.

There's a paradox, though. You need to obsess about code to be a good engineer. To obsess about code, you need to willfully ignore the big picture sometimes. You can't get to an adequate depth of understanding if you always stop at the point where customers are happy and where the business makes money. Increasing your understanding enables you to do things that make customers even happier and that makes the business even more successful. But losing sight of the big picture makes your work less valuable, and that makes you a bad engineer.

After working all Friday night on my build scripts, dreaming about them, and then mulling them on my train ride, I decided it was time to step away. My build scripts will benefit GameChanger, but were they really the best way for me to spend a working Saturday? Reluctantly, I decided that they were not. I probably should be spending my time regaining the perspective that I suspend when I write code. So I spent a few hours actually using our product at the Penn-Princeton baseball game. And I learned a few things while doing so.

I stuck to my pledge not to code (save for a short Facebook API emergency). It wasn't easy. I really wanted my coding fix. But my weekend of perspective was valuable. I should do this every weekend.

Sunday
Oct142012

Nobody codes alone

For the past few months at GameChanger, my team has had a "nobody codes alone" policy. Our three developers — Nick, Ben, and I — work together on the same features at the same time. It's an unusual practice. Conventional wisdom says that we should work on separate projects to minimize coordination overhead.

Conventional wisdom is wrong for two reasons. The first reason is pretty intuitive, but the second one is surprising.

The intuitive reason: working together simplifies code and reduces bugs. Any time wasted on coordination is gained back by time NOT spent on fixing bugs. I've lost count of the number of times that I've taken a complex, esoteric idea and simplified it after talking with someone else. Bugs reveal themselves more quickly in simple code. People working together are good at simplifying and distilling each other's ideas.

The surprise, though, is that co-development makes non-engineers more productive. Designers, testers, and product managers have to juggle every project that's in flight. When a bunch of projects are happening simultaneously, the burden falls on non-engineers who have to deal with interruptions and last-second requests. In most places I've worked, PMs, testers, and designers are overworked and stressed out. They crave the opportunity to do few things and to do them well. Co-developing features helps non-engineers become happier and more productive.

Not every team is capable of working together on everything. Ben, Nick and I can do it because we don't freak out at ambiguous situations. Our work collides frequently, and, when it does, someone invariably speaks up. When that happens, we stop what we're doing and we talk. Then we figure out a plan and get back to coding.

Monday
Oct082012

Reducing the TDD cycle

Rails and Rspec

If you've been practicing Test Driven Development, you are familiar with the normal cycle:

 
  1. Write a failing test
  2. Write enough code to pass the test
  3. Refactor while keeping the test in a passing state

Earlier this year I started developing in Rails with Rspec and found my cycle to be closer to the following:

 
  1. Write a failing test
  2. Wait for Rails environment to load and my test to run
  3. Write enough code to pass the test
  4. Wait for Rails environment to load and my test to run
  5. Refactor while keeping the test in a passing state
  6. Wait for Rails environment to load and my test to run

 

These wait stages will vary depending on your environment, but in my case it can be high as 15 seconds to run a test that is measured in ms. After a while I found myself starting to skip steps, writing larger and larger tests, filling in more code at once, etc. My rationale was, if its going to take so long to run one test, I might as well write a few at once. When I realized what I was doing I thought there had to be a better way. After some brief searching, I found it!

 

Spork

Spork is a service that preloads your Rails environment, and then forks a copy of your server when you run tests. This reduces your TDD cycle back to the normal 3 step process, saving you valuable time.

Setting up Spork only takes a few minutes. For the latest instructions, see the spork-rails gem.

Installing

This assumes you have already installed the rspec-rails gem, and configured your spec_helper.rb file.

Add spork-rails to your Gemfile

group :test, :development do
...
   gem "spork-rails"
...

Configuring

After installing the gem, you need to configure Spork. You can bootstrap your test helper file by running:

spork rspec --bootstrap

When it completes, it will tell you to modify your spec_helper.rb file and follow the instructions within. The bootstrap command will have edited your spec_helper.rb file and added two new blocks at the top:

require 'spork'
#uncomment the following line to use spork with the debugger
#require 'spork/ext/ruby-debug'

Spork.prefork do
  # Loading more in this block will cause your tests to run faster. However,
  # if you change any configuration or code from libraries loaded here, you'll
  # need to restart spork for it take effect.

end

Spork.each_run do
  # This code will be run each time you run your specs.

end

# The previous contents of the file will be at the bottom
...

Generally, all you need to is move everything that was in the file before, and is now below the Spork sections, within the Spork.prefork block. This block instructs Spork to perform this work only when it starts your initial Rails environment. Thus, when your environment is forked by Spork, all this overhead is avoided.

If you need to do some activity for each forked environment, place it in the Spork.each_run block.

Usage

Now that Spork is configured, you can start it and see how much faster your test runs are.

To start Spork:

$ spork
Using RSpec, Rails
Preloading Rails environment
Loading Spork.prefork block...
Spork is ready and listening on 8989!

Spork is now up and running, and listening on port 8989

To run your tests using Spork:

rspec --drb spec/

You will know its working because:

  • your tests run immediately
  • in your spork terminal you see a message indicating it is running your tests

If you want RSpec to default to using Spork, you can edit your .rspec file and add the --drb option to it. This way when you run RSpec it will look for Spork and use it if available, otherwise it will load your Rails environment normally.

Rubymine Support

For those of you using Rubymine, you can also leverage Spork. They have a great help page that provides the instructions here: Using DRB Server

Caveats

If you are making changes that would normally require you to restart Rails, you will now need to remember to restart Spork instead. This can be automated using tools like Guard which I'll cover in another post.

Saturday
Sep152012

How we fixed more bugs by deleting our bug DB

 

The entirety of our new bug database

Two weeks ago, I got frustrated with the hundreds of bugs and feature requests in our database. So I deleted the whole thing. Then an odd thing happened: we started fixing bugs. We fixed almost 30 bugs last week, easily our best bug fix rate since I've been at GameChanger.

Joel Spolsky inspired me with his post on Software Inventory. I followed his advice almost to the letter:

At some point you realize that you’ve put too much work into the bug database and not quite enough work into the product.

  • Suggestion: use a triage system to decide if a bug is even worth recording.
  • Do not allow more than two weeks (in fix time) of bugs to get into the bug database.
  • If you have more than that, stop and fix bugs until you feel like you’re fixing stupid bugs. Then close as “won’t fix” everything left in the bug database. Don’t worry, the severe bugs will come back.

In our release cycle, two weeks is an eternity. So we don't wait for two weeks of bugs to pile up. We wait until our bug column in Trello is roughly a screen-and-a-half tall.

Then we fix bugs! Our new bug column is too small to ignore. Some of the bugs that popped up were old bugs that customers had complained about for months. They had nowhere to hide in our tiny Trello column. So we fixed them.

The normal Huge Bug Database works in the opposite way. It requires a triage system (meeting), which requires agreement on a system of priority and severity (something to argue about). Then there needs to be some sort of scheme (meeting) for scheduling bug fixes along with feature work. And someone's got to make sure (emails) that the small percentage of Chosen Bugs are actually fixed before release goes out. That's a ton of meetings, arguments, emails, and management for little benefit.

Instead of documenting, categorizing, and scheduling bugs, we're fixing them and going back to writing features. Yay!

Sunday
Aug052012

Transforming a 200k-line pile of spaghetti

This week, an intriguing question made the rounds at StackExchange: "I've inherited 200k lines of spaghetti code — what now?" Since I apparently lack enough karma to post on StackExchange — I created my account today — I'll respond here.

At its core, this is a people problem and not a technology problem. I've failed when I haven't recognized this fact. The core question isn't one of SCM, build systems, or coding practices. The core problem is convincing a group of people (scientists, in this case) to adopt a new set of practices and to change the behaviors that led to spaghetti code. 

To make this transformation, some of the same people who are writing spaghetti code today will have to become evangelists for your ideas.

Here's what I'd do: 

  • Observe for a couple of weeks. Understand where the team is experiencing the most pain. What short term pressures does the team face? Learn what people are good at. Figure out who's most excited about making changes.
  • Form a rough vision of what success looks like. The leading answer at StackExchange is quite thorough and it's a great starting point. But remember that this vision will necessarily be unique to each organization.
  • Start by solving a problem that people already care about. Are people complaining about lost source code? Embarrassing bugs? Time spent on supporting legacy code? Too many feature requests and too little time? Bad user experience? Late releases? Whatever it is, a 200k-line code base is going to have lots of problems. The team is going to care about some of those problems more than others. With your first big initiative, earn people's gratitude! At a past job, the team became resentful when I tried to solve an important problem that the team didn't care about. I likely would have succeeded if I'd spent my first few months on problems that mattered to them.
  • Support other people's good ideas. You can't tame a 200k-line code base by yourself. If you need other people to come up with good ideas, you better support those people when their ideas come along. My colleague Andrew was incensed at our group chat software and drove our adoption of HipChat. A few weeks later, our entire development workflow is built on HipChat. Build notifications, deploy notifications, code review requests, and production alerts all route to HipChat rooms. At a company that hates email, nobody's ideas on continuous integration, monitoring, and alerting would have worked without a chat platform that the team loved.
  • Be optimistic. Every problem has a solution. Your 200k-line code base won't be beautiful overnight, but there are going to be some great victories along the way. Enjoy them.
  • Don't obsess over your failures. When you're dealing with a 200k-line code base, you're going to have to make a lot of changes. Not every change will work. Don't worry about it. I tried to turn every Monday into a bug-fixing day. It worked for a little bit but it proved hard to keep our attention on bugs when we were hustling to wrap up higher priority work on Mondays. I didn't force the issue. We'll find another way to prioritize bug fixes.
  • Sell, sell, sell. When you make something better, make sure that other people understand it and can adopt it themselves. Do demos. Hold training classes. Pair program. Do what you need to do to make sure that good ideas get critical mass.

Transforming a codebase is really about transforming a team. And transforming a team is about getting people excited to make big changes. Enjoy the challenge, and good luck!

 

Wednesday
Jul042012

Risk is invisible

Risk is invisible but reward is not. This basic fact has been at the root of more than a few calamities, not the least of which are the recent financial crisis and the dot-com boom and bust. A company's or a society's attitude towards risk is a core part of its culture. And we'd like to have a culture that encourages smart risks and discourages stupid ones. How do we do that?

Software engineers are responsible for avoiding stupid risks. But we often don't. Under pressure to meet a public deadline or to ship a highly visible feature, engineers routinely take reckless — but invisible — risks. We can introduce security loopholes, skimp on testing, or skip monitoring entirely. To make matters worse, customers — who can't see the risks — are usually thrilled by this behavior and software managers sometimes reward it. When you aren't attuned to the risks of software development, it's easy to mistake recklessness for "customer focus." How do we prevent this from happening?

It obviously helps when managers already understand risk. I was pleasantly surprised and impressed when our investor Jos White stopped by GameChanger and advised us to prioritize technical architecture. Jos is a wildly successful three-time entrepreneur. He founded three $100M+ companies, but he's never been an engineer. So when came to speak with us, I was excited to hear his story. And I wasn't surprised to hear him talk about finding untapped markets and finding great teams who believed in their bones that they could bring their products to life. But I wasn't expecting a self-described marketing guy to talk about the importance of doing architecural work. Jos understood that good architecture was essential for scaling at low risk, even if that work doesn't pay off instantly. Jos has been around the block a few times, and so he seemingly "gets" the risk management balance that software companies need to master. Others learn to "get it" by working with technical people they trust.

As with many interesting problems, the root answer is cultural. Companies need to create a culture that values good risks and that discourages bad ones. In software, this comes down to valuing risk reduction and the people who practice it well:

  • Does your company have a technical career path that rewards people for paying attention to details?
  • Do engineers get public attention and praise for risk mitigation? Or, alternatively, do they get attention and praise for heroic responses to problems that they should have prevented?
  • Do engineers have prestige in your company's culture?
  • Do engineers get rewarded or chastised for speaking truth to power?

For our part, engineers need to use direct no-nonsense language when informing others about risk. At GameChanger, I was impressed when my colleague Doug made risk tangible by illustrating risks with graphs. A line that moved down and to the right was a good way to illustrate that we lowered our risk of crippling performance problems in the future.

It's hard to create a culture that takes good risks and avoids foolish ones. But it's an essential challenge for any company to grasp.

Sunday
May272012

I won't grow up

I won't grow up,
I don't want to go to school.
Just to learn to be a parrot,
And recite a silly rule.
If growing up means
It would be beneath my dignity to climb a tree,
I'll never grow up, never grow up, never grow up
Not me!

- Peter Pan

I love what Facebook calls the Hacker Way, the idea that anyone can make a difference by doing more and by talking less, by moving quickly, and by constantly improving. I love that GameChanger has embraced this ethos. But some of our recent hack attempts have lead to embarrassing and unacceptable site crashes. And with those screwups came pressure to "grow up" and shelve the Hacker Way.

Three months ago, we decided to fix our problems without "growing up." Though we can't tolerate embarrassing bugs and crashes, we need to eradicate them in a way that doesn't destroy the hacker spirit that fuels the company. Instead, our answer to these problems is to relentlessly driving down the risk of hackery.

I didn't always take that approach. At a past company, we — and I was a primary culprit — established a culture of extreme caution. Though we described ourselves as "agile", we introduced a functional spec process, and then a design spec process and then a test spec process. We mandated signoffs by an ever-growing list of stakeholders. We took responsibilities away from individual developers and delegated them to specialists. In short, we systematically eliminated any possibility of screwing up by adding processes, reviews, and checklists. To be fair, it worked at preventing catastrophes. But, in the process, we de-empowered ourselves. New employees never felt as if they could change things. We sucked the fun out of what we were doing. And we slowed down.

Preserving the Hacker Way is as important as preventing screwups. My colleagues at GameChanger are building a foundation of technology and process that lets us have fun and make customers happy. I should note that I can't take credit. Lots of people are working hard to make our product safe for customers and hackers alike. Here's what we're doing.

Measure everything

We realized that we didn't know enough about how our system was behaving. We installed graphite and statsd and built APIs that made it easy for everyone to measure the behavior of the systems they were building. The graphs were a quick prod to action and improvement. Were we really making that many database calls? Were our queues really blowing up dramatically under load? The graphs surfaced bugs, which we fixed. And we proudly demo'ed "before" and "after" graphs to the whole company.

Monitor aggressively

Having data handy was only useful if we acted on it. Automated alerts were our prod to action. If a bad deploy causes errors to spike, we find out right away. If a queue suddenly explodes in size, we get notified. We constantly tune our thresholds so that they are just right — not so frequent to spam us with false positives, not so laid back that our customers suffer needlessly.

Automate relentlessly

Testing GameChanger is hard. It's a distributed, asynchronous system comprising a web site, and API, dozens of queue processors, an iOS app, and an Android app. All of these parts interact and they change frequently.

To assure that these parts work together well, we've been relentlessly automating our testing. We've set up a Jenkins instance that runs  

  • pylint checks and PyVows tests against our server-side code
  • Javascript tests in a headless browser
  • mobile app tests againsts headless simulators
  • Selenium WebDriver "smoke tests" against multiple browsers
  • continuous deployment of test versions of our mobile app

In the mean time, our mobile team is working on automating crash reporting and response with Crashlytics.

Eliminate noise

Noise is the enemy of automation. People ignore automated alerts and tests when there are false positives and duplicate notifications. We hate noise. We keep our email inboxes clear of Jenkins notifications by routing them to HipChat. We're moving our alerts to nagios, which allows us to acknowledge known problems and clear our inboxes of noise.

Reduce batch size 

Big changes are inherently riskier than small ones. But keeping changes small requires conscious effort. Our mobile team is building tighter, more focused releases. Our server teams rely on a feature flag framework that lets us ship and test incomplete features without exposing them to customers.

Eat our own dog food

We routinely release risky changes to staff before enabling them for everyone. A couple of weeks back, we noticed that a staff-only jQuery upgrade broke our core checkout flow. We fixed the problem before releasing that upgrade to all users.

Review prudently

Every code change goes through mandatory pre-commit code review. Because we keep our batches small, code reviews only take a few minutes and require few changes. Since I'm a new employee, code reviews have often corrected my misunderstandings about our system. Reviews have also prompted us to come to quick agreement on design conventions and coding standards. It's easy to get carried away with code reviews, but I've been impressed with our commitment to keeping reviews quick and informative.

Fix stuff that doesn't work

When a part of our system constantly causes us problems, we fix it. We recently launched a rewritten version of our streaming server — and crushed a bunch of database load in the process. We also fixed post-deploy error spikes caused by inadequate load balancing.

Learn from every mistake

Despite our efforts, we still screw up sometimes. A common conclusion from our post-mortems is that we need to simplify development. Recently, for instance, we broke some page layouts in IE when accidentally deleted an HTML doctype from a base template. In our post-mortem, we realized that we were making a repetitive change to our thirteen base templates, and that we messed up one of the thirteen edits. The mistake slipped past the original committer and the code reviewer (me). We realized that having so many base templates was an invitation for error and we're now working on consolidating those thirteen templates into one.

We're also reducing library version management by using a standardized requirements.txt file. We're simplifying web UI development with a style guide.

Over time, we're getting smarter and our system is getting simpler. That's the point.

So, is it working?

We only started these initiatives three months ago and the results have been incredible:

  • Web error spikes are half as frequent as they used to be, and that doesn't include improvements we made before we started measuring.
  • Our support team used to tell us when our messaging and streaming queues weren't working. Now we tell them — and we fix them fast.
  • More times than we can count, a failing smoke test or pylint check has preventing us from a disastrous deploy.

We all recognize that there's tons more to do. Our system can become a lot simpler and more reliable. We need to close some monitoring loopholes. We still to root out some false positives. We are honing our skills at writing automated tests. But the bottom line is that we're becoming more reliable, and we're keeping the Hacker Way alive and well. We haven't burdened ourselves with tons of rules and processes. We're continuing to experiment with new libraries, to roll out new tools, to invent quirky traditions, and to have fun.

We haven't grown up and we don't plan on it.

Sunday
Apr152012

Being Careful Is Not a Solution

When someone makes a big mistake, it's tempting to tell him or her to be careful. The developer who deployed buggy code should have been more careful. The support rep who messed up a customer's data should have followed the procedure more carefully. Nobody should have deployed anything during a high traffic time. When problems happen, people get angry and ask, "Why wasn't someone more careful?" And most of the time, some contrite person steps up and says, "I screwed up and should have been more careful."

Being careful is good, but being too careful is bad. An excess of carefulness doesn't atone for systemic flaws. That is, we shouldn't transfer responsibility from an error-prone system to whoever happened to make the most recent error. When I find myself promising to be more careful or telling someone else to be careful, a little voice in my head tells me, "You're ignoring the real problem."

It should be hard for a developer to break significant functionality.

It should be hard for the product to crash.

It should be hard for a customer to make a mistake that requires them to contact support.

It should be easy to know quickly when something is wrong.

All of those aspirations are possible. They just require more hard thought than being careful does. But, through good system design, judicious automation, meaningful alerting, and obsessive iteration, I've found that it is possible to build systems that enable me and my customers to be carefree.