The Petabyte Files

Monday, September 1, 2014

Hiring Software Engineers

We recently looked at our hiring process to attempt to determine what works and what doesn't. By "works" I mean what gives us insight into future performance. The first thing we realized was that we had no process.

Step 1: Create a process

The first thing we did was look at data from other companies. Google has published some information about their hiring process and in particular what didn't work. Brain teasers were out, but there wasn't a lot of good information about what does work. The problem is that without a lot of data you really don't know how good your process is. You certainly know how well the people you hired have performed, but you have no way to know how good the people you didn't hire were and without a highly structured process it's hard to match interview performance with real-world performance.

The next thing we did was to have everyone come up with a list of questions they typically ask candidates and for each one we asked, "what does the answer to this question tell us?". Does it tell us how well the candidate will perform as an engineer, does it tell us how good the candidate is at interviewing, or does it give us some vague sense of a quality we think an engineer should have that may or may not translate to real-world performance? It's hard to know which questions fall into the first category, but it's a little easier to filter out questions that fall in the last two.

We started by not presuming any relationships that we didn't have data to support. Does performance under stress in an interview translate into performance under stress in the real world? I don't know, so presuming that relationship does not provide me with useful information. Does the ability to "stand out" in an interview process correlate with being a good engineer? Probably not. We attempted to determine for each question how and why it would impact our decision to hire someone. Is specialized technical knowledge critical, important, just nice to have? What kinds of questions will tell us if they are a good fit with the team?

After we came up with what we felt like were a good set of questions we looked at the environment of the interview. Since we have no data to support the hypothesis that there is a correlation between being good at interviewing and being good at software engineering, we wanted to create an atmosphere that was as relaxed as possible. We actually explain our process and the thinking behind it because we don't want candidates to feel like stumbling over an answer because they are nervous is going to reflect negatively on them. We wanted to give them every opportunity to show us what they've got because we don't want to miss out on someone really good over a bad interview.

The last thing we looked at was our filter for candidates. This one was tough because we don't want to waste time on someone who is obviously not going to work out, but at the same time we don't want to have such a restrictive filter that you miss out on good candidates. What we settled on was FizzBuzz. A very simple test to see if they could program at all, or at least could figure out how to use Google. We try to impress on recruiters that we don't want them to apply their own filters because frankly we don't trust them to make any better decisions than we would, but we haven't met with a lot of success on that front.

So, how well is it working? I don't know. We certainly removed things that were not likely to be useful and put some priorities on the process that everyone understands and agrees on. I'd like to say time will tell, but we just don't hire enough engineers to collect the kind of data that a Google or Facebook could. Oh, and if you're looking for a job with some challenging problems, we are hiring.

Saturday, August 9, 2014

S3 with strong consistency

We use S3 extensively here at Korrelate, but we frequently run into problems with it's eventual consistency model. We looked into working around it by using the US West region that has read-after-write consistency, but most of our infrastructure is on the US East region and moving it would be a lot of work. Netflix has a project called s3mper that provides a consistency checking layer for Hadoop using DynamoDB, but we really needed something for Ruby.

Since we also have a lot of infrastructure built around Redis, we decided to use it for our consistency layer because it's very fast and has built-in key expiration. The implementation is fairly simple: all writes to S3 also write a key to Redis with the etag of the new object. When a read method is called, the etag in Redis is checked against the etag from the client to see if they match. If they do, the read proceeds as normal. If they don't, an AWS::S3::Errors::PreconditionFailed is thrown. The client then decides how to handle the error, whether that is retrying or doing something else. If the Redis key is nil, it is assumed the data is consistent.

In practice, it's never more than a second or two to get consistent data after a write, but we set the Redis key timeout to 24 hours to give ourselves plenty of buffer without polluting the the DB with an endless number of keys.

This is still incomplete because it doesn't cover listing methods in ObjectCollection like with_prefix and each, but it's a start.

Thursday, August 7, 2014

The Premortem

We are all familiar with the postmortem. When things go wrong, we want to understand why they went wrong so hopefully we can avoid them in the future. This works well for classes of failures that are due to things like infrastructure or procedural deficiencies. Maybe you had a missing alert in your monitoring system or something wasn't communicated to the right person. The problem is that we can't plan for the unexpected and it's the unexpected that causes the most problems.

The way the premortem works is this: imagine it is one year from now and (insert your project here) has failed horribly. Please write a brief history explaining what went wrong.

What this attempts to do is to bypass our natural tendency to downplay potential issues. Whether you are in favor of the project or not, this exercise will engage your imagination to come up with failure scenarios you might not have otherwise considered.

Give it a shot and please post comments about what you thought of it.

Wednesday, August 6, 2014

You are bad at estimating

For the last 4 years or so at Korrelate we have been using Scrum. We had 2 week sprints, estimated stories and planned based on those estimates. In that time we have learned one very important lesson: we are very bad at estimating. In retrospect, this shouldn't have come as a surprise to anyone. There is a mountain of research across multiple fields that demonstrates just how bad expert estimates are. They are so bad that on average they are worse than randomly assigned numbers. If you think you are somehow different and you can do it better, you are wrong. You may be thinking that your estimates have been fairly accurate and your sprints largely successful. There are some reasonable explanations for this phenomena:

You are doing a lot of very similar tasks. Given a history of nearly identical work it is possible to come up with fairly good estimates
You have a large enough team that your bad estimates have roughly balanced each other out so far
You are working a lot of extra hours
You have yet to experience regression to the mean

Given that we know we are bad at estimating, what should we do? I propose a fairly simple change: weight all stories equally. That's right, don't estimate anything. This may sound crazy at first, but the evidence shows that equal weighting is on average better than expert estimates and as good or nearly as good as sophisticated algorithms. You would probably do better than your current estimates by basing them on the number of words in the story.

Now, obviously some stories will require more effort than others, we just don't have a good idea of which ones those are. I propose another change to help here: break every story up into the smallest pieces that make logical sense. Some stories will becomes epics that contain multiple smaller stories. If the story can't be sensibly broken up then just leave it. Do not make any attempt to equalize them either within the epic or against other stories, that's just a back door to estimating. I think this is the best attempt we can and should make to reduce the difference in effort between stories.

So, without estimates, how do you plan? This brings up the issue of sprints. My final proposal is that we abandon them as well. If you want to know how many stories the team is likely to complete over the next month, just take the average number of stories they have completed over the last few months. The time they took to finish is irrelevant and you can still follow trends in velocity over time and use them to provide better estimates about things like how adding another engineer will affect velocity and how long it will take to ramp them up. Resist the urge to add your "professional intuition" into the equation, you will only screw it up. Trust the data, not your gut.

I'd love to hear your personal experience with estimating or not estimating. Sprints vs continuous deployment or anything else related to improving the development process.

Sunday, December 1, 2013

Upgrade to Storm 0.9.0-wip16 with JRuby 1.7.4

At Korrelate, we've been running Storm with JRuby 1.6 and Storm 0.8.3-wip3 for many months and we recently decided to upgrade to JRuby 1.7 and Storm 0.9.0. We made a few brief attempts in the past and ran into problems that exceeded the amount of time we had to do the upgrade, but this time we had more downtime to get through the problems. Here is a rundown of what we did to get everything working.

We are using version 0.6.6 of Redstorm to handle deployment and testing.

The first hurdle to overcome was getting it to run locally using redstorm local.

In your project's ivy directory, create a file called storm_dependencies.xml. Notice the snakeyaml version is set to 1.11. This is a fix for https://github.com/nathanmarz/storm/pull/567

Make sure you are using JRuby 1.7.4, we ran into additional problems with 1.7.8 and backed out for now. Now create a file called topology_dependencies.xml in your ivy directory

This will deploy your topology with JRuby 1.7.4 and it's dependencies.

You should now be able to run your topology in local mode, the next step is to deploy it to your cluster. First, we'll look at the changes required in your topology code. There are several bugs in 1.7.4 related to loading files from a jar. Make sure you only require RUBY-6970.rb in your topology. Thanks to Jordan Sissel for doing most of the work for this fix.

Since we had to upgrade snakeyaml to run it locally, we also have to update it in the storm cluster. The following change is required for project.clj in the root of your storm distribution

The last thing you may have to do is make a small change to the storm code for a bug that occurs if you're deploying on Ubuntu (possibly others??). If you're using storm-deploy you can fork the storm github repo to make the changes and point storm-deploy to your repo in the storm.clj file

A few other related notes:

For java.lang.OutOfMemoryError: PermGen space errors you can put this in your .ruby-env file if you're using a compatible ruby version manager like RVM:

Use t.options = '--use-color' if you are missing color output in your tests

I didn't track this as I was doing it so let me know if I missed anything. Good luck!