The Naive Approach to Hiring People

(This is a snapshot of my old weblog. New posts and selected republished essays can be found at raganwald.com.)

Monday, February 11, 2008

Every once in a while I read (or write!) something about hiring programmers. What to look for in a résumé. What to put in your résumé. Why _____ is my favourite interview question. Why _____ sucks as an interview question. Whether we need to filter the absolute dreck out.

I’ve even written one of those _____ sucks posts on my blog, and I’m here to tell you, I was wrong. And I’m going to tell you why I was wrong. But first, here is an interesting programming problem-style interview question. I’m not suggesting it is good or bad, for reasons that will become obvious.

An interesting interview question

You have a large collection of documents, each of which accurately describes a single person’s properties. One document, one person. To keep this light, perhaps you are looking for a compatible bridge partner. The documents are online player profiles, and you are interested in finding a suitable partner. The properties for people are multi-valued, there is a large set of properties, and for each property in each document, there is either no value or a selection from a set of values. One value might be number of years of experience, another might be whether they overcall in third position with a weak hand (where “no value” means the other person did not answer that question and “no” means they do not overcall).

This is an iterative problem, you have to perform the separation on a regular basis, perhaps once each month. And each month, there is a new set of documents and persons to classify. Having performed the classification, you can check the game results at your local bridge club and see how everybody did, both the people you selected as potential partners and the people you rejected.

Describe a strategy for picking the best partners based on their profiles.

Before we discuss whether it is a useful problem, let me tell you who I’m interviewing with this hypothetical question: Technical Hiring Managers, specifically people who are technical themselves and are also responsible for hiring other technical people. Part of their job is looking through piles of résumés, picking out the good ones to phone screen.

What I’m looking for in a “correct” answer is a basic understanding of Document Classification. Given that we are talking about programming and programmers, a really good answer will discuss things like Naïve Bayes Classifiers. Like programs that can distinguish Ham from Spam.¹

The point is that someone with at least a basic understanding of document classification knows how to apply what we know about document classification to the problem of selecting candidates to phone screen based on their résumés.

Someone with an understanding of document classification knows how to apply what we know about document classification to the problem of selecting questions to ask in phone screens and in face-to-face interviews, not to mention what to do with the answers. (And the emotionally nice thing about this is that it’s an interview question for interviewers to solve.²)

If Statements vs. Classification

A very senior Microsoft developer who moved to Google told me that Google works and thinks at a higher level of abstraction than Microsoft. “Google uses Bayesian filtering the way Microsoft uses the if statement,” he said.

—Joel Spolsky, Microsoft Jet

I could make a reasonable argument that someone who doesn’t think of selecting candidates as a classification problem might miss the fact that the things to look for—years of experience with a specific technology, length of time the most recent position—are merely document features with probabilities attached to them. I could make the argument that they are “thinking in if statements” about hiring programmers.

I could go on about how saying a particular job requires “Five years of JEE” is an if statement, and one that is far from universal. So someone who thinks like that is not a good interviewer, that they really ought to be thinking in terms of the probability that someone with five years of JEE will be Ham and not Spam.

Oh, the irony. I would be arguing that the interesting question is useful because it identifies people who pose questions like that as being bad interviewers!

There are really two approaches to take in selecting candidates. The first is the approach of the if statement: You form a model of what the candidate ought to do, work out what they ought to know in order to do that, and then you work out the questions to ask (or the features to look for) that demonstrate the candidate knows those things. If they know this and this and this and if they don’t have this bad thing or that bad thing, call them in for an interview (or, if you are interviewing them and they have demonstrated their strength, hire).

The second approach is the classifier approach. Each feature you look for, each question you ask, is associated with a probability. You put them all together and you classify them as interview/no interview or hire/no hire with a certain degree of confidence.

So is the classifier the same thing as the if statements, only with percentages instead of boolean logic? Perhaps we could simply make up a score card (10 points for each year of JEE, 15 points if they use JUnit, &c.)? No.

The most important thing about most classifiers is that they can be remarkably naïve and still work. In fact, they often work better when they are naïve. Specifically, they do not attempt to draw a logical connection between the features that best classify candidates and the actual job requirements. Classifiers work by training themselves to recognize the differences that have the greatest statistical relevance to the correct classification.

That’s the naïvité at work: they have no idea that experience in functional programming is irrelevant to a job writing Javascript: they just notice that the people with FP experience tend to do well in Javascript jobs, so they start considering it relevant.

Training day

Document classification systems are trained, typically using supervised learning: “These are the résumés of the good people. These are the résumés of the ones we had to fire.”

Here’s a thought experiment: Pretend you are trying to write a mechanical document classifier. Let’s see if designing a machine to perform the process can identify some opportunities to improve the way humans perform the process. (As a bonus, we might actually identify ways machines could augment the process, but that is not our objective.).

If you were writing a document classifier for résumés, the first thing you would probably write would be a feature that updated the training corpus whenever a programmer completed their initial probation: If their first formal review was positive, their résumé would be added to the “Interview” bin. Otherwise, it would be added to the “No Interview” bin.

Programming Collective Intelligence breaks out of “thinking in if statements” and provides practical examples for building systems that reason based on learning from data and behaviour, such as the Naïve Bayesian Filters discussed in this essay and collaborative filters such as recommendation engines.

This is a big, big difference between approaching hiring people as an exercise in if statements and as an exercise in classification. If you are working with if statements, you only change the if statements when something radical changes in the job or in the pool of people applying for the job.

But if you are approaching hiring people as an exercise in classification, you are constantly training your classifier. In fact, the quality of your results is driven by your process for training, for continuous improvement. It’s a process problem: how do we do a good job of training our classifier and keeping it trained?

Consider the training process I mentioned above: you build a document classifier, and you feed it the résumés of people you hire after they complete probation. If they quit or are fired, they are marked as “No Interview.” If they get a lukewarm review, we mark them as “No Interview.” But if they get a good review, they are marked “Interview.” What do you think?

Ok, thanks for using the comment link to tell me what you think. Here’s what I think: this is dangerously incomplete. Pretend we’re sorting emails into Ham and Spam. Training our résumé classifier based on who we thought was originally worth an interview is like training our email classifier based on which emails ended up in our in box. It totally ignores the good emails that were classified as junk. To classify emails properly, you have to go into your junk mail folder every once in a while and find the one or two good emails that were misclassified as junk, then mark them “not junk.”

Our thought experiment has identified a critical component of classification systems: to train such a system, you have to identify your false negatives, just as junk mail filters let you sort through your junk mail and mark some items not junk.

Where hiring people is concerned, what is the process for checking our junk mail filter? How do we find out whether any of the résumés we passed over belonged to people worth hiring? I don’t have an answer to this question, but thinking of résumé selection as an exercise in document classification identifies it as an obvious weakness in the way most companies handle interviewing: as an industry, we don’t do much to train our selection process.

A metric fuckload of process

A company really obsessed with hiring well would keep statistics. I know, I can feel your discomfort. More paperwork, more process, more forms to fill out. But honestly, every process is improved when you start to measure it. Maybe we measure too many things, or the wrong things. My ex-colleague Peter Holden is a terrific operational manager. His metric for metrics is to ask whether a particular measurement is a management report, meaning—in his operations lingo—is that piece of data used to make an active decision in the business?

For example, if we actually store résumés and also the outcomes—whether we hired them, how they did—and then use that data to constantly improve how we select résumés, then that is a management report and that is data worth collecting.

Likewise, we could ask questions in interviews and actually track who answered correctly and who answered incorrectly and whether the answer had any correlation with a candidate’s eventual job performance. Does that sound like too much work? Seriously? Are you drinking the same kool-aid I’m drinking about the importance of hiring good people and the critical need to avoid bad hires?

The bottom line in my interviewing technique is that smart people can generally tell if they’re talking to other smart people by having a conversation with them on a difficult or highly technical subject, and the interview question is really just a pretext to have a conversation on a difficult subject so that the interviewer’s judgment can form an opinion on whether this is a smart person or not.

—Joel Spolsky, The Phone Screen

Or let’s move up a level. Many people like the touchy-feely voodoo approach to interviewing. Joel Spolsky calls certain questions “a pretext to have a conversation on a difficult subject so that the interviewer’s judgment can form an opinion on whether this is a smart person or not.” So maybe the answer to the question can’t be tracked in a neat yes/no, right/wrong way.

But you know what you can track? How about tracking whether each interviewer is a reliable filter? Do you keep statistics for which interviewers let too much Spam through, for which interviewers are so conservative that they statistically must be turning good people (Hams) away?

No? I must be honest with you. Until now, neither did I. Although I do not speak for Mobile Commons, I’ll bet we will be discussing it soon. We’re serious about growing, we’re serious about hiring really good people, and we don’t want to put on the blinders and demand “Five years of JEE.” Which means we want to talk to a lot of people who are “Smart and Get Things Done.” And which also means we need to get really, really good at bringing good people on board.

Which means we want to ask the questions that actually help us distinguish the best from the not-so-best. Which brings me back to my interesting question above, and why I won’t say whether it’s good or bad. Because I haven’t trained my filter by asking it of a representative sample and then determining the correlation between a supposedly correct answer and actual fitness for the job.

And the only way to know if it is useful is to incorporate it into a classifier and see if it collects a high conditional probability.

Summary

I am not suggesting that naïve Bayesian filters can outperform human interviewers, or that asking fuzzy questions like “How would you design a Monopoly game” have no place in hiring, or that an experienced programmer cannot tell if another person is an experienced programmer by talking to them.

I am especially not suggesting that people do not make false statements: many of the people I have interviewed in my career really believed that working on one Java application for two years made them experienced programmers with strong OO architecture skills.

But as stated clearly above, I am claiming that someone with at least a basic understanding of document classification knows how to apply what we know about document classification to the problem of selecting candidates to phone screen based on their résumés. I am claiming that what we know about training classification systems can be applied to improving the hiring process.

And mostly I am claiming that when we take a single question or feature, like "Years or experience," or perhaps, "Ability to write Fizzbuzz in an interview," the correct way to reason about its applicability to the hiring process is to think of its statistical correlation to our objective, not to try to construct a chain of if statements.

If you find this interesting, Games People Play discusses what to do about the fact that candidates will say or do anything to get a job, including lie about their experience.

An Apology

Remember I told you I was wrong about thinking something sucked?

Did you ever take that test yourself? Deckard?

—Rachel, Blade Runner

Once upon a time, I was asked an interview question, and I gave a very thorough answer, including all of the usual correct answers plus an unusual nuänce, a corner case that most people probably would have skipped. It cost me the job, as it turned out, the interviewer told me I was mistaken. I carried that on my back for years, even though the job probably wasn’t all that great a fit.

But now, I realize that worrying about answering the question correctly is thinking in if statements. If I get it correct, then I must be fit for the job. Not true at all. There could be a classifier question where there is a strong reverse correlation between getting the question correct and confidence in classifying you as “Ham.”³

The only thing that matters about that interviewer is whether, on the whole, he does a good job of separating Ham from Spam. Perhaps he does, in which case I was simply one of those statistical necessities, a false negative. Or equally valid, the question itself may have been highly valid, as was his interpretation of the answer: the only thing that matters might have been that answering in the manner the interviewer expected was highly correlated with job success, and that answering in the manner I did was negatively correlated with job success.

Naïve classifiers are brutal in that way. They don’t work the way you expect them to work. Spam filters give relevance to all sorts of words you wouldn’t expect. Or to phrases you don’t expect (thanks to interesting work with Markov Models). It’s a precise, bloodless process.

It isn’t personal. And for that reason, we really ought to back away from thinking about hiring in if statements. It’s a path that leads right towards taking it personally. As an interviewee, we take questions or puzzles that we find difficult very personally. We get angry if we are asked things we consider irrelevant to the job. Secretly, we want interviewers to validate our worth, not just by saying “Hire,” but by valuing the things we value about ourselves, which means we look for interviewer to have if statements that align with our notions of competence.

And as interviewers, it is difficult to take ourselves out of the equation. If we only hire people just like us, we have no opportunity to learn and improve on our hiring practices. Hiring people unlike ourselves is hard if we hire with if statements. It requires valuing our incompetence instead of our competence.

Approaching the problem as a problem in classification is our road out of that emotional swamp. It’s a process we can explain and understand without being personal, without judging ourselves as people or our candidates.

With this new understanding, I apologize to that interviewer for my criticism of the interview process. I will try to improve my approach to discussing interviewing and interview questions in the future.

There are a lot of classification algorithms, and this essay is not a claim that Naïve Bayes is ideal for any or all hiring purposes. But I use it as an example because most people understand spam filters and roughly how they work. [back]

Although this isn’t the subject of the essay, please feel free to use this question in the following manner: If you find yourself in an interview where the interviewer bombards you with puzzle after puzzle in an effort to impress you with how smart he is, when he folds his arms and asks you if you have any questions for him, pull this one out. Let me know how it goes :-) [back]

Reverse-engineering classifiers can be futile, but one can imagine a question that reveals the person answering it is highly overqualified for a basic clerking job. Or something. [back]

¶ 12:03 PM

Comments on “The Naive Approach to Hiring People”:

Interesting thoughts about a very hard problem -- I hadn't thought of it as a data projection problem before (which, of course, it is), so thanks for the insight.

One thing I also wonder is whether it's useful, given a sufficient sample, to treat hiring as a regression problem estimating a real-valued fitness function. In other words, instead of having just a fit/not fit classification, you have range of values, perhaps 0 to 1, where 1 is the best person you could possible hire, and 0 is the worst. This opens up some interesting possibilities: setting different thresholds depending on how badly you need someone, telling the difference between someone who's good enough and someone who can become one of the real technical leaders in your group, etc.

Regardless, my organization is likely too small to collect enough data to be meaningful, so we have to settle for Joel's idea of getting the conversation going and determining whether we have found another smart person. We mostly hire kids straight out of college (we are located in a small college town, so that's most of the employable technical workforce), so we happily rarely have any use for metrics like years of experience using X.

# posted by

AN : 1:30 PM

AN:

Terrific comment. Actually, this post was inspired by the book I am hawking on it :-)

My weird premise is that if you take any of the non-"thinking in if statements" algorithms and apply it to a human process like hiring or project management or code reviews you will get a better result than thinking in if statements.

So if you don't have a big corpus suitable for document projection, maybe you can build a collaborative filtering model or something else...

And as I identified, the weakness of "thinking in naïve bayesian classifiers" is that we do not have a good way of inspecting the people we reject to see if they went on to be good hires elsewhere.

# posted by

Reginald Braithwaite : 1:34 PM

Thanks for the book recommendation. I've got it on my list to pick up next month, but it's always nice to see a vote of confidence from someone whose blog I enjoy reading.

Even before the book came out, one of the things I've had on my feature wishlist for our software for a while was a context-sensitive default value selector. The idea being that as users fill in fields of a data entry form, you suggest likely values for other fields, based on using a Bayesian classifier to correlate values from previously entered data and see whether there's a value that rises above a probability threshold to determine whether it's likely enough to be worth suggesting.

By the way, if you're interested in studying more about advanced approaches to regression and classification problems, there's great coverage of nonlinear (i.e. "neural") approaches in Ripley's Pattern Recognition and Neural Networks. The math is fairly elaborate, but my impression is that you won't be overly concerned by that.

# posted by

AN : 1:56 PM

When reading your post, I got to the problem description and immediately thought "either you try to figure out all the rule for matching players, or you use Bayesian statistics". I could have described the Bayesian approach in general terms and why it would be good (all from reading on Reddit), but if you asked me to go a level deeper and show exactly how to structure the system, I would have had to say 'I don't use it much and I would have to look that up' (which is what I did). After some wikipedia, I could probably discuss it as a more informed lay person, abeit not an expert.

Now here's a thought. What about letting an interviewee use Google during an interview? What would that tell you? Would it let people bluff, or could you it help find the Ham? I mean, there's a lot to know out there, and at least someone who could google up answers to things they didn't immediately know would show some problem solving skills.

# posted by

dmh2000 : 2:09 PM

Reg, it's posts like these that make you a must-read in my book; you are consistently one of the calmest, most thoughtful, most intelligent, and least dogmatic voices around, and it's always a joy to read your thoughts.

(I've wanted to say that for ages, and this post particularly demonstrates why I have so much time for you.)

dmh2000: Funny you should wonder about that. I've just been for an interview where I found myself being asked about motor-tasks (ie. "how do you do basic-task X?", "what is specific-attribute Y used for?") of things I haven't looked at for 4 years, despite being very explicit about not having looked at them for 4 years. Two of the questions really stuck in my head, and I looked them up when I got back. Within two minutes of Googling, I had both answers... So next time I find myself in that position, I'll be asking if I can look it up - or even taking along my Nokia 770 and looking it up myself... just to bring the point home.

(I still haven't had an answer about the job, incidentally, but I'm not getting my hopes up.)

# posted by

Gwenhwyfaer : 3:02 PM

On false negatives:

It wouldn't be too hard to set up an alert for a given applicant starting work at a competitor or startup, assuming some list of employees is published. If you don't want to go to all the work of faux-headhunting, you can just track how long they keep the job.

Then, of course, it would be in your competitors' interest to keep someone on the directory for as long as possible...

# posted by

Joel : 5:38 PM

This reduces the problem of hiring to the problem of objectively evaluating current and past employees, which is easier, but still a hard problem - even if you ignore social issues.

It's still a great idea. Funny to consider the long range implications. If machines/algorithms can trade stocks and hire people, could a machine-led company be successful? This is very likely a much easier problem than general AI. :-)

# posted by

Vladimir S : 6:17 PM

Vladimir:

This reduces the problem of hiring to the problem of objectively evaluating current and past employees, which is easier, but still a hard problem - even if you ignore social issues.

Well, you remember the old riddle: "How do you eat an Elephant?"--"One bite at a time."

First, evaluating résumés can definitely be assisted by document classification, possibly helping to rank them, with the feedback being whether a human phone screening them decides to bring them in for an interview.

Frankly, I would be amazed if a company like Google isn’t picking this low-hanging fruit right now.

Next, phone screens are often 3-5 standardized questions. A human may ask the questions and have the right to terminate the interview with a "No Hire" if the candidate is an obvious reject. But human might rate their satisfaction with the answers, perhaps zero to five stars.

An algorithm might assign them a final score, correcting for statistically harsh or generous interviewers, and adjusting the relative rank of the questions based on whether the candidates selected for formal interviews are rated Hire or No Hire.

A stock pile of questions could be developed, and the system could introduce an additional question from time to time on a trial basis—perhaps the three questions become five when the system is training new questions.

This could easily be automated with a "phone screen application" that displays the questions to be asked.

These two steps—selecting candidates to screen and screening them— can both benefit from classification as an assistant to humans.

I wonder if I have just described a new YCombinator start up?

# posted by

Reginald Braithwaite : 6:47 PM

Taking this one step further, what if something like this was used by someone like monster.com to identify which jobs you'd be a really good fit for?

# posted by

Andrei : 7:06 PM

On False Negatives:

A large enough organization could just hire negatives once and a while, to keep the filter well trained. I'll bet the bad-hire rate for most technical organizations is in the 5% neighborhood, hiring 1% of applicants without regard to what the filter says is well worth it if it keeps the filter producing, say, 3% false positives.

If you're just filtering resumes to see who's worth an interview, an organization doesn't need to be very big to be able to do interviews on the negatives once and a while.

# posted by

Mike : 9:49 PM

A very rough (and perhaps nonsensical) idea for a startup:

A web site that is billed as a way for hiring managers to track candidates. In the background (but fully disclosed) the system is silently harnessing the collective intelligence of all hiring managers inputting data about candidates into the system to produce a sort of Rotten Tomatoes for human resources. As each hiring manager gets a resume, they enter that person into the site. If they end up hiring that person, after the 90-day review (or whatever time span is appropriate) they go back in and up-vote or down-vote based on whether that person turned out to be a good hire.

This could solve the problem of "checking your junk folder". You could go back and look at the ratings of candidates you passed up and how they ended up producing for other companies that ended up hiring them. Of course, all the statistics would be presented anonymously.

# posted by

Matt : 11:43 PM

<< Home