Lies, Damned Lies, and...
Statistics—by which I mean collected evidence—is useful for making decisions in a probabilistic environment. A probabilistic environment is one in which we have several alternatives, the alternatives have different outcomes, and we cannot assure ourselves of selecting the most beneficial outcome with the evidence we are able to gather.
The best we can do is make a choice that gives us the greatest likelihood of a good outcome, or perhaps limits the likelihood of a bad outcome.
We humans like to use evidence of past outcome to guide ourselves when making future choices. We’re pattern-matching machines. If we ate a green banana and it hurt our stomach in the past, we avoid green bananas in the future. Almost everything we do in our lives and our careers is based on this principle, although we wrap it up in
books full of formulae and impressive phrases like
bayesian filtering.
The trouble with this is that humans rarely apply even a modicum of common sense to probabilistic decisions. For example, one huge issue is called
sampling bias. Consider hiring programmers, a popular topic. What does a certification with Sun or MSFT tell us about a candidate?
Well, let’s look at the evidence. Let’s pick 10,000 people randomly. Divide into two groups, programmers and non-programmers. What percentage of the programmers have certification? What percentage of the non-programmers have certification?
This evidence we just collected is exactly the same kind of evidence that spam classification systems use to determine whether emails are spam or not. So, can we apply this evidence to selecting people to interview for jobs as programmers?
The catch in this case is that
people applying for a job as programmers is not the same kind of sample as
the population at large. The samples are different. The filter (“select people with certification”) is most effective when the sample of people applying for the job most closely resembles the population at large, and it is least effective when the sample of people applying for the job does not resemble the population at large.
Imagine two companies, Alpha-Geeks (“A”), and BigCo (“B”). Alpha-Geeks is a start up working on something hip that you cannot explain to your Mother-in-Law. It is using one of the technologies covered by certification. BigCo is in the Consulting Industry, its clients are big corporations everyone has heard of.
The sample of people applying for jobs with Company A is very different than the sample of people applying for jobs with Company B. Don’t you think that Company B attracts many, many more submissions than Company A? And aren’t those submissions much more weighted to the average or even mediocre? (
Weighted towards doesn’t mean that talented people don’t apply for jobs at Company B, please keep your cool).
The filter is going to be far more useful with Company B than Company A,
because the sample of people applying for jobs at Company B is far more similar to the evidence sample than the sample of people applying for jobs at Company A.
And that’s the key to making good decisions based on evidence: your evidence sample must be very similar to your decision environment.
(I am not going down the rabbit hole of saying that people with certification have some sort of personality that doesn’t match a start up or anything along those lines. Such a thing may explain this result, but explanation is not needed: the very fact that the samples of applicants are different from each other is enough to understand the principle of sample bias.)
but is it good enoughOne argument is that even a flawed filter has some value. So even if certification is flawed, it’s still helpful. I’ve already taken a whack at certification, let’s take a whack at the diametrically opposite filter so give this some balance: let’s talk about asking applicants to solve some sort of problem in an interview.
Why do people administer this kind of problem? Possibly, because it makes them feel smarter than the applicant. Possibly because their mentor Joe Furrybeard from MIT did it that way. And possibly because they performed the following simple experiment: they asked everyone in their company to solve the problem at an off-site retreat, where 90% got it right. They have observed interviewees struggle with it, and fewer than 50% get it right.
They know that
Sturgeon’s Law applies to applicants (more
here and
here), so it looks like a winner: apply the test and throw the losers out on the spot.
Well, this is contentious. But let’s start with an easy issue. Remember that we tested people at a retreat? We could have the following phenomenon: most of the people at the company got it right because they were asked the problem in a relaxed setting, while many interviewers blow it under pressure. Our “evidence” that 90% of our staffers get it right is biased.
Selection bias happens when hiring in other ways: we could select by lifestyle (golfers need not apply at our start up unless they are
disc golfers). True, everyone at the company is good at this problem. But that’s only because we hire people who are good at problems like this, not because there is a correlation between this specific talent and someone’s job performance.
Even if everyone at the company is extremely talented, that may say nothing about the value of this interview problem. Let’s see how. Let’s apply two filters to everyone we interview: one is very good at predicting job performance, one that is poor. Perhaps the good one is something obvious like
previous performance under similar circumstances, while the bad one is
whether they play a musical instrument.
If we use both filters, we will select far fewer people than if we just use the good one. But when we survey our employees, we discover that they are good and that they play musical instruments. So why don’t we drop the long interviews asking about past performance and simply pick those who play a musical instrument?
Let’s think about a spam filter: it gets thousands of emails, we classify them by hand into spam and not-spam, and it learns the relevance of various pieces of evidence like words.
In our case, want to know if “plays musical instrument” is significant. We measure that everyone in the company plays a musical instrument. Very good. So we have a measure that 100% of the people we hired play musical instruments. But did we measure how many of the people we turned down for jobs played musical instruments? No. Maybe 50% or more of the rejects play musical instruments.
Well, if 100% of the hires play an instrument and 50% of the no-hires play an instrument, that’s still pretty useful, isn’t it? No way! Because, as we pointed out, our sample of employees is contaminated by the fact that we only hire musicians. Those statistics would only be relevant if there were no hidden correlation to our existing selection. But since our selection is contaminated, playing a musical instrument has about as much value as belonging to your
college club has for fitness to rule the nation.
The problem with both of our sample bias problems is simple: although what we did looked a little like a Bayesian filter, we did not sample real job applicants for our company and train our filters based on trusted classification. We trained our filters based on inappropriate data sets like the population at large or existing employees.
little liesSampling bias is a pervasive problem. Another way it creeps into decision-making is through
discounting. Humans have a way of filtering evidence before they use it to make decisions. Consider this situation: a drug company is putting a new Heart Medication through trials. 1,000 patients try it for two years years after having bypass surgery. The company reports that 60% of the patients reported better-than-average blah-blah-blah (cholesterol, mood, blood pressure, take your pick) two years after surgery.
Great medication? It sounds like 600 patients got better. What’s the problem?
Don’t settle for the press release, let’s look at the study parameters. Well, look at that, they threw 200 patients out of the survey. It seems they
died before the survey was completed. And what does improved blah-blah-blah mean? Well, there was a control group of 1,000 patients who didn’t take the drug. Five percent of them died, and the company threw them out using the same protocol as the test group. The survivor’s blah-blah-blah was measured, and the company is claiming that 60% of the people who survived the drug experience are better than the
median blah-blah-blah of the control group survivors.
But when you copy and paste the tables from the PDF into your calculator and do a quick calculation, you discover that although 60% are better than median, only 40% of the survivors are better than the average blah-blah-blah of the control group. The data set is strongly skewed.
Well, that’s effing
terrible. First, you have
four times the mortality rate. True, 60% of the survivors are better than average, but what appears to be going on is that a lot of them are only a little bit better, and the ones that are worse are much, much worse. That explains why there were 150 more deaths in the test group.
What happened here? The drug company discounted the problem results. It threw out the deaths and it chose to measure the median instead of the mean. Luckily for us, we don’t do this in software development.
Or do we?
How often do you hear someone report “Operation successful, but the patient died?” I hear it all the time. For example, people report that a project was problematic
Because the client kept changing requirements. I’m sure the client did change requirements, no word of a lie.
But what if we ask, “
What’s your success rate with your software development methodology?” Will the response be, “it almost never works, it depends on fixed requirements and that’s not what we observe in the field?” or will the response be, “it works just fine?”
What happens here is that people believe it
ought to work, so they discount the times it doesn’t work by blaming clients or programmers or managers. In effect, they are throwing the dead patients out of the study! I suspect what happens is that they have heard this worked at another company, with different people, different clients, different skill sets, everything different. But they ignore the obvious sample bias, they ignore the fact that what works at BigCo may not work for them, (or vice versa), so they discount their failures.
It goes the other way too. Some people throw the
survivors out of the study. Have you ever heard someone describe Ruby as a language that won’t work for large teams of mediocre programmers? Because, you know, they aren’t hackers? There may be some truth to it, but you’re also hearing someone discount evidence of success just as they discount evidence of failure.
My summary here is a short one: Statistics are only useful for decision-making when rigourously examined for fitness. You must be very, very certain that your evidence sample strongly resembles the decision-making environment, and you must be very careful that you don’t discount significant portions of your evidence.
I am going to give the final word on the subject to the Marketing Product Manager that steered JProbe to eight-figure revenues and a Jolt Award. (If you think software developers make decisions based on bad statistics, you will be amazed at what passes for evidence when marketing is discussed.)
The plural of anecdote is not data.
—Alan Armstrong