Simple Heuristics That Make Algorithms Smart

Gerd Gigerenzer’s debate with Daniel Kahneman, Amos Tversky, and collaborators in the heuristics and biases research program has been one of the more interesting sideshows in psychology. Although the debate often appears to be more of framing than substance, it largely revolves around the question of whether we should characterize “biased” human decisions as error or “irrationality.”

BTalking about failure is increasingly in vogue. People have started posting “anti-resumes,” listing their rejections rather than achievements (for an academic example, see here). Around the world, people are encouraged to share their failures, including through videos or at public-speaking events.

This movement is laudable: by speaking up about how frequently we fail, we are shifting the narrative, allowing people to be empowered to talk about situations when they have failed, and allowing us to learn from them.

One way that behavioral scientists have become swept up in this movement is by publishing more and more null findings of their studies. For example, one recent large-scale field study in Guatemala found that honesty primes did not reduce cheating on tax returns, despite having been effective in other contexts. Another study showed that preselecting an environmental option did not make it more likely that it was chosen, although similar studies had found that preselecting options increases the uptake of those choices.

Publishing these studies is important. If we only publish our successes and not our failures—a tendency that has been referred to as the “file-drawer problem”—then the literature provides only a very skewed picture of what’s actually going on, what’s known as publication bias.

But I think that something is amiss in the current effort to publish null results. They all take a postmortem approach, trying to document the failure to find an effect after the study has already been conducted.

There is no “failed” study—just a failure to learn from a study.

What if we conducted premortems of our studies? That is, what if we designed studies—especially large, expensive field studies—that, should they fail, could nevertheless shed light on why they may have failed? While publishing null results is a great practice to tackle the file-drawer problem, such publications are rarely designed to shed light on these underlying causes, inviting criticism from original authors who, for example, note that important preconditions for their theory were absent or insufficiently operationalized.

A study premortem can help change that. The concept of premortems was coined by applied psychologist Gary Klein. It suggests that before launching a project, people should imagine a scenario in which their project fails. They then backtrack from that counterfactual. How might that failure have occurred? Engaging in these, writes Richard Thaler, a proponent of premortems, allows people to consider how they may have to change their project to either minimize the risk of that failure happening or have a backup plan in place.

Note that this contrasts with how people usually think: we have a tendency to believe that our project is likely to succeed, and then we effortlessly generate all the reasons why it will be successful. Instead, by imagining that the project will fail, and thinking about all the reasons why it might fail, premortems force us to consider things from a less biased perspective and reduce the overconfidence that accompanies most projects.

These premortems, I propose, could become a crucial step in the design stage of studies, especially large, expensive field studies. At the moment, most field studies are not set up to test why they might not find a statistically significant effect. They manipulate some variable (for instance, introducing a new persuasive message), compare it against a control group (e.g., the standard message that had been used previously), and then aim to show that the intervention “works.” When a study “doesn’t work”—when the new treatment has the same effect as the control group (and the difference in effect size is not statistically significant)—commentators often argue that a study has “failed.” I believe this perspective is unhelpful at best and can at times even harm our understanding of a phenomenon.

In contrast, if a study is well-designed—when it is able to provide valuable insights beyond its main results—then there is no such thing as a “failed” study. Just because a study did not get a significant result doesn’t mean there is nothing to learn. Whereas the goal of a policymaker might be to help, the goal of a (behavioral) scientist is to learn, and no matter the outcome, a well-designed study can offer insight not only into whether an effect has empirical support but also into what it is that makes it work (or not). Put differently, there is no “failed” study—just a failure to learn from a study.

We need to start designing studies that allow us to understand why they failed, and premortems can help us get there.

Of course, this is sometimes easier said than done. Who would have thought that requiring bike helmets in Australia would lead to an increase in the proportion of head injuries? It turns out that the helmet requirement, turned a lot of people away from cycling. There was a reduction in head injuries after the law was implemented, but the proportion was far less than one would have have expected, given the number of people who were no longer cycling. One possible reason is that wearing a helmet makes people feel safer and, as a result, more willing to engage in the type of reckless cycling for which helmets are no protection. Another reason is that cyclists have strength and safety in numbers; a cyclist’s biggest risk is getting hit by a car. By inadvertently reducing the number of cyclists, the helmet law made it less safe for the remaining cyclists.

Similarly, who would have thought that tightening border enforcement between the United States and Mexico would lead to an increase in immigration? We now know that many Mexicans who worked in the U.S. were seasonal workers who would often travel between the U.S. and Mexico; they saw no need to permanently settle in the U.S. and move their families. With tighter border restrictions, frequent travel became infeasible—and so families moved to the U.S. permanently.

Hindsight is, of course, twenty-twenty. Sometimes even premortems may not be enough, because we are not aware of the factors that are really driving effects in the real world. After all, this is a key component of the scientific endeavor.

But how could we design better studies despite this often critical limitation? We can look to Shell, the petroleum company, for inspiration. One thing Shell is known for is their “Shell Scenarios,” where they ask their own employees, as well as renowned external stakeholders, “what if” questions they believe have only a very low probability of occurring. By asking this question, and playing through the whole scenario, they often discover new strategies they may want to pursue or new data they may want to collect.

I advocate for a similar strategy in study design. Next time you’re developing a large-scale field study, ask yourself: “What if we don’t find a significant result? How could this study still contribute knowledge? Why wouldn’t it have worked, and how would I be able to tell why it didn’t?” These questions could, for example, be included in study preregistration protocols, such as AsPredicted or the Open Science Framework, and help improve the designs of our studies before we conduct them.

I’m beyond thrilled that we can now publish null results. But we need to start designing studies that allow us to understand why they failed, and premortems can help us get there. There’s no such thing as a failed study; when designed well, there is only ever a failure to learn from them.Gerd Gigerenzer’s debate with Daniel Kahneman, Amos Tversky, and collaborators in the heuristics and biases research program has been one of the more interesting sideshows in psychology. Although the debate often appears to be more of framing than substance, it largely revolves around the question of whether we should characterize “biased” human decisions as error or “irrationality.”

Kahneman and Tversky made their mark with a series of papers in which they proposed that people make decisions under uncertainty by using a small number of “heuristics,” or rules of thumb. These heuristics reduce the otherwise complex calculations to simpler judgments. While often useful, Kahneman and Tversky demonstrated that such heuristics could lead to systematic errors. In other words, they can lead to “biased” decisions.

An example heuristic is anchoring and adjustment. Start an estimate from an initial value (the anchor) then adjust toward a final answer. Although useful in some situations, this heuristic can cause problems, because people will often use an irrelevant or weakly relevant initial value, and then insufficiently adjust from that anchor.

Should we replace the biased humans with unbiased algorithms? Or will the use of fast and frugal heuristics hold the humans in good stead?

The program of work triggered by Kahneman and Tversky has now grown to a massive catalogue of heuristics and associated biases. Humans are now labelled as “predictably irrational.”

Gigerenzer takes a different view of human decision making. He argues that although simple heuristics often yield “biased” decisions, they can deliver a better answer. This is particularly the case in uncertain or complex environments, or where there is only a small sample from which to draw conclusions. People implement these heuristics through their gut feelings, with the selection of the appropriate heuristic a function of the unconscious.

Consider the gaze heuristic: when catching a ball, run so that the ball moves in a straight line at constant velocity in your gaze. This will lead you to where the ball will land.

The gaze heuristic can lead someone to run in an indirect or curved line. They don’t run straight to where they should wait for the ball’s arrival. The movement appears inefficient, but it allows a complex calculation to be replaced with something tractable for a human.

Gigerenzer and colleagues have generated a substantial body of evidence that humans use these simple heuristics, often to great power. Humans, and dogs, use the gaze heuristic. Laypeople and amateur players used the recognition heuristic to pick a Wimbledon match winner in 90 percent of those cases when it could be applied: if you recognize one of two players and not the other, predict that the one you recognize will win. Similarly, people judging the relative size of two cities used the recognition heuristic in 90 percent of the cases when they could.

Like Kahneman and Tversky, Gigerenzer’s work has generated a legacy of researchers exploring the power of “fast and frugal” heuristics and how they are used by humans. In this view, humans, through their use of simple heuristics, are quite smart.

Replacing humans with algorithms

A shallow reading of the differing positions of Kahneman, Tversky, Gigerenzer, and friends can lead to contrasting perspectives on whether you should use algorithms to replace human judgment. Should we replace the biased humans with unbiased algorithms? Or will the use of fast and frugal heuristics hold the humans in good stead?

As I catalogued in a previous Behavioral Scientist article, the evidence accumulated to date on the decision-making competition between algorithms and humans is relatively one-sided. Since Paul Meehl’s book Clinical Versus Statistical Prediction, published in 1954, a large literature has emerged comparing the two. Spanning areas such as medical and psychiatric diagnosis, job performance, university admissions, and procurement, algorithms typically come out on top.

How should we reconcile a view of good human decision-making using simple heuristics with the apparently straightforward picture of the superiority of algorithms?

For example, William Grove and his colleagues looked at 136 studies in medicine and psychiatry in which algorithms had been compared to expert judgement. In 63 of these studies, the algorithm was superior. In 65 there was a tie. This left 8 studies in which the human was the better option.

So how should we reconcile a view of good human decision-making using simple heuristics with the apparently straightforward picture of the superiority of algorithms?

A good starting point is to recognize that many of these algorithms are simple.

The power of simplicity

A revisit of Meehl’s Clinal Versus Statistical Prediction provides an illustration. Many of the algorithms competing against the clinicians involved a simple tallying of the number of factors for or against a certain diagnosis. The most complicated methods involved regression. These likely involved calculations done by hand.

We see a similar pattern across most of the literature comparing human judgment to algorithms. It is not state-of-the-art machine learning and artificial intelligence being used to make the algorithmic judgments, although those examples are also becoming more common. Rather, simple actuarial and statistical techniques, and often informal techniques, generate these results.

In a 1979 paper Robyn Dawes demonstrated why these simple methods can work, showing the power of “improper linear models.” Linear models with weights obtained through nonoptimal methods can still outperform clinical judgment and, in the areas Dawes examined, performed surprisingly strongly relative to models with weights derived from the data.

Gigerenzer’s work has also demonstrated the power of simple methods. For instance, in Simple Heuristics That Make Us Smart, Jean Czerlinski, Gigerenzer, and Daniel Goldstein describe a competition between some simple heuristics and the more complex multiple regression; both were to predict outcomes across 20 environments, such as school dropout rates and fish fertility.

One competitor in their competitions was “Take the Best,” which works through cues in order of validity in predicting the outcome. For example, if you want to know which of two schools has the highest dropout rate, attendance rate has the highest validity. If one school has lower attendance than the other, infer that that school has the higher dropout rate. If the attendance rate is the same, look at the next cue.

Modern discussions of whether humans will be replaced by algorithms typically frame the problem as a choice between humans on one hand or complex statistical and machine learning models on the other. Yet much of the past success of algorithms relative to human judgment points us to a third option.

Depending on the precise specifications, the result of the competition was either a victory for Take the Best or at least equal performance with multiple regression. This is impressive for something that is less computationally expensive and ignores much of the data (in other words, is biased).

There are some differences between the algorithms typically involved in the comparisons with humans and those heuristics explored by Gigerenzer and friends. The primary one is that some of Gigerenzer and friends’ heuristics are even simpler and often exclude information. Rather than tallying all cues like Dawes’s approach, Take the Best involves looking at cues only until finding one that discriminates. The recognition heuristic relies on a lack of knowledge to work. In some cases, “less is more.” But despite these differences, the simplicity of both is a stark contrast to the advanced methods that get many people excited today.

So what is it about simple algorithms that gives them an edge over humans using simple heuristics? At least part of the answer lies in what Kahneman and friends have recently labelled “noise.”

Noise

Although humans have been labelled as predictably irrational, much human decision-making can also be characterized as inconsistent. There is noise in the decisions.

When presented with the same information on different occasions, people often draw different conclusions. For instance, nine radiologists judging on separate occasions the malignancy of the same case of gastric ulcers had a correlation of between 0.60 and 0.92 with their own judgments, contradicting themselves around 20 percent of the time. A group of experienced software professionals asked to estimate the effort to undertake a software task differed as much as 71 percent in estimating the same task a second time, with a correlation of 0.7 between their own estimates. This inconsistency tends to increase when we examine decisions across different humans.

Algorithms, in contrast to humans, are typically consistent, returning the same decision each time. This difference in consistency is so marked that models of human decision makers, developed through a method called bootstrapping (not to be confused with the statistical term bootstrapping), typically outperform the decision makers they are modeled upon. For example, in one study, models developed from decisions of clinical psychologists tended to outperform most of those same psychologists in differentiating psychotic from neurotic patients.

This evidence of the superiority of the mechanical application of simple models—even if constructed from the decisions of experts themselves—suggests that humans don’t use these simple models for many decisions, or at best they use them inconsistently in the environments in which these measures are typically made.

What might we take from the above?

Modern discussions of whether humans will be replaced by algorithms typically frame the problem as a choice between humans on one hand or complex statistical and machine learning models on the other. For problems such as image recognition, this is probably the right frame. Yet much of the past success of algorithms relative to human judgment points us to a third option: the mechanical application of simple models and heuristics.

Simple models appear more powerful when removed from the minds of the human and implemented in a consistent way. The chain of evidence that simple heuristics are powerful tools, that humans use these heuristics, and that these heuristics can make us smart does not bring us to a point where these humans are outperforming simple heuristics or models consistently applied by an algorithm.

Humans are inextricably entwined in developing these algorithms, and in many cases provide the expert knowledge of what cues should be used. But when it comes to execution, taking the outputs of the model gives us a better outcome.

There is one thread from the work on simple heuristics that suggests we might be able to improve algorithms performance even further, and that is considering whether the algorithms could be even simpler. While unweighted counting of all cues could be effective, exclusion of cues in simpler approaches such as Take the Best might perform even better in some circumstances. This is an empirical question worth examining.

I must now admit that there is one territory in which we need to further explore this question before giving all credit to the algorithms. This is one of the territories in which the simple heuristics that humans use work best: unstable environments. Most of the competitions between algorithms and humans involve stable environments with at least a modicum of data; there can be no structural change in the environment to throw the algorithm off. In an uncertain world, do simple heuristics in the minds of humans outperform more mechanical approaches?

As I suggested at the end of my previous article, that needs to be the subject of another discussion. But even with that territory underexplored, we should learn from our use of simple heuristics and their power in the environments in which we make many of our most important decisions. We can also learn from the benefits of a more mechanical application. We can use simple heuristics to make algorithms smart.Gerd Gigerenzer’s debate with Daniel Kahneman, Amos Tversky, and collaborators in the heuristics and biases research program has been one of the more interesting sideshows in psychology. Although the debate often appears to be more of framing than substance, it largely revolves around the question of whether we should characterize “biased” human decisions as error or “irrationality.”

Kahneman and Tversky made their mark with a series of papers in which they proposed that people make decisions under uncertainty by using a small number of “heuristics,” or rules of thumb. These heuristics reduce the otherwise complex calculations to simpler judgments. While often useful, Kahneman and Tversky demonstrated that such heuristics could lead to systematic errors. In other words, they can lead to “biased” decisions.

An example heuristic is anchoring and adjustment. Start an estimate from an initial value (the anchor) then adjust toward a final answer. Although useful in some situations, this heuristic can cause problems, because people will often use an irrelevant or weakly relevant initial value, and then insufficiently adjust from that anchor.

Should we replace the biased humans with unbiased algorithms? Or will the use of fast and frugal heuristics hold the humans in good stead?

The program of work triggered by Kahneman and Tversky has now grown to a massive catalogue of heuristics and associated biases. Humans are now labelled as “predictably irrational.”

Gigerenzer takes a different view of human decision making. He argues that although simple heuristics often yield “biased” decisions, they can deliver a better answer. This is particularly the case in uncertain or complex environments, or where there is only a small sample from which to draw conclusions. People implement these heuristics through their gut feelings, with the selection of the appropriate heuristic a function of the unconscious.

Consider the gaze heuristic: when catching a ball, run so that the ball moves in a straight line at constant velocity in your gaze. This will lead you to where the ball will land.

The gaze heuristic can lead someone to run in an indirect or curved line. They don’t run straight to where they should wait for the ball’s arrival. The movement appears inefficient, but it allows a complex calculation to be replaced with something tractable for a human.

Gigerenzer and colleagues have generated a substantial body of evidence that humans use these simple heuristics, often to great power. Humans, and dogs, use the gaze heuristic. Laypeople and amateur players used the recognition heuristic to pick a Wimbledon match winner in 90 percent of those cases when it could be applied: if you recognize one of two players and not the other, predict that the one you recognize will win. Similarly, people judging the relative size of two cities used the recognition heuristic in 90 percent of the cases when they could.

Like Kahneman and Tversky, Gigerenzer’s work has generated a legacy of researchers exploring the power of “fast and frugal” heuristics and how they are used by humans. In this view, humans, through their use of simple heuristics, are quite smart.

Replacing humans with algorithms

A shallow reading of the differing positions of Kahneman, Tversky, Gigerenzer, and friends can lead to contrasting perspectives on whether you should use algorithms to replace human judgment. Should we replace the biased humans with unbiased algorithms? Or will the use of fast and frugal heuristics hold the humans in good stead?

As I catalogued in a previous Behavioral Scientist article, the evidence accumulated to date on the decision-making competition between algorithms and humans is relatively one-sided. Since Paul Meehl’s book Clinical Versus Statistical Prediction, published in 1954, a large literature has emerged comparing the two. Spanning areas such as medical and psychiatric diagnosis, job performance, university admissions, and procurement, algorithms typically come out on top.

How should we reconcile a view of good human decision-making using simple heuristics with the apparently straightforward picture of the superiority of algorithms?

For example, William Grove and his colleagues looked at 136 studies in medicine and psychiatry in which algorithms had been compared to expert judgement. In 63 of these studies, the algorithm was superior. In 65 there was a tie. This left 8 studies in which the human was the better option.

So how should we reconcile a view of good human decision-making using simple heuristics with the apparently straightforward picture of the superiority of algorithms?

A good starting point is to recognize that many of these algorithms are simple.

The power of simplicity

A revisit of Meehl’s Clinal Versus Statistical Prediction provides an illustration. Many of the algorithms competing against the clinicians involved a simple tallying of the number of factors for or against a certain diagnosis. The most complicated methods involved regression. These likely involved calculations done by hand.

We see a similar pattern across most of the literature comparing human judgment to algorithms. It is not state-of-the-art machine learning and artificial intelligence being used to make the algorithmic judgments, although those examples are also becoming more common. Rather, simple actuarial and statistical techniques, and often informal techniques, generate these results.

In a 1979 paper Robyn Dawes demonstrated why these simple methods can work, showing the power of “improper linear models.” Linear models with weights obtained through nonoptimal methods can still outperform clinical judgment and, in the areas Dawes examined, performed surprisingly strongly relative to models with weights derived from the data.

Gigerenzer’s work has also demonstrated the power of simple methods. For instance, in Simple Heuristics That Make Us Smart, Jean Czerlinski, Gigerenzer, and Daniel Goldstein describe a competition between some simple heuristics and the more complex multiple regression; both were to predict outcomes across 20 environments, such as school dropout rates and fish fertility.

One competitor in their competitions was “Take the Best,” which works through cues in order of validity in predicting the outcome. For example, if you want to know which of two schools has the highest dropout rate, attendance rate has the highest validity. If one school has lower attendance than the other, infer that that school has the higher dropout rate. If the attendance rate is the same, look at the next cue.

Modern discussions of whether humans will be replaced by algorithms typically frame the problem as a choice between humans on one hand or complex statistical and machine learning models on the other. Yet much of the past success of algorithms relative to human judgment points us to a third option.

Depending on the precise specifications, the result of the competition was either a victory for Take the Best or at least equal performance with multiple regression. This is impressive for something that is less computationally expensive and ignores much of the data (in other words, is biased).

There are some differences between the algorithms typically involved in the comparisons with humans and those heuristics explored by Gigerenzer and friends. The primary one is that some of Gigerenzer and friends’ heuristics are even simpler and often exclude information. Rather than tallying all cues like Dawes’s approach, Take the Best involves looking at cues only until finding one that discriminates. The recognition heuristic relies on a lack of knowledge to work. In some cases, “less is more.” But despite these differences, the simplicity of both is a stark contrast to the advanced methods that get many people excited today.

So what is it about simple algorithms that gives them an edge over humans using simple heuristics? At least part of the answer lies in what Kahneman and friends have recently labelled “noise.”

Noise

Although humans have been labelled as predictably irrational, much human decision-making can also be characterized as inconsistent. There is noise in the decisions.

When presented with the same information on different occasions, people often draw different conclusions. For instance, nine radiologists judging on separate occasions the malignancy of the same case of gastric ulcers had a correlation of between 0.60 and 0.92 with their own judgments, contradicting themselves around 20 percent of the time. A group of experienced software professionals asked to estimate the effort to undertake a software task differed as much as 71 percent in estimating the same task a second time, with a correlation of 0.7 between their own estimates. This inconsistency tends to increase when we examine decisions across different humans.

Algorithms, in contrast to humans, are typically consistent, returning the same decision each time. This difference in consistency is so marked that models of human decision makers, developed through a method called bootstrapping (not to be confused with the statistical term bootstrapping), typically outperform the decision makers they are modeled upon. For example, in one study, models developed from decisions of clinical psychologists tended to outperform most of those same psychologists in differentiating psychotic from neurotic patients.

This evidence of the superiority of the mechanical application of simple models—even if constructed from the decisions of experts themselves—suggests that humans don’t use these simple models for many decisions, or at best they use them inconsistently in the environments in which these measures are typically made.

What might we take from the above?

Modern discussions of whether humans will be replaced by algorithms typically frame the problem as a choice between humans on one hand or complex statistical and machine learning models on the other. For problems such as image recognition, this is probably the right frame. Yet much of the past success of algorithms relative to human judgment points us to a third option: the mechanical application of simple models and heuristics.

Simple models appear more powerful when removed from the minds of the human and implemented in a consistent way. The chain of evidence that simple heuristics are powerful tools, that humans use these heuristics, and that these heuristics can make us smart does not bring us to a point where these humans are outperforming simple heuristics or models consistently applied by an algorithm.

Humans are inextricably entwined in developing these algorithms, and in many cases provide the expert knowledge of what cues should be used. But when it comes to execution, taking the outputs of the model gives us a better outcome.

There is one thread from the work on simple heuristics that suggests we might be able to improve algorithms performance even further, and that is considering whether the algorithms could be even simpler. While unweighted counting of all cues could be effective, exclusion of cues in simpler approaches such as Take the Best might perform even better in some circumstances. This is an empirical question worth examining.

I must now admit that there is one territory in which we need to further explore this question before giving all credit to the algorithms. This is one of the territories in which the simple heuristics that humans use work best: unstable environments. Most of the competitions between algorithms and humans involve stable environments with at least a modicum of data; there can be no structural change in the environment to throw the algorithm off. In an uncertain world, do simple heuristics in the minds of humans outperform more mechanical approaches?

As I suggested at the end of my previous article, that needs to be the subject of another discussion. But even with that territory underexplored, we should learn from our use of simple heuristics and their power in the environments in which we make many of our most important decisions. We can also learn from the benefits of a more mechanical application. We can use simple heuristics to make algorithms smart.