Wherefore Art Thou?

Book Review:

Judea Pearl and Dana Mackenzie, "The Book of Why: The New Science of Cause and Effect", Penguin Books, 2019.

If you have read the LessWrong Sequences, you know that Eliezer Yudkowsky is a big fan of Judea Pearl -- he references Pearl in this, this, and this article among others. Someone even made this picture of Eliezer, holding Judea Pearl's book Causality:

If you're unfamiliar with what I'm talking about, let's just say that Pearl's work can be used to understand how people irrationally update their beliefs, and what they should do instead. His book on Probabilistic Reasoning in Intelligent Systems is referenced by folks like MIRI in the context of getting AI to learn from probabilistic models of the world. He has won the Turing Award, the most prestigious in computer science. Unfortunately, he is also known as the father of Daniel Pearl, a journalist who was killed by terrorists in Pakistan in 2002.

Judea Pearl's latest book, and the one I'm reviewing here, is titled The Book of Why. Unlike his previous books, it is aimed at a general audience and is relatively lighter on the math. It is written with a co-author, Dana Mackenzie, who is a mathematician and science writer. However, for the most part it is written in the voice of Pearl himself, so without intending to minimize the contributions of Mackenzie, I shall henceforth refer to Pearl as the author.

The Preface of the book declares three purposes: (i) to explain the so-called "Causal Revolution" in layman's terms; (ii) to discuss the history of cause-and-effect questions in science; and (iii) to suggest ways in which robots could one day talk to us in the language of cause and effect. The rest of the book consists of an introductory chapter and ten more chapters spaced across 370 pages, plus the usual acknowledgements, notes, bibliography and index sections appended. Allow me to summarize.

***

Introduction

Causal inference is something we humans naturally do, yet until the "Causal Revolution" of the last two or three decades, scientists lacked a mathematical framework to talk about simple cause-and-effect relationships. Students of classical statistics have been prohibited from asking causal questions (unless one is dealing with a randomized controlled trial), because "correlation is not causation". A large part of the problem was that a symbolic expression like P(L|D) only summarizes observations, e.g. the probability that a patient survives to a certain lifespan given a drug. It cannot distinguish seeing from active intervention -- and therefore it cannot tell us whether a falling barometer reading causes the atmospheric pressure to change or the other way around. Judea Pearl solves this by using a new kind of operator: P(L|do(D)). He also likes to use causal diagrams. These tools allow us to ask counterfactual questions as well: for instance, "if we had not vaccinated our children, how many would have died?" This kind of reasoning is important to Pearl, because he believes that a strong artificial intelligence will need to understand causality.

Chapter 1

These three steps of learning -- association, intervention, and imagining counterfactuals -- comprise what Pearl refers to as the Ladder of Causation. Today's AI, along with most animals, are stuck on the bottom rung of the ladder, i.e. they learn only by seeing how variables are related. Human toddlers, as well as early human ancestors who used tools, can experiment and plan -- they can do X to make Y happen. But if you can imagine worlds that don't exist, where you had acted differently than you did, you are at the top of the ladder. As a counterfactual learner you can understand the reasons behind phenomena by forming models of the underlying processes. According to the authors, this emphasis on mental models fits nicely with Yuval Noah Harari's theory that imagination helped Homo sapiens to conquer the planet. As an example of causal inference using diagrams, consider the following:

The diagram at the top represents the information that the court has ordered the execution of a prisoner, that the captain of a firing squad has received the order and instructed his two soldiers to shoot, and that the outcome is a dead prisoner. The bottom diagram asks a counterfactual question: what if Soldier A decided not to fire? Well, in this case the prisoner would still be dead due to Soldier B. While this is a simple example, causal diagrams can be applied to complex questions like the effect of a higher minimum wage on unemployment, or the harms and benefits of vaccination against smallpox. One important thing to note about causality is that, while it may involve probabilities, it is not reducible to probabilities.

Chapter 2

Unfortunately, the history of statistics has been dominated by people who tried to express the concept of causation purely in the language of probability. Some folks, including Francis Galton, Karl Pearson and Ronald A. Fisher, have even argued that causation is nothing more than a special case of correlation, and that tables of data are all there is to science. Of course, they struggled to deal with spurious correlations and confounders. Meanwhile, Sewall Wright's breakthrough work on path diagrams in the 1920s was neglected for fifty years.

Part of the reluctance toward path analysis came from the idea that it was too "subjective" (a similar critique was leveled at Bayesian statistics). But as Pearl hinted at in the introduction to the book, it was also because statisticians lacked a language to specify the assumptions behind causal diagrams.

Chapter 3

Bayesian networks have many practical applications, including spam filters, DNA-matching software, speech recognition, weather forecasting, and decoding cell phone transmissions. These tools work by induction from evidence (effect) to hypothesis (cause). Back in the 18th century a Presbyterian minister, the Reverend Thomas Bayes, laid the foundations for Bayesian networks by trying to solve the problem of "inverse probability". If the "forward probability" of a billiard ball stopping within x feet of the end of a pool table of length L is P(x|L) then the inverse probability is P(L|x), which is the harder question of assessing the length of the table given that the ball stopped x feet from the end. Bayes's rule says that P(A|B) P(B) = P(B|A) P(A). This famous equation tells us how to update our belief in a hypothesis while taking the context into account. (See also Eliezer Yudkowsky's explanation.) With Bayesian networks we can create causal diagrams, based on relationships such as A -> B -> C (a "chain"), A <- B -> C (a "fork") or A -> B <- C (a "collider"). These junctions are very important building blocks for moving up the Ladder of Causation, as we shall see.

Chapter 4

Statisticians have always been plagued by the problem of confounding variables: when a supposed causal relationship between X and Y is biased by a "lurking" third variable Z which influences both X and Y. The traditional approach to deconfounding the measurements is to set up a randomized controlled trial; and if that is infeasible, to "adjust" for Z (by stratifying the data). Naturally, Judea Pearl believes that causal diagrams provide a superior approach. Firstly, it provides a much clearer definition of confounding -- a confounder is anything leading to a discrepancy between P(Y|X) and the intervention P(Y|do(X)). Secondly, it gives a solution in the form of the back-door criterion: just block every path between X and Y that starts with an arrow pointing into X. For example:

In the first setup, there is a "back-door path" between X and Y passing through B, so you want to control for B. You don't need to control for A, because it is a collider (i.e. no info flows through the junction). In the second graph, there are two back-door paths, but controlling for C would be sufficient to deconfound X and Y. In any of these diagrams, you need to be able to observe the variables in order to control for them.

Chapter 5

One case study that shows the importance of the confounding problem is the debate about whether smoking causes lung cancer. Esteemed statisticians like R.A. Fisher (who, as we last saw, made valuable contributions to evolutionary biology) denied that there was a causal link between smoking and lung cancer, on the basis that no randomized controlled trial had been done, and that the association could be attributed to spurious variables such as a gene that makes people crave cigarettes and also somehow made it more likely for them to get lung cancer. Of course, many scientists and doctors in the late 1950s and early 60s were smokers themselves (and some got paid consulting fees by tobacco companies). It was only after the U.S. surgeon general published a report in 1964 saying that smoking causes cancer (in men) that smoking rates began to decrease. So, how did they reach the conclusion? Well, in 1959 J. Cornfield and A. Lilienfeld published a paper arguing that if smokers have nine times the risk of developing lung cancer, and if a "smoking gene" was present among just 12 percent of nonsmokers, then it had to be nine times as common among smokers to fully account for the association -- mathematically impossible. Then in 1963 the surgeon general's advisory committee listed a number of qualitative criteria that helped them judge the causal significance: consistency of results; strength of association; specificity of the effect; temporal relationship; and coherence with other types of evidence. While each of these criteria is flawed, the committee's report at least had an impact on public health. Naturally, Pearl suggests that if the language of causal diagrams was available back then, the debate might have been shorter.

Chapter 6

When we apply our causal intuitions to data, many quirky paradoxes arise. One of the more famous ones is the Monte Hall problem: suppose you partake in a contest where you can choose between three doors, behind one of which is a prize; when you pick a door, the host (who knows where the prize is) will open up one of the doors that do not lead to the prize. You then get the chance to switch or stay with your initial pick. What do you do? The intuition that many people share is that it shouldn't make a difference... yet if you switch doors, you tend to win two-thirds of the time! The reason for this is that, although the host cannot read your mind, the rules of the game entail that he or she must open a door that is neither your door nor the one in front of the prize. The fact that the host opens one door and not the other is therefore evidence that the unopened door is more likely to lead to the prize. The lesson is that we need to take into account not just the data, but the data-generating process too. Another well-known puzzle is Simpson's paradox, which is illustrated in the following table:

What you can see is that the hypothetical drug increases the risk of heart attack for women and for men, but when looking at the total row it appears to decrease the risk of heart attack! Another example is as follows:

The situation here is that exercise seems harmful to the population as a whole, but paradoxically beneficial when we segregate by age group. The solution to the first problem is to condition on gender because it is a confounder of the drug -> heart attack relationship (since men are at greater risk for heart attack and also may prefer not to take the drug). Therefore we first take the rate of heart attack for men and women separately and then average it, to find that the drug is bad for everyone. For the second problem we take age as a confounder of exercise and cholesterol (since older folks may exercise more yet have more cholesterol). We therefore stratify the data by age and conclude that exercise is beneficial to everyone. As Judea Pearl emphasizes, the correct approach will depend on the causal story behind the data. Sometimes it may be more appropriate to aggregate the data, at other times to partition it. Once again the structure of causal diagrams is illuminating.

Chapter 7

Here we get to the real meat of The Book of Why. This chapter explains three shortcuts that help us answer questions about intervention: (i) back-door adjustment; (ii) front-door adjustment; and (iii) instrumental variables. The back-door method is the simplest, and it is basically what we did above with the drug example. However, it requires that you have data about deconfounders. An alternative is the front-door criterion, which uses observational data to estimate the effect of variable X on Y in the following situation:

It is important that the mediator be "shielded" from the confounder. For example, imagine that X represents whether people signed up for a job-training program, mediator M indicates whether they actually showed up, Y refers to their earnings after the program, and the confounder C is their motivation. To use front-door adjustment, you simply combine the average causal effect of X -> M and M -> Y. But you need to ensure that people only failed to show up for reasons unrelated to their motivation.
The third method is to introduce an instrumental variable into the diagram, as illustrated below:

Assuming that there is no confounder of Z and X (and that Z has no direct arrow to U or Y), we can simply multiply the path coefficients a and b and divide by a to get the causal effect of X on Y.
These shortcuts let us estimate causal effects without running experiments. Using the do-calculus, we can verify mathematically whether these procedures are applicable. If you can reduce P(Y|do(X)) to P(Y|X) via legitimate manipulations, then you can climb up the Ladder of Causation. What counts as "legitimate" depends on a set of technical rules, which I shall not elaborate on but leave them here in case you are curious:

P(Y|do(X), Z, W) = P(Y|do(X), Z) if W is irrelevant to Y
P(Y|do(X), Z) = P(Y|X, Z) if Z blocks all back-door paths from X to Y
P(Y|do(X)) = P(Y) if there is no causal path from X to Y

Chapter 8

Now we have finally arrived at the issue of counterfactuals, which Pearl considers to lie at the heart of causality. As he writes:

"Responsibility and blame, regret and credit: these concepts are the currency of a causal mind. To make any sense of them, we must be able to compare what did happen with what would have happened under some alternative hypothesis." (p. 260)

The Scottish philosopher David Hume suggested in 1748 that a cause is "an object followed by another [...] where, if the first object had not been, the second never had existed". But how do people generate alternative worlds in their heads? Judea Pearl thinks our minds probably use a shortcut akin to structural models. Consider the following causal diagram:

Imagine that Alice has 6 years of experience, a high school degree, and a salary of $81,000. We could ask: what would her salary be if she had a college degree? If we have data about other employees at the firm, and if we assume a linear relationship between the variables, we can use a pair of functions (one for salary and one for experience) to estimate a would-be salary of $76,000. This is similar to linear regression, with the crucial difference that the structural causal model stipulates that salary "listens to" education and experience, but experience does not listen to salary.

Chapter 9

One application of counterfactuals is mediation analysis. An example is the question, "why do citrus fruits prevent scurvy?" Here, we are looking for a mechanism, such as vitamin C. But mediation analysis also lets us distinguish between the direct and indirect effects of an intervention. For example, consider a situation where a person is deciding whether or not to take a job offer, and will only accept it if the salary exceeds a certain threshold. In the following causal diagram with path coefficients, we assume linear effects, such that a one-unit increase in Education implies a two-unit increase in Skill (the mediator), and so on.

For Education = 1, the Salary is determined as (7 * 1) + (3 * 2) = 13. Suppose our candidate accepts offers above 10; then the total effect of Education on Outcome in this case is one (since Outcome is either one or zero). Now, if we set Education to zero but keep Skill = 2 as if Education were one (thus, a counterfactual) then Salary goes to 2 * 3 = 6 and the offer gets turned down for a natural indirect effect of zero. (Pearl uses the word "natural" to clarify that this is not an experiment.) If we set Education to one but Skill to zero, then Salary = 7 and the controlled direct effect on Outcome is zero. Alternatively, keeping Skill constant at 2 would give us the same result as the total effect, namely a Salary of 13 and Outcome one. So the question is: what value should we hold the mediator at? Pearl argues that it is more natural and intuitive to use the former, i.e. keeping Skill at the level it was before we changed Education from zero to one.

Chapter 10

The final chapter is titled "Big Data, Artificial Intelligence, and the Big Questions". According to the author, AI is the reason he got into causation in the first place. The problem with Big Data is that it cannot answer causal questions by itself -- we also need a model to interpret the data and data-generating process. AI could potentially help us with this, but it would need to understand causes and effects. A truly general AI should be able to hold meaningful conversations with us humans: learning from us, teaching us, and motivating us. Furthermore, it should be able to reflect on its actions and learn from past mistakes. But would such a machine have "free will"? Well, Pearl considers himself a compatibilist, meaning that he thinks the apparent clash between free will and determinism is a confusion between two different levels of description: the neural level and the cognitive level. Even if free will is an illusion, it still provides us with computational benefits, such as the ability to talk about intentions and agency. So, perhaps we should program future computers to have this illusion. Pearl ends on an optimistic note: he suggests that a moral robot will be "... better able to distinguish good from evil than we are, better able to resist temptation, better able to assign guilt and credit" (p. 370). Perhaps he makes it sound too easy.

***

The Bibliography at the end of The Book of Why includes not just a list of references, but also an annotated bibliography, ordered by chapter. The Book of Why is rich in illustrations and is well-structured -- two factors that I always like in nonfiction books. It is clear that Judea Pearl and Dana Mackenzie put a lot of consideration into the book. My biggest quibble would be the prose itself: it can sometimes be dry, and at times it is hard to follow the authors' train of thought. To be fair, the technical nature of the subject means that it was never going to be the easiest book, and I doubt I could have done better. I'd just suggest that if you pick this book up, you read it carefully and take your time with it. And I say this even though The Book of Why already simplifies the content of Pearl's other work! After this book, I'm more keen on reading Causality, but I'm also quite intimidated.

Overall, I think Pearl and Mackenzie succeed in their mission to explain the benefits of structural causal diagrams, the importance of causal reasoning in our understanding of the world, and the pitfalls that statisticians ran into when they abandoned causality. Furthermore, this is a message that more people need to hear, because people still have it drilled into their heads that "correlation is not causation", without being offered alternative tools to answer causal questions (other than RCTs and a bag of data-mining tricks).

I personally gave The Book of Why 5/5 stars on Goodreads, but other reviewers were disappointed by what they perceived to be an arrogant tone, or by a lack of real-world examples of Pearl's methods being used to solve problems. In his defense, Pearl does give credit to his influences, and the bibliography lists some contemporary texts in causal inference. Of course, you should make up your own mind. Peter McCluskey (aka "Bayesian Investor") has a review of the book on LessWrong, concluding that "Pearl is great at evaluating what constitutes clear thinking about causality".

Search This Blog

Omne ignotum pro magnifico