Dangers of Monocausality
“Monocausality…has always been a seductive way of looking at the world. It has always been a simplistic one, too. The world is complex. So are people and their motives. The job of journalism is to take account of that complexity, not simplify it out of existence through the adoption of some ideological orthodoxy.”
-Bret Stephens, journalist at The New York Times on the proper role of journalism
In this post, I’ll argue the job of a good analyst is to build algorithms that make accurate judgments amidst complexity. Doing so requires a full characterization of the environment, not simply a distillation of its complexity down to a single correlated, and often convenient, element.
Correlation does not imply causation.
Those even remotely familiar with data analytics have heard that phrase, yet it’s so easy to fall victim to it! That is, to think we’ve found a causal relationship, when really it’s just a coincidence. So what is correlation? Statistical correlation is the interdependence of two measures of variable quantity, the quality of which is established by a correlation coefficient.
Causation, on the other hand, is that which wholly or partially catalyzes, or causes, a phenomenon. Correctly identifying causation is where the money’s made.
Science uses correlation when studying the natural world; the advancement of mankind has relied upon the rigorous testing of these correlations to glean the truth about our environment. Humans are pattern-seekers and it’s in our nature to search for the truth in the patterns we notice, but in doing so we are predisposed to claiming the pattern (e.g. correlation) we’ve observed is the true cause of an outcome. Assuming causation from correlation is the error of hastily stating an effect (dependent variable) is the direct result of the stated cause (independent variable).
To determine whether the pattern you’ve observed has any predictive power in determining the event outcome you must calculate the correlation coefficient (for details check out this video). For the humble purposes of this post, lets only consider positive correlation coefficients between 1 (perfect correlation) and 0 (no correlation). A correlation coefficient of 0.75 says that, assuming all other elements are constant and for a linear relationship, a 0.75 change in the effect relates to a 0.75 change in the cause. Notice the backwards relationship between effect and cause. Only when it can be safely assumed all elements of a system are included in the model can we consider causality.
We need to be discerning when empirically observing a patterned outcome. Do not to make a causal rush to judgment.
We’re primed to treat context as causal.
Even more so, the supposed cause of a complex outcome distilled down to a single input variable, monocausality, to use Mr. Stephens’ term, is almost always incorrect. And, even when this oversimplification seems like a good fit for the environment under observation (whether that be society at large or the NFL landscape) the “goodness-of-fit” is more likely a mirage created by inadequate sampling or other statistical fallacy.
A better use of correlation is to combine it with other elements of the system that have a sufficiently high correlation to your desired outcome. But unfortunately, that’s not what the average decision-maker does. Rather, as the introduction for our Week 9 2020 picks outlined, overweighting salient contextual elements steers even the savviest bettor away from following the process.
So then, how can one resist the influence of salient contextual information? By using the base rate to advise the likelihood of an outcome occurring.
The base rate is the measure of past outcomes unconditional on…well…salient contextual information! In other words, it answers the question, “How often does that happen?” absent of the specific context. Using contextual information without a thorough understanding of its causal link to the outcome in question is where casual decision-makers go wrong.
A good analyst charged with forecasting the likelihood of an outcome would benefit greatly from using the base rate to assign a probability rather than get distracted by contextual information. For example, would you take the following bet:
+200 odds (meaning you win $200 for every $100 wagered), Patrick Mahomes OVER 400 passing yards against the 32nd-ranked pass defense?
One one hand, someone without knowledge of the base rate would likely be tempted to take the this wager, especially if they assume it is a 50/50 outcome. On the other hand, maybe they wouldn’t take the bet if armed with the following base rate information:
Mahomes has averaged 330 passing yards per game this season with a normally-distributed standard deviation of 70 yards.
A quick calculation on the base rate information would suggest Mahomes’ likelihood of reaching 400 passing yards would be ~16%. However, since he’s attained his season-long passing metrics against better-ranked pass defenses, depending on your prior estimation (without the base rate data) the contextual information might raise this likelihood to, say, 33%. Now, the +200 odds seem fairly priced, if not slightly underpriced, for the base rate-adjusted likelihood.
To summarize, avoid the temptation to assume a single contextual element is the cause of an outcome unless it can be statistically proven and always use the base rate to inform your prior estimation of an outcome.