A while back, I wrote a piece where I took Tyler Cowen’s side in a debate with Scott Alexander [Siskind]1 about AI doomerism. I admit it’s not my best work; it didn’t quite come together the way I wanted. The point that I was most interested in making there was that Siskind was wrong to demand that Cowen put a probability number on his prediction that AI will not doom us all. Putting probability numbers on predictions makes them worse, not better. Cowen stuck his neck out: he told us that a certain event wouldn’t happen. Future events will either confirm or refute his prediction. Siskind claimed that there was a 1/3 chance that AI would doom us all. Whether the AI apocalypse happens or not, Siskind could claim to have been correct, since he assigned a non-zero probability to both outcomes. Cowen made a prediction. Siskind waffled, with numbers.
In general, the whole idea of “probabilistic forecasting” is just total horseshit, for this very reason. If you assign a non-zero probability to all outcomes, you can never be wrong. Perhaps the most well-known and controversial example of this is Nate Silver’s famous prediction (or “prediction”) that Donald Trump had a 29% chance to win in 2016. This was a substantially higher number than most other probabilistic forecasters gave Trump, but still sub-50%. So did Silver get it wrong? He argued he didn’t get it wrong: “Things that are 29% probable happen 29% of the time.” But if Trump had lost, he wouldn’t have gotten it wrong either. As a Popperian would say: probabilistic forecasts (or “forecasts”) are unfalsifiable.
Now, some sorts of probabilistic forecasts are falsifiable. If you are not talking about a particular event, but instead about a repeatable event type, then you can make a prediction about what proportion of events of that type will have a certain outcome. Coin flipping is the usual example here. You can predict that the coin will land heads with 50% probability if you mean by that that, given a sufficiently large succession of coin flips, the coin will land heads (approximately) 50% of the time. But it makes no sense to make probabilistic forecasts for one-off events, like an AI apocalypse or the 2016 presidential election.
Silver and Siskind have a response ready to this argument, though. While we can’t evaluate the probabilities of any particular one-off event, we can evaluate probabilities as a whole by seeing whether or not we are well-calibrated. But “calibration” is not a particularly useful metric. In this post, I’ll explain why I don’t like it.
The basic idea is simple. First, make a large number of probabilistic predictions at all different levels of probability. Then organize them by how probable you considered each outcome to be: make a basket of “90% probable predictions,” a basket of “80% probable predictions,” and so on. Then see how often your predictions in each basket are correct. If 90% of the things that you say are 90% probable happen, then you are well-calibrated. If only 70% of the things that you say are 90% probable happen, then you are not well-calibrated. You’re overconfident at the 90% level. This is a really appealing way to attempt to validate your probabilistic forecasts. I have two major problems with it.
Exogenous Dependencies
Start with a thought experiment. Let’s say that you have nine events that you assign a 90% probability to. As it turns out, all of those events will occur: 100% of the 90% probable events happen, so you’re underconfident at the 90% level. But before any of this is revealed and you’re in a position to realize that you’re underconfident at the 90% level, you consider a new proposition: the proposition that Leonard Nimoy will return from the grave to win the 2024 presidential election. You decide that this is 90% probable, so into the “90% probable” basket it goes. Nimoy, I’m sorry to say, does not return from the grave to win the 2024 presidential election. So now 90% of the events in your 90% basket happen. You are perfectly calibrated at 90%.
This is weird for two reason. First, by the logic of calibration, this is a vindication of your probabilistic forecast regarding Nimoy. Sure, he didn’t return from the grave, but things that are 90% probable don’t happen 10% of the time, and this is one of that 10% of the time. We can’t evaluate individual events, sure, but we can evaluate classes of events, and this belongs to a perfectly-calibrated class of 90%-probable events. If we don’t want that forecast to be vindicated — and we don’t! — then the logic of calibration is defective. Assigning a high probability to impossible events is irrational in itself; we shouldn’t be evaluating that assignment by reference to how well you did with the probability assignments of a bunch of other, unrelated events.
Second, this is weird because assigning a 90% probability to a clearly false proposition constituted a strict improvement to your epistemic situation, according to the logic of calibration. You were overconfident at the 90% level before you made your insane Nimoy prediction — that is, you were epistemically defective in how you assigned probabilities to those nine true propositions. But once you made your insane Nimoy prediction, you became perfectly calibrated — that is, you were epistemically flawless in how you assigned probabilities to those ten predictions, of which nine were true and one false. In other words: making an bad prediction about one proposition improved your epistemic standing with regard to nine other unrelated propositions. That’s also crazy.
This example might be far-fetched, but it simply highlights a fundamental flaw in the idea of calibration: it makes how good a prediction is about one event depend on how good your predictions are about a class of other events. And those other events have nothing in common with the first event or with each other, other than the fact that you made similar probabilistic predictions regarding those events. That is weird and bad.
That is my main objection to calibration. Here’s another thing I worry about.
Basketing
The other problem with calibration is that it assumes that we can construct meaningfully large baskets of events with identical probabilities. When Siskind does his yearly calibration exercises (e.g.), all of his predictions are nice, round numbers: 60%, 70%, 80%… Yet does he really think that everything that he assigns a 70% probability to is 70% probable? None of them are 71% probable? 71.5%? 71.5235691619361%?
Assume that Siskind really thinks all of the events he lists as 70% probable are exactly 70% probable. If so, are we talking about objective or subjective probability? Either way, it’s weird! If we’re talking about objective probability, then Siskind is saying that all objective probabilities come in nice, even numbers, and nothing (or at least nothing that he considers) is 71% probable. If we’re talking about subjective probability, then Siskind is saying that he himself doesn’t make fine distinctions in probability: for him, if something is more than 70% probable, then it must be 80% probable; there’s no in between. Not only is this psychologicaly implausible, it’s inconsistent with Bayesianism, and Bayesianism is the system of choice for everyone who is interested in subjective probabilities. This very much includes Siskind (and Silver).
So assume that probabilities really can take a range of intermediate values; there are some things that are 71.5235691619361% probable. But how many things does Siskind think are 71.5235691619361% probable? It would be weird if it was more than one; that’s a super precise number! (If you think it might be more than one, add a few more decimal places to the example). But then we can’t evaluate whether or not 71.5235691619361% of the predictions made at the 71.5235691619361% level are true. There’s only one prediction, and it either happens or it doesn’t.
Perhaps we should round off and aggregate. Sure, some things are 71.5235691619361% probable, but that’s close enough to 70%, so it goes in the 70% basket. Yet if that’s what we’re doing, then I have more questions. How many baskets do you have, and why that many baskets? What ranges do they cover, and why those ranges? I struggle to see how anyone could have a principled answer to those questions. Yet if this process of rounding off and basketing is fundamentally arbitrary, then the calibration scores that will result by evaluating those baskets will be equally arbitrary.
I don’t want to say that neither of these problems can be solved. But these look like really big problems. And until they are solved, the whole idea of calibration is suspect. And since calibration is our best account of how to evaluate “probablistic forecasts” of one-off events, probabilistic forecasting is suspect as well.
Erratum: It was pointed out in the comments that I completely overlooked Brier scores as a way of measuring the quality of probabilistic forecasts. Oops! I still have problems with Brier scores, and thus with probabilistic forecasts, but they’re not nearly as foundational as my objections to calibration. So I concede that my ultimate conclusions about probabilistic forecasts are stronger than my arguments here support. I’m still right about calibration, though.
Siskind is usually known by his pen name, “Scott Alexander,” but I feel awkward referring to him by his pen name when I know his full name after the NYT-doxxing scandal.
This reminds me a bit of the old joke, usually targeting economists, about people who ask: "it works in practice, but does it work in theory?" I think it's hard to look at Tetlock's forecasting tournaments, where some people are clearly reliably outperforming others, and so clearly have skills/habits that it's worth trying to understand/evaluate/reproduce, and to think there's no there there. Likewise with the difference between Nate Silver style poll aggregation and its associated probabilistic forecasts vs. pundit-style one off predictions, or even just plain old weather forecasting. Would you prefer your local meteorologist to just say "rain tomorrow" or "no rain tomorrow" to the probabilistic forecasts they do provide? My strong guess is that a hedge fund that took on board as an internal rule--"no probabilistic forecasting"--would go broke fast.
Big picture, I think it makes more sense to accept that there's clearly a robust, well-justified practice of probabilistic forecasting, and to then task how we can evaluate probabilistic forecasts in light of the sorts of puzzles you raise above, rather than to treat those puzzles as genuinely threatening the coherence/justifiability of the practice.
On the dependency point, one natural move is to point out that nobody should think that calibration is the only desideratum in a set of probabilistic forecasts; accuracy matters too, and it's often easy to achieve calibration at the price of accuracy. E.g., suppose I'm trying to tell you how a series of coin tosses resulted, and it's not prediction; rather, I get to see how the coin landed, albeit from far away, so it's a demanding test of visual acuity. It's easy to achieve excellent calibration by just assigning each toss 50% probability of landing heads, 50% tails. But in that case I'm throwing out all the info I get from my eyes. If I try to use that info, I may end up straying from perfect calibration. There's a good chance that my best strategy if I care about accuracy (i.e., I don't care about calibration at all) is to aim to be as accurate as I can, and then, if I get feedback, see where I'm departing from perfect calibration and use that to improve my accuracy even more.
I think that's pretty similar to what's going on in your Nimoy case; there's a way to get perfect calibration in that case, but it involves giving up on accuracy. In general, I think you should think of calibration as a means to accuracy, rather than as an end in its own right; if you know a set of forecasts is uncalibrated, and how, then you can produce a set of forecasts that is more accurate than it. So aiming for calibration is best understood as aiming for a set of forecasts that isn't obviously less accurate than some identifiable other set of forecasts.
When we assess the accuracy of probabilistic predictions we usually use a scoring rule like the Brier score. If you use such a rule in your Nimoy example, the score of your predictions worsens.