9 Comments
User's avatar
Daniel Greco's avatar

This reminds me a bit of the old joke, usually targeting economists, about people who ask: "it works in practice, but does it work in theory?" I think it's hard to look at Tetlock's forecasting tournaments, where some people are clearly reliably outperforming others, and so clearly have skills/habits that it's worth trying to understand/evaluate/reproduce, and to think there's no there there. Likewise with the difference between Nate Silver style poll aggregation and its associated probabilistic forecasts vs. pundit-style one off predictions, or even just plain old weather forecasting. Would you prefer your local meteorologist to just say "rain tomorrow" or "no rain tomorrow" to the probabilistic forecasts they do provide? My strong guess is that a hedge fund that took on board as an internal rule--"no probabilistic forecasting"--would go broke fast.

Big picture, I think it makes more sense to accept that there's clearly a robust, well-justified practice of probabilistic forecasting, and to then task how we can evaluate probabilistic forecasts in light of the sorts of puzzles you raise above, rather than to treat those puzzles as genuinely threatening the coherence/justifiability of the practice.

On the dependency point, one natural move is to point out that nobody should think that calibration is the only desideratum in a set of probabilistic forecasts; accuracy matters too, and it's often easy to achieve calibration at the price of accuracy. E.g., suppose I'm trying to tell you how a series of coin tosses resulted, and it's not prediction; rather, I get to see how the coin landed, albeit from far away, so it's a demanding test of visual acuity. It's easy to achieve excellent calibration by just assigning each toss 50% probability of landing heads, 50% tails. But in that case I'm throwing out all the info I get from my eyes. If I try to use that info, I may end up straying from perfect calibration. There's a good chance that my best strategy if I care about accuracy (i.e., I don't care about calibration at all) is to aim to be as accurate as I can, and then, if I get feedback, see where I'm departing from perfect calibration and use that to improve my accuracy even more.

I think that's pretty similar to what's going on in your Nimoy case; there's a way to get perfect calibration in that case, but it involves giving up on accuracy. In general, I think you should think of calibration as a means to accuracy, rather than as an end in its own right; if you know a set of forecasts is uncalibrated, and how, then you can produce a set of forecasts that is more accurate than it. So aiming for calibration is best understood as aiming for a set of forecasts that isn't obviously less accurate than some identifiable other set of forecasts.

Expand full comment
Matt Lutz's avatar

Thanks for such a thoughtful comment, Dan. Few philosophical objections are decisive, so I don't think I've conclusively refuted calibration or anything like that here. If I give calibrationists pause, then that's sufficient philosophical progress for me.

The best I can say in calibration's favor is that I can't help but feel that someone who is well-calibrated has attained an epistemic achievement of some kind. I'm inclined to question that feeling because I have larger conceptual worries about how the notion of probability is used.

I don't think that probabilities are particularly useful in forecasting, but a kind of causal analysis is. I'm certainly not in favor of pure vibe-based punditry, and I think that some of the things that Silver has introduced into election analysis are genuinely very useful, like the fact that he treats national polls as worthless because it doesn't matter how much you run up the score in safe states, or his analysis of "house effects" for various polling firms. But these virtues aren't Bayesian virtues, they're explanationist virtues: controlling for house effects is a way of trying to isolate the causal impact that actual public opinion has on polling data, and state-by-state analysis is just a way of making your model match the actual causal structure of what determines the winner of the election.

Tetlock's superforecasters aren't making probabilistic forecasts. They make actual predictions; the "superforecasters" are the ones who stick their neck out and get it right a huge amount of the time. From subsequent interviews with these superforecasters, Tetlock has said that what superforecasters do looks like Bayesian reasoning. But good causal analysis and Bayesian analysis have a lot in common (as many explanationists have argued; see, e.g., Ted Poston's book on IBE), and I think superforecasters are better understood as doing causal analyses. That's what I'd recommend for hedge funds.

As for your idea that we need to supplement calibration with some sort of accuracy measure: Yes! Absolutely! But how do we measure accuracy for a probabilistic forecast? The basic problem is that there is no accuracy measure for probabilistic forecasts of one-off events. (This is related to the "problem of the single case.") The obvious way to gin up an accuracy measure for single-case probabilistic forecasts is to say that if we give something a low probability but it happens anyway (or vice versa), the forecast was inaccurate. But this is precisely what Silver et al want to claim is naive. "Things that are 29% probable happen 29% of the time;" the 538 forecast for 2016 wasn't inaccurate. Why not? Because 538 is well-calibrated. Calibration is proposed AS AN ACCURACY MEASURE. And my point is that it's not an accuracy measure. It's something else, something a lot weirder. Maybe it's still useful in some cases, for some purposes. But it's not the accuracy measure we're looking for. We still need one of those.

Expand full comment
Daniel Greco's avatar

Thanks! I just came across your blog recently but I've really been enjoying it, btw.

On the hedge fund idea, call it causal analysis if you like but they need to come up with judgments that play the decision theoretic role of probabilities. The basic problem with making only binary predictions if you need to be making bets is that you can't decide when a bet with uneven odds is a good one. If the bet pays $500 if I win and it only costs $100 to make, then even if we all make the binary prediction that it won't pay off, we need probabilistic forecasting to decide whether it's a wise bet nonetheless. This is the situation venture capital firms are in. If you want to make binary predictions about whether their potential acquisitions will reach a $1 billion valuation, you should predict "no" every time. Very few startups hit it big. But VC firms are looking to spot the next Amazon, Facebook, or Google. If it does blow up, then their investment will grow exponentially in value. The difference between a 0.0001% chance of a firm being the next Amazon and a 0.5% chance really matters; one isn't worth acquiring a stake in, and the other is.

As for accuracy vs calibration, in the lingo people distinguish the two. (In)accuracy is generally measured as distance from the truth, usually squared (that's called the "Brier Score"). So if you predicted something would happen with .8 probability, and then it happened, your inaccuracy for that prediction is .2 squared, so .04. While you can measure (in)accuracy for one-off predictions, the general sense is that it's not that meaningful or interesting. But having people come up with a whole bunch of probabilistic predictions on a menu of events, and then comparing their (in)accuracy across the whole bunch, is much more interesting/informative. That's how the Tetlock tournaments are scored. Which brings us back to the idea of superforecaster predictions vs probabilistic forecasts. I'm not sure what you mean by saying they make actual predictions rather than forecasts--they're asked to come up with probabilities, and then those probabilities are scored for accuracy as described above: https://www.gjopen.com/faq

Note that they are not measured by how well calibrated they are. And I think that fits with the point I made in the earlier comment; calibration is a means to accuracy, rather than an end in itself.

I think what you're really hankering after, which I don't think you can get, is a really airtight way of evaluating one-off forecasts after the fact, that would led you decide whether Nate Silver was vindicated or not in 2016. He'll say he was using a well-calibrated method that occasionally goes wrong. Other people will say they knew Trump was going to win, and they were thus vindicated over Silver. I guess I think he's reasonable in saying: "if you really knew Trump was going to win, then show me a bunch of other predictions you've made/are making, and let's score the bunch, to see whether you just got lucky. That's what I'm doing, and taken holistically, my record is pretty good." I think people should get *credit* for having predicted a Trump victory, but I also think that absent something like a solid accuracy/calibration score over a whole bunch of similar predictions, it's reasonable to think they were probably just lucky.

The situation is a bit like reliabilism in epistemology; reliabilists think you can only evaluate individual beliefs for justification by subsuming them under a more general type (i.e., belief produced by such-and-such process) and then evaluating the process (i.e., what proportion of its outputs are true?) Similarly, people who like probabilistic forecasting will generally saying you can only meaningfully evaluate forecasts by subsuming them under more general types, and evaluating the types (e.g., with accuracy and calibration methods). Doing so leads to well-known generality/reference class problems--what count as "similar" predictions? This is part of what's going on in your Nimoy case. But even though there's no theoretically satisfying solution, I think we need to just accept that this is a place where epistemology requires judgment/phronesis as to rather than trying to do without anything like probabilistic forecasting.

Expand full comment
Matt Lutz's avatar

Glad you like the blog! Some scattered replies:

Somehow I forgot that Brier scores are referred to as "accuracy." Of course, of course, the Joyce "accuracy dominance" argument... we're on the same page now. I still have some issues with that sort of way of talking about probabilistic correctness, but those are relatively minor compared to my worries about calibration. I think Brier scores are much better than calibration in basically every way.

I didn't know that Tetlock's superforecasters made probabilistic rather than binary predictions. Not sure why I thought otherwise. Thanks for setting me straight on that.

I don't think maximizing expected utility is a good way to run a hedge fund. I'm with my (former) colleague Brad Monton on this: maximizing expected utility is dumb, as a number of paradoxes have shown. (https://philpapers.org/rec/MONHTA-2) Sam Bankman-Fried and Caroline Ellison famously argued for biting the bullet in the St. Petersburg Paradox. Look where that landed Alameda and FTX. Perhaps a case in point? In general, I'm skeptical of any formal decision theory. Predicting the future is hard, acting under uncertainty is hard as well, and people who do well are mostly just getting lucky.

The connection between calibration and reliabilism is apt and useful. I guess one way you could put the point I'm making with both of my objections is that calibration suffers from a version of the generality problem. You're going to get bad results whenever your epistemic status regarding P depends on your epistemic status regarding a host of other propositions. Why should P's status depend on those other things, and how are we populating the list of other things to begin with?

In case it's not clear, I'm an explanationist evidentialist in the Conee/Feldman/McCain model. See my published works in epistemology for more detail about my particular commitments... =)

Expand full comment
Philippe Bélanger's avatar

When we assess the accuracy of probabilistic predictions we usually use a scoring rule like the Brier score. If you use such a rule in your Nimoy example, the score of your predictions worsens.

Expand full comment
Matt Lutz's avatar

Yes! Brier scores are better than calibration.

Expand full comment
Philippe Bélanger's avatar

Has anyone (well, anyone worth taking seriously) ever suggested using this calibration method to evaluate the accuracy of probabilities? The idea seems fairly nonsensical.

Expand full comment
Matt Lutz's avatar

Nate Silver! Scott Alexander! These are very popular mainstream thinkers about probability and forecasting. I mean, you could "No True Scotsman" this and say that anyone who takes calibration seriously should not themselves be taken seriously, but these really are popular and influential ideas among the "educated public."

Expand full comment
Prado's avatar

"If we don’t want that forecast to be vindicated — and we don’t! — then the logic of calibration is defective."

But the logic of calibration in precisely to vindicate not individual forescasts, but the forecaster herself. The idea here being that since there is no way to evaluate the precision of the specific forecast, we can at least know how good that person is at forecasting.

"In other words: making an bad prediction about one proposition improved your epistemic standing with regard to nine other unrelated propositions."

Only because you crafted the exaple so that the mistakes compensate. If you make a mistake in a math problem, the only way to get back to the right answer is to make another mistake. Making another mistake improves your math score, but this does not invalidate the test. If you consistenly use calibration to test your forecasting skills, this type of coincidence where crazy predictions make your score better will happen rarely.

"Not only is this psychologicaly implausible, it’s inconsistent with Bayesianism"

I don't see how that's the case. Humans obviously use heuristics when making estimations. Estimating round numbers is what you would expect even in non-probabilistic guesses; the number of M&Ms in a jar, for instance. The fact that we can't tell there is 812 M&Ms instead of 800 doesn't undermine the fact that we can evaluate who is good and who is bad at these types of estimations. It's hard to tell whether turquoise is green or blue; doesn't mean we should call the sky green or the grass blue.

Expand full comment