Humean Being

Jul 18, 2023

Thanks for such a thoughtful comment, Dan. Few philosophical objections are decisive, so I don't think I've conclusively refuted calibration or anything like that here. If I give calibrationists pause, then that's sufficient philosophical progress for me.

The best I can say in calibration's favor is that I can't help but feel that someone who is well-calibrated has attained an epistemic achievement of some kind. I'm inclined to question that feeling because I have larger conceptual worries about how the notion of probability is used.

I don't think that probabilities are particularly useful in forecasting, but a kind of causal analysis is. I'm certainly not in favor of pure vibe-based punditry, and I think that some of the things that Silver has introduced into election analysis are genuinely very useful, like the fact that he treats national polls as worthless because it doesn't matter how much you run up the score in safe states, or his analysis of "house effects" for various polling firms. But these virtues aren't Bayesian virtues, they're explanationist virtues: controlling for house effects is a way of trying to isolate the causal impact that actual public opinion has on polling data, and state-by-state analysis is just a way of making your model match the actual causal structure of what determines the winner of the election.

Tetlock's superforecasters aren't making probabilistic forecasts. They make actual predictions; the "superforecasters" are the ones who stick their neck out and get it right a huge amount of the time. From subsequent interviews with these superforecasters, Tetlock has said that what superforecasters do looks like Bayesian reasoning. But good causal analysis and Bayesian analysis have a lot in common (as many explanationists have argued; see, e.g., Ted Poston's book on IBE), and I think superforecasters are better understood as doing causal analyses. That's what I'd recommend for hedge funds.

As for your idea that we need to supplement calibration with some sort of accuracy measure: Yes! Absolutely! But how do we measure accuracy for a probabilistic forecast? The basic problem is that there is no accuracy measure for probabilistic forecasts of one-off events. (This is related to the "problem of the single case.") The obvious way to gin up an accuracy measure for single-case probabilistic forecasts is to say that if we give something a low probability but it happens anyway (or vice versa), the forecast was inaccurate. But this is precisely what Silver et al want to claim is naive. "Things that are 29% probable happen 29% of the time;" the 538 forecast for 2016 wasn't inaccurate. Why not? Because 538 is well-calibrated. Calibration is proposed AS AN ACCURACY MEASURE. And my point is that it's not an accuracy measure. It's something else, something a lot weirder. Maybe it's still useful in some cases, for some purposes. But it's not the accuracy measure we're looking for. We still need one of those.

Expand full comment

Daniel Greco

Jul 18, 2023

Thanks! I just came across your blog recently but I've really been enjoying it, btw.

On the hedge fund idea, call it causal analysis if you like but they need to come up with judgments that play the decision theoretic role of probabilities. The basic problem with making only binary predictions if you need to be making bets is that you can't decide when a bet with uneven odds is a good one. If the bet pays $500 if I win and it only costs $100 to make, then even if we all make the binary prediction that it won't pay off, we need probabilistic forecasting to decide whether it's a wise bet nonetheless. This is the situation venture capital firms are in. If you want to make binary predictions about whether their potential acquisitions will reach a $1 billion valuation, you should predict "no" every time. Very few startups hit it big. But VC firms are looking to spot the next Amazon, Facebook, or Google. If it does blow up, then their investment will grow exponentially in value. The difference between a 0.0001% chance of a firm being the next Amazon and a 0.5% chance really matters; one isn't worth acquiring a stake in, and the other is.

As for accuracy vs calibration, in the lingo people distinguish the two. (In)accuracy is generally measured as distance from the truth, usually squared (that's called the "Brier Score"). So if you predicted something would happen with .8 probability, and then it happened, your inaccuracy for that prediction is .2 squared, so .04. While you can measure (in)accuracy for one-off predictions, the general sense is that it's not that meaningful or interesting. But having people come up with a whole bunch of probabilistic predictions on a menu of events, and then comparing their (in)accuracy across the whole bunch, is much more interesting/informative. That's how the Tetlock tournaments are scored. Which brings us back to the idea of superforecaster predictions vs probabilistic forecasts. I'm not sure what you mean by saying they make actual predictions rather than forecasts--they're asked to come up with probabilities, and then those probabilities are scored for accuracy as described above: https://www.gjopen.com/faq

Note that they are not measured by how well calibrated they are. And I think that fits with the point I made in the earlier comment; calibration is a means to accuracy, rather than an end in itself.

I think what you're really hankering after, which I don't think you can get, is a really airtight way of evaluating one-off forecasts after the fact, that would led you decide whether Nate Silver was vindicated or not in 2016. He'll say he was using a well-calibrated method that occasionally goes wrong. Other people will say they knew Trump was going to win, and they were thus vindicated over Silver. I guess I think he's reasonable in saying: "if you really knew Trump was going to win, then show me a bunch of other predictions you've made/are making, and let's score the bunch, to see whether you just got lucky. That's what I'm doing, and taken holistically, my record is pretty good." I think people should get *credit* for having predicted a Trump victory, but I also think that absent something like a solid accuracy/calibration score over a whole bunch of similar predictions, it's reasonable to think they were probably just lucky.

The situation is a bit like reliabilism in epistemology; reliabilists think you can only evaluate individual beliefs for justification by subsuming them under a more general type (i.e., belief produced by such-and-such process) and then evaluating the process (i.e., what proportion of its outputs are true?) Similarly, people who like probabilistic forecasting will generally saying you can only meaningfully evaluate forecasts by subsuming them under more general types, and evaluating the types (e.g., with accuracy and calibration methods). Doing so leads to well-known generality/reference class problems--what count as "similar" predictions? This is part of what's going on in your Nimoy case. But even though there's no theoretically satisfying solution, I think we need to just accept that this is a place where epistemology requires judgment/phronesis as to rather than trying to do without anything like probabilistic forecasting.

Expand full comment

Jul 18, 2023

Glad you like the blog! Some scattered replies:

Somehow I forgot that Brier scores are referred to as "accuracy." Of course, of course, the Joyce "accuracy dominance" argument... we're on the same page now. I still have some issues with that sort of way of talking about probabilistic correctness, but those are relatively minor compared to my worries about calibration. I think Brier scores are much better than calibration in basically every way.

I didn't know that Tetlock's superforecasters made probabilistic rather than binary predictions. Not sure why I thought otherwise. Thanks for setting me straight on that.

I don't think maximizing expected utility is a good way to run a hedge fund. I'm with my (former) colleague Brad Monton on this: maximizing expected utility is dumb, as a number of paradoxes have shown. (https://philpapers.org/rec/MONHTA-2) Sam Bankman-Fried and Caroline Ellison famously argued for biting the bullet in the St. Petersburg Paradox. Look where that landed Alameda and FTX. Perhaps a case in point? In general, I'm skeptical of any formal decision theory. Predicting the future is hard, acting under uncertainty is hard as well, and people who do well are mostly just getting lucky.

The connection between calibration and reliabilism is apt and useful. I guess one way you could put the point I'm making with both of my objections is that calibration suffers from a version of the generality problem. You're going to get bad results whenever your epistemic status regarding P depends on your epistemic status regarding a host of other propositions. Why should P's status depend on those other things, and how are we populating the list of other things to begin with?

In case it's not clear, I'm an explanationist evidentialist in the Conee/Feldman/McCain model. See my published works in epistemology for more detail about my particular commitments... =)

Expand full comment

Philippe Bélanger

When we assess the accuracy of probabilistic predictions we usually use a scoring rule like the Brier score. If you use such a rule in your Nimoy example, the score of your predictions worsens.

Expand full comment

Yes! Brier scores are better than calibration.

Expand full comment

Philippe Bélanger

Has anyone (well, anyone worth taking seriously) ever suggested using this calibration method to evaluate the accuracy of probabilities? The idea seems fairly nonsensical.

Expand full comment