[Peers] Testing by betting

A year ago, on September 9th 2020, we experienced a first moment for our new 'statistical paradigm'. Professor Glenn Shafer presented Testing by betting at the Royal Statistical Society (RSS) discussion session. A recording can be found here on YouTube. My colleagues at CWI and I were lucky that the Covid situation at the time allowed us to attend the virtual discussion session together while physically in the same room. You can see us for a short moment in the video at [1:19:41] with me in yellow.

Since the beginning of my Ph.D., the group of researchers worldwide has grown that contributes to novel statistical techniques based on betting scores, e-values, and test martingales. The RSS discussion features, for example, dr. Aaditya Ramdas from Carnegie Mellon University [1:12:36 in the video], whose work was not familiar to us, not even to my supervisor, when I started my Ph.D. Just like Shafer and Vovk, Ramdas and his collaborators show the great mathematical results of the approach, and I like his angle in this RSS discussion: it creates new statistical possibilities that are impressive, also if you do not see the value of thinking about statistics in the language of betting. For me, however, that language of betting is the most important part because statistics needs to improve its track record of communication.

Just as Professor Shafer, my supervisor Peter Grümwald and I believe that statistical communication will benefit from the notion of a betting score or e-value. In the acknowledgments of the accompanying paper, Shafer mentions that his discussion paper "was inspired by conversations about game-theoretic testing and meta- analysis with Peter Grünwald, Wouter Koolen, and Judith ter Schure at the Centrum Wiskunde & Informatica in Amsterdam in December 2018." It is very nice that he mentions that visit since for me, it really was a highlight of my Ph.D. We discussed why the p-value is so often misunderstood, and – with Shafer's wonderful historic knowledge – why it was introduced in the first place. His working papers on probabilityandfinance.com also go back to the early days of probability theory to show that thinking in terms of betting was very important for the groundwork of statistics.

The empirical evidence that the p-value is misunderstood by doctors as well as writers of introductory textbooks served as a great introduction to Shafer's paper. Some of the discussants participating in the RSS discussion do not believe that betting scores will fare any better, but I am still optimistic. I do not want to accept yet that statistical inference is just too difficult.I have had good experiences piloting some meta-analysis-casino analogies in front of epidemiologists and qualitative researchers at the Research Integrity Amsterdam Lunch Meeting, which Shafer mentions in his reply [starting at 1:35:59]. It would be great to go beyond such anecdotal evidence. As I state in my written contribution [p. 360-361]:

"I am already convinced that it will serve statistics well to replace p-values by bets, and power analyses by implied targets. But how do we know whether practitioners actually find this more intuitive? As an applied statistician, I will keep that question from the chat in mind as well, and see if I can design an experiment to test it."

In his written reply [p. 473] Shafer adds:

"Perhaps the most crucial choice will be the selection of participants. Will they be highly trained statisticians? Poker players, as Philip Dawid playfully suggests? Scientists, who use statistical tests? Students in statistics classes? Or perhaps teenagers? I did not play poker as a teenager, but I remember that my classmates, when disputing each others’ predictions, readily used betting taunts: ‘Wanna bet?’, ‘Put your money where your mouth is’. Imagine making these reports to teenagers:
• Prof Shafer tested your app's rainfall predictions over the past year by betting against them and turned $1 into $10. He concluded that the app is not doing a good job.
• Prof Shafer constructed a statistical model for your app's rainfall predictions over the past year. The app's predictions were inaccurate by an amount he would have expected only 1% of the time. He concluded that if his model is right, the app is not doing a good job.

Which report would the teenager be more likely to remember? Which would she be able to repeat to a classmate? If you teach statistics, which do you think your students would be able to repeat accurately? "