One of the biggest problems with the modern approach to scientific statistics is that it fundamentally does not tell us what we want to know. We try to use it to infer the future from the past behavior, when we should be using it to better understand the (extended) present.
When we flip a fair coin, which outcome is more likely?
(1) H(eads)-T(ails)-H-T-H-T-H-T-H-T (alternating)
(2) H-H-H-H-H-H-H-H-H-H (all heads)
(3) T-T-T-T-T-T-T-T-T-T (all tails)
(4) T-H-H-T-H-T-T-T-H-T (some random looking combination)
Of course, all of these outcomes are equally likely. After all, any given pattern of events is exactly as likely any other given pattern of events.
But the real question we want to answer is, given a sequence of results, and only that sequence of results, how sure can we be that the coin is fair.
Now, the first notion we must disabuse ourselves of is that we can tell the future. Past results do not guarantee future returns, after all. But having conceded that, we can still ask what our four types of statistical inference tells us.
Bayesian inference inference can tell us whether to expect a heads or a tails after some prior pattern of heads or tails. If we take a God’s-eye-view of the matter, we can quickly convince ourselves that the correct answer would be 50% in every case. But Bayes rule doesn’t tell us how we could arrive at that number to begin with.
We could measure the coin very carefully and conclude that no side has a greater intrinsic propensity to show itself than the other. Unfortunately though, not only only is this against the rules (remember, we are only given access to the results we have access to), but it also neglects the fact that a coin toss is subject to not the intrinsic factors of the coin itself, but also extrinsic factors relating to the toss. It is, after all, possible to construct a clever device that catches a coin in just such a way that the result is perfectly predictable.
Frequentist statistics, meanwhile, can tell us what we should expect the distribution of results will be assuming that we know it’s a fair coin and we can repeat the experiment infinitely many times. More precisely, it can tell us how often a particular set of outcomes would result from a given number of fair coin tosses. Then we can decide if how rare a distribution is rare enough for us to say it is by chance or not.
But notice, this all stems from the assumption that the coin is fair to begin with, which was not given. What we wanted to know was whether the coin was fair based solely on our observation of coin tosses.
This is why the gamblers fallacy is so enticing. The gamblers fallacy is when we try to make sense of a random distribution of outcomes of a coin that we believe to be fair. But in a world where fairness cannot be guaranteed, it’s not really a fallacy at all. This can be easily illustrated by considering a coin-toss where we don’t know that the coin is fair. After the first toss comes up heads, what should we bet on next?
If the coin is truly fair, then it wouldn’t matter, frequentist reasoning will tell us that the outcome of the next will be just as random as the first. If, however, the coin was even a little unfair this would bias the result. In the real world, propensity theory informs us that every physical object is practically guaranteed to have some bias. From here we can use Bayesian reasoning to determine that the chance of the first roll being heads, given a the coin is biased towards heads, is much higher than if the coin was fair or biased towards tails.
Thus, if we didn’t know anything about the coin being tossed, we should still be more surprised (even if only a little) from a Heads-Tails results than from a Heads-Heads result. And the more heads we get in a row to start, the more surprised we should be once the first tails appears.
This chain of reasoning can be summed up in the information theoretic construct of surprisal, which can be formalized using a slight mangling of Claude Shannon’s idea of self information. The formula for this is -log(P), where the base of the logarithm can be anything and P is the probability of the event.
In the context we are dealing with, we can only describe the probability after the fact, since we don’t actually know if the coin is fair or not. So, using base 2, the first toss yields a probability 100% of landing heads if it does indeed land heads. The surprise embodied by this event is 0, since (assuming we are certain the coin must land heads or tails) there is no surprise when it does so.
If we toss the coin again and it lands heads again, this adds no surprise. The probability of the coin landing heads that we have observed is still 100%, and -log(1)=0.
But if the coin lands tails, then there is a 50% chance of heads or tails. Using 2 as the base for the logarithm, -log(0.5)=1.
The nice thing about this approach is that the information content can be averaged. Say we roll flip four coins then:
HHHH and TTTT yields 0 surprise.
HHTT, TTHH, THTH, HTHT, HTTH, or THHT can be given a value of 1, while
HTTT, THTT, TTHT, TTTH, THHH, HTHH, HHTH, HHHT all have an average value of about 1.2
This neatly captures the basic insight that a event which occurs more rarely in some set of events should surprise us more when it happens. It also captures the idea, that the amount of increase in surprise decreases the more attempts we make. So the increase in surprise from getting a tails after five heads as compared to a tails after six heads is more than the increase from getting a tails after six heads as compared to a tails after seven heads. This would be counterintuitive if we assumed the coin was fair, where we would be less surprised if we got a tails in a a million tosses than if we got a tails in three tosses.
While variations of this concept are often used extensively as a scientific tool, the real challenge is develop this notion into a principle that can be used as a scientific paradigm on par with frequentism, Bayesianism and propensity.