A mathematical proof of Ockham’s razor?

Ockham’s razor is a principle often used to dismiss out of hand alleged phenomena deemed to be too complex. In the philosophy of religion, it is often invoked for arguing that God’s existence is extremely unlikely to begin with owing to his alleged incredible complexity. Bild A geeky brain is desperately required before entering this sinister realm.

In a earlier post I dealt with some of the most popular justifications for the razor and made the following distinction:

Methodological Razor: if theory A and theory B do the same job of describing all known facts C, it is preferable to use the simplest theory for the next investigations.

Epistemological Razor: if theory A and theory B do the same job of describing all known facts C, the simplest theory is ALWAYS more likely.”

Like the last time, I won’t address the validity of the Methodological Razor (MR) which might be an useful tool in many situations.

My attention will be focused on the epistemological glade and its alleged mathematical grounding.

Example: prior probabilities of models having discrete variables

To illustrate how this is supposed to work, I built up the following example. Let us consider the result Y of a random experiment depending on a measured random variable X . We are now searching for a good model (i.e. function  f(X)  ) such that the distance d = Y - f(X) is minimized with respect to constant parameters appearing in f . Let us consider the following functions: f1(X,a1)f2(X,a1,a2)f3(X,a1,a2,a3)  and  f4(X,a1,a2,a3,a4) . which are the only possible models aiming at representing the relation between Y and X. Let n1 = 1, n2 = 2, n3 =3 and n4 = 4 be their number of parameters. In what follows, I will neutrally describe how objective Bayesians justify Ockham’s razor in that situation.

The objective Bayesian reasoning

Objective Bayesians apply the principle of indifference, according to which in utterly unknown situations every rational agent assigns the same probability to each possibility.

Let be pi_{total} = p( f i) , the probability that the function is the correct description of reality. It follows from that assumption that p1_{total}=p2_{total} = p3_{total} = p4_{total} = p = \frac{1}{4} owing to the the additivity of the probabilities.

Let us consider that one constant coefficient ai can only take on five discrete values  1, 2, 3, 4 and 5. Let us call p1  p2p3  and  p4 the probabilities that one of the four models is right with very specific values of the coefficient (a1, a2, a3, a4). By applying once again the principle of indifference, one gets: p1(1) = p1(2) = p1(3) = p1(4) = p1(5) = \frac{1}{5}p1_{total} = 5^{-n1}p In the case of the second function which depends on two variable a, we have 5*5 doublets of values which are possible: (1,1) (1,2),…..(3,4)….(5,5) From indifference, it follows that p2(1,1)=p2(1,2) = ... = p2(3,4) = ....p2(5,5) = \frac{1}{25} p2_{total} = 5^{-n2}p There are 5*5*5 possible values for f3.

Indifference entails that p3(1,1,1)=p3(1,,12)=... =p3(3,,2,4)=....p3(5,5,5)= \frac{1}{125} p3_{total} = 5^{-n3}p f4 is characterized by four parameters, so that a similar procedure leads to p4(1,1,1,1)=p4(1,1,1,2) =...=p4(3, 2,1,4)=....p4(5,5,5,5)=\frac{1}{625}p4_{total}= 5^{-n4}p Let us now consider four wannabe solutions to the parameter identification problem: S1 = a1 S2 = {b1, b2} S3 = {c1, c2, c3} S4 = {d1, d2, d3, d4} each member being an integer between 1 and 5. The prior probabilities of these solutions are equal to  the quantities we have just calculated above. Thus p(S1)= 5^{-n1}p p(S2)= 5^{-n2}p p(S3)= 5^{-n3}p p(S4)= 5^{-n4}p From this, it follows that  \frac{p(Si)}{p(Sj)}= 5^{nj - ni} or O(i,j)= \frac{p(Si)}{p(Sj)} =5^{nj - ni} If one compares the first and the second model, O(1,2) = 5^{2-1} = 5 which means that the fit with the first model is (a priori) 5 times as likely as that with the second one .

Likewise, O(1,3) = 25 and O(1,4) = 125 showing that the first model is (a priori) 25 and 125 times more likely than the third and fourth model, respectively. If the four model fits the model with the same quality (in that for example fi(X, ai) is perfectly identical to Y), Bayes theorem will preserve the ratios for the computation of the posterior probabilities.

In other words, all things being equal, the simplest model f1(X,a1) is five times more likely than f2(X,a1,a2), 25 times more likely than f3(X,a1,a2,a3) and 125 times more likely than f4(X,a1,a2,a3,a4) because the others contain a greater number of parameters.

For this reason O(i,j) is usually referred to as an Ockham’s factor, because it penalizes the likelihood of complex models. If you are interested in the case of models with continuous real parameters, you can take a look at this publication. The sticking point of the whole demonstration is its heavy reliance on the principle of indifference.

The trouble with the principle of indifference

I already argued against the principle of indifference in an older post. Here I will repeat and reformulate my criticism.

Turning ignorance into knowledge

The principle of indifference is not only unproven but also often leads to absurd consequences. Let us suppose that I want to know the probability of certain coins to land odd. After having carried out 10000 trials, I find that the relative frequency tends to converge towards a given value which was 0.35, 0.43, 0.72 and 0.93 for the four last coins I investigated. Let us now suppose that I find a new coin I’ll never have the opportunity to test more than one time. According to the principle of indifference, before having ever started the trial, I should think something like that:

Since I know absolutely nothing about this coin, I know (or consider here extremely plausible) it is as likely to land odd as even.

I think this is magical thinking in its purest form. I am not alone in that assessment.

The great philosopher of science Wesley Salmon (who was himself a Bayesian) wrote what follows. “Knowledge of probabilities is concrete knowledge about occurrences; otherwise it is uselfess for prediction and action. According to the principle of indifference, this kind of knowledge can result immediately from our ignorance of reasons to regard one occurrence as more probable as another. This is epistemological magic. Of course, there are ways of transforming ignorance into knowledge – by further investigation and the accumulation of more information. It is the same with all “magic”: to get the rabbit out of the hat you first have to put him in. The principle of indifference tries to perform “real magic”. “

Objective Bayesians often use the following syllogism for grounding the principle of indifference.

1)If we have no reason for favoring one outcomes, we should assign the same probability to each of them

2) In an utterly unknown situation, we have no reason for favoring one of the outcomes

3) Thus all of them have the same probability.

The problem is that (in a situation of utter ignorance) we have not only no reason for favoring one of the outcomes, but also no grounds for thinking that they are equally probable.

The necessary condition in proposition 1) is obviously not sufficient.

This absurdity (and other paradoxes) led philosopher of mathematics John Norton to conclude:

“The epistemic state of complete ignorance is not a probability distribution.”

The Dempter Shafer theory of evidence offers us an elegant way to express indifference while avoiding absurdities and self-contradictions. According to it, a conviction is not represented by a probability (real value between 0 and 1) but by an uncertainty interval [ belief(h) ; 1 – belief(non h) ] , belief(h) and belief(non h) being the degree of trust one has in the hypothesis h and its negation.

For an unknown coin, indifference according to this epistemology would entail  belief(odd) = belief(even) = 0, leading to the probability interval [0 ; 1].

Non-existing prior probabilities

Philosophically speaking, it is controversial to speak of the probability of a theory before any observation has been taken into account. The great philosopher of evolutionary biology Elliot Sober has a nice way to put it: ““Newton’s universal law of gravitation, when suitably supplemented with plausible background assumptions, can be said to confer probabilities on observations. But what does it mean to say that the law has a probability in the light of those observations? More puzzling still is the idea that it has a probability before any observations are taken into account. If God chose the laws of nature by drawing slips of paper from an urn, it would make sense to say that Newton’s law has an objective prior. But no one believes this process model, and nothing similar seems remotely plausible.”

It is hard to see how prior probabilities of theories can be something more than just subjective brain states.


The alleged mathematical demonstration of Ockham’s razor lies on extremely shaky ground because:

1) it relies on the principle of indifference which is not only unproven but leads to absurd and unreliable results as well

2) it assumes that a model has already a probability before any observation.

Philosophically this is very questionable. Now if you are aware of other justifications for Ockham’s razor, I would be very glad if you were to mention them.


Why probabilities matter



In real life, it’s pretty rare (some would even say utterly impossible) to be sure of anything at all, like knowing it’s going to rain in one hour, that a conservative president is going to be elected, that you will be happily married in two years and so on and so forth.

We all recognize that it is only meaningful to speak of the probability or likelihood of each of these events.

The question of how to interpret their profound nature (ontoloy) is however, far from being an easy one.

I will use the basic proposition: if I roll the dice, there is a probability of 1/6 I will get a 3 in order to illustrate the two main interpretation of the probability concept out there.

1. Frequentism

According to this interpretation, the probability of an event equals its frequency if it is repeated an infinite number of times. If you roll a dice a great number of time, the frequency of the event (that is the number of 3s divided by the total number of rollings) will converge towards 1/6.

Mathematically it is a well defined concept and in many cases it can be relatively easily approximated. One of the main difficulties is that it apparently fails to account for the likelihood of unique situations, such as that (as far as we know in 2013) the Republicans are going to win the next American elections.

This brings us to the next popular interpretation of probability.

2. Bayesianism

For Bayesians, probabilities are degrees of belief and each degree of belief is a probability.

My degree of belief that the dice will fall onto 3 is 1/6.

But what is then a „degree of belief“? It is a psychological mind state which is correlated with a certain readiness for action.

According to many proponents of Bayenianism, degrees of belief are objective in so far that every rational creature disposing of a set of information would have exactly the same.

While such a claim is largely defensible for many situations such as the rolling of dices, the spread of a disaease or the results of the next elections, there are cases where it does not seem to make any sense at all.

Take for exampling the young Isaac Newton who was considering his newly developed theory of universal gravitation. What value should his degree of belief have taken on BEFORE he had begun to consider the first data of the real world?


And what would it mean ontologically to say that we have a degree of belief of 60% that the theory is true? What is the relation (in that particular situation) between the intensity of certain brain processes and the objective reality?

Such considerations have led other Bayesians to give up objectivity and define „degrees of belief“ as subjective states of mind, which might however be objectively constrained in many situations.

Another criticism of (strong) Bayesianism is that it ties the concept of probability to the belief of intelligent creatures. Yet it is clear that even in an universe lacking conscious beings, the probability of the decay of an atom and of more fundamental quantum processes would still exist and be meaningful.

For completeness, I should mention the propensity interpretation of Karl Popper who viewed the likelihood of an event as an intrinsic tendency of a physical system to tend towards a certain state of affairs.


So this was my completely unbiased (pun intended!) views on probabilities.

When debating (and fighting!) each other, theists and atheists tend to take their own epistemology (theory of knowledge) as granted.

This often leads to fruitless and idle discussions.

This is why I want to take the time to examine how we can know, what it means to know, before discussing what we can (and cannot) know.


Thematic list of ALL posts on this blog (regularly updated)

My other blog on Unidentified Aerial Phenomena (UAP)



Next episod: Naked Bayesianism.