Defining Entropy with a Composition Law

The mathematical definition of entropy is often hard to motivate, coming across as rather mysterious. But it can be uniquely characterised with a very natural and pretty ‘composition law’, which anyone could come up with.


Theorem: The formula for information entropy

is the unique real-valued function of a finite probability distribution which:

  1. is continuous with respect to ,
  2. obeys the Composition Law,
  3. increases for uniform distributions as the width increases,
  4. has some fixed value for a fair coin toss .

The last assumption is merely a choice of units—without it, the function is defined up to an overall scaling factor.

Motivating the Composition Law

If we wanted to define the ‘uncertainty’ associated with a distribution, what properties should our measure have?

You can picture a finite probability distribution as a decision tree with weighted branches:

The same distribution can also be expressed by dividing the decision tree up into a composition of sub-trees:

Notice that each final outcome still has the same overall probability — all we’ve done is add an extra ‘step’ in the random process.

Since these two pictures represent the same scenario, we want our uncertainty measure to be the same for both. But the second picture has many parts. How do you measure the total uncertainty? By taking a weighted sum of the uncertainties of each sub-tree.

Why a weighted sum? Consider what should happen when one of the branches’ probabilities goes to zero: the total uncertainty should not include the uncertainty of the sub-tree under that branch, since it is certain that its outcomes won’t happen.

For example, the probability of is in the picture, so the total uncertainty contains only a small contribution from the green sub-tree.

Expressing this mathematically, the uncertainty measure should satisfy

where the left hand side is the uncertainty of the distribution in the first picture, and the right hand side is the total uncertainty in the second picture.

Note that the probabilities on each branch are still important, but are just too small to draw.

Formally

Now we can codify this composition law more formally. Let

be a finite probability distribution which is normalised so that . Let be some equivalence relation on the set of outcomes, . Define the quotient distribution

where is the equivalence class of . This corresponds to taking a distribution and binning the outcomes. Also define the restricted distributions

which correspond to keeping only a subset of outcomes and renormalising the probabilities.

These fit into the example above where like so:

Expressed in this language, the composition law is

where .

Proof of Uniqueness

We will prove that any ‘uncertainty’ function satisfying

  1. continuity
  2. the composition law
  3. the property that increases with

must be equal to the Shannon entropy

where . The proof is in three steps, one for each property above:

  1. Show that is uniquely defined for all distributions if it is known for all rational distributions .
  2. Show that is uniquely defined for all rational distributions if the entropy of the uniform distribution is known for all .
  3. Show that is uniquely defined for all if we fix .

Step 1.

This follows by the assumption of continuity. If is the limit a sequence of rational distributions , then by continuity .

Step 2.

Let be a rational distribution, and let be the lowest common denominator of all the probabilities , so that

where are non-negative integers. Now consider the set , and let be the uniform distribution on . Define an equivalence relation which partitions into different sets , where the th set contains elements. The size of the th equivalence group, as a fraction of the whole, is given by , and we have by construction.

From the composition law, we have

since and . This shows that is uniquely defined if is known for all .

Step 3.

We will now show that is uniquely defined by by showing that the only possible functions are

where is a free parameter.

Consider a uniform distribution on partitioned by into groups of , so that the th equivalence class is . Writing down the composition law for this partition yields

and hence . The only increasing functions satisfying this property are multiples of . By choosing , we fix .

This completes the proof!

In summary, under the three assumptions, knowing is enough to uniquely define for all , which is in turn enough to uniquely define for all rational distributions , which is enough to define completely.