A Mathematical Theory of Communication by C. E. Shannon - HTML preview

/ Home / Mathematics / A Mathematical Theory of Communication

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

The entropy in the case of two possibilities with probabilities p and q

p, namely

p log p

q log q

is plotted in Fig. 7 as a function of p.

1.0

BITS

1.0

Fig. 7 — Entropy in the case of two possibilities with probabilities p and 1

p .

The quantity H has a number of interesting properties which further substantiate it as a reasonable

measure of choice or information.

1. H

0 if and only if all the pi but one are zero, this one having the value unity. Thus only when we

are certain of the outcome does H vanish. Otherwise H is positive.

2. For a given n, H is a maximum and equal to log n when all the pi are equal (i.e., 1 ). This is also n

intuitively the most uncertain situation.

8See, for example, R. C. Tolman, Principles of Statistical Mechanics, Oxford, Clarendon, 1938.

3. Suppose there are two events, x and y, in question with m possibilities for the first and n for the second.

Let p i j be the probability of the joint occurrence of i for the first and j for the second. The entropy of the

;

joint event is

H x y

∑ p i j log p i j

;

i j

;

while

H x

∑ p i j log∑ p i j

;

i j

;

H y

∑ p i j log∑ p i j

;

i j

;

It is easily shown that

H x y

H x

H y

;

with equality only if the events are independent (i.e., p i j

p i p j ). The uncertainty of a joint event is

;

less than or equal to the sum of the individual uncertainties.

4. Any change toward equalization of the probabilities p 1 p 2

pn increases H. Thus if p 1

p 2 and

;

we increase p 1, decreasing p 2 an equal amount so that p 1 and p 2 are more nearly equal, then H increases.

More generally, if we perform any “averaging” operation on the pi of the form

p 0

∑

ai j p j

where ∑ i ai j

∑

1, and all ai j

0, then H increases (except in the special case where this transfor-

j ai j =

mation amounts to no more than a permutation of the p j with H of course remaining the same).

5. Suppose there are two chance events x and y as in 3, not necessarily independent. For any particular value i that x can assume there is a conditional probability pi j that y has the value j. This is given by p i j

;

pi j

∑

j p i j

;

We define the conditional entropy of y, Hx y as the average of the entropy of y for each value of x, weighted according to the probability of getting that particular x. That is

Hx y

∑ p i j log pi j

;

i j

;

This quantity measures how uncertain we are of y on the average when we know x. Substituting the value of pi j we obtain

Hx y

∑ p i j log p i j ∑ p i j log∑ p i j

;

i j

;

H x y

H x

;

H x y

H x

Hx y

;

The uncertainty (or entropy) of the joint event x y is the uncertainty of x plus the uncertainty of y when x is

;

known.

6. From 3 and 5 we have

H x

H y

H x y

H x

Hx y

;

Hence

H y

Hx y

The uncertainty of y is never increased by knowledge of x. It will be decreased unless x and y are independent events, in which case it is not changed.

7. THE ENTROPY OF AN INFORMATION SOURCE

Consider a discrete source of the finite state type considered above. For each possible state i there will be a set of probabilities pi j of producing the various possible symbols j. Thus there is an entropy Hi for each state. The entropy of the source will be defined as the average of these Hi weighted in accordance with the probability of occurrence of the states in question:

∑ PiHi

i ∑ Pipi j log pi j

i j

;

This is the entropy of the source per symbol of text. If the Markoff process is proceeding at a definite time

rate there is also an entropy per second

H 0

∑ fiHi

where fi is the average frequency (occurrences per second) of state i. Clearly

H 0

where m is the average number of symbols produced per second. H or H 0 measures the amount of information generated by the source per symbol or per second. If the logarithmic base is 2, they will represent bits

per symbol or per second.

If successive symbols are independent then H is simply

∑ pi log pi where pi is the probability of sym-

bol i. Suppose in this case we consider a long message of N symbols. It will contain with high probability about p 1 N occurrences of the first symbol, p 2 N occurrences of the second, etc. Hence the probability of this particular message will be roughly

pp 1 N pp 2 N

ppnN

log p : N ∑ pi log pi

log p :

log 1 p

H is thus approximately the logarithm of the reciprocal probability of a typical long sequence divided by the number of symbols in the sequence. The same result holds for any source. Stated more precisely we have

(see Appendix 3):

Theorem 3: Given any

0 and

0, we can find an N 0 such that the sequences of any length N

N 0

fall into two classes:

1. A set whose total probability is less than .

2. The remainder, all of whose members have probabilities satisfying the inequality

log p 1

In other words we are almost certain to have

very close to H when N is large.

A closely related result deals with the number of sequences of various probabilities. Consider again the

sequences of length N and let them be arranged in order of decreasing probability. We define n q to be the number we must take from this set starting with the most probable one in order to accumulate a total

probability q for those taken.

Theorem 4:

log n q

Lim

∞

when q does not equal 0 or 1.

We may interpret log n q as the number of bits required to specify the sequence when we consider only log n q

the most probable sequences with a total probability q. Then

is the number of bits per symbol for

the specification. The theorem says that for large N this will be independent of q and equal to H. The rate of growth of the logarithm of the number of reasonably probable sequences is given by H, regardless of our interpretation of “reasonably probable.” Due to these results, which are proved in Appendix 3, it is possible

for most purposes to treat the long sequences as though there were just 2 HN of them, each with a probability 2 HN

The next two theorems show that H and H 0 can be determined by limiting operations directly from the statistics of the message sequences, without reference to the states and transition probabilities between

states.

Theorem 5: Let p Bi be the probability of a sequence Bi of symbols from the source. Let 1

∑ p Bi log p Bi

N i

where the sum is over all sequences Bi containing N symbols. Then GN is a monotonic decreasing function of N and

Lim GN

∞

Theorem 6: Let p Bi S j be the probability of sequence Bi followed by symbol S j and pB S j

;

p Bi S j

p Bi be the conditional probability of S j after Bi. Let

;

∑ p Bi Sj log pB Sj

;

i j

;

where the sum is over all blocks Bi of N

1 symbols and over all symbols S j. Then FN is a monotonic

decreasing function of N,

NGN

1 GN 1

;

1 N

∑ Fn

;

N n 1

;

and Lim N ∞ FN

These results are derived in Appendix 3. They show that a series of approximations to H can be obtained by considering only the statistical structure of the sequences extending over 1 2

N symbols. FN is the

;

better approximation. In fact FN is the entropy of the N th order approximation to the source of the type discussed above. If there are no statistical influences extending over more than N symbols, that is if the conditional probability of the next symbol knowing the preceding N

1 is not changed by a knowledge of

any before that, then FN

H. FN of course is the conditional entropy of the next symbol when the N

preceding ones are known, while GN is the entropy per symbol of blocks of N symbols.

The ratio of the entropy of a source to the maximum value it could have while still restricted to the same

symbols will be called its relative entropy. This is the maximum compression possible when we encode into the same alphabet. One minus the relative entropy is the redundancy. The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters, is roughly 50%. This

means that when we write English half of what we write is determined by the structure of the language and

half is chosen freely. The figure 50% was found by several independent methods which all gave results in

this neighborhood. One is by calculation of the entropy of the approximations to English. A second method

is to delete a certain fraction of the letters from a sample of English text and then let someone attempt to

restore them. If they can be restored when 50% are deleted the redundancy must be greater than 50%. A

third method depends on certain known results in cryptography.

Two extremes of redundancy in English prose are represented by Basic English and by James Joyce’s

book “Finnegans Wake”. The Basic English vocabulary is limited to 850 words and the redundancy is very

high. This is reflected in the expansion that occurs when a passage is translated into Basic English. Joyce

on the other hand enlarges the vocabulary and is alleged to achieve a compression of semantic content.

The redundancy of a language is related to the existence of crossword puzzles. If the redundancy is

zero any sequence of letters is a reasonable text in the language and any two-dimensional array of letters

forms a crossword puzzle. If the redundancy is too high the language imposes too many constraints for large

crossword puzzles to be possible. A more detailed analysis shows that if we assume the constraints imposed

by the language are of a rather chaotic and random nature, large crossword puzzles are just possible when

the redundancy is 50%. If the redundancy is 33%, three-dimensional crossword puzzles should be possible,

etc.

8. REPRESENTATION OF THE ENCODING AND DECODING OPERATIONS

We have yet to represent mathematically the operations performed by the transmitter and receiver in en-

coding and decoding the information. Either of these will be called a discrete transducer. The input to the

transducer is a sequence of input symbols and its output a sequence of output symbols. The transducer may

have an internal memory so that its output depends not only on the present input symbol but also on the past

history. We assume that the internal memory is finite, i.e., there exist a finite number m of possible states of the transducer and that its output is a function of the present state and the present input symbol. The next

state will be a second function of these two quantities. Thus a transducer can be described by two functions:

f xn

;

n 1

g xn

;

where

xn is the n th input symbol,

n is the state of the transducer when the n th input symbol is introduced,

yn is the output symbol (or sequence of output symbols) produced when xn is introduced if the state is n.

If the output symbols of one transducer can be identified with the input symbols of a second, they can be

connected in tandem and the result is also a transducer. If there exists a second transducer which operates

on the output of the first and recovers the original input, the first transducer will be called non-singular and the second will be called its inverse.

Theorem 7: The output of a finite state transducer driven by a finite state statistical source is a finite state statistical source, with entropy (per unit time) less than or equal to that of the input. If the transducer is non-singular they are equal.

Let

represent the state of the source, which produces a sequence of symbols xi; and let

be the state of

the transducer, which produces, in its output, blocks of symbols y j. The combined system can be represented by the “product state space” of pairs

. Two points in the space

and

2 , are connected by

;