Claude Shannon is properly described as "the father of information theory" although he described his work as "communication theory." While others had loosely connected the idea of information to its opposite, entropy, it was Shannon who put the communication of signals in the presence of noise on a sound mathematical basis. In 1871, James Clerk Maxwell showed how an intelligent being could in principle sort out the disorder in a gas of randomly moving molecules, by gathering information about their speeds and sorting them into hot and cold gases, in apparent violation of the second law of thermodynamics. William Thomson (Lord Kelvin) called this being "Maxwell's intelligent demon." As early as the 1890's, Ludwig Boltzmann, who established the statistical physics foundation of thermodynamics, had described entropy as "missing information." Boltzmann chose the logarithm of the number W of equiprobable microstates as the measure for his entropy, because he wanted entropy to be an additive quantity.
S = k log Wwhere k is Boltzmann's constant. If one system can be in one thousand possible states and another system also in a thousand possible states, the combined system has a million possible states. In a base 10 system, log101000 = 3, and 3 + 3 is 6 = log101000000. In 1929, Leo Szilard imagined a gas with but a single molecule in a container. He then devised a mechanism that could behave like Maxwell's demon. It would insert a partition into the middle of the container, then gather the information about which of the two sides of the partition the molecule was in. This was a binary decision and it allowed Szilard to develop the mathematical form for the amount of entropy S produced by a one-bit measurement, which Szilard identified as the acquisition of information and storage in the "memory" of a physical device or of a human observer.
S = k log 2The base-2 logarithm reflects the binary decision. The amount of entropy generated by the measurement may, of course, always be greater than this fundamental amount of negative entropy (information) created, but not smaller, or the second law - that overall entropy must increase - would be violated. The earlier work of Maxwell, Boltzmann, and Szilard did not figure directly in Shannon's work. Shannon studied the design of early analog computers (specifically Vannevar Bush's differential analyzer at MIT, which was used by Coolidge and James to calculate the wave functions of the hydrogen molecule in 1936). Then, with John von Neumann and Alan Turing, he helped design the first digital computers, based on the Boolean logic of 1's and 0's and binary arithmetic. Shannon analyzed telephone switching circuits that used electromagnetic relay switches, then realized that the switches could solve some problems in Boolean algebra. During World War II, Shannon worked at Bell Labs on cryptography and sending control signals in the presence of noise. Alan Turing visited the labs for a couple of months and showed Shannon his 1936 ideas for a universal computer (the "Turing Machine"). Shannon's work on communications, control systems, and cryptography were initially classified, but they contained almost all of the mathematics that eventually appeared in his landmark 1948 article "A Mathematical Theory of Communication," that is the basis for modern information theory. Norbert Wiener's work on probability theory in Cybernetics had an important influence on Shannon. There can be no new information in a world of certainty. Probability and statistics are at the heart of both information theory and quantum theory. Shannon developed his expression for an information (Shannon) entropy, which he showed has the same mathematical form as thermodynamic (Boltzmann) entropy. He wrote:
Suppose we have a set of possible events whose probabilities of occurrence are p1, p2, • • • , pn. These probabilities are known but that is all we know concerning which event will occur. Can we find a measure of how much "choice" is involved in the selection of the event or of how uncertain we are of the outcome? If there is such a measure, say H(p1, p2, • • • , pn), it is reasonable to require of it the following properties: 1. H should be continuous in the pn. 2. If all the pn are equal, pi = 1/n, then H should be a monotonic increasing function of n. With equally likely events there is more choice, or uncertainty, when there are more possible events. 3. If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H. The meaning of this is illustrated in Fig. 6.
Shannon Entropy and Boltzmann EntropyShannon entropy is the average (expected) value of the information contained in a received message. If there are many possible messages, we get a lot more information than when there are only two possibilities (one bit of information). It is the base 2 logarithm of the number of possibilities. Entropy thus characterizes our uncertainty about the information in an incoming message, and increases for more possibilities with greater randomness. The less likely an event is, the more information it provides when it occurs. Shannon defined his entropy or information as the negative of the logarithm of the probability distribution. One bit of information is also known as one "shannon." Boltzmann entropy is maximized when the particle distribution is maximally random among positions in phase space, when the number of microstates W corresponding to a given macrostate is as large as possible. An improbable macrostate might be when every particle is in the same microstate. Finding all the particles in a corner of the possible volume is information in the same sense as receiving one of the possible messages. An equilibrium macrostate is when particles are as randomly distributed as possible. Any information is gone. Counterintuitively, maximum Boltzmann entropy (no information) is maximal uncertainty before a message is received and then maximal Shannon entropy (information), after a message is received, making the two entropies hard to compare.
Historical BackgroundInformation in physical systems was connected to a measure of the structural order in a system as early as the nineteenth century by William Thomson (Lord Kelvin) and Ludwig Boltzmann, who described an increase in the thermodynamic entropy as “lost information.” In 1877, Boltzmann proved his “H-Theorem” that the entropy or disorder in the universe always increases. He defined entropy S as the logarithm of the number W of possible states of a physical system, an equation now known as Boltzmann’s Principle, S = k log W. In 1929, Leo Szilard showed the mean value of the quantity of information produced by a 1-bit, two-possibility (“yes/no”) measurement as S = k log 2, where k is Boltzmann’s constant, connecting information directly to entropy. Following Szilard, Ludwig von Bertalanffy, Erwin Schrödinger, Norbert Wiener, Claude Shannon, Warren Weaver, John von Neumann, and Leon Brillouin, all expressed similar views on the connection between physical entropy and abstract “bits” of information. Schrödinger said the information in a living organism is the result of “feeding on negative entropy” from the sun. Wiener said “The quantity we define as amount of information is the negative of the quantity usually defined as entropy in similar situations.” Brillouin created the term “negentropy” because he said, “One of the most interesting parts in Wiener’s Cybernetics is the discussion on “Time series, information, and communication,” in which he specifies that a certain “amount of information is the negative of the quantity usually defined as entropy in similar situations.” Shannon, with a nudge from von Neumann, used the term entropy to describe his estimate of the amount of information that can be communicated over a channel, because his mathematical theory of the communication of information produced a mathematical formula identical to Boltzmann’s equation for entropy, except for a minus sign (the negative in negative entropy). Shannon described a set of i messages, each with probability pi. He then defined a quantity H, H = k Σ pi log pi, where k is a positive constant. Since H looked like the H in Boltzmann’s H-Theorem, Shannon called it the entropy of the set of probabilities p1, p2, . . . , pn. To see the connection between the two entropies, we can note that Boltzmann assumed that all his probabilities were equal. For n equal states, the probability of each state is p = 1/n. The sum over n states, Σ pi log pi, is then n x 1/n x log (1/n) = log (1/n) = - log n. If we set Shannon's number of possible messages n equal to Boltzmann's number of possible microstates W, we get Boltzmann’s entropy with a minus sign, H = - k log W. Shannon’s entropy H is the negative of Boltzmann’s S. Shannon showed that a communication that is certain to tell you something you already know (one of the messages has probability unity) contains no new information.
The Mathematical Theory of Communication (excerpts)
IntroductionThe recent development of various methods of modulation such as PCM and PPM which exchange bandwidth for signal-to-noise ratio has intensified the interest in a general theory of communication. A basis for such a theory is contained in the important papers of Nyquist1 and Hartley2 on this subject. In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible due to the statistical structure of the original message and due to the nature of the final destination of the information.