What is Information?
This is a very fundamental question that promises insight to all our intellectual activities and beyond.
The word information is ubiquitous and of course we know what it means. We would describe it as: something that reduces our uncertainty. Are we happy with suchlike descriptions? They use words, which again need clarification, which again can be decomposed into smaller parts. As long as you still can abstract away, you are not done.
Note
Still, we need to start with the natural usage of the word. We talk about information in newspapers, in books, in sentences, in words.
What is involved?
a sender: a person, an animal, our environment,...
a message: written or spoken text, signs, ...
a receiver: again a person, a computer, a radio, ...
This is communication. Is Communication = Information? No, it's communication of information.
We need to decompose more: A person has a brain with a map of our world, with concepts. A message consists of words of a natural language, but it can also consist of signs, Chinese pictographic characters, hieroglyphs, bits or bytes....
These and other things we can think of, are all quite different. Still they must have something in common.
A person as a sender consists of many parts, these parts all communicate as well.
What is the smallest part of a communication processes?
Selection from a multitude
Note
First a multitude, then the selection. That's evolution. Evolution is how all dynamic systems work.
- This is how biological evolution works. Selection by environment.
- Economy: supply/demand = creation of multitude; publicity, offers, comparison = selection
- The brain: ideas = creating multitude = creativity; selection via comparison with sensory input and experience
- Sciences: papers = multitude; research, studying = selection
- Memetics: ideas communicated to others
- Culture in societies
- ...
We have abstracted everything away and remain with a multitude. In mathematics the foundation is set theory. The set is what we so far called multitude.
Note
Where are we now? We have a
- set of elements
- a way to choose or select
One instance of choosing means to choose exactly one element. By selecting one element we de-select the others. The elements are exclusive.
Regarding choosing or selecting in relation to the set theory one might want to check out the axiom of choice.
With this exclusiveness added the best words I came up with are variable and value.
Note
set --> variable element --> value
Value and variable turn up everywhere: in programming, in physics, in statistics,..., because they are the building blocks of everything: They are information.
Note
building block for information = variable = set of exclusive values
Value and variable are quite ubiquitous and thus appropriate for something as fundamental as information.
Let's get quantitative: What is the smallest amount, we can choose from? The answer: 2. The important aspect quantitatively is only the size of the variable, also called cardinality. For a variable V I shall denote the size with |V|. It makes sense to use the smallest variable, the bit (B of size |B| = 2), as unit of measurement for information. Values of bigger variables can be mapped to combinations of bits, which are elements of the Cartesian product. n bits can create |Bn| = 2n combinations.
Note
The information of V is how many bits do we need to select a value of V.
Because 2log2|V| = |V| we get
Note
Information as a measure is a property of a variable, not of its values.
The occurrences of values of a variable need to be distinguished from the variable itself. They make another variable. When the focus is on the occurrences, one can use the word variate. Let's denote this variable with . To select an occurrence of any value of V we need log| bits. |
The frequency of occurrence is yet another variable. Here we look at pi = |. | ⁄ | |∑ipi = 1. Something like this is called a probability. The exclusiveness is essential to it. It is a measure.
Note
Add probability for or ∨, but only for exclusive values of the same variable, else in general p(a∨b) = p(a) + p(b) − p(a b) holds. Note that p(a∨b) = p(a b).
Multiply probability for and, but only for values belonging to independent variables else p(a b) = p(a)p(b|a) holds. See below.
One can select values of V, i.e. Vi, by first selecting an occurrence of any value (log|) and then subtract the excess selection done ( |log|): i|log|. Via this selection path one can do with less bits to reference the total amount of occurrences: | − log| i| = − logpi − ∑i| i|logpi ≤ | |log|V|. Dividing both sides by | we get the average bits needed to select a value of |V. It is called entropy (S).
IV is the upper limit of SV. IV can also be expressed with the SV formula by using a uniform pi = 1 ⁄ |V|.
IV is the upper limit of SV. Considering that a variable is always embedded in a context of other variables, this limit case vaguely can be interpreted as loss of structure.
Note
The entropy is the average number of bits needed to select a value of the variable.
The occurrences of values of a variable are due to the variable's environment (= context), i.e. its relation to other variables. This context is often unknown or too complex (= too much information) to describe. It might also be known and used to derive the frequency of occurrences of values (a priori probability).
Physical systems can be regarded as a huge set of variables.
There is even a deeper relation between between energy and information. Check out these links: - Physics of information - Statistical physics - Statistical thermodynamics
So far information was regarded as inherent to a variable (= a piori).
I continue with the perspective of a learning system, like humans and other animals or learning machines (computers).
A learning system needs to provide as much memory as there is in the observed system to fix one state.
But often the learning system is confronted with a system too complex (= too much information) to create a complete map of it. It doesn't see all the variables. Most of them stay hidden (latent variables). In this situation including the frequency of occurrences of values, i.e. the probability, is often the best thing one can do. The probability distribution is a result of hidden dependencies from hidden variables.
Or it is due to the limited information content of nature itself, i.e. due to quantum mechanics. With probability distributions quantum mechanics can continue to use the well known continuum mathematics (real numbers,...), although the major statement of quantum mechanics is, that there is no continuum.
In such a complex system it normally is not possible to select a single value of a variable. Then one resorts to a partial selection via a probability distribution for the variable. This method includes also the case of selecting one value via a dirac distribution, the entropy of which is 0. A larger entropy expresses an more imprecise selection or, when talking about predictions, more uncertainty about what to expect.
For an exactly fixated value of one variable we often can predict the value of another variable via a functional dependency.
Note
Functions are important, because there is no information needed to fix the value of the dependant variable.
With an imprecise selection and an imprecise relation, analytical expressions for dependencies are often not feasible. Then one resorts to conditinal probability: p(a|b) (probability of a given b), a ∈ A and b ∈ B.
It is
Note
p(a)p(a|b) does not make sense.
This is Bayes Theorem. It allows to derive p(b|a) from p(a|b), i.e. the inverse dependency (corresponding to inverse function in exact mathematics). The starting variable's distribution is the prior and the dependent variable's distribution is the posterior.
There is a probability distribution for B for every value of A (p(b|a) ∈ p(B|a)): ∑bp(b|a) = 1. But for p(b|a) as a function of a (likelihood funcion) there is no such normalization. In general p(B|ai) and p(B|aj) can actually be completely unrelated.
Based on theoretical reasoning and assumptions, i.e by creating a model (= finding variables and their dependency), one normally ends up using a predefined distribution, like the normal distribution. The parameters of such a distribution often need to be estimated based on measurements. There are two related methods to do this:
There is the fundamental principle of maximum entropy (ME). By maximizing the entropy one does not assume more or make a more precise selection than what is actually known.
Via a Bayesian inference one gets p(a|{bi}) = p(a)(p({bi}|a))/(p({bi})) (p({bi}|a) = ∏ip(bi|a)) for the (parameter) variable A and then one can maximize the entropy of this distribution and take the a ∈ A with maximum p(a|{bi}). For a uniform p(a) this yields the same result as ...
Maximum likelihood (ML). It selects a ∈ A such that ∏ip(bi|a) becomes a maximum.
The product (∏) is an assumption that the observed bi values belong to independent and identical distributions (i.i.d).
Here some links regarding ML and ME: 1, 2, 3.
For more on information and probability I recommend these great books:
Whether values are exclusive in the first place, i.e. are variables, depends on the context, i.e. on the values of other variables. An intelligent system can find out the exclusiveness by observation. Only then it can observe the frequency of values with respect to other values of the same variable.
No comments:
Post a Comment