A variable, the unit of information, is basically finite, but when thinking, i.e. simulating with generated variables, mappable to all kind of real variables, it is best not to fix the size and precision until really needed. So one works with arbitrary precision, which practically is the meaning of infinity.

Having a flexible generated variable, like \(ℝ\), the size is infinite. But by using it, one maps it to a specific real finite variable. How to measure the finite variable with the infinite \(ℝ\)?

This is the topic of integration and measure theory. We have a variable or space and a quantity associated to subsets of the space. This quantity adds up for the union of disjoint sets.

In this blog I follow this path a little further, with special interest in the probability measure.

Measure

Basically we can model an infinite space with a (formal concept) lattice FCA algorithmically refined ad infinitum. The nodes in the FCA consist of intent and extent. We only look at one of these dual sets: the intent.

In the finite case, for each intent we can ask

  • How many elements are there?
  • How many elements are there, that belong to some special variable(s)?

If such a question can be answered in a unambiguous way, then it defines a measure function on the (intent) set. The intents may contain elements not of interest to our question, to our measure. Then they do not count, i.e. their measure is 0. They are not part of the support of the measure.

For infinite variables we start with a measure function.

Many quantities encountered when describing the physical world are values of a measure (function) \(μ\) on a measure space \((V,Σ)\) (\((V,Σ,μ)\)):

\[\begin{split}q = μ(σ) ∈ ℝ^+\\ σ ∈ Σ\\ μ(σ_1 ∪ σ_2) = μ(σ_1)+μ(σ_2)\\ σ_1 ∩ σ_2 = Ø\end{split}\]

The σ-algebra \(Σ ⊂ 2^V\) is a set algebra with the addition to be complete under countably infinite unions and complementation. Two sets of \(Σ\) can overlap, but the measure is only counted once via summing only over disjoint sets resulting from intersection and complementation.

Extensive variable

The value of the measure \(μ(σ)\) is determined by \(σ ∈ Σ ⊂ 2^V\), i.e. \(μ(σ)\) is determined by a set. Although the \(σ\)‘s are not exclusive, \(μ\) contains the intelligence not to count overlapping parts twice. If \(μ(σ_1)+μ(σ_2) < μ(σ_1∪σ_2)\), then there is an overlapping. If the exclusive values of a variable are sets, I’ve called it an `extensive variable`_. A measure on an infinite variable \(V\) splits \(V\) into disjoint sets and thus fits into this definition. But there are more \(i,j\) with same \(μ(σ_i)+μ(σ_j)\), i.e. the measure looses information. What remains is the measure value of the underlying `extensive values`_ (tautology).

Examples of measures are

  • The count. This exists only for finite variables. The count \(C\) of a variable is a measure for its inherent information. \(\text{lb} C\) \(\{0,1\}\) combinations one needs to reproduce the same count and is thus of relevance to information processing systems. Nature, itself a big computer, also combines small variables to create big ones.
  • The analogue to count for infinite variables is the σ-finite lebesgue measure (length, area, volume).
  • One can generalize a measure to have direction: complex measure and vector measure.

Measure is the integral already

The summing over \(Σ\) is already in the measure: \(μ=∫_Σdμ\)

\(μ(σ) = ∫_Σ μ'(σ)dσ\) corresponds to the Riemann Interal, if the \(σ\)‘s are adjacent infinitesimal sets with a set size (length) defined (\(dx\)), e.g. via a metric. The measure, though, can be applied to a broader range of sets, i.e to all sets satisfying the σ-algebra.

Radon-Nikodym derivative

When comparing two measures \(ν(σ)\) and \(μ(σ)\) for infinitesimal changes to \(σ\), ideally by one \(x∈V\), which entails infintesimal changes in \(dν=ν(\{x\})\) and \(dμ=μ(\{x\})\), then there is a function expressing the ratio of the change:

\(f=dν/dμ\)

\(ν(σ)=∫_σfdμ=∫_σdν/dμ dμ = ∫_σdν\)

Probability

Probability \(P\) is a measure with the extra requirement \(P(V)= ∫_V dP(v) = ∫_V P'(v)dv = 1\).

The domain of \(P\) is \(Σ ⊂ 2^V\) not singular values of \(V\). There is no topology required for \(V\) and probability arises by not looking at the details, i.e. on the topology. \(Σ\) is there only to satisfy the requirement of a measure. But a topology can induce the Borel σ-algebra.

\(P\) is the CDF (cumulative distribution function) and \(P'\) is the PDF (probability density function). A PDF does only make sense unless there is a metric or at least a topology on \(V\) and \(dv\) is a measure on the `neighborhood`_ of \(v\). Note, that there is no uniform PDF on a infinite \(V\).

  • \(V\) in probability jargon is called sample space or set of outcomes.
  • \(v ∈ Σ\) are called events.
  • The average bits or nats needed to reference a value of a variable \(V\), is the information of the variable \(V\). With the measure we can define this for the infinite case \(I(V)=\sum N_i(\log N - \log N_i)/N = -\sum p_i\log p_i \rightarrow I(V) = ∫ P'\log P'dv= ∫\log P' dP\). In physics this is called entropy.

Random variable

A measure \(X:Σ→ℝ\) (also \((V,Σ,X)\)) is a random variable on \(V\), if there is also a probability measure \(P:Σ→[0,1]\) (\((V,Σ,P)\)) defined.

  • \(P(x)≡P(X^{-1}(x))\)
  • \(E(XY) = (X,Y) = ∫_V XYdP\) defines an inner product
  • \(E(X) = (X,1)1 = ∫_V XdP\)
  • \(E((X-E(x))^2)\)
    is the variance and the square root of it is the standard deviation, which corresponds to the Euclidean distance \(d(X,E(X))\).
  • \((X,X-E(x)) = 0\), i.e. \(X_o=X-E(x)\) and \(X\) are orthogonal

Occurrences and events

In the information blog I’ve avoided the name event. It sounds like occurences. But what is meant is very different.

Occurences:Sets that can be intersected to only contain exclusively one value \(v\) of the variable \(V\). These are the FCA intents mentioned above, who correspond to a topology on \(V\) that defines the values \(v\). Borel sets augment the topology by countable intersections and complements giving rize to the Borel σ-algebra of events.
Events:Sets that only consist of values of \(V\) combining alternative occurences and thus forming the basis for the probability summation.

Occurences lead to the definition of probability by just counting them without looking at the details of other elements in the occurence. Let \(O_v\) be the set of occurences and \(O_V=\bigcup_v O_v\), then ` P(v) = | O_v | / | O_V | \(. The **exclusiveness is necessary** for `\sum P(v) = 1\) (\(∫ dP(v)=1\)), else ` | bigcup_v O_v | < sum | O_v | \(. `O_V=\bigcup_v O_v\) requires \(O_v\) to be finite (a finite measure on \(V\)), while \(v\) itself might be an infinite set. In the latter case in order for \(O_V\) to be finite, most \(|O_v|=0\), which also entails \(P(v)=0\) for most \(v\).

Events summarize occurences, i.e. probabilities, over \(v\), by counting all occurences that contain either this value of \(V\) or that value of \(V\) (summarized in the event \(v\)). This gives rise to probability \(P\). \(P\) adds for alternatives of the same variable.

Fisher information is not information

Distinguish entropy, which is information, from fisher information, which is not information, and should rather be called Fisher [score/inverse] variance. The score \(∂_θ\ell(x;θ)=∂_θ\log(x;θ)\) is the change of information at \(x ∈ X\) when modifying \(θ\). The first moment \(E(∂_θ\ell)=∫∂_θ\ell p(x;θ)dx=0\), the second moment \(F(θ)=E(∂_θ∂_θ\ell)=∫ ∂_θ\ell∂_θ\ell p(x;θ)dx\) is the fisher information. \(F(θ)\) has as lower bound the inverse variance of the variable \(θ\) (not value \(θ\)) (Cramér-Rao): \(F(θ) \geq 1/\text{Var}(θ)\). A \(θ\) where \(\text{Var}(θ)\) is small, makes all \(F(θ)\) big. The biggest \(F(θ)\) marks the best \(θ\), because it has the biggest influence on \(p(x;θ)\). If we have two parameters \(θ\) and \(φ\) (or more) then \(F(θ,φ)=∫ ∂_θ∂_φ\ell p(x;θ,φ)dx\) is an inner product for the scores (= `tangent space`_ on \((x;θ,φ)\)), that does not dependent on the parameters \(θ\) and \(φ\).

Product Measure

The product measure is defined as

\[\begin{split}(μ_A×μ_B)(A×B)=μ_A(A)μ_B(B)\\ A ∈ Σ_A\\ B ∈ Σ_B\end{split}\]

All combinations of \(A×B\) do occur. This is a complete independence of the two variables \(A\) and \(B\).

  • If values are exclusive, then they belong to the same variable.
  • For a functional dependence only the couples that are elements of the function do occur.
  • A relation we have, if part of the \(A×B\) space is filled. With a probability measure this means that certain \(P(a×b)=0\) and other \(P(a×b)\neq P(a)P(b)\), but rather \(P(a×b) = P(a|b)P(b) = P(b|a)P(a)\), i.e. the second event’s probability depends on the first one (\(P(a|b)\) or \(P(b|a)\)). If \(P(a×b)=0\), then \(P(a|b)=P(b|a)=0\): \(a\) does not occur if \(b\) occurs. But for another \(b_1\) that might be different, i.e. we cannot add the \(a\) as another value to the \(B\) variable, but we still can say that on \(A×B\) we have less than maximum information I: \(I(A×B) = ∫ \log P'(a×b)dP(a×b) \leq ∫ \log(P'(a)P'(b))P'(a)P'(b)dadb = I(A)+I(B)\)
  • For complete independence we have the above product space. This might be in a common context, but also if the contexts are completely unrelated, e.g. by taking variables from different unrelated times and/or from unrelated places, the value couples will fill all of \(A×B\) and specifically for probability \(P(a×b) = ∫ dP(a×b) = ∫ da P'(a)P(b|a) = ∫ db P'(b)P(a|b) = P(a)P(b)\), with \(a ⊂ A\) and \(b ⊂ B\) and \(P(b|a) = ∫_{\beta ∈ b} dP(\beta|a)\).