Wednesday, April 19, 2017

Complex Inner Product

Along the line of information, variable and measure, there are two products that make sense on extensive values. Each such value can be regarded as a variable of elements.

The exterior product makes sense as a measure of the amount of element combinations. This corresponds to the normal multiplication in \(ℝ\).
The dot product is the complement to the exterior product.

First the complex numbers are revisited, then the actually more fundamental dot product and exterior product are motivated via the projection of a vector and at last the complex inner product is considered.

Complex Numbers¶

The real numbers allow to compare only two extensive values of the same kind (addable and subtractable) by their measure:

\[\frac{v_2}{v_1} ∈ ℝ\]

In a vector space of any dimension, two subspaces \(V_1\) and \(V_2\) span a plane of \((v_1,v_2)\) combinations (\(v_1 ∈ V_1\) and \(v_2 ∈ V_2\)). In this plane two vectors can have the same direction or be orthogonal or something in between. To express this, in addition to the size ratio we need an angle (the word direction builds on angle, too). Or we can make a model of reality where the \(v_1\) direction is placed in the \(1\) axis (the real axis) and \(v_2/v_1\) has

one component in the \(1\) axis and
one component that is orthogonal to the \(1\) axis, by convention turned counterclockwise.

This orthogonal axis is nothing esoteric: it only expresses that there is a component of \(v_2\) not pointing to the direction of \(v_1\), i.e. not adding to \(v_1\). By naming the orthogonal direction \(i\) (imaginary unit) we keep addition separate.

\[z = \frac{v_2}{v_1} = \frac{|v_2|}{|v_1|}(a+bi) = r(a+bi) ∈ ℂ\]

with \(a^2+b^2=1\), is a way to keep direction change and size ratio separate.

As with real numbers we can think of the fraction \(\frac{v_2}{v_1}\) as the definition of the complex numbers in the sense that the complex number depicts the relation between \(v_2\) and \(v_1\), or better, the operation that makes \(v_1\) to \(v_2\).

One can express any real world \(v_1\) and \(v_2\) by ratios with some unit \(e\):

\(v_1=z_1e\) and \(v_2=z_2e\).

This way \(z_1\) and \(z_2\) stand for \(v_1\) and \(v_2\) in an analogous way as real numbers do stand for quantities not compared to others of different kind (not addable).

Multiplying a \(v_1\) by \(z\) produces \(v_2\), i.e. it also produces a rotation. Another \(z\) will start from the last state. Specifically, if \(z=i\), reapplying it yields \(ii=-1\), because in a plain, orthogonal to orthogonal is the opposite direction.

Real ratio \(r\) and angle are related to the complex number \(z\) via

\[z = r (\cos φ + i \sinφ)\]

To reverses the \(z\) operation we take the inverse of \(z\):

\[\begin{split}\frac{1}{z} &= \frac{\bar z}{| z | ^2} \\ &= \frac{1}{r} (\cos φ - i \sin φ) \\ &= \frac{1}{r} (\cos(-φ) + i \sin(-φ))\end{split}\]

To get an intuitive understanding of why we can also write

\[z = e^{iφ}\]

we can

think that by multiplying \(z\) the phase is added and by deviding the phase is subtracted
grow towards the direction of \(z\) from \(1\) by infinite infinitesimal changes

\[(\cos\frac{φ}{∞}+i\sin\frac{φ}{∞})^∞ = (1+\frac{iφ}{∞})^∞ = e^{iφ}\]
derive \(z = r (\cos φ + i \sin φ)\) by \(φ\)

\[\frac{∂ z}{∂φ} = i z\]

and see that the solution of the differential equation is \(z=e^{iφ}\)

numbers vs geometric algebra

If we had given \(1\) a more concrete dimension \(e_1\) and had named the unit of the orthogonal dimension \(e_2\), then with \(I = e_1 e_2 = e_1 ⋅ e_2 + e_1 ∧ e_2 = e_1 ∧ e_2\) we would have got \(e_1 I = - I e_1 = e_2\). This is the geometric algebra approach, which does not abstract the unit: Then \(I=e_1e_2\) is different from a \(J=u_1u_2\). \(Iu_1\) doesn’t have any meaning. For (complex) numbers, on the other hand, we store separately, what they refer to, i.e. whether we can add or multiply two of them or not.

\(z_1\bar{z_2}=r_1r_2(\cos(φ_1-φ_2)+i\sin(φ_1-φ_2))\) is the area (\(r_1r_2\)) projected onto the \(1\) and the \(i\) axis respectively:

all on \(i\) means \(φ_1=φ_2+π/2\), i.e. different kind, fully combinable, combination of elements = enclosed area
all on \(1\) means \(φ_1=φ_2\), i.e. same kind, not combinable, zero enclosed area

For a \(z\) alone,

\(i r \sin φ\) gives the projection onto \(i\) orthogonal to \(1\), i.e. the part of \(z\) combinable with \(1\), but not addable to \(1\) (enclosed area, exterior product)
\(r \cos φ\) gives the projection of \(r\) onto \(1\), i.e. the part of \(z\) not combinable with \(1\), but addable to the \(1\) direction. And (repeated) adding is given by multiplication with a real number (dot product)

Further down the idea to combine dot product and exterior product will turn up again.

Dot Product¶

I take a step back and assume there is no dot product yet.

A measure function \(μ\) is additive for the union of disjoint subsets of its support variable. With this addition one or more such variables form a vector space. So a vector space is like a bunch of extensive variables, that are more or less orthogonal and added separately for each variable.

Two extensive values \(v_1\) and \(v_2\) are considered orthogonal, if the mutual σ-algebras form all possible combinations.

example

\(v_1\) could be a certain number of green balls and \(v_2\) a certain number of red balls. The respective σ-algebra consists of subsets of either green balls or red balls. An element of the combined \(Σ_1×Σ_2\) would be a certain number of red balls together with a certain number of green balls.

Orthogonality with probability

Two variables \(V_1\) and \(V_2\) are independent, if \(P_{12}(σ_1∪σ_2) = P_1(σ_1)P_2(σ_2)\) (product measure), \(σ_1⊂V_1\), \(σ_2⊂V_2\). Let \(t_i∈T_i\) name the experiments for \(V_i\): \(x_i=x_i(t_i)\). The PDF is then \(p_i(x_i) = |x_i^{-1}(x_i)| / |T_i|\).

Every \(t_1\) combines with every \(t_2\). \(|T_1| |T_2|\) is the size of the rectangle in the product experiment space. \(p_1(x_1)p_2(x_2)∈[0,1]\) is a fraction of that rectangle.

For independent and identically distributed: \(P_1 = P_2 = P\).

The product measure \(|v_1||v_2|\) corresponds to the enclosed area.

locality

The subsets of \(v_1\) and \(v_2\) combine not because they are from unrelated contexts but due to the local topology or metric.

We can think of starting from a point and going a number of neighborhoods (distance) along one variable \(v_1\) (in one direction) and a number of neighborhoods along the other variable \(v_2\) (the other direction).

We can make all linear combination of steps in either direction to reach all possible points in the rectangle spanned by the two orthogonal variables.

Two non-orthogonal vectors that lead away from a starting point can be decomposed into parallel and orthogonal set components via a projection.

example

A mixture of red and green balls (=vector) would be projected on the red balls variable by removing the green balls.

The projection \(\mathcal{P}_{12}\) from \(v_1\) onto \(v_2\) defines the dot product via \(\mathcal{P}_{12}v_1v_2 = μ(\mathcal{P}_{12}Σ_1×Σ_2)\).
The projection \(\mathcal{I}-\mathcal{P}_{12}\) from \(v_1\) orthogonal to \(v_2\) defines the exterior product via \((\mathcal{I}-\mathcal{P}_{12})v_1v_2 = μ((\mathcal{I}-\mathcal{P}_{12})Σ_1×Σ_2)\)

Only with the dot product defined one can use Gram-Schmidt orthonormalisation.

Due to additivity a vector component \(a_1\) can be expressed by a number \(a^1\) and a unit vector \(e_1\): \(a_1 = a^1e_1\).

A general vector is a linear combination of the unit vectors.

\[\begin{split}a&=&a_1+a_2 = &a^1e_1+a^2e_2 \\ b&=&b_1+b_2 = &b^1e_1+b^2e_2\end{split}\]

Let’s multiply the two vectors by making all combinations of their orthogonal components:

\(ab=a_1b_1+a_1b_2+a_2b_1+a_2b_2 = a^1b^1e_1e_1+a^1b^2e_1e_2+a^2b^1e_2e_1+a^2b^2e_2e_2\)

\(a_1b_1+a_2b_2=(a^1b^1+a^2b^2)\) is the dot product, a scalar.
\(a_1b_2 + a_2b_1=(a^1b^2-a^2b^1)e_1∧e_2\) sums the area. It is the exterior product.

dot product and exterior product are complementary due to the mentioned projection and it is a good idea to combine them in the geometric product. But if one defines the dot product \(e_1·e_2=e_2·e_1=0\) for orthogonal \(e_1\) and \(e_2\) and \(e_1∧e_1=e_2∧e_2=0\) for same direction, then one can handle the operations separately, which is normally done, but which lets one easily forget about their complementarity. The geometric product can then be expressed as \(ab=a·b+a∧b\).

The exterior product is quite intuitive via the area. And it makes also sense that \(e_1∧e_2=-e_2∧e_1\), because then it gives the enclosed area when multiplying two general vectors \(a\) and \(b\).

But why is the dot product a scalar?

The multiplication is over the same value and when a \(σ_1∈Σ_1\) is chosen, we have no further freedom to choose it again. So \(e_1e_1 = \mathcal{P}_{12}e_1e_1 = μ(σ_1)μ(σ_1)=1\) is a mere number.
It can be used to define a norm \(∥a∥=√{aa}\), the measure we talked about so far. Because \(∥e_1∥=1\) the square follows from the bilinearity.
The dot product with the unit vector gives the projection (\(v_1 · e_2 = \mathcal{P}_{12}(v_1)\)) and we can write \((v_1 · e_2)e_2 = v_1^2e_2\), so \(v_1^2=v_1e_2\) better be a scalar. The norm can also be seen as projection \(∥a∥=a·e_a=√{aa}\).
Adding extensive values of different kind (\(c=a+b\)) we get the Pythagorean theorem from the dot product: \(|c|^2=c·c=(a+b)·(a+b)=a^2+b^2\)

In geometric algebra, for 2D, the exterior product behaves like a scalar: it is called a pseudoscalar: dot product and exterior product are mutually complementary and in the complex numbers they are combined to one scalar.

The Complex Inner Product¶

For complex numbers the product \(\bar{z_1}z_2\) gives

a wedge (=area) part for the orthogonal projection multiplication (imaginary part)
a dot part for the projection multiplication (real part)

A complex number is regarded as a 2D vector and \(i\) transforms one component to the other, i.e. rotates by the right angle.

A complex vector \(v\) consisting of \(n\) complex numbers is isomorphic to a real vector of dimension \(2n\), because the real and imaginary parts are added independently.

The inner product of two complex vectors defined as \(<v_1|v_2> = v_1·v_2 = Σ\bar{v_1^k}v_2^k\), accounts only for \(4n\) combinations (\(2n\) dot combinations: \(n\) {1,1}, \(n\) {i,i}; and \(2n\) wedge combinations {1,i}), and not for the \(2n 2n = 4n^2\) possible dot and wedge products between the components. When keeping dot and wedge separate, with \(e_k·e_l=0\) for \(k≠l\) it actually accounts for all dot combinations (\(2n+2n(2n-1)=4n^2\)). Even with \(e_k∧e_k=0\), though, it misses \(2n(2n-1)-2n=4n^2-4n=4n(n-1)\) of the wedge combinations.

This complex inner product is thus only applicable to cases that can be decomposed into \(n\) 2D spaces. Basically we are in a 2D space where the dot and wedge parts get accumulated separately.

This can be extended to square integrable complex function spaces (\(L^2\))

\[<φ|ψ> = ∫ \bar(φ)(x)ψ(x) dx\]

Replace \(x\) with \(x,t\) and \(dx\) with \(dxdt\) for time dependence.

Here every function value is regarded as an independent component. But then we have infinite dimension, which is only tractable with an approximation algorithm to get arbitrarily close (Cauchy). The approximation works best with orthonormal function components instead of value by value components:

\[Σ_{nm}<ψ_n(x)|ψ_m(x')> → δ_{mn}(x-x')\]

This condition combines orthonormality \(<ψ_n|ψ_m> = δ_{mn}\) with completeness \(Σ_n<ψ_n(x)|ψ_n(x')> → δ(x-x')\). The latter is the ability to approximate all functions in the \(L^2\) sense. \(δ(x-x')\) represents a point \(x\) in the \(L^2\) sense.

\[<x,ψ> = ∫δ(x-x')ψ(x')dx' = ψ(x)\]

Not \(ψ_n(x)\) is orthogonal to \(ψ_m(x)\) at any \(x\), but the dot and the wedge part of \(\bar{ψ_n}(x)ψ_m(x)\) vanish through summation over the range of \(x\).

Quantum Mechanics

A particle is an identity defined by its linking (selection) of a values from variables.

In quantum mechanics this selection is inherently imprecise. A selection is an finite interval of the infinitely close, infinitely precise values generated by our mind. An \(x∈ℝ\) is replaced by a bell shaped function whose foremost purpose is to introduce the imprecision that is inherent to the atomic scale. Note, that this imprecision motivates also to approximate physical (differential) equations.

The bell shaped function \(ψ(x)\) desribes a state that holds together a neighborhood around a special \(x\) i.e. a state is one imprecise value of a variable.

The development of the state (in time) is derived from the current localized state with differential operators. One shifts one’s attention from the values of the state function to the derivatives. Every one of these derivatives is an independent variable and together they make a vector space. A linear differential operator can now be described as a matrix linearly combining components of such a vector of derivatives. Via eigenvalue equations an operator gives rise to orthogonal functions best suited to approximate the state function through their superposition. I.e. instead of the vector of derivatives \((ψ', ψ'', ψ''',..)\) we can now use the vector of orthogonal eigenfunctions \((ψ_1, ψ_2, ψ_3,...)\) like the \(e^{ikx}\) eigenfunctions of the operator \(∂_x=\frac{∂}{∂x}\) (Fourier Transform). The spectral theorem says that all states can be approximated with the eigenfunctions. With orthonormal eigenfunctions as basis the state becomes a vector and the according operator is a diagonal matrix of the eigenvalues. An arbitrary operator is a non-diagonal matrix.

For \(∂_x\) the eigenfunctions are \(e^ikx\) and the eigenvalues are \(ik\). Multiplying with \(-iħ\) we get a real eigenvalue with physical content: the De Broglie impulse \(p=ħk=h/λ\). \(∂_t\) similarly leads to eigenfunctions \(e^{iωt}\) with eigenvalues \(iω\), but made real and physical through \(-iħ∂_t\): \(E=ħω=hν=h/T\).

Classical \(p^2/2m=H\) becomes the Schrödinger equation \(-\frac{ħ^2}{2m}Δ_xψ=∂_tψ\). But classical mechanics is a macrosopic theory. It is better to start from the relativistically corrected wave equation \((Δ_t/c^2-Δ_x+(mc/ħ)^2)ψ=0\) (Klein-Gordon equation) and approximate to first order \(∂_t\). Making also \(Δ_x\) first order \(∂_x\) yields the Dirac equation.

The Schrödinger equation leads to a continuity equation: \(∂_tφ^2=\bar{φ}∂_tφ+φ∂_t\bar{φ}=(\bar{φ}Δφ+φΔ\bar{φ})iħ/2m=div(\bar{φ}∂_xφ+φ∂_x\bar{φ})iħ/2m= div J\)

Note, that \(t\), other than \(x\), cannot be an operator in quantum mechanics. In quantum field theory this different treatment not conforming to special relativity is resolved by making \(x\) a parameter like \(t\). string theory goes the other way and makes \(t\) a real variable.

probability amplitude

Normal probabilities are defined over one variable with exclusive values. Two variables give rise to the product probability. The variables are independent, if \(P_{12}=P_1P_2\). This means that the variables combine to span a 2D space. Such a space is modeled with the complex number. With complex \(P_1\) and \(P_2\) they don’t need to coincide with the axes, but can point in any directions. Their enclosed angle can be from right angle (orthogonal = completely independent) to zero (same direction = same variable). These complex probabilities are still the usual probabilities via their length (amplitude). The normalization is done on the product space, though, because the objective is to describe the relation between two states, whether independent or exclusive. The latter is the one variable probability: \(∫\bar{φ}φdx=1\).

The probability can be ensemble interpreted, but practically follow from the equations of quantum mechanics, like the Schrödinger equation.

The function values of \(ψ(x)\) in quantum mechanics are probability amplitudes, a probability with direction coded as complex number and normally changing along \(x\). The values of two functions \(φ\) and \(ψ\) have a relative direction and also this changes along \(x\). Generally the summation of \(<φ|ψ>=∫\bar{φ}(x)ψ(x)dx\) results in a complex number, the sum of all the value projection multiplications, dot and wedge. If we set \(|ψ>=Q|φ>\), how much dot and how much wedge remains after summation, is determined by the operator \(Q\), because \(<φ|φ>=1\) by itself.

A real expectation value \(<Q>\) means \(\bar{<Q>}=<Q> \equiv <φ|Qφ>=<Qφ|φ>\), i.e. each state can be projected onto the other with same result. With orthonormal \(Q\) eigenfunctions this can be expressed as \(<φ|q_n><q_n|φ>=\bar{q^n}q^n\). \(Q\) is called hermitian.
\(<Q>=0 \equiv <φ|Qφ>=<Qφ|φ>=0\). \(Q\) produces an orthogonal \(|ψ>=Q|φ>\). Because of the spectral theorem, there is an approximation for any such \(|ψ>=|q_n><q_n|ψ>\).
An imaginary \(<Q>\) mean \(\bar{<Q>}=-<Q> \equiv <φ|Qφ>=<-Qφ|φ>\). \(Q\) is antihermitian. \(ψ=Q|φ>\) is not compatible with \(|φ>\). In this case \(i\) or \(-i\) makes \(Q\) compatible again (e.g. \(ħ∂_x→-iħ∂_x\)).

In the Heisenberg picture states are fixed and operators are functions. One works with operators instead of state functions. The product of two operators \(AB\) generally results in a complex expectation value \(<AB>\) fully or partially imaginary. The imaginary part, the summation of the wedge parts of the function values, can be extracted with the commutator \(<[A,B]>/2i\). This imaginary part’s expectation value is the joint uncertainty

\(ΔA^2ΔB^2=<(A-<A>)^2><(B-<B>)^2>=≥<[A,B]/2i>^2\)

\(ΔAΔB≥ħ/2\)

The uncertainty is the interval of \(x\) or \(p\) values expressed by the bell shaped probability amplitude. The summation over the probability amplitudes preserves the idea of projecting either into each other (hermitian) or onto the complementary direction (antihermitian, uncertainty, e.g. \(ΔxΔp\)).

Measure

Measure — Math Project documentation

A variable, the unit of information, is basically finite, but when thinking, i.e. simulating with generated variables, mappable to all kind of real variables, it is best not to fix the size and precision until really needed. So one works with arbitrary precision, which practically is the meaning of infinity.

Having a flexible generated variable, like \(ℝ\), the size is infinite. But by using it, one maps it to a specific real finite variable. How to measure the finite variable with the infinite \(ℝ\)?

This is the topic of integration and measure theory. We have a variable or space and a quantity associated to subsets of the space. This quantity adds up for the union of disjoint sets.

In this blog I follow this path a little further, with special interest in the probability measure.

Measure¶

Basically we can model an infinite space with a (formal concept) lattice FCA algorithmically refined ad infinitum. The nodes in the FCA consist of intent and extent. We only look at one of these dual sets: the intent.

In the finite case, for each intent we can ask

How many elements are there?
How many elements are there, that belong to some special variable(s)?

If such a question can be answered in a unambiguous way, then it defines a measure function on the (intent) set. The intents may contain elements not of interest to our question, to our measure. Then they do not count, i.e. their measure is 0. They are not part of the support of the measure.

For infinite variables we start with a measure function.

Many quantities encountered when describing the physical world are values of a measure (function) \(μ\) on a measure space \((V,Σ)\) (\((V,Σ,μ)\)):

\[\begin{split}q = μ(σ) ∈ ℝ^+\\ σ ∈ Σ\\ μ(σ_1 ∪ σ_2) = μ(σ_1)+μ(σ_2)\\ σ_1 ∩ σ_2 = Ø\end{split}\]

The σ-algebra \(Σ ⊂ 2^V\) is a set algebra with the addition to be complete under countably infinite unions and complementation. Two sets of \(Σ\) can overlap, but the measure is only counted once via summing only over disjoint sets resulting from intersection and complementation.

Extensive variable

The value of the measure \(μ(σ)\) is determined by \(σ ∈ Σ ⊂ 2^V\), i.e. \(μ(σ)\) is determined by a set. Although the \(σ\)‘s are not exclusive, \(μ\) contains the intelligence not to count overlapping parts twice. If \(μ(σ_1)+μ(σ_2) < μ(σ_1∪σ_2)\), then there is an overlapping. If the exclusive values of a variable are sets, I’ve called it an `extensive variable`_. A measure on an infinite variable \(V\) splits \(V\) into disjoint sets and thus fits into this definition. But there are more \(i,j\) with same \(μ(σ_i)+μ(σ_j)\), i.e. the measure looses information. What remains is the measure value of the underlying `extensive values`_ (tautology).

Examples of measures are

The count. This exists only for finite variables. The count \(C\) of a variable is a measure for its inherent information. \(\text{lb} C\) \(\{0,1\}\) combinations one needs to reproduce the same count and is thus of relevance to information processing systems. Nature, itself a big computer, also combines small variables to create big ones.
The analogue to count for infinite variables is the σ-finite lebesgue measure (length, area, volume).
One can generalize a measure to have direction: complex measure and vector measure.

Measure is the integral already

The summing over \(Σ\) is already in the measure: \(μ=∫_Σdμ\)

\(μ(σ) = ∫_Σ μ'(σ)dσ\) corresponds to the Riemann Interal, if the \(σ\)‘s are adjacent infinitesimal sets with a set size (length) defined (\(dx\)), e.g. via a metric. The measure, though, can be applied to a broader range of sets, i.e to all sets satisfying the σ-algebra.

Radon-Nikodym derivative

When comparing two measures \(ν(σ)\) and \(μ(σ)\) for infinitesimal changes to \(σ\), ideally by one \(x∈V\), which entails infintesimal changes in \(dν=ν(\{x\})\) and \(dμ=μ(\{x\})\), then there is a function expressing the ratio of the change:

\(f=dν/dμ\)

\(ν(σ)=∫_σfdμ=∫_σdν/dμ dμ = ∫_σdν\)

Probability¶

Probability \(P\) is a measure with the extra requirement \(P(V)= ∫_V dP(v) = ∫_V P'(v)dv = 1\).

The domain of \(P\) is \(Σ ⊂ 2^V\) not singular values of \(V\). There is no topology required for \(V\) and probability arises by not looking at the details, i.e. on the topology. \(Σ\) is there only to satisfy the requirement of a measure. But a topology can induce the Borel σ-algebra.

\(P\) is the CDF (cumulative distribution function) and \(P'\) is the PDF (probability density function). A PDF does only make sense unless there is a metric or at least a topology on \(V\) and \(dv\) is a measure on the `neighborhood`_ of \(v\). Note, that there is no uniform PDF on a infinite \(V\).

\(V\) in probability jargon is called sample space or set of outcomes.
\(v ∈ Σ\) are called events.
The average bits or nats needed to reference a value of a variable \(V\), is the information of the variable \(V\). With the measure we can define this for the infinite case \(I(V)=\sum N_i(\log N - \log N_i)/N = -\sum p_i\log p_i \rightarrow I(V) = ∫ P'\log P'dv= ∫\log P' dP\). In physics this is called entropy.

Random variable

A measure \(X:Σ→ℝ\) (also \((V,Σ,X)\)) is a random variable on \(V\), if there is also a probability measure \(P:Σ→[0,1]\) (\((V,Σ,P)\)) defined.

\(P(x)≡P(X^{-1}(x))\)
\(E(XY) = (X,Y) = ∫_V XYdP\) defines an inner product
\(E(X) = (X,1)1 = ∫_V XdP\)
\(E((X-E(x))^2)\)

is the variance and the square root of it is the standard deviation, which corresponds to the Euclidean distance \(d(X,E(X))\).
\((X,X-E(x)) = 0\), i.e. \(X_o=X-E(x)\) and \(X\) are orthogonal

Occurrences and events

In the information blog I’ve avoided the name event. It sounds like occurences. But what is meant is very different.

Occurences:	Sets that can be intersected to only contain exclusively one value \(v\) of the variable \(V\). These are the FCA intents mentioned above, who correspond to a topology on \(V\) that defines the values \(v\). Borel sets augment the topology by countable intersections and complements giving rize to the Borel σ-algebra of events.
Events:	Sets that only consist of values of \(V\) combining alternative occurences and thus forming the basis for the probability summation.

Occurences lead to the definition of probability by just counting them without looking at the details of other elements in the occurence. Let \(O_v\) be the set of occurences and \(O_V=\bigcup_v O_v\), then ` P(v) = | O_v | / | O_V | \(. The **exclusiveness is necessary** for `\sum P(v) = 1\) (\(∫ dP(v)=1\)), else ` | bigcup_v O_v | < sum | O_v | \(. `O_V=\bigcup_v O_v\) requires \(O_v\) to be finite (a finite measure on \(V\)), while \(v\) itself might be an infinite set. In the latter case in order for \(O_V\) to be finite, most \(|O_v|=0\), which also entails \(P(v)=0\) for most \(v\).

Events summarize occurences, i.e. probabilities, over \(v\), by counting all occurences that contain either this value of \(V\) or that value of \(V\) (summarized in the event \(v\)). This gives rise to probability \(P\). \(P\) adds for alternatives of the same variable.

Fisher information is not information

Distinguish entropy, which is information, from fisher information, which is not information, and should rather be called Fisher [score/inverse] variance. The score \(∂_θ\ell(x;θ)=∂_θ\log(x;θ)\) is the change of information at \(x ∈ X\) when modifying \(θ\). The first moment \(E(∂_θ\ell)=∫∂_θ\ell p(x;θ)dx=0\), the second moment \(F(θ)=E(∂_θ∂_θ\ell)=∫ ∂_θ\ell∂_θ\ell p(x;θ)dx\) is the fisher information. \(F(θ)\) has as lower bound the inverse variance of the variable \(θ\) (not value \(θ\)) (Cramér-Rao): \(F(θ) \geq 1/\text{Var}(θ)\). A \(θ\) where \(\text{Var}(θ)\) is small, makes all \(F(θ)\) big. The biggest \(F(θ)\) marks the best \(θ\), because it has the biggest influence on \(p(x;θ)\). If we have two parameters \(θ\) and \(φ\) (or more) then \(F(θ,φ)=∫ ∂_θ∂_φ\ell p(x;θ,φ)dx\) is an inner product for the scores (= `tangent space`_ on \((x;θ,φ)\)), that does not dependent on the parameters \(θ\) and \(φ\).

Product Measure¶

The product measure is defined as

\[\begin{split}(μ_A×μ_B)(A×B)=μ_A(A)μ_B(B)\\ A ∈ Σ_A\\ B ∈ Σ_B\end{split}\]

All combinations of \(A×B\) do occur. This is a complete independence of the two variables \(A\) and \(B\).

If values are exclusive, then they belong to the same variable.
For a functional dependence only the couples that are elements of the function do occur.
A relation we have, if part of the \(A×B\) space is filled. With a probability measure this means that certain \(P(a×b)=0\) and other \(P(a×b)\neq P(a)P(b)\), but rather \(P(a×b) = P(a|b)P(b) = P(b|a)P(a)\), i.e. the second event’s probability depends on the first one (\(P(a|b)\) or \(P(b|a)\)). If \(P(a×b)=0\), then \(P(a|b)=P(b|a)=0\): \(a\) does not occur if \(b\) occurs. But for another \(b_1\) that might be different, i.e. we cannot add the \(a\) as another value to the \(B\) variable, but we still can say that on \(A×B\) we have less than maximum information I: \(I(A×B) = ∫ \log P'(a×b)dP(a×b) \leq ∫ \log(P'(a)P'(b))P'(a)P'(b)dadb = I(A)+I(B)\)
For complete independence we have the above product space. This might be in a common context, but also if the contexts are completely unrelated, e.g. by taking variables from different unrelated times and/or from unrelated places, the value couples will fill all of \(A×B\) and specifically for probability \(P(a×b) = ∫ dP(a×b) = ∫ da P'(a)P(b|a) = ∫ db P'(b)P(a|b) = P(a)P(b)\), with \(a ⊂ A\) and \(b ⊂ B\) and \(P(b|a) = ∫_{\beta ∈ b} dP(\beta|a)\).

Infinity

What is infinity? — Math Project documentation

So far I’ve blogged only about the finite variable and a topological structure (FCA) on it:

a variable is a set of exclusive values, created by a selection process
variables live in a lattice of set containment (topology, FCA)
an extensive value is a variable by itself

The last kind I explore in a separate measure blog.

Here I want to ponder over infinity, how it can be integrated into the current understanding and what good it is for.

What is infinity?¶

Finite variables have finite information. Do infinite variables have infinite information? But that would mean a computer taking up all the universe would not suffice to store the values of an infinite variable or select a value thereof. Infinity in this sense does not exist and therefore one can also not make any statements or conclusions that involve infinity.

The universe can be regarded as a parallel computer consisting of a myriad of, but still finitely many, selection processes happening in parallel. Each of these selections are from finite variables.

We, as an organism that survived by evolving adaptability through a brain able to map the world and simulate it, need to be able to generate internal variables that can be mapped to the real ones.

Paradigm change to algorithms: generated variables

Instead of regarding actual variables with a finite number of values we now start to look at algorithms to generate variables. E.g. \(ℕ\) is generated by looping +1 (=(+1)*). With this the size of a variable (space complexity) can be moved to the time domain (time complexity). This is not only a feature of the information processing brain but also of the world itself (\(ΔEΔt≥h\)) (action, h).

A generated variable is versatile. It can be generated to the appropriate size, whatever the physical variable asks for.

Infinity is a construct of the mind and not of the physical reality. Nothing in reality is infinite, neither time nor space nor anything else. Infinity is not applicable to real variables, but only to the generation of variables, i.e. to its generation algorithm The infinity that is meant in mathematics is a loop in an algorithm that is ended, when the wanted magnitude or precision is reached.

Our numbers are such a generated variable. It is an algorithm to create a multitude mappable to all kind of physical variables. It is infinite, but this only means that our only limitation is the available time or space:

numbers of \(ℕ\), generated by \((+1)*\), can be selected (written down) only if not too large
additionally a number in the continuum, the real numbers, \(ℝ\) can be selected only with a limited precision.

Infinity

Infinity is a practical notation (for the deferred decision about the stop time) of an algorithm with at least one loop.

The continuum \(ℝ\) is not only generated by an algorithm but also consist of algorithms. Operations on extensive physical values (quantities) are mapped to operations on the numbers and then made part of the numbers to form reusable algebraic structures like groups (addition) and fields (addition and multiplication).

Addition to the size of an extensive value
Multiplication: Independent extensive values are variables independently selectable from. One can form the cartesian product. Multiplication gives the size of \(AxB\).

So numbers are algorithms and it turns out that some of them do not have finite time complexity. For example, \(√2\), the diagonal of a unit square, is an algorithm which involves an infinite loop (open loop). That the algorithm never ends is synonymous to: \(√2\) does not exist in \(ℚ\), but by including such algorithms we make the completion of \(ℚ\), which is then called the real numbers \(ℝ\): \(ℝ=ℚ∪𝕁\). The irrational numbers \(𝕁\) and rational numbers \(ℚ\) are dense in \(ℝ\).

Why include infinite loops?

They in principle never end and thus they need to be aborted, and then we are still in \(ℚ\), i.e. the limit points practially don’t exist. We can get out of this dead end by not talking any more about the limit points but rather about the algorithm: \(√2\) as an algorithm is different than \(1.4142135623730951\).

By including the algorithms the statements given by the algebraic structures become more general. We have less limitations when making new calculations (closure).

Because the limit points can actually never be reached, i.e. do not exist, one must be careful about calculating with “algorithms”, i.e. in \(ℝ\), when dealing with \(0\) and \(∞\). \(0/0\) can be any number, but it is still the foundation of calculus. It matters how one approaches 0. L’Hôpital’s rule and in general asymptotics helps then to find the limit. The same is true for \(∞\) (infinitely large). \(∞/∞\) can be any number as well. By a linear definition \(0 = \lim_{n→0} n\) and \(∞ = \lim_{n→∞} n\) one could write \(∞/∞=1\), \(0/0=1\) and \((1+1/∞)^∞ = e\). Because \(1/∞\) is approaching \(0\) slower than linearly, we have, \(0∞=0\). \(∞+1=∞, `3∞=(3+0)∞=3∞\). But to acknowledge L’Hôpital’s rule we would have \(3∞ ≠ ∞\) and \(3\cdot 0≠0\), which is not consistent with \(ℚ\), thus \(0/0\) and \(∞/∞\) is not defined in \(ℝ\).

\(ℚ\) vs \(ℝ\)

In \(ℚ=\{a/b|a,b∈ℤ,b≠0\}\), if one allows \(a_i∈ℤ\) and \(b_i∈ℤ\) to go to \(±∞\) in any possible way, also nonlinearly, i.e. if one admits sequences in this way, then one defines only a subset of \(ℝ\). The reason is not so much proofs like

\(√2=a/b ⇒ 2=a^2/b^2 ⇒ a^2\) even \(⇒ a\) even \(⇒ b\) even \(⇒ a/b\not\inℚ\)

which hinges on

if \(a^2\) is even then \(a\) is even

and it is questionable, whether that can be done if we can’t reach \(∞\). Remember: \(∞+1=∞\).

It is rather the comprehensive definition of \(ℝ\) to comprise all thinkable sequences that approach a number as one equivalence class.

Infinity and Information¶

The information of a variable is basically the number of values. If the values of the variable are generated by an algorithm, then one moves the complexity (the size, the information) to the number of time steps needed to generate the values.

algorithmic complexity

Kolmogorov complexity looks only at the length of the algorithm and neglects the time. In the general descriptive complexity theory the time is considered, though, via complexity classes.

The information of an infinite variable in bits is always infinite, but one can further classify them via their algorithmic complexity, i.e. via their number of nested endless loops.

\(ℕ\) has one loop and so do \(ℤ\) and \(ℚ\), because they have a bijection to \(ℕ\).
\(ℝ\) has two nested loops, infinitely dense-in-itself and unbounded. One can think of writing a real number with infinite but countable binary digits and thus can conclude that the cardinality of the real numbers is \(2^ℕ\). Every ever so small interval in \(ℝ\) is also of the same size. See also continuum hypothesis

axiom of choice

It is not possible to choose all elements from an infinite variable, because it would need infinite information and/or infinite time. One therefore resorts to make choice an axiom to still be able to reason about such sets.

Practically the information of a infinite variable depends on when one chooses to stop the infinite loop, i.e. on the precision. When modelling reality in computer software the integer type or the floating point type is chosen according to the needed precision. If they are not enough, one can use arbitrary precission libraries.

Infinity and Topology¶

In \(ℝ\) even a normal \(1\) means \(1.\bar{0}\). The latter is an algorithm that never ends. The short \(1\) chooses an element of \(ℕ\) with variable length coding. The presence of infinitely close other numbers in \(ℝ\) asks for a method to distinguish the \(1\) from them. This is done by an algorithm that produces numbers ever closer to \(1\). At a certain step, e.g. \(1.000n\), the \(n\) is chosen to be \(0\), instead of \(1\) to \(9\). Every step generates a variable to allow a further choice. This infinite loop is called open.

open

open can be interpreted as open ended infinite loop. Open ended in the sense that we will decide later, when to get out of it.

The objective of an open loop is to define an element \(x\) by approaching it. The intermediate steps of the loop form a neighborhood of \(x\). \(x\) is defined by the algorithm and is included in the set: topological closure.

The definition of elements is given by the separation axioms:

T0 defines a point \(x\) (element, value) via neighborhoods as described
T1 is when, of any two points, one has a neighborhood not containing the other
T2 is when any two points can be distinguished by disjoint neighborhoods. Every filter and every net has a unique limit.

If the closure of a set has missed a point and that point is separated from the closed set by a neighborhood, then this is a regular space. A regular space is metrizable.

Cauchy convergence and completeness is generalized with the uniform space that is built on top of axioms equivalent to a pseudometric, where two elements not necessarily need to have a distance.

If in the topology we have a metric a neighborhood conveniently is defined as an open ball.

All these concepts to define closeness can be visualized with a finite FCA lattice and then be generated to finer ones ad infinitum.

space vs time

Neighborhood is normally mapped to our sense of physical space, but this provokes the misleading idea that more is selected at the same time. It is better to map a neighborhood to our sense of time, because that better depicts the fact that the points’ only reason to occur together, is the selection process itself. We look at one selection process at a time. open makes this selection process an open ended loop. It stands for the older notation \(x = \lim_{n→∞}x_n\). “\(f: A→B\) is continous if an open \(X⊂B\) has an open \(f^{-1}(X)⊂A\)” is the same as \(\lim_{n→∞}f(x_n)=f(x)\). With a metric one can also say “For every \(δ>|f(x_n)-f(x)|\) there is an \(ε>|x_n-x|\)”. Note also that neighborhood does not imply nearness in the metric sense, but rather in the set containment sense.

compact

Every open cover has a finite subcover.

With this property connects the infinity with the finite and thus allows to make global statements about the set that at the infinitely close (local) wouldn’t have much meaning, because infinity can never be reached.

That a bounded set is closed or vice versa can be proved with compactness.

Usefulness of Infinity¶

One such use mentioned already is to have a versatile multitude to map all kind of real variables to.
Variables do not exist alone. They exist because of other variables. The functional dependence is a general characterization of the system, not so much the size of the variables (information), which changes from system to system. With the infinity as defined here one makes a simulation with generated variables to the wanted precision. The algorithm needs less memory and simulations can be done with minimal generated variables. So this analytic description altogether saves a lot of memory.
One does not need to use by chance numbers

When a physicist liberally uses infinity in his description of the world, this is an idealization justified by the wanted precision. For example an infinite distance could be a few centimeters when describing an atomic scale phenomenon. It is this idea that makes him use \(∞\) instead of a by chance distance like 2cm or 3cm.
One can make more general statements. Such statements are shorter, i.e. need less space (information, complexity).

In a general statement the precision is unknown and so the decision about it needs to be deferred.

For example in a NaCl crystal the diagonal will need one precision for KBr another one.
There is often no finite algorithm to describe certain things, like the length of the diagonal of a square (\(√2\)).
Trial and error is a basic principle because it follows from selection. This is an infinity iterative algorithm that is stopped when content with the result. Obviously this has many applications.