Wednesday, April 19, 2017

Complex Inner Product

Along the line of information, variable and measure, there are two products that make sense on extensive values. Each such value can be regarded as a variable of elements.

First the complex numbers are revisited, then the actually more fundamental dot product and exterior product are motivated via the projection of a vector and at last the complex inner product is considered.

Complex Numbers

The real numbers allow to compare only two extensive values of the same kind (addable and subtractable) by their measure:

\[\frac{v_2}{v_1} ∈ ℝ\]

In a vector space of any dimension, two subspaces \(V_1\) and \(V_2\) span a plane of \((v_1,v_2)\) combinations (\(v_1 ∈ V_1\) and \(v_2 ∈ V_2\)). In this plane two vectors can have the same direction or be orthogonal or something in between. To express this, in addition to the size ratio we need an angle (the word direction builds on angle, too). Or we can make a model of reality where the \(v_1\) direction is placed in the \(1\) axis (the real axis) and \(v_2/v_1\) has

  • one component in the \(1\) axis and
  • one component that is orthogonal to the \(1\) axis, by convention turned counterclockwise.

This orthogonal axis is nothing esoteric: it only expresses that there is a component of \(v_2\) not pointing to the direction of \(v_1\), i.e. not adding to \(v_1\). By naming the orthogonal direction \(i\) (imaginary unit) we keep addition separate.

Next

\[z = \frac{v_2}{v_1} = \frac{|v_2|}{|v_1|}(a+bi) = r(a+bi) ∈ ℂ\]

with \(a^2+b^2=1\), is a way to keep direction change and size ratio separate.

As with real numbers we can think of the fraction \(\frac{v_2}{v_1}\) as the definition of the complex numbers in the sense that the complex number depicts the relation between \(v_2\) and \(v_1\), or better, the operation that makes \(v_1\) to \(v_2\).

One can express any real world \(v_1\) and \(v_2\) by ratios with some unit \(e\):

  • \(v_1=z_1e\) and \(v_2=z_2e\).

This way \(z_1\) and \(z_2\) stand for \(v_1\) and \(v_2\) in an analogous way as real numbers do stand for quantities not compared to others of different kind (not addable).

Multiplying a \(v_1\) by \(z\) produces \(v_2\), i.e. it also produces a rotation. Another \(z\) will start from the last state. Specifically, if \(z=i\), reapplying it yields \(ii=-1\), because in a plain, orthogonal to orthogonal is the opposite direction.

Real ratio \(r\) and angle are related to the complex number \(z\) via

\[z = r (\cos φ + i \sinφ)\]

To reverses the \(z\) operation we take the inverse of \(z\):

\[\begin{split}\frac{1}{z} &= \frac{\bar z}{| z | ^2} \\ &= \frac{1}{r} (\cos φ - i \sin φ) \\ &= \frac{1}{r} (\cos(-φ) + i \sin(-φ))\end{split}\]

To get an intuitive understanding of why we can also write

\[z = e^{iφ}\]

we can

  • think that by multiplying \(z\) the phase is added and by deviding the phase is subtracted

  • grow towards the direction of \(z\) from \(1\) by infinite infinitesimal changes

    \[(\cos\frac{φ}{∞}+i\sin\frac{φ}{∞})^∞ = (1+\frac{iφ}{∞})^∞ = e^{iφ}\]
  • derive \(z = r (\cos φ + i \sin φ)\) by \(φ\)

    \[\frac{∂ z}{∂φ} = i z\]

    and see that the solution of the differential equation is \(z=e^{iφ}\)

numbers vs geometric algebra

If we had given \(1\) a more concrete dimension \(e_1\) and had named the unit of the orthogonal dimension \(e_2\), then with \(I = e_1 e_2 = e_1 ⋅ e_2 + e_1 ∧ e_2 = e_1 ∧ e_2\) we would have got \(e_1 I = - I e_1 = e_2\). This is the geometric algebra approach, which does not abstract the unit: Then \(I=e_1e_2\) is different from a \(J=u_1u_2\). \(Iu_1\) doesn’t have any meaning. For (complex) numbers, on the other hand, we store separately, what they refer to, i.e. whether we can add or multiply two of them or not.

\(z_1\bar{z_2}=r_1r_2(\cos(φ_1-φ_2)+i\sin(φ_1-φ_2))\) is the area (\(r_1r_2\)) projected onto the \(1\) and the \(i\) axis respectively:

  • all on \(i\) means \(φ_1=φ_2+π/2\), i.e. different kind, fully combinable, combination of elements = enclosed area
  • all on \(1\) means \(φ_1=φ_2\), i.e. same kind, not combinable, zero enclosed area

For a \(z\) alone,

  • \(i r \sin φ\) gives the projection onto \(i\) orthogonal to \(1\), i.e. the part of \(z\) combinable with \(1\), but not addable to \(1\) (enclosed area, exterior product)
  • \(r \cos φ\) gives the projection of \(r\) onto \(1\), i.e. the part of \(z\) not combinable with \(1\), but addable to the \(1\) direction. And (repeated) adding is given by multiplication with a real number (dot product)

Further down the idea to combine dot product and exterior product will turn up again.

Dot Product

I take a step back and assume there is no dot product yet.

A measure function \(μ\) is additive for the union of disjoint subsets of its support variable. With this addition one or more such variables form a vector space. So a vector space is like a bunch of extensive variables, that are more or less orthogonal and added separately for each variable.

Two extensive values \(v_1\) and \(v_2\) are considered orthogonal, if the mutual σ-algebras form all possible combinations.

example

\(v_1\) could be a certain number of green balls and \(v_2\) a certain number of red balls. The respective σ-algebra consists of subsets of either green balls or red balls. An element of the combined \(Σ_1×Σ_2\) would be a certain number of red balls together with a certain number of green balls.

Orthogonality with probability

Two variables \(V_1\) and \(V_2\) are independent, if \(P_{12}(σ_1∪σ_2) = P_1(σ_1)P_2(σ_2)\) (product measure), \(σ_1⊂V_1\), \(σ_2⊂V_2\). Let \(t_i∈T_i\) name the experiments for \(V_i\): \(x_i=x_i(t_i)\). The PDF is then \(p_i(x_i) = |x_i^{-1}(x_i)| / |T_i|\).

Every \(t_1\) combines with every \(t_2\). \(|T_1| |T_2|\) is the size of the rectangle in the product experiment space. \(p_1(x_1)p_2(x_2)∈[0,1]\) is a fraction of that rectangle.

For independent and identically distributed: \(P_1 = P_2 = P\).

The product measure \(|v_1||v_2|\) corresponds to the enclosed area.

locality

The subsets of \(v_1\) and \(v_2\) combine not because they are from unrelated contexts but due to the local topology or metric.

We can think of starting from a point and going a number of neighborhoods (distance) along one variable \(v_1\) (in one direction) and a number of neighborhoods along the other variable \(v_2\) (the other direction).

We can make all linear combination of steps in either direction to reach all possible points in the rectangle spanned by the two orthogonal variables.

Two non-orthogonal vectors that lead away from a starting point can be decomposed into parallel and orthogonal set components via a projection.

example

A mixture of red and green balls (=vector) would be projected on the red balls variable by removing the green balls.

  • The projection \(\mathcal{P}_{12}\) from \(v_1\) onto \(v_2\) defines the dot product via \(\mathcal{P}_{12}v_1v_2 = μ(\mathcal{P}_{12}Σ_1×Σ_2)\).
  • The projection \(\mathcal{I}-\mathcal{P}_{12}\) from \(v_1\) orthogonal to \(v_2\) defines the exterior product via \((\mathcal{I}-\mathcal{P}_{12})v_1v_2 = μ((\mathcal{I}-\mathcal{P}_{12})Σ_1×Σ_2)\)

Only with the dot product defined one can use Gram-Schmidt orthonormalisation.

Due to additivity a vector component \(a_1\) can be expressed by a number \(a^1\) and a unit vector \(e_1\): \(a_1 = a^1e_1\).

A general vector is a linear combination of the unit vectors.

\[\begin{split}a&=&a_1+a_2 = &a^1e_1+a^2e_2 \\ b&=&b_1+b_2 = &b^1e_1+b^2e_2\end{split}\]

Let’s multiply the two vectors by making all combinations of their orthogonal components:

\(ab=a_1b_1+a_1b_2+a_2b_1+a_2b_2 = a^1b^1e_1e_1+a^1b^2e_1e_2+a^2b^1e_2e_1+a^2b^2e_2e_2\)

dot product and exterior product are complementary due to the mentioned projection and it is a good idea to combine them in the geometric product. But if one defines the dot product \(e_1·e_2=e_2·e_1=0\) for orthogonal \(e_1\) and \(e_2\) and \(e_1∧e_1=e_2∧e_2=0\) for same direction, then one can handle the operations separately, which is normally done, but which lets one easily forget about their complementarity. The geometric product can then be expressed as \(ab=a·b+a∧b\).

The exterior product is quite intuitive via the area. And it makes also sense that \(e_1∧e_2=-e_2∧e_1\), because then it gives the enclosed area when multiplying two general vectors \(a\) and \(b\).

But why is the dot product a scalar?

  • The multiplication is over the same value and when a \(σ_1∈Σ_1\) is chosen, we have no further freedom to choose it again. So \(e_1e_1 = \mathcal{P}_{12}e_1e_1 = μ(σ_1)μ(σ_1)=1\) is a mere number.
  • It can be used to define a norm \(∥a∥=√{aa}\), the measure we talked about so far. Because \(∥e_1∥=1\) the square follows from the bilinearity.
  • The dot product with the unit vector gives the projection (\(v_1 · e_2 = \mathcal{P}_{12}(v_1)\)) and we can write \((v_1 · e_2)e_2 = v_1^2e_2\), so \(v_1^2=v_1e_2\) better be a scalar. The norm can also be seen as projection \(∥a∥=a·e_a=√{aa}\).
  • Adding extensive values of different kind (\(c=a+b\)) we get the Pythagorean theorem from the dot product: \(|c|^2=c·c=(a+b)·(a+b)=a^2+b^2\)

In geometric algebra, for 2D, the exterior product behaves like a scalar: it is called a pseudoscalar: dot product and exterior product are mutually complementary and in the complex numbers they are combined to one scalar.

The Complex Inner Product

For complex numbers the product \(\bar{z_1}z_2\) gives

  • a wedge (=area) part for the orthogonal projection multiplication (imaginary part)
  • a dot part for the projection multiplication (real part)

A complex number is regarded as a 2D vector and \(i\) transforms one component to the other, i.e. rotates by the right angle.

A complex vector \(v\) consisting of \(n\) complex numbers is isomorphic to a real vector of dimension \(2n\), because the real and imaginary parts are added independently.

The inner product of two complex vectors defined as \(<v_1|v_2> = v_1·v_2 = Σ\bar{v_1^k}v_2^k\), accounts only for \(4n\) combinations (\(2n\) dot combinations: \(n\) {1,1}, \(n\) {i,i}; and \(2n\) wedge combinations {1,i}), and not for the \(2n 2n = 4n^2\) possible dot and wedge products between the components. When keeping dot and wedge separate, with \(e_k·e_l=0\) for \(k≠l\) it actually accounts for all dot combinations (\(2n+2n(2n-1)=4n^2\)). Even with \(e_k∧e_k=0\), though, it misses \(2n(2n-1)-2n=4n^2-4n=4n(n-1)\) of the wedge combinations.

This complex inner product is thus only applicable to cases that can be decomposed into \(n\) 2D spaces. Basically we are in a 2D space where the dot and wedge parts get accumulated separately.

This can be extended to square integrable complex function spaces (\(L^2\))

\[<φ|ψ> = ∫ \bar(φ)(x)ψ(x) dx\]

Replace \(x\) with \(x,t\) and \(dx\) with \(dxdt\) for time dependence.

Here every function value is regarded as an independent component. But then we have infinite dimension, which is only tractable with an approximation algorithm to get arbitrarily close (Cauchy). The approximation works best with orthonormal function components instead of value by value components:

\[Σ_{nm}<ψ_n(x)|ψ_m(x')> → δ_{mn}(x-x')\]

This condition combines orthonormality \(<ψ_n|ψ_m> = δ_{mn}\) with completeness \(Σ_n<ψ_n(x)|ψ_n(x')> → δ(x-x')\). The latter is the ability to approximate all functions in the \(L^2\) sense. \(δ(x-x')\) represents a point \(x\) in the \(L^2\) sense.

\[<x,ψ> = ∫δ(x-x')ψ(x')dx' = ψ(x)\]

Not \(ψ_n(x)\) is orthogonal to \(ψ_m(x)\) at any \(x\), but the dot and the wedge part of \(\bar{ψ_n}(x)ψ_m(x)\) vanish through summation over the range of \(x\).

Quantum Mechanics

A particle is an identity defined by its linking (selection) of a values from variables.

In quantum mechanics this selection is inherently imprecise. A selection is an finite interval of the infinitely close, infinitely precise values generated by our mind. An \(x∈ℝ\) is replaced by a bell shaped function whose foremost purpose is to introduce the imprecision that is inherent to the atomic scale. Note, that this imprecision motivates also to approximate physical (differential) equations.

The bell shaped function \(ψ(x)\) desribes a state that holds together a neighborhood around a special \(x\) i.e. a state is one imprecise value of a variable.

The development of the state (in time) is derived from the current localized state with differential operators. One shifts one’s attention from the values of the state function to the derivatives. Every one of these derivatives is an independent variable and together they make a vector space. A linear differential operator can now be described as a matrix linearly combining components of such a vector of derivatives. Via eigenvalue equations an operator gives rise to orthogonal functions best suited to approximate the state function through their superposition. I.e. instead of the vector of derivatives \((ψ', ψ'', ψ''',..)\) we can now use the vector of orthogonal eigenfunctions \((ψ_1, ψ_2, ψ_3,...)\) like the \(e^{ikx}\) eigenfunctions of the operator \(∂_x=\frac{∂}{∂x}\) (Fourier Transform). The spectral theorem says that all states can be approximated with the eigenfunctions. With orthonormal eigenfunctions as basis the state becomes a vector and the according operator is a diagonal matrix of the eigenvalues. An arbitrary operator is a non-diagonal matrix.

For \(∂_x\) the eigenfunctions are \(e^ikx\) and the eigenvalues are \(ik\). Multiplying with \(-iħ\) we get a real eigenvalue with physical content: the De Broglie impulse \(p=ħk=h/λ\). \(∂_t\) similarly leads to eigenfunctions \(e^{iωt}\) with eigenvalues \(iω\), but made real and physical through \(-iħ∂_t\): \(E=ħω=hν=h/T\).

Classical \(p^2/2m=H\) becomes the Schrödinger equation \(-\frac{ħ^2}{2m}Δ_xψ=∂_tψ\). But classical mechanics is a macrosopic theory. It is better to start from the relativistically corrected wave equation \((Δ_t/c^2-Δ_x+(mc/ħ)^2)ψ=0\) (Klein-Gordon equation) and approximate to first order \(∂_t\). Making also \(Δ_x\) first order \(∂_x\) yields the Dirac equation.

The Schrödinger equation leads to a continuity equation: \(∂_tφ^2=\bar{φ}∂_tφ+φ∂_t\bar{φ}=(\bar{φ}Δφ+φΔ\bar{φ})iħ/2m=div(\bar{φ}∂_xφ+φ∂_x\bar{φ})iħ/2m= div J\)

Note, that \(t\), other than \(x\), cannot be an operator in quantum mechanics. In quantum field theory this different treatment not conforming to special relativity is resolved by making \(x\) a parameter like \(t\). string theory goes the other way and makes \(t\) a real variable.

probability amplitude

Normal probabilities are defined over one variable with exclusive values. Two variables give rise to the product probability. The variables are independent, if \(P_{12}=P_1P_2\). This means that the variables combine to span a 2D space. Such a space is modeled with the complex number. With complex \(P_1\) and \(P_2\) they don’t need to coincide with the axes, but can point in any directions. Their enclosed angle can be from right angle (orthogonal = completely independent) to zero (same direction = same variable). These complex probabilities are still the usual probabilities via their length (amplitude). The normalization is done on the product space, though, because the objective is to describe the relation between two states, whether independent or exclusive. The latter is the one variable probability: \(∫\bar{φ}φdx=1\).

The probability can be ensemble interpreted, but practically follow from the equations of quantum mechanics, like the Schrödinger equation.

The function values of \(ψ(x)\) in quantum mechanics are probability amplitudes, a probability with direction coded as complex number and normally changing along \(x\). The values of two functions \(φ\) and \(ψ\) have a relative direction and also this changes along \(x\). Generally the summation of \(<φ|ψ>=∫\bar{φ}(x)ψ(x)dx\) results in a complex number, the sum of all the value projection multiplications, dot and wedge. If we set \(|ψ>=Q|φ>\), how much dot and how much wedge remains after summation, is determined by the operator \(Q\), because \(<φ|φ>=1\) by itself.

  • A real expectation value \(<Q>\) means \(\bar{<Q>}=<Q> \equiv <φ|Qφ>=<Qφ|φ>\), i.e. each state can be projected onto the other with same result. With orthonormal \(Q\) eigenfunctions this can be expressed as \(<φ|q_n><q_n|φ>=\bar{q^n}q^n\). \(Q\) is called hermitian.
  • \(<Q>=0 \equiv <φ|Qφ>=<Qφ|φ>=0\). \(Q\) produces an orthogonal \(|ψ>=Q|φ>\). Because of the spectral theorem, there is an approximation for any such \(|ψ>=|q_n><q_n|ψ>\).
  • An imaginary \(<Q>\) mean \(\bar{<Q>}=-<Q> \equiv <φ|Qφ>=<-Qφ|φ>\). \(Q\) is antihermitian. \(ψ=Q|φ>\) is not compatible with \(|φ>\). In this case \(i\) or \(-i\) makes \(Q\) compatible again (e.g. \(ħ∂_x→-iħ∂_x\)).

In the Heisenberg picture states are fixed and operators are functions. One works with operators instead of state functions. The product of two operators \(AB\) generally results in a complex expectation value \(<AB>\) fully or partially imaginary. The imaginary part, the summation of the wedge parts of the function values, can be extracted with the commutator \(<[A,B]>/2i\). This imaginary part’s expectation value is the joint uncertainty

\(ΔA^2ΔB^2=<(A-<A>)^2><(B-<B>)^2>=≥<[A,B]/2i>^2\)

\(ΔAΔB≥ħ/2\)

The uncertainty is the interval of \(x\) or \(p\) values expressed by the bell shaped probability amplitude. The summation over the probability amplitudes preserves the idea of projecting either into each other (hermitian) or onto the complementary direction (antihermitian, uncertainty, e.g. \(ΔxΔp\)).

Measure

Measure — Math Project documentation

A variable, the unit of information, is basically finite, but when thinking, i.e. simulating with generated variables, mappable to all kind of real variables, it is best not to fix the size and precision until really needed. So one works with arbitrary precision, which practically is the meaning of infinity.

Having a flexible generated variable, like \(ℝ\), the size is infinite. But by using it, one maps it to a specific real finite variable. How to measure the finite variable with the infinite \(ℝ\)?

This is the topic of integration and measure theory. We have a variable or space and a quantity associated to subsets of the space. This quantity adds up for the union of disjoint sets.

In this blog I follow this path a little further, with special interest in the probability measure.

Measure

Basically we can model an infinite space with a (formal concept) lattice FCA algorithmically refined ad infinitum. The nodes in the FCA consist of intent and extent. We only look at one of these dual sets: the intent.

In the finite case, for each intent we can ask

  • How many elements are there?
  • How many elements are there, that belong to some special variable(s)?

If such a question can be answered in a unambiguous way, then it defines a measure function on the (intent) set. The intents may contain elements not of interest to our question, to our measure. Then they do not count, i.e. their measure is 0. They are not part of the support of the measure.

For infinite variables we start with a measure function.

Many quantities encountered when describing the physical world are values of a measure (function) \(μ\) on a measure space \((V,Σ)\) (\((V,Σ,μ)\)):

\[\begin{split}q = μ(σ) ∈ ℝ^+\\ σ ∈ Σ\\ μ(σ_1 ∪ σ_2) = μ(σ_1)+μ(σ_2)\\ σ_1 ∩ σ_2 = Ø\end{split}\]

The σ-algebra \(Σ ⊂ 2^V\) is a set algebra with the addition to be complete under countably infinite unions and complementation. Two sets of \(Σ\) can overlap, but the measure is only counted once via summing only over disjoint sets resulting from intersection and complementation.

Extensive variable

The value of the measure \(μ(σ)\) is determined by \(σ ∈ Σ ⊂ 2^V\), i.e. \(μ(σ)\) is determined by a set. Although the \(σ\)‘s are not exclusive, \(μ\) contains the intelligence not to count overlapping parts twice. If \(μ(σ_1)+μ(σ_2) < μ(σ_1∪σ_2)\), then there is an overlapping. If the exclusive values of a variable are sets, I’ve called it an `extensive variable`_. A measure on an infinite variable \(V\) splits \(V\) into disjoint sets and thus fits into this definition. But there are more \(i,j\) with same \(μ(σ_i)+μ(σ_j)\), i.e. the measure looses information. What remains is the measure value of the underlying `extensive values`_ (tautology).

Examples of measures are

  • The count. This exists only for finite variables. The count \(C\) of a variable is a measure for its inherent information. \(\text{lb} C\) \(\{0,1\}\) combinations one needs to reproduce the same count and is thus of relevance to information processing systems. Nature, itself a big computer, also combines small variables to create big ones.
  • The analogue to count for infinite variables is the σ-finite lebesgue measure (length, area, volume).
  • One can generalize a measure to have direction: complex measure and vector measure.

Measure is the integral already

The summing over \(Σ\) is already in the measure: \(μ=∫_Σdμ\)

\(μ(σ) = ∫_Σ μ'(σ)dσ\) corresponds to the Riemann Interal, if the \(σ\)‘s are adjacent infinitesimal sets with a set size (length) defined (\(dx\)), e.g. via a metric. The measure, though, can be applied to a broader range of sets, i.e to all sets satisfying the σ-algebra.

Radon-Nikodym derivative

When comparing two measures \(ν(σ)\) and \(μ(σ)\) for infinitesimal changes to \(σ\), ideally by one \(x∈V\), which entails infintesimal changes in \(dν=ν(\{x\})\) and \(dμ=μ(\{x\})\), then there is a function expressing the ratio of the change:

\(f=dν/dμ\)

\(ν(σ)=∫_σfdμ=∫_σdν/dμ dμ = ∫_σdν\)

Probability

Probability \(P\) is a measure with the extra requirement \(P(V)= ∫_V dP(v) = ∫_V P'(v)dv = 1\).

The domain of \(P\) is \(Σ ⊂ 2^V\) not singular values of \(V\). There is no topology required for \(V\) and probability arises by not looking at the details, i.e. on the topology. \(Σ\) is there only to satisfy the requirement of a measure. But a topology can induce the Borel σ-algebra.

\(P\) is the CDF (cumulative distribution function) and \(P'\) is the PDF (probability density function). A PDF does only make sense unless there is a metric or at least a topology on \(V\) and \(dv\) is a measure on the `neighborhood`_ of \(v\). Note, that there is no uniform PDF on a infinite \(V\).

  • \(V\) in probability jargon is called sample space or set of outcomes.
  • \(v ∈ Σ\) are called events.
  • The average bits or nats needed to reference a value of a variable \(V\), is the information of the variable \(V\). With the measure we can define this for the infinite case \(I(V)=\sum N_i(\log N - \log N_i)/N = -\sum p_i\log p_i \rightarrow I(V) = ∫ P'\log P'dv= ∫\log P' dP\). In physics this is called entropy.

Random variable

A measure \(X:Σ→ℝ\) (also \((V,Σ,X)\)) is a random variable on \(V\), if there is also a probability measure \(P:Σ→[0,1]\) (\((V,Σ,P)\)) defined.

  • \(P(x)≡P(X^{-1}(x))\)
  • \(E(XY) = (X,Y) = ∫_V XYdP\) defines an inner product
  • \(E(X) = (X,1)1 = ∫_V XdP\)
  • \(E((X-E(x))^2)\)
    is the variance and the square root of it is the standard deviation, which corresponds to the Euclidean distance \(d(X,E(X))\).
  • \((X,X-E(x)) = 0\), i.e. \(X_o=X-E(x)\) and \(X\) are orthogonal

Occurrences and events

In the information blog I’ve avoided the name event. It sounds like occurences. But what is meant is very different.

Occurences:Sets that can be intersected to only contain exclusively one value \(v\) of the variable \(V\). These are the FCA intents mentioned above, who correspond to a topology on \(V\) that defines the values \(v\). Borel sets augment the topology by countable intersections and complements giving rize to the Borel σ-algebra of events.
Events:Sets that only consist of values of \(V\) combining alternative occurences and thus forming the basis for the probability summation.

Occurences lead to the definition of probability by just counting them without looking at the details of other elements in the occurence. Let \(O_v\) be the set of occurences and \(O_V=\bigcup_v O_v\), then ` P(v) = | O_v | / | O_V | \(. The **exclusiveness is necessary** for `\sum P(v) = 1\) (\(∫ dP(v)=1\)), else ` | bigcup_v O_v | < sum | O_v | \(. `O_V=\bigcup_v O_v\) requires \(O_v\) to be finite (a finite measure on \(V\)), while \(v\) itself might be an infinite set. In the latter case in order for \(O_V\) to be finite, most \(|O_v|=0\), which also entails \(P(v)=0\) for most \(v\).

Events summarize occurences, i.e. probabilities, over \(v\), by counting all occurences that contain either this value of \(V\) or that value of \(V\) (summarized in the event \(v\)). This gives rise to probability \(P\). \(P\) adds for alternatives of the same variable.

Fisher information is not information

Distinguish entropy, which is information, from fisher information, which is not information, and should rather be called Fisher [score/inverse] variance. The score \(∂_θ\ell(x;θ)=∂_θ\log(x;θ)\) is the change of information at \(x ∈ X\) when modifying \(θ\). The first moment \(E(∂_θ\ell)=∫∂_θ\ell p(x;θ)dx=0\), the second moment \(F(θ)=E(∂_θ∂_θ\ell)=∫ ∂_θ\ell∂_θ\ell p(x;θ)dx\) is the fisher information. \(F(θ)\) has as lower bound the inverse variance of the variable \(θ\) (not value \(θ\)) (Cramér-Rao): \(F(θ) \geq 1/\text{Var}(θ)\). A \(θ\) where \(\text{Var}(θ)\) is small, makes all \(F(θ)\) big. The biggest \(F(θ)\) marks the best \(θ\), because it has the biggest influence on \(p(x;θ)\). If we have two parameters \(θ\) and \(φ\) (or more) then \(F(θ,φ)=∫ ∂_θ∂_φ\ell p(x;θ,φ)dx\) is an inner product for the scores (= `tangent space`_ on \((x;θ,φ)\)), that does not dependent on the parameters \(θ\) and \(φ\).

Product Measure

The product measure is defined as

\[\begin{split}(μ_A×μ_B)(A×B)=μ_A(A)μ_B(B)\\ A ∈ Σ_A\\ B ∈ Σ_B\end{split}\]

All combinations of \(A×B\) do occur. This is a complete independence of the two variables \(A\) and \(B\).

  • If values are exclusive, then they belong to the same variable.
  • For a functional dependence only the couples that are elements of the function do occur.
  • A relation we have, if part of the \(A×B\) space is filled. With a probability measure this means that certain \(P(a×b)=0\) and other \(P(a×b)\neq P(a)P(b)\), but rather \(P(a×b) = P(a|b)P(b) = P(b|a)P(a)\), i.e. the second event’s probability depends on the first one (\(P(a|b)\) or \(P(b|a)\)). If \(P(a×b)=0\), then \(P(a|b)=P(b|a)=0\): \(a\) does not occur if \(b\) occurs. But for another \(b_1\) that might be different, i.e. we cannot add the \(a\) as another value to the \(B\) variable, but we still can say that on \(A×B\) we have less than maximum information I: \(I(A×B) = ∫ \log P'(a×b)dP(a×b) \leq ∫ \log(P'(a)P'(b))P'(a)P'(b)dadb = I(A)+I(B)\)
  • For complete independence we have the above product space. This might be in a common context, but also if the contexts are completely unrelated, e.g. by taking variables from different unrelated times and/or from unrelated places, the value couples will fill all of \(A×B\) and specifically for probability \(P(a×b) = ∫ dP(a×b) = ∫ da P'(a)P(b|a) = ∫ db P'(b)P(a|b) = P(a)P(b)\), with \(a ⊂ A\) and \(b ⊂ B\) and \(P(b|a) = ∫_{\beta ∈ b} dP(\beta|a)\).