Probabilistic optimization algebra

This page introduces the probabilistic optimization fragment of daitai-algebra: how training, inference and classical gradient descent are unified under a single primitive — Step — and a small set of rewrite rules.

For the underlying state-transition relation see Step semantics in 5 minutes. This page extends Step to the probabilistic setting.

The position

Classical machine learning writes:

θ_{t+1} = θ_t − η · ∇L(θ_t)

daitai writes:

θ is a RandomVar. Optimization is the update of its distribution given evidence.

This is optimization as inference. In daitai-algebra it is not a perspective — it is the only definition. Deterministic gradient descent falls out as a special case where the distribution is a Dirac delta.

Sorts

Three sorts are added to the core:

| Sort | Meaning | |------|---------| | Param | a parameter, typed as RandomVar | | ParamDist | a distribution over Param | | Loss | Param → Scalar |

Primitive operators

E       : ParamDist × Expr           → Scalar
KL      : ParamDist × ParamDist      → Scalar
Observe : Loss × Scalar → ParamDist  → ParamDist
Step    : ParamDist × Loss × Scalar  → ParamDist

E(q, f) is the expected value of f under q. KL(q, q') ≥ 0 with KL(q, q) = 0; no symmetry is assumed.

Observe(L, β)(q) induces a posterior proportional to q(θ) · exp(−β · L(θ)). Normalization is implicit.

The Step axiom

Step(q, L, η)
  := argmin_{q'}  KL(q' || q) + η · E(q', L)

This is the definition, not an algorithm. A backend may realize it by SGD, natural gradient, mirror descent, variational inference, or sampling — the algebra does not care, only the result must satisfy the equation up to the backend's tolerance.

The stability law

For all valid q, L, η:

KL(Step(q, L, η), q)  ≤  η · E(q, L)

A step cannot move the distribution further (in KL) than the energy it consumes. The compiler reads this; you don't write it.

Degeneration to gradient descent

When ParamDist ≡ Dirac(θ):

Step(Dirac(θ), L, η)  ⇒  Dirac(θ − η · ∇L(θ))

So θ_{t+1} = θ_t − η ∇L(θ_t) is a theorem, not a primitive.

Rewrite rules

Every backend is allowed to apply these rewrites because they preserve semantics.

R1 — Zero step

Step(q, L, 0)  ⇒  q

R2 — Zero loss

Step(q, L, η)  ⇒  q       if ∀θ. L(θ) = 0

R3 — Additivity (same loss)

Step(Step(q, L, η₁), L, η₂)  ⇒  Step(q, L, η₁ + η₂)

R4 — Composition (different loss)

Step(Step(q, L₁, η₁), L₂, η₂)  ⇒  Step(q, L₁ + L₂, η*)

η* is a backend-chosen combinator. The core makes no linearity or symmetry assumptions.

R5 — Fixed point

Step(q, L, η)  ⇒  q
  if  argmin_{q'} KL(q' || q) + η · E(q', L)  =  q

R6 — Dirac degeneration

The classical gradient-descent rewrite above. Requires that ∇L(θ) exists and that the backend allows a deterministic representation.

R7 — Natural gradient via Moore–Penrose

If q ≡ N(μ, Σ) and H_L(μ) exists:

Step(q, L, η)
  ⇒  N(
       μ − η · pinv(H_L(μ)) ∘ ∇L(μ),
       pinv(H_L(μ))
     )

pinv is the Moore–Penrose pseudoinverse. H_L is allowed to be singular — that is precisely why the pseudoinverse appears.

R8 — Information-driven rewrite

maximize MI(A, B) using Step rewrites to minimize L using Step with L := −MI(A, B). Mutual-information maximization is therefore not a new primitive; it is a loss.

What this buys you

The same Step operator covers:

  • stochastic gradient descent
  • natural gradient
  • mirror descent
  • variational inference
  • probabilistic PCA, ICA, CCA
  • mutual-information maximization

All as instances of one definition, with stability and composition laws the backend can rely on without re-deriving them per algorithm.

Reading next