Probabilistic optimization algebra
This page introduces the probabilistic optimization fragment of daitai-algebra: how training, inference and classical gradient descent are unified under a single primitive — Step — and a small set of rewrite rules.
For the underlying state-transition relation see Step semantics in 5 minutes. This page extends Step to the probabilistic setting.
The position
Classical machine learning writes:
θ_{t+1} = θ_t − η · ∇L(θ_t)
daitai writes:
θis aRandomVar. Optimization is the update of its distribution given evidence.
This is optimization as inference. In daitai-algebra it is not a perspective — it is the only definition. Deterministic gradient descent falls out as a special case where the distribution is a Dirac delta.
Sorts
Three sorts are added to the core:
| Sort | Meaning |
|------|---------|
| Param | a parameter, typed as RandomVar |
| ParamDist | a distribution over Param |
| Loss | Param → Scalar |
Primitive operators
E : ParamDist × Expr → Scalar
KL : ParamDist × ParamDist → Scalar
Observe : Loss × Scalar → ParamDist → ParamDist
Step : ParamDist × Loss × Scalar → ParamDist
E(q, f) is the expected value of f under q. KL(q, q') ≥ 0 with KL(q, q) = 0; no symmetry is assumed.
Observe(L, β)(q) induces a posterior proportional to q(θ) · exp(−β · L(θ)). Normalization is implicit.
The Step axiom
Step(q, L, η)
:= argmin_{q'} KL(q' || q) + η · E(q', L)
This is the definition, not an algorithm. A backend may realize it by SGD, natural gradient, mirror descent, variational inference, or sampling — the algebra does not care, only the result must satisfy the equation up to the backend's tolerance.
The stability law
For all valid q, L, η:
KL(Step(q, L, η), q) ≤ η · E(q, L)
A step cannot move the distribution further (in KL) than the energy it consumes. The compiler reads this; you don't write it.
Degeneration to gradient descent
When ParamDist ≡ Dirac(θ):
Step(Dirac(θ), L, η) ⇒ Dirac(θ − η · ∇L(θ))
So θ_{t+1} = θ_t − η ∇L(θ_t) is a theorem, not a primitive.
Rewrite rules
Every backend is allowed to apply these rewrites because they preserve semantics.
R1 — Zero step
Step(q, L, 0) ⇒ q
R2 — Zero loss
Step(q, L, η) ⇒ q if ∀θ. L(θ) = 0
R3 — Additivity (same loss)
Step(Step(q, L, η₁), L, η₂) ⇒ Step(q, L, η₁ + η₂)
R4 — Composition (different loss)
Step(Step(q, L₁, η₁), L₂, η₂) ⇒ Step(q, L₁ + L₂, η*)
η* is a backend-chosen combinator. The core makes no linearity or symmetry assumptions.
R5 — Fixed point
Step(q, L, η) ⇒ q
if argmin_{q'} KL(q' || q) + η · E(q', L) = q
R6 — Dirac degeneration
The classical gradient-descent rewrite above. Requires that ∇L(θ) exists and that the backend allows a deterministic representation.
R7 — Natural gradient via Moore–Penrose
If q ≡ N(μ, Σ) and H_L(μ) exists:
Step(q, L, η)
⇒ N(
μ − η · pinv(H_L(μ)) ∘ ∇L(μ),
pinv(H_L(μ))
)
pinv is the Moore–Penrose pseudoinverse. H_L is allowed to be singular — that is precisely why the pseudoinverse appears.
R8 — Information-driven rewrite
maximize MI(A, B) using Step rewrites to minimize L using Step with L := −MI(A, B). Mutual-information maximization is therefore not a new primitive; it is a loss.
What this buys you
The same Step operator covers:
- stochastic gradient descent
- natural gradient
- mirror descent
- variational inference
- probabilistic PCA, ICA, CCA
- mutual-information maximization
All as instances of one definition, with stability and composition laws the backend can rely on without re-deriving them per algorithm.
Reading next
- Step semantics in 5 minutes — the classical state relation that this extends.
- The bra-ket primitives — the quantum/classical state vocabulary
ParamandParamDistplug into.