Categorical Stochastic Processes and Likelihood

In this work we take a Category Theoretic perspective on the relationship between probabilistic modeling and function approximation. We begin by defining two extensions of function composition to stochastic process subordination: one based on the co-Kleisli category under the comonad (Omega x -) and one based on the parameterization of a category with a Lawvere theory. We show how these extensions relate to the category Stoch and other Markov Categories. Next, we apply the Para construction to extend stochastic processes to parameterized statistical models and we define a way to compose the likelihood functions of these models. We conclude with a demonstration of how the Maximum Likelihood Estimation procedure defines an identity-on-objects functor from the category of statistical models to the category of Learners. Code to accompany this paper can be found at https://github.com/dshieble/Categorical_Stochastic_Processes_and_Likelihood


Introduction
The explosive success of machine learning over the last two decades has inspired theoretical work aimed at developing rigorous frameworks for reasoning about and extending machine learning algorithms. For example, inspired by the inherent compositional structure at the heart of gradient based optimization, several authors have developed category theoretic frameworks for reasoning about neural networks and automatic differentiation [5; 9; 11; 12]. Separately, one of the most active areas of applied category theory focuses on building a categorical framework for probability theory and statistics. Researchers like Fritz [14], Cho and Jacobs [4], and Culbertson and Sturtz [6; 7] have developed strategies for describing the construction of probabilistic models from data in categorical terms. We aim to bridge these streams of research by using a probabilistic construction to define an optimization objective.
Cho and Jacobs [4] and Culbertson and Sturtz [6;7] explore how new data points affect their models' epistemic uncertainty, or uncertainty due to limited data or knowledge. For example, a simple model of a complex nonlinear system is likely to have high epistemic uncertainty. Another form of uncertainty is aleatoric uncertainty, or inherent uncertainty in a system that will cause results to differ each time we run the same experiment. For example, if we aim to predict the output of a system that includes a non-deterministic stage (such as a coin toss), we will need to cope with aleatoric uncertainty.
Aleatoric uncertainty is common in physical systems. For example, many biological processes will produce slightly different results based on randomness in turbulent fluid flows. For this reason, models that approximate physical systems often implicitly or explicitly produce a probability distribution over the possible outputs conditioned on some input [25].
Even models that produce point estimates, such as the ones described by Fong et al. [12], can be viewed as predicting the expected value of some unknown probability distribution. For example, suppose we have some system X → y that contains a degree of aleatoric uncertainty such that P (y|X) is Gaussian. Now suppose we train a point estimate model that predicts y from X such that the mean square error between the model's predictions and the observations from the execution of this system is minimized. This is approximately equivalent to minimizing the Kullback-Leibler (KL) divergence (which measures how one probability distribution is different from a second, reference probability distribution) between a distribution with expected value given by the model's output and P (y|X). In this way the structure of the model's aleatoric uncertainty is captured in its loss function (mean square error in this case). Now consider a physical system which has several components, each of which has some degree of aleatoric uncertainty. Suppose we want to build a compositional model for this system. If we use the neural network-like composition of Fong et al. [12], then we can only represent the full model's uncertainty with the loss function that parameterizes the backpropagation functor. As a result, we cannot characterize the interactions between the uncertainty in the different parts of the system.
For example, Eberhardt et al. [8] build a convolutional neural network model to assess how the visual cortex performs a rapid stimulus categorization task. Their model includes multiple layers which represent the hierarchy within the central nervous system from photorecepters in the eye, to edge-detecting neurons in the primary visual cortex, to higher-order feature detectors in the later stages of visual cortex. Although there is aleatoric uncertainty at each layer of this biological system, Eberhardt et al. use a standard composition of neural network layers and therefore can only represent this uncertainty with a cross-entropy loss over the model's final output.
In this paper we describe an alternative strategy for constructing and composing parametric models such that we can explicitly characterize how different subsystems' uncertainties interact. We use this strategy to build a generalized framework for training neural networks that have stochastic processes as layers. To do this, we replace the domain of Fong et al.'s [12] Backpropagation functor (Para, also written as Para(Euc) [16]) with a probabilistically motivated category over which we can define the error function er : R × R → R through the maximum likelihood procedure. Our specific contributions are to: • Develop a strategy for composing stochastic processes that is compatible with both subordination [20] and parametric function composition [12].
• Introduce two categories with this compositional structure, one based on Para(Euc) [16] and one based on the co-Kleisli category of the co-monad (Ω ⊗ ), and explore their relationships with each other and with the category Stoch of Markov kernels.
• Extend the category of stochastic processes to a category of parametric statistical models.
• Demonstrate that the Radon-Nikodym derivative with respect to the Lebesgue measure acts as a semifunctor from a sub-semicategory of parametric statistical models into a semicategory of likelihood functions.
• Define a family of subcategories of parametric statistical models over which we can use the maximum likelihood procedure to define a backpropagation functor into the category Learn of learning algorithms [12].

Probability Measures, Random Variables and Markov Kernels
A probability space is a triplet (Ω, Σ, µ) where (Ω, Σ) is a measurable space and µ is a probability measure over (Ω, Σ). That is, µ is a countably additive function over the σ-algebra Σ that returns results in the unit interval [0, 1] such that µ(Ω) = 1, µ(∅) = 0. Recall that Σ is a set of subsets of Ω. For some topological space Ω, we will write B(Ω) for the Borel algebra of Ω, or the smallest σ-algebra that contains all open sets. A random variable defined on the probability space (Ω, Σ, µ) is a measurable function from (Ω, Σ) to (R, B(R)). We will sometimes use the term "random variable" to refer to measurable functions into (R n , B(R n )) as well. These are also called multivariate random variables or random vectors. While some authors use uppercase letters like X to denote random variables, we will use lowercase letters like f, g to emphasize that random variables are functions. Given a probability space (Ω, B(Ω), µ) and a random variable f : Ω → R, the pushforward f * µ of µ along f is a probability measure over (R, B(R)) defined to be: A Markov kernel between the measurable space (A, Σ A ) and the measurable space is a probability measure on (B, Σ B ). In particular: For example, a Markov Kernel between the one-point set and the measurable space (A, Σ A ) is just a probability measure over (A, Σ A ).
A stochastic process defined in the probability space (Ω, Σ, µ) is a family of random variables indexed by some set T . That is, we can write a stochastic process as a function f : Ω × T → R. In this paper we will limit our study to stochastic processes that are jointly Borel-measurable. We can define the pushforward of µ along such a stochastic process f to be the Markov Kernel

Categories
A central category that we will work in is the symmetric monoidal category Meas of measurable spaces and measurable functions. The objects in Meas are pairs Note that Meas is not cartesian closed. Staton et al. [19] introduce a similar category QBS that is cartesian closed. The objects in QBS are quasi-Borel spaces, or tuples (X, M X ) where X is a set and M X is a set of functions from R into X such that: We will generally work in the following subcategory of Meas: Definition 2.1. Euc is the strict Cartesian monoidal subcategory of Meas where objects are restricted to be (R n , B(R n )) for some n ∈ N and morphisms are restricted to be continuously differentiable.
Note that in Euc the tensor product of the objects (R a , B(R a )) and Another important category that we will consider is Stoch [18; 21], which has measurable spaces as objects and Markov kernels as morphisms. We define the composition of the Markov kernels µ : A×Σ B → [0, 1] and µ : B ×Σ C → [0, 1] to be the following, where x a ∈ A and σ c ∈ Σ C : The identity morphism at (A, Σ A ) is δ where for x a ∈ A, σ a ∈ Σ A : The tensor product of the Markov Kernels µ : The objects in Stoch are also equipped with a commutative comonoidal structure that is compatible with the monoidal product in Stoch. Fritz et al. [14] dub categories with this structure Markov Categories.

Definition 2.2.
A Markov category is a semicartesian symmetric monoidal category (C, ⊗, 1) in which every object X is equipped with a comultiplication map cp : X → X ⊗ X and a counit map del : X → 1 that satisfy the commutative comonoid equations, naturality of del and: where σ Y,X is the symmetric monoidal swap map in C.
Stoch naturally arises as the Kleisli category of the Giry Monad, which is an affine symmetric monoidal monad that sends a measurable space to the space of probability measures over that space [18].
Stoch has many notable subcategories based on restrictions of these measurable spaces. For example, the category FinStoch consists of finite measurable spaces and Markov Kernels between them. In order to be able to define regular conditional probabilities, Fong [10] and Culbertson et al. [7] restrict to countably generated measurable spaces (CGStoch), whereas Fritz et al. [15] restrict to standard Borel spaces (BorelStoch), which are the Borel spaces associated with Polish spaces.

Random Variables and Independence in BorelStoch
In any categorical presentation of probability, a natural question is how to reason about the notion of independence of random variables [13; 14; 17].
Since BorelStoch is the Kleisli category of the restriction of the Giry monad [18] over the Meas-subcategory of standard Borel spaces, we can define an embedding functor from this subcategory into BorelStoch that acts as an identity on objects and sends the measurable function This formalizes the intuition that Markov Kernels are a generalization of both measurable functions and probability measures, and provides an avenue to directly study random variables and their independence in BorelStoch. Now suppose we have a probability space (Ω, Σ, µ) such that (Ω, Σ) is standard Borel, and two real-valued random variables defined on this space f, f . We can think of these random variables as morphisms in Meas from (Ω, Σ) to (R, B(R)). We can represent this probability space as a morphism in BorelStoch between 1 and (Ω, Σ): that is, a Markov kernel µ : 1 × Σ → [0, 1]. Going forward we will write the type signature 1 × Σ → [0, 1] as Σ → [0, 1] for convenience.
We can then represent f and f with their embeddings into BorelStoch: the Dirac Markov kernels δ f , δ f . If we compose δ f and µ in BorelStoch, we form a new probability measure We now have a hint of how we can reason about the independence or dependence of random variables in BorelStoch. First, consider the probability measure : Accepted in Compositionality on 2021-02-02. Click on the title to verify. This is simply the product measure over (R × R, B(R × R)) of the probability measures (δ f • µ) and (δ f • µ) over (R, B(R)). It is completely determined by the marginal distributions of f and f over the probability space (Ω, Σ, µ), and it is agnostic to the independence or dependence structure of f and f . The reason for this is that the measure µ is essentially "duplicated", and the random variables f and f are not actually compared over the same probability space.
In contrast, consider instead the probability measure where cp : Ω → Ω ⊗ Ω is the comonoidal copy map at Ω in BorelStoch [14]. We can see that for σ × σ ∈ B(R × R): This is the probability measure over (R × R, B(R × R)) associated with the joint distribution of the random variables f and f over (Ω, Σ, µ).
Therefore, the random variables f and f are independent over the probability space (Ω, Σ, µ) if and only if the probability measures

The co-Kleisli Construction
Fong et al. [12] and Gavranović [16] build their characterization of machine learning optimization problems on top of the category Para(Euc) of Euclidean spaces and parameterized differentiable maps between them. Rather than represent the loss function itself categorically, the authors treat it as an externally-provided hyperparameter.
However, in practice the loss function is usually implied by the problem. A common problem statement is as follows: given some parameterized random variable, derive the parameters that maximize the likelihood of some observed data being drawn from the distribution of this random variable. A natural question is therefore whether it is possible to replace the parameterized differentiable maps in Para(Euc) with parameterized random variables.
Before moving to Para(Euc), we will start with the category Euc of Euclidean spaces and differentiable maps between them. Our first step will be to replace the morphisms in Euc with stochastic processes, or indexed families of random variables. We start with the following definition: For example, if Ω is R n for some n ∈ N, the category CoKl (Ω,B(Ω)) (Euc) (which we will hereafter abbreviate CEuc, see Table 3.1) has the same objects as Euc, and the morphisms between R a and R b are continuously differentiable (and therefore Borel-measurable) functions of the form f : Ω × R a → R b . In CEuc, the composition of f : Ω × R a → R b and f : And the tensor of f : One important thing to note is that ω is reused when we compose or tensor f and f . This allows us to make the following claim: Proposition 1. For any ω ∈ Ω, the identity-on-objects map that sends the function f : Ω×R a → R b in CEuc to the function f (ω, ) : R a → R b in Euc is a strict monoidal functor R ω : CEuc → Euc, which we call the realization functor.
Proof. First, if f is the identity map in CEuc then f (ω, ) is by definition the identity function.
Next, consider f : Ω × R a → R b , f : Ω × R b → R c in CEuc and any x a ∈ R a . Then: Accepted in Compositionality on 2021-02-02. Click on the title to verify. so composition is preserved. Finally, consider g : Ω × R a → R b , g : Ω × R c → R d in CEuc and any x a ∈ R a , x c ∈ R c . Then: so the monoidal tensor is preserved.
Given a probability measure µ : B(Ω) → [0, 1], we can think of CEuc as a category of differentiable stochastic processes defined on the probability space (Ω, B(Ω), µ). One particularly important kind of stochastic process is a Levy Process. We can view Levy Processes as continuous-time generalizations of random walks, or as Brownian motions with drift. Formally, a Levy Process is a one-dimensional stochastic process f : Ω × R → R defined on the probability space (Ω, B(Ω), µ) such that: • For any ω ∈ Ω the function f (ω, ) is continuous.
A subordinator is a non-decreasing Levy Process. That is, for any fixed ω ∈ Ω the function f (ω, ) is non-decreasing.

Proposition 2. Continuously differentiable subordinators form a single-object subcategory of CEuc
Proof. First, note that the identity arrow on R is trivially a subordinator. Next, suppose f and g are subordinators. By Lalley [20] we have that g • f is a Levy Process. Since both f and g are non-decreasing, for t 2 > t 1 we have for any fixed ω ∈ Ω that: Therefore, g • f is a subordinator as well.

Independence and Dependence in CEuc
Since all of the stochastic processes in CEuc are defined over the same probability space (Ω, B(Ω), µ), there is a major difference between how CEuc and BorelStoch represent independence and dependence. Given the arrows f : Ω × R a → R b and f : Ω × R c → R d in CEuc and the vectors x a ∈ R a , x c ∈ R c , the random variables f ( , x a ) and f ( , x c ) may be either dependent or independent.
In order to see how this differs from the situation in BorelStoch, recall that the pushforward of µ along the stochastic process f : However, this mapping does not form a functor. We see that for f : whereas: Accepted in Compositionality on 2021-02-02. Click on the title to verify.
These are not necessarily equivalent if the random variables f ( , The reason for this mismatch comes down to the fact that tensor and composition in BorelStoch are based on the Markov property. We can slightly modify CEuc to define a new category of stochastic processes that exhibit this independence behavior.

Shorthand Name
Full Name

The Parameterization Construction
In order to reason about the behavior of a system of stochastic processes, it is useful to study them in a simpler setting. There are two simple ways to do this: take pushforwards and study stochastic processes as Markov Kernels, or take expectations and study stochastic processes as functions. In order to make these lines of study rigorous, we first need to establish the functoriality of these transformations. To this end, in this section we build a new category of stochastic processes such that the map f → f * µ described in Section 3.1 is functorial. In Sections 5.2 and 6 we will explore the functoriality of the expectation. In order to elevate the pushforward to a functor, we need to modify the definition of how stochastic processes compose. Unlike in CEuc, where we treat all stochastic processes as if they were defined over the same probability space, the category in this section will consist of stochastic processes defined over different, non-interacting probability spaces. The composition or tensor of two stochastic processes in this new category will produce a stochastic process over the product of those processes' associated probability spaces. This will allow us to treat all of the stochastic processes in this category as if they were mutually independent.
We note that this strategy of expanding the probability space each time we introduce a new source of randomness is commonly used by probability theorists [1; 2; 24].

An extension of Para
We will begin by slightly modifying Gavranović's [16] Para construction, which is itself a generalization of Para from Fong et al. [12].
Consider the small symmetric strict monoidal categories C and D such that there exists a faithful identity-on-objects monoidal functor ι : D → C. That is, we can think of D as a subcategory of C. Then write ( ⊗ A) • ι : D → C to denote the functor that sends the object A in D to A ⊗ A in C and write c B : D → C for the constant functor that sends all objects in D to B.
in Para D (C) is then as follows, where we write • C and ⊗ C for the composition and tensor of arrows in C respectively: Accepted in Compositionality on 2021-02-02. Click on the title to verify.
And the tensor of arrows g : P ⊗ A → B and g : Q ⊗ C → D in Para D (C) is: Note that unlike Gavranović [16], we require C to be strict monoidal in order to ensure that composition is associative without resorting to equivalence classes. Proposition 3. Suppose C and C are small symmetric strict monoidal categories with a strict monoidal functor F : C → C between them. Suppose D is a small symmetric strict monoidal category equipped with a faithful identity-on-objects strict monoidal functor ι : D → C and that the image of F • ι is a subcategory D of C . Then the map F p : Para D (C) → Para D (C ) that applies the same actions on objects and arrows as F is a strict monoidal functor.
Proof. We will first show that F p is a functor, and then we will show that it is strict monoidal. Like above, we write • C , ⊗ C , and σ (Q,A) for the composition, tensor, and symmetric monoidal swap of arrows in C.
First note that since F p : Para D (C) → Para D (C ) applies the same actions on objects and arrows as F : C → C , it trivially preserves identity morphisms. Next, we will show that F p preserves composition. Suppose f : P ⊗ A → B, g : Q ⊗ B → C are arrows in Para D (C). Then we have that: Next, we will show that F p is strict monoidal. We first note that F p trivially preserves the monoidal unit, since the monoidal unit is the same in C and Para D (C). Next, suppose f : P ⊗ A → B and g : Q ⊗ C → D are arrows in Para D (C). Then we have that:

A Category of Parametric Measurable Maps
In this Section, we will use the Para construction to build a new category of stochastic processes over which the mapping f → f * µ is functorial. In this category the tensor and composition will have the same independence structure that they have in Stoch.

Lawvere Parameterization
We begin with the following definition: Note that the objects in O * are of the form O ⊗ O ⊗ · · · ⊗ O. When the tensor is repeated n times we will write this as O n . For any strict Cartesian monoidal category C with a Lawvere parameterization we can define a mapping Copy : . This mapping acts as identity-on-objects and sends the arrow f : Accepted in Compositionality on 2021-02-02. Click on the title to verify.

Proposition 4. Copy is a full identity-on-objects strict monoidal functor.
Proof. First, we note that Copy is identity-on-objects by definition.
Next, consider any objects A, B in C and any arrow f Therefore, Copy preserves identity morphisms. Next, we will show Copy preserves composition.
Finally, we will show that Copy preserves tensor.

Applying Para to Euc
Now suppose we have a probability space (Ω, B(Ω), µ) where Ω is R k , k ∈ N. We can form the Lawvere theory (Ω, B(Ω)) * with generating object (Ω, B(Ω)) and tuples (Ω, B(Ω)) n = (Ω n , B(Ω n )) as objects. We can also form the faithful identity-on-objects functor ι : (Ω, B(Ω)) * → Euc. Then for any (Ω n , B(Ω n )) ∈ (Ω, B(Ω)) * , we can create the probability space (Ω n , B(Ω n ), µ n ) where µ n is the product measure: Now consider the Lawvere parameterization Para (Ω,B(Ω)) * (Euc) (which we will hereafter abbreviate PEuc). Intuitively, PEuc allows us to reason about probabilistic relationships in terms of measurable functions rather than probability measures. We can make this probabilistic intuition more formal. First, PEuc behaves similarly to a category of Markov Kernels and we can show the following: Proposition 5. We can construct a Markov Category [14] on top of PEuc by equipping each object with the comultiplication map cp and and the counit map dc defined as follows: PEuc as a stochastic process over (Ω n , B(Ω n ), µ n ). However, unlike in CEuc, if we compose or tensor f with another arrow in PEuc, we do not get another stochastic process over (Ω n , B(Ω n ), µ n ). Instead, we get a stochastic process over some other probability space. Intuitively, we can think of the stochastic processes in PEuc as being defined over different, non-interacting probability spaces. Now given some arrow f : Ω n ×R a → R in PEuc and x a ∈ R a , the measurable function f ( , x a ) is a real-valued random variable over the probability space (Ω n , B(Ω n ), µ n ). The pushforward of µ n along this random variable f ( , x a ) * µ n ( ) is then a probability measure over the space (R, B(R)).
In general, we can extend this pushforward procedure to define a mapping between parametric families of measurable maps and Markov Kernels. Given some f : Proposition 6. The mapping P ush µ that takes a parametric family f : Ω n × R a → R b of measurable maps to the Markov Kernel f * µ n is an identity-on-objects strict monoidal functor from PEuc to BorelStoch.
Proof. We first note that for any R a , P ush µ trivially maps the identity at R a in PEuc to its identity in BorelStoch. Next, we will demonstrate that P ush µ preserves composition. Suppose we have some f : Finally, we will demonstrate that P ush µ preserves tensor. Suppose we have some f : Accepted in Compositionality on 2021-02-02. Click on the title to verify.

Parameterized Statistical Models
We have been discussing the arrows in PEuc as parameterized random variables, or stochastic processes, but we can also think of them as Euc arrows with an element of randomness that is dictated by the probability measure µ. One of the primary goals of this work is to replace the domain of Fong et al.'s [12] Backpropagation functor, Para(Euc), with a probabilistically motivated category over which we can define the error function er : R × R → R through maximum likelihood. Therefore, a natural next step is to extend PEuc to a category in which we can instead think of the arrows as Para(Euc) arrows with an element of randomness added. In order to do this, we will replace the stochastic processes in PEuc with parameterized stochastic processes, which we will also refer to as parametric statistical models. That is, the arrows in this category will consist of families of random variables that have two layers of parameterization: one layer acts as the model input (e.g. the independent variable in a linear regression model) and one layer acts as the model parameters (e.g. the slope, intercept and variance terms).

The Category DF
Given a probability space (Ω, B(Ω), µ) where Ω = R k , k ∈ N, any stochastic process f : Ω n ×R a → R b in PEuc defines a stochastic relationship between values in R a and R b . A parametric statistical model is a parameterized family of such relationships. For example, consider a univariate linear regression model l : and f N (0,s 2 ) is a normally distributed random variable with variance s 2 . Any value [a, b, s] ∈ R 3 defines the stochastic process, or PEuc arrow: For any model input value x ∈ R, the function l ( , [a, b, s], x) is then a random variable defined on the probability space (Ω n , B(Ω n ), µ n ). Like with any ordinary univariate linear regression model, this random variable is normally distributed on the real line.
We can define a category of such models by applying Para (Ω,B(Ω)) * to Para Euc (Euc) to form the category Para (Ω,B(Ω)) * (Para Euc (Euc)), which we will rename DF for brevity (see Table 1 for a list of all such abbreviations). This naming derives from the fact that the arrows in this category are Discriminative and Frequentist statistical models. That is, each arrow operates as if both the parameters and input values are fixed and only the output value is probabilistic. For example, the homset DF[R, R] includes the linear regression model above. In contrast, generative models and Bayesian models assume a probability distribution over the input and parameter values respectively.

A subcategory of Gaussian-preserving transformations
For example, any linear function is Gaussian-preserving. Now for some probability space (Ω, B(Ω), µ) where Ω = R k , k ∈ N, we can construct a set of DF-arrows N µ such that for any f ∈ N µ with the signature f : Ω n × R p × R a → R b and ω n ∈ Ω n , x p ∈ R p , x a ∈ R a : where T (x p , ) : R a → R b is a Gaussian-preserving transformation and G : Ω n → R b is a multivariate normal random variable defined on the probability space (Ω n , B(Ω n ), µ n ). Note that this Accepted in Compositionality on 2021-02-02. Click on the title to verify.
includes the univariate linear regression model l, as well as the identity arrow, since constant distributions are multivariate normal with variance 0.
Note that N µ is closed under the tensor in DF, since given the maps f : T (x p , x a )) + (G (ω m ), G(ω n )).
Next, we will define DF Nµ to be the category with the same objects as DF and arrows generated by the composition of arrows in N µ .

Proposition 7. DF Nµ is a strict symmetric monoidal subcategory of DF.
Proof. Since DF Nµ contains the identities and is closed under composition by definition, we only need to demonstrate that DF Nµ is closed under the monoidal product on arrows. We will demonstrate that for any f, g in Ar(DF Nµ ) we can write g ⊗ f as a composition of arrows in N µ . First note that: where for all i ≤ n, j ≤ m, f i and g j are arrows in N µ . Without loss of generality, we will assume that n ≤ m, which implies that: We can now write the following: Since this is a composition of arrows in N µ , g ⊗ f is in N µ .

Proposition 8. Given any arrow
is a multivariate normal random variable defined on the probability space (Ω n , B(Ω n ), µ n ).
Proof. We will show that this property holds for the arrows in N µ and that it is preserved by composition.
To begin, note that for any n, m, the pushforward of µ m along f : Ω m → R a is equivalent to the pushforward of µ m+n along the random variable f l (ω m , ω n ) = f (ω m ) where ω m ∈ Ω m , ω n ∈ Ω n . For σ a ∈ B(R a ): By a similar argument we have that the pushforward of µ m along f : Ω m → R a is equivalent to the pushforward of µ n+m along the random variable f r (ω n , ω m ) = f (ω m ). Next, we note that for any x p ∈ R p , x a ∈ R a and arrow f : Ω n × R p × R a → R b ∈ N µ , the random variable f ( , x p , x a ) : Ω n → R b is multivariate normal and defined on the probability space (Ω n , B(Ω n ), µ n ). This follows from the fact that for ω n ∈ Ω n : where T (x p , x a ) is a constant and G : Ω n → R b is multivariate normal. Next, we show that for any in DF such that the random variable f ( , x p , x a ) : Ω n → R b is multivariate normal, the random variable: is multivariate normal over (Ω m+n , B(Ω m+n ), µ m+n ) since: f (ω n , x p , x a )) + G (ω m ).
Since the random variable f ( , x p , x a ) : Ω n → R b is multivariate normal over (Ω n , B(Ω n ), µ n ), by the note above we have that the random variable f r ((ω m , ω n ), x p , x a ) = f (ω n , x p , x a ) defined over (Ω m+n , B(Ω m+n ), µ m+n ) is multivariate normal. Since x q is constant this implies that the following random variable is also multivariate normal: Similarly, the random variable G l (ω m , ω n ) = G (ω m ) is also multivariate normal and independent of T (x q , f r ( , x p , x a )). Therefore, we can write: Since this is a sum of independent normally distributed random variables, the following random variable is also multivariate normal: As an aside, note that N µ itself is not closed under composition. Suppose f : Note that T is Gaussian preserving since the product of a constant and a Gaussian is Gaussian. Now if we write f (ω n , x p , x a ) = T (x p , x a ) + G(ω n ) we see that: which we cannot express as a sum of a Gaussian-preserving transformation over R q+p × R a → R b and a multivariable normal random variable defined on (Ω n+m , B(Ω n+m ), µ n+m ).

Relationship to Gauss
DF Nµ is similar to the category Gauss from Section 6 of Fritz et al. [14], with a few key differences. In Gauss, objects are natural numbers and morphisms a → b are tuples (M, C, s) where M is a matrix in R b×a , C is a positive semidefinite matrix in R b×b and s is a vector in R b .
Intuitively, the morphisms in Gauss represent transformations of random variables. That is, (M, C, s) implicitly represents the following transformation of random variables: Where ξ s,C is a multivariate normal random variable with mean s and covariance matrix C that is independent of f . If the random variable f is normally distributed, then g(f ) is as well.
A primary difference between Gauss and DF Nµ is that the morphisms in DF Nµ explicitly include the functional form of ξ s,C in the morphism itself. For any arrow (M, C, s) : a → b in Gauss and a choice of such an ξ s,C over (Ω, B(Ω), µ), we can form the DF Nµ arrow f : However, since this arrow is dependent on the choice of ξ s,C , this mapping is not functorial.
Accepted in Compositionality on 2021-02-02. Click on the title to verify.

Proposition 9. DF Nµ is an Expectation Composition category.
Proof. We will use a proof by induction. By the definition of DF Nµ , there exists some k ∈ N such that we can express f as a composition of k arrows in N µ . First note that if k = 1, then f is in N µ , and the statement must hold since for x q ∈ R q , x p ∈ R p , x a ∈ R a : Without loss of generality we will assume f k−1 and h have the following signatures: Note that q + q = q and m + m = m. Now we can show the following, where the step marked * holds by induction and x q ∈ R q , x q ∈ R q , x p ∈ R p , x a ∈ R a : (ω m ,ω m ,ωn)∈Ω m +m +n By induction we have that the original statement holds for all f , f ∈ DF Nµ .
For f : Ω n × R p × R a → R b in an Expectation Composition category C and x p ∈ R p , x a ∈ R a the following function must be differentiable by the Leibniz integration rule: We can therefore define a functor Exp : C → Para(Euc) that acts as the identity on objects and sends the arrow f to f E .

Likelihood and Learning
In this section we will apply the maximum likelihood procedure to the arrows in DF to derive the error function er : R × R → R. We will then use this error function to define a modification of Fong et al.'s [12] backpropagation functor. However, since different arrows in DF have likelihood functions of different forms, we will not define a single backpropagation functor out of DF. Instead, we will define multiple functors from subcategories of DF into Learn.
To do this, we will first define a substructure of DF with well-defined likelihood functions. Then, we will describe a class of subcategories of DF derived from this substructure. Finally, we will define two backpropagation functors for any subcategory in this class.

Conditional Likelihood
The conditional likelihood is a general measure of the goodness of fit of a set of parameters and observed data for a given parametric statistical model. We can define the conditional likelihood of a parametric statistical model f : Ω n × R p × R a → R b over the probability space (Ω n , B(Ω n ), µ n ) at the points x p ∈ R p , x a ∈ R a , x b ∈ R b in terms of the pushforward measure of µ n along the random variable f ( , x p , x a ). To do this, we evaluate the Radon-Nikodym derivative of f ( , x p , x a ) * µ n = µ n (f ( , x p , x a ) −1 ) with respect to a reference measure at the point x b . In this work we select the Lebesgue measure over R b , λ b , as the reference measure. Note that the Radon-Nikodym derivative with respect to the Lebesgue measure is not defined for all measures. For example, no discrete measure has a Radon-Nikodym derivative with respect to the Lebesgue measure, since for any finite collection of points For example, the conditional likelihood function for the univariate linear regression model l that we introduced in Section 5.1 is L l : Definition 6.1. An abstract conditional likelihood from R a to R b is a Borel-measurable and Lebesgue-integrable function of the form L : We can define the composition of the abstract conditional likelihoods L : Similarly, we can define a tensor product of abstract conditional likelihoods. The tensor of , x a , x b ).
We can define a monoidal semicategory of abstract conditional likelihoods, which we name CondLikelihood. Monoidal semicategories are similar to monoidal categories but lack identity morphisms. Definition 6.2. A monoidal semicategory is a monoid object in SemiCat, the monoidal category of semicategories.
The objects in CondLikelihood are spaces of the form R n for some n ∈ N. The tensor of the objects R a and R b in CondLikelihood is defined to be R a+b . The unit of this tensor is R 0 .
The morphisms between R a and R b are equivalence classes of abstract conditional likelihood functions such that for L, L * : We define the composition and tensor of these equivalence classes in terms of their representatives. That is, consider the equivalence classes L and L and suppose L i : Note that for any x q ∈ R q , x p ∈ R p , x a ∈ R a , the functions (L j • L i )((x q , x p ), x a , ) : R c → R for all L i ∈ L, L j ∈ L are λ c -a.e. equivalent, so CondLikelihood is closed under composition. The tensor of equivalence classes is defined similarly.
However, CondLikelihood does not form a category, because objects in CondLikelihood do not necessarily have identities. For example, for b > 0 there is no function , x a , x b ).

Proposition 10. CondLikelihood is a monoidal semicategory.
Proof. We will first show that CondLikelihood is a semicategory. We have already shown that CondLikelihood is closed under composition, so we simply need to show that composition is associative. Suppose the following are representatives of three arrows in CondLikelihood: Now consider the representatives of their composition Therefore, composition in CondLikelihood is associative, so CondLikelihood is a semicategory. Next, we will show that CondLikelihood is a monoid object in SemiCat. Note that: Now suppose the following are representatives of three arrows in CondLikelihood: Consider the representatives of their tensor ((g 3 ⊗ g 2 ) ⊗ g 1 ) and (g 3 ⊗ (g 2 ⊗ g 1 )). For , and x a1 ∈ R a1 : (x a2 , x a1 ))).
Therefore, ⊗ satisfies the associative law as well as the left and right unit laws.
If we extend from functions to generalized functions (distributions) we can form a category similar to CondLikelihood. For example, Blute et al. [3] define a category DRel of tame distributions in which the Dirac delta δ exists as a singular distribution. The semicategory CondLikelihood is similar in spirit to the nuclear ideal of DRel that Blute et al. describe. However, we will use conditional likelihood functions to define optimization objectives, and there is no obvious way to do this with a singular distribution. For this reason we will keep CondLikelihood as a monoidal semicategory. Next, given a probability space (Ω, B(Ω), µ) define DF Rµ to be the substructure of DF with the same objects, but with morphisms between R a and R b limited to f : Ω n × R p × R a → R b such that the following Borel-measurable and Lebesgue-integrable function exists: Proposition 11. DF Rµ is a monoidal semicategory.
Proof. We will first show that DF Rµ is closed under composition. Suppose f : Ω n × R p × R a → R b and f : Ω m ×R q ×R b → R c are arrows in DF Rµ . We can show that for all x a ∈ R a , x p ∈ R p , x q ∈ R q there exists some Borel-measurable and Lebesgue integrable g : R c → R such that for σ c ∈ B(R c ): where λ c is the Lebesgue measure over R c : Next, we will show that DF Rµ is closed under tensor. Suppose f : Ω n × R p × R a → R b and f : Ω m × R q × R c → R d are arrows in DF Rµ . We can show that for all x q ∈ R q , x p ∈ R p , x c ∈ R c , x a ∈ R a there exists some measurable g : where λ d+b is the Lebesgue measure over R d+b : Next, we can define the mapping RN µ : DF Rµ → CondLikelihood that acts as the identity on objects and sends any morphism f : Ω n × R p × R a → R b in DF Rµ to the equivalence class that contains the function RN µ f : Note that Proposition 11 implies that this function exists. maximum expected log-likelihood estimator for f with respect to τ is the vector x p ∈ R p that maximizes the following function: That is, the maximum expected log-likelihood estimator for f with respect to τ is the vector x p that maximizes the expected value of log Equivalently, x p minimizes the weighted sum over x a of the KL-divergences between f ( , x p , x a )  *  µ n and τ (x a , ), where the weight of each x a is determined by τ [23]. Now suppose that instead of observing a probability space The maximum log likelihood estimator for f with respect to this dataset is the vector x p ∈ R p that maximizes the function: Note that if we assume the samples in S n are drawn from (R a × R b , B(R a × R b ), τ ), then by the weak law of large numbers 1 n L Sn converges to L τ in probability as n → ∞. However, it will be challenging to derive an objective function for Fong et al.'s [12] backpropagation functor from L Sn directly, since their construction assumes that the error function has the signature er : R × R → R and has an invertible derivative. We will slightly modify L Sn to make this easier. For : Ω n × R p × R a → R and the marginal likelihood at x p ∈ R p of this component for some sample (x ai , x bi ) ∈ S n is: where we write x bi [j] for the jth component of x bi . The maximum log-marginal likelihood estimator for f with respect to this dataset is then the vector x p ∈ R p that maximizes the function: Note that M Sn (x p ) = L Sn (x p ) when the real-valued random variables f ( , x p , x ai )[j] are mutually independent for all x ai . This suggests a criterion for an error function er : R × R → R over which we can define Fong et al.'s [12] backpropagation functor: we want the following two real-valued functions of R p to move in tandem for any fixed (x a , y) ∈ R a × R and j ≤ b: We will now make this formal.

Learning from Likelihoods
Suppose we have a real-valued random variable f over the probability space (Ω n , B(Ω n ), µ n ). Write E µ n [f ] ∈ R for the expectation of f over µ n : And define f 0 to be: Next, suppose U : Cat → SemiCat is the forgetful functor.
hr −→ DF Rµ that satisfies the following property. There exists: • A differentiable function with invertible derivative er : R × R → R • For each n ∈ N, a function α n : (Ω n → R) → R • For each n ∈ N, a non-negative function β n : (Ω n → R) → R such that for any x p ∈ R p , x a ∈ R a , j ≤ b and arrow in the semicategory C Rµ whose image under inc • h l : C Rµ → U (DF) has the signature f : Ω n × R p × R a → R b , we can write: We will refer to er as a marginal error function of C. Proof. To begin, consider the structure C Rµ that has the same objects as DF Rµ and:

Proposition 13. DF
Since U (DF Nµ ) and DF Rµ are small, this intersection is well-defined and C Rµ is a semicategory. Now note that there exist identity-on-objects and identity-on-morphisms inclusion semifunctors such that the following diagram commutes: Now consider any other semicategory C equipped with monic semifunctors: such that the following diagram commutes: Since inc and inc are inclusion maps, l and r must act identically on objects and morphisms. Therefore, any object or morphism in the image of l or r must also be in C Rµ , so we can define the unique semifunctor h : C → C Rµ that has the same action on objects and morphisms as l and r. This implies that And so C Rµ is the pullback of the diagram: Next, consider some f : Ω n ×R p ×R a → R b in C Rµ , and note that for any x p ∈ R p , x a ∈ R a , j ≤ b, the random variable f ( , x p , x a )[j] is univariate normal. For each n ∈ N we also define the standard deviation function s n : (Ω n → R) → R where for g : Ω n → R: Now for any x p ∈ R p , x a ∈ R a , y ∈ R, j ≤ b we can write: Therefore:

Backpropagation Functors
For any Marginal Likelihood Factorization Category C and choice of learning rate we can define two kinds of backpropagation functors: one into Fong et al.'s Learn category [12] and one into a probabilistic analog of Learn. We will first show the functor that maps C into Learn. Write F er for Fong et al.'s Backpropagation functor with learning rate under the marginal error function er of C. Then we can define the following functor that maps a parametric statistical model in C to a learning algorithm: For example, this functor sends parametric statistical models in DF Nµ to learning algorithms that minimize the square error function with gradient descent. We can think of E er as a point estimation functor: it sends an arrow f in C to a learner whose inference function is formed from f 's expectation. The higher order moments of the pushforward distributions of the arrows in C are only used to define the loss function er.
Next, consider the strict symmetric monoidal subcategory Learn R of Learn where objects are restricted to be R n , n ∈ N and the tensor of objects is R n ⊗ R m = R n+m . Now given the probability space (Ω, B(Ω), µ) where Ω = R k , k ∈ N, we can form the category Para (Ω,B(Ω)) * (Learn R ). A morphism between R a and R b in Para (Ω,B(Ω)) * (Learn R ) is a tuple (I, U, r) where I, U, r are functions of types: Accepted in Compositionality on 2021-02-02. Click on the title to verify.
Intuitively, we can think of such a morphism as a statistical learner in which each of the inference, update and request functions are stochastic processes over (Ω n , B(Ω n ), µ n ). Now since DF = Para (Ω,B(Ω)) * (Para(Euc)), by Proposition 3 the mapping: P er : DF → Para (Ω,B(Ω)) * (Learn R ) that applies the same actions on objects and arrows as F er is a strict monoidal functor. Unlike E er however, this functor does not define the gradient update for the statistical model f in terms of its expectation. Instead, given a parameter vector x p ∈ R p , input vector x a ∈ R a and output vector x b ∈ R b , the update function U in the image of P er will generate different updates for different samples of ω n from (Ω n , B(Ω n ), µ n ). This is similar to how Tensorflow Probability [22] defines the update step for Distribution layers.

Discussion and Future Work
Consider once again a physical system that is composed of several components, each of which has some degree of aleatoric uncertainty. If we construct a neural network model for this system like we describe in Section 1, we cannot characterize the interactions between the uncertainty in the different parts of the system. However, if we model the components of the system as stochastic processes and apply DF composition, we can capture how the uncertainty of the component parts combine. For example, given estimates of the kind of uncertainty inherent to the photorecepters in the eye, edge-detecting neurons in primary visual cortex, and higher-order feature detectors in the later stages of visual cortex, we may be able to build a more realistic model of how these sources of uncertainty interact than the one that Eberhardt et al. [8] use to assess how the visual cortex performs a rapid stimulus categorization task.
Once we build such a model, we can use either E er or P er to derive a Learner with a structure that incorporates this combined uncertainty. The functor E er will convert the model to a point estimator and bundle the combined uncertainty into a loss function. In contrast, P er will preserve the uncertainty and produce a learning algorithm where both forward and backward passes are stochastic.
One of the largest differences between this construction and those of Cho and Jacobs [4] and Culbertson and Sturtz [6] is the treatment of model updates in the face of new data. While these authors also describe categorical frameworks in which we can model how a new observation updates the parameters of a statistical model, they primarily study Bayesian algorithms in which the model parameters are represented with a probability distribution.
In contrast, our construction is inherently frequentist. While the backpropagation functors above aim to find an optimal parameter value given the data we have seen, they make no assumptions about what that value may be. Although uncertainty motivates the objective that our parameter estimation procedure aims to optimize, the optimization algorithm does not use it directly. Therefore, a potential future direction for this work is to extend the category DF of deterministic and frequentist models to handle generative algorithms that model uncertainty in the input vector and Bayesian algorithms that model uncertainty in the parameter vector.
Furthermore, our current definition of Marginal Likelihood Factorization Categories may be overly restrictive. For example, our definition specifies that each category is characterized by a single marginal error function er. This makes it challenging to build a theory for how we could compose Marginal Likelihood Factorization Categories with different marginal error functions. Another potential future direction would be to relax the restrictions on these categories or prove that they are necessary. Proof. First, let's note that PEuc is semicartesian because the monoidal unit (R 0 , B(R) 0 ) is the terminal object. Next, we will show that cp and dc satisfy the conditions in Definition 2.1 of Fritz et al. [14]. Note that we write the symmetric swap map as σ :