## Likelihood calculus paper series review part 3 – Distributed control of uncertain systems using superpositions of linear operators

The third (and final, at the moment) paper in the likelihood calculus series from Dr. Terrence Sanger is Distributed control of uncertain systems using superpositions of linear operators. Carrying the torch for the series right along, here Dr. Sanger continues investigating the development of an effective, general method of controlling systems operating under uncertainty. This is the paper that delivers on all the promises of building a controller out of a system described by the stochastic differential operators we’ve been learning about in the previous papers. In addition to describing the theory, there are examples of system simulation with code provided! Which is a wonderful, and sadly uncommon, thing in academic papers, so I’m excited. We’ll go through a comparison of Bayes’ rule and Markov processes (described by our stochastic differential equations), go quickly over the stochastic differential operator description, and then dive into the control of systems. The examples and code run-through I’m going to have to save for another post, though, just to keep the size of this post reasonable.

The form our system model equation will take is the very general

$dx = f(x)dt + \sum g_i(x)u_i dt + \sum h_i(x, u_i)dB_i$,

where $f(x)$ represents the environment dynamics, previously also referred to as the unforced or passive dynamics of the system, $g_i(x)$ describes how the control signal $u_i$ affects the system state $x$, $h_i(x, u_i)$ describes the state and control dependent additive noise, and $dB_i$ is a set of independent noise processes such as Brownian motion. Our control goal is to find a set of functions of time, $u_i(t)$, such that the dynamics of our system behave according to a desired set of dynamics that we have specified.

In prevalent methods in control theory, uncertainty is a difficult problem, and often can only be effectively handled with a number of simplifications, such as linear systems of Gaussian added noise. In biological systems that we want to model, however, uncertainty is ubiquitous. There is noise on outgoing and incoming signals, there are unobserved controllers simultaneously exerting influence over the body, complicated and often unmodeled highly non-linear dynamics of the system and its interactions with the environment, etc. In the brain especially, the effect of unobserved controllers is a particular problem. Multiple areas of the brain will simultaneously be sending out control signals to the body, and the results of these signals tends to be only partially known or known only after a delay to the other areas. So for modeling, we need an effective means of controlling distributed systems operating under uncertainty. And that’s exactly what this paper presents: ‘a mathematical framework that allows modeling of the effect of actions deep within a system on the stochastic behaviour of the global system.’ Importantly, the stochastic differential operators that Dr. Sanger uses to do this are linear operators, which opens up a whole world of linear methods to us.

Bayes’ rule and Markov processes

Bayesian estimation is often used in sensory processing to model the effects of state uncertainty, combining prior knowledge of state with a measurement update. Because we’re dealing with modeling various types of system uncertainty, it seems like a good idea to consider Bayes’ rule. Here, Dr. Sanger shows that Bayes’ rule is in fact insufficient in this application, and Markov processes must be used. There are a number of neat insights that come from this comparison, so it’s worth going through.

Let’s start by writing Bayes’ rule and Markov processes using matrix equations. Bayes’ rule is the familiar equation

$p(x|y) = \frac{p(y|x)}{p(y)}p(x)$,

where $p(x)$ represents our probability density or distribution, so $p(x) \geq 0$ and $\sum_i p(x = i) = 1$. This equation maps a prior density $p(x)$ to the posterior density $p(x|y)$. Essentially, this tells us the effect a given measurement of $y$ has on the probability of being in state $x$. To write this in matrix notation, we assume that $x$ takes on a finite number of states, so $p(x)$ is a vector, which gives

$p_x' = Ap_x$,

where $p_x$ and $p_x'$ are the prior and posterior distributions, respectively, and $A$ in a diagonal matrix with elements $A_{ii} = \frac{p(y|x)}{p(y)}$.

Now, the matrix equation for a discrete-time, finite-state Markov process is written

$p_x(t+1) = Mp_x(t)$.

So where in Bayes’ rule the matrix (our linear operator) $A$ maps the prior distribution into the posterior distribution, in Markov processes the linear operator $M$ maps the probability density of the state at time $t$ into the density at time $t+1$. The differences come in through the form of the linear operators. The major difference being that $A$ is a diagonal matrix, while there is no such restriction for $M$. The implication of this is that in Bayes’ rule the update of a given state $x$ depends only on the prior likelihood of being in that state, whereas in Markov processes the likelihood of a given state at time $t+1$ can depend on the probability of being in other states at time $t$. The off diagonal elements of $M$ allow us to represent the probability of state transition, which is critical for capturing the behavior of dynamical systems. This is the reason why Bayes’ method is insufficient for our current application.

Stochastic differential operators

This derivation of stochastic differential operators is by now relatively familiar grounds, so I’ll be quick here. Starting with a stochastic differential equation

$dx = f(x)dt + g(x)dB$,

where $dB$ is our noise source, the differential of unit variance Brownian motion. The noise term introduces randomness into the state variable $x$, so we describe $x$ with a probability density $p(x)$ that evolves through time. This change of the probability density through time is captured by the Fokker-Planck partial differential equation

$\frac{\partial}{\partial t}p(x,t) = - \frac{\partial}{\partial x}(f(x)p(x,t)) + \frac{1}{2} \frac{\partial^2}{\partial x^2}(g(x)g^T(x)p(x,t))$,

which can be rewritten as

$\dot{p} = \mathcal{L}p$,

where $\mathcal{L}$ is the linear operator

$\mathcal{L} = - \frac{\partial}{\partial x}f(x) + \frac{1}{2} \frac{\partial^2}{\partial x^2}g(x)g^T(x)$.

$\mathcal{L}$ is referred to as our stochastic differential operator. Because the Fokker-Planck equation is proven to preserve probability densities (non-negativity and sum to 1), applying $\mathcal{L}$ to update our density will maintain its validity.

What is so exciting about these operators is that even though they describe nonlinear systems, they themselves are linear operators. What this means is that if we have two independent components of a system that affect it’s dynamics, described by $\mathcal{L}_1$ and $\mathcal{L}_2$, we can determine their combined effects on the overall system dynamics through a simple summation, i.e. $\dot{p} = (\mathcal{L}_1 + \mathcal{L}_2)p$.

Controlling with stochastic differential operators

Last time, we saw that control can be introduced by attaching a weighting term to the superposition of controller dynamics, giving

$\dot{p} = \sum_i u_i \mathcal{L}_i p$,

where $\mathcal{L}_i$ is the stochastic differential operator of controller $i$, and $u_i$ is the input control signal to that controller. In the context of a neural system, this equation describes a set of subsystems whose weighted dynamics give rise to the overall behavior of the system. By introducing our control signals $u_i$, we’ve made the dynamics of the overall system flexible. As mentioned in the previous review post, our control goal is to drive the system to behave according to a desired set of dynamics. Formally, we want to specify $u_i$ such that the actual system dynamics, $\hat{\mathcal{L}}$, match some desired set of dynamics, $\mathcal{L}^*$. In equation form, we want $u_i$ such that

$\mathcal{L}^* \approx \hat{\mathcal{L}} = \sum_i u_i \mathcal{L}_i$.

It’s worth noting also here that the resulting $\hat{\mathcal{L}}$ still preserves the validity of densities that it is applied to.

How well can the system approximate a set of desired dynamics?

In this next section of the paper, Dr. Sanger talks about reworking the stochastic operators of a system into an orthogonal set, which can then be used to easily approximate a desired set of dynamics. It’s my guess that the motivation behind doing this is to see how close the given system is able to come to reproducing the desired dynamics. This is my guess because this exercise doesn’t really generate control information that can be used to directly control the system, unless we translate the weights calculated by doing this back into term of the actual set of actions that we have available. But it can help you to understand what your system is capable of.

To do this, we’ll use Gram-Schmidt orthogonalization, which I describe in a recent post. To actually follow this orthogonalization process we’ll need to define an inner product and normalization operator appropriate for our task. A suitable inner product will be one that lets us compare the similarity between two of our operators, $L_1$ and $L_2$, in terms of their effects on an initial state probability density, $p_0$. So define

$\langle L_1, L_2 \rangle_{p_0} = \langle L_1 p_0, L_2 p_0 \rangle = \langle \dot{p}_{L_1} , \dot{p}_{L_2} \rangle$

for the discrete-space case, and similarly

$\langle \mathcal{L}_1, \mathcal{L}_2 \rangle_{p_0} = \int (\mathcal{L}_1 p_0)(\mathcal{L}_2 p_0)dx = \int \dot{p}_{L_1} \dot{p}_{L_2} dx$

in the continuous-state space. So this inner product calculates the change in probability density resulting from applying these two operators to this initial condition, and finds the amount which they move the system in the same direction as the measure of similarity.
The induced norm that we’ll use is the 2-norm,

$||L||_{p_0} = \frac{||L p_0||_2}{||p_0||_2}.$

With the above inner product and normalization operators, we can now take our initial state, $p_0$, and create a new orthogonal set of stochastic differential operators that span the same space as original set through the Gram-Schmidt orthogonalization method. Let’s denote the orthonormal basis set vectors as $\Lambda_i$. Now, to approximate a desired operator $L^*$, generate a set of weights, $\alpha$, over our orthonormal basis set using a standard inner product: $\alpha_i = \langle L^*, \Lambda_i \rangle$. Once we have the $\alpha_i$, the desired operator can be recreated (as best as possible given this basis set),

$L^* \approx \hat{L} = \sum \alpha_i \Lambda_i.$

This could then be used as a comparison measure as the best approximation to a desired set of dynamics that a given system can achieve with its set of operators.

Calculating a control signal using feedback

Up to now, there’s been a lot of dancing around the control signal, including it in equations and discussing the types of control a neural system could plausibly implement. Here, finally, we actually get into how to go about generating this control signal. Let’s start with a basic case where we have the system

$\dot{p} = (\mathcal{L}_1 + u\mathcal{L}_2)p,$

where $\mathcal{L}_1$ describes the unforced/passive dynamics of the system, $\mathcal{L}_2$ describes the control-dependent dynamics, and $u$ is our control signal.

Define a cost function $V(x)$ that we wish to minimize. The expected value of this cost function at a given point in time is

$E[V] = \int V(x)p(x)dx,$

which can be read as the cost of each state weighted by the current likelihood of being in that state.
To reduce the cost over time, we want the derivative of our expected value with respect to time to decrease. Written in an equation, we want

$\frac{d}{dt}E[V] = \int V(x)\dot{p}(x)dx < 0.$

Note that we can sub in for $\dot{p}$ to give

$\frac{d}{dt}E[V] = \int V(x)[\mathcal{L}_1p(x) + u\mathcal{L}_2p(x)]dx.$

Since our control is effected through $u$, at a given point in time where we have a fixed and known $p(x,t)$, we can calculate the effect of our control signal on the change in expected value over time, $\frac{d}{dt}E[V]$, by taking the partial differential with respect to $u$. This gives

$\frac{\partial}{\partial u}\left[\frac{d}{dt}E[V]\right] = \int V(x)\mathcal{L}_2p(x)dx,$

which is intuitively read: The effect that the control signal has on the instantaneous change in expected value over time is equal to the change in probability of each state $x$ weighted by the cost of that state. To reduce $\frac{d}{dt}E[V]$, all we need to know now is the sign of the right-hand side of this equation, which tells us if we should increase or decrease $u$. Neat!

Although we only need to know the sign, it’s nice to include slope information that gives some idea of how far away the minimum might be. At this point, we can simply calculate our control signal in a gradient descent fashion, by setting $u = - k \int V(x)\mathcal{L}_2p(x)dx$. The standard gradient descent interpretation of this is that we’re calculating the effect that $u$ has on our function $\frac{d}{dt}E[V]$, and assuming that for some small range, $k$, our function is approximately linear. So we can follow the negative of the function’s slope at that point to find a new point on that function that evaluates to a smaller value.

This similarly extends to multiple controllers, where if there is a system of a large number of controllers, described by

$\dot{p} = \sum_i u_i \mathcal{L}_i p,$

then we can set

$u_i = - k_i \int V(x)\mathcal{L}_ip(x)dx.$

Dr. Sanger notes that, as mentioned before, neural systems will often not permit negative control signals, so where $u_i < 0$ we set $u_i = 0$. The value of $u_i$ for a given controller is proportional to the ability of that controller to be reduce the expected cost at that point in time. If all of the $u_i = 0$, then it is not possible for the available controllers to reduce the expected cost of the system.

Comparing classical control and stochastic operator control

Now that we finally have a means of generating a control signal with our stochastic differential operators, let’s compare the structure of our stochastic operator control with classical control. Here’s a picture:

The most obvious differences are that a cost function $V(x)$ has replaced the desired trajectory $\dot{x}$ and the state $x$ has been replaced by a probability density over the possible states, $p(x)$. Additionally, the feedback in classical control is used to find the difference between the desired and actual state, which is then multiplied by a gain, to generate a corrective signal, i.e. $u = k * (x^* - x)$, whereas in stochastic operator control signal is calculated as specified above, by following the gradient of the expected value of the cost function, i.e. $u = -k \int V(x)p(x)dx$.

Right away there is a crazy difference between our two control systems already. In classical control case, we’re following a desired trajectory, several things are implied. First, we’ve somehow come up with a desired trajectory. Second, we’re necessarily assuming that regardless of what is actually going on in the system, this is the trajectory that we want to follow. This means that the system is not robust to changes in the dynamics, such as outside forces or perturbations applied during movement. In the stochastic operator control case, we’re not following a desired path, instead the system is looking to minimize the cost function at every point in time. This means that it doesn’t matter where we start from or where we’re going, the stochastic operator controller looks at the effect that each controller will have on the expected value of the cost function and generates a control signal accordingly. If the system is thrown off course to the target, it will recover by itself, making it far more robust than classical control theory. Additionally, we can easily change the cost function input to the system and see a change in the behaviour of the system, whereas in classical control a change in the cost function requires that we regenerate our desired trajectory $\dot{x}$ before our controller will act appropriately.

While these are impressive points, it should also be pointed out that stochastic operator controllers are not the first to attack these issues. The robustness and behaviour business is similarly handled very well, for specific system (either linear or affine, meaning linear in terms of the dynamics of the control signal applied) and cost function forms (usually quadratic), by optimal feedback controllers. Optimal feedback controllers regenerate the desired trajectory online, based on system feedback. This leads to a far more robust control system that classical control provides. However, as mentioned, this is only for specific system and cost function forms. In the stochastic operator control any type of cost function be applied, and the controller dynamics described by $L_i$ can be linear or nonlinear. This is a very important difference, making stochastic operator control far more powerful.

Additionally, stochastic operator controllers operate under uncertainty, by employing a probability density to generate a control signal. All in all, stochastic operator controllers provide an impressive and novel amount of flexibility in control.

Conclusions

Here, Dr. Sanger has taken stochastic differential operators, looked at their relation to Bayes’ rule, and developed a method for controlling uncertain systems robustly. This is done through the observation that these operators have linear properties that lets the effects of distributed controllers on a stochastic system be described through a simple superposition of terms. By introducing a cost function, the effect of each controller on the expected cost of the system at each point in time can be calculated, which can then be used to generate a control signal that robustly minimizes the cost of the system through time.

Stochastic differential operators can often be inferred from the problem description; their generation is something that I’ll examine more closely in the next post going through examples. Using these operators time-varying expected costs associated with state-dependent cost functions can be calculated. Stochastic operator controllers introduce a significant amount more freedom in choice of cost function than has previously been available in control. Dr. Sanger notes that an important area for future research will be in the development of methods for optimal control of systems described by stochastic differential operators.

The downside of the stochastic operator controllers is that they are very computationally intensive, due to the fact that they must propagate a joint-density function forward in time at each timestep, rather than a single system state. One of the areas Dr. Sanger notes of particular importance for future work is the development of parallel algorithms, both for increasing simulation speed and examining possible neural implementations of such an algorithms.

And finally, stochastic differential operators exact a paradigm shift from classical control on the way control is considered. Instead of driving the system to a certain target state, the goal is to have the system behave according to a desired set of dynamics. The effect of a controller is then the difference between the behavior of the system with and without the activity of the controller.