## Likelihood calculus paper series review part 2 – Neuro-mechanical control using differential stochastic operators

The second paper put out by Dr. Terence Sanger in the likelihood calculus paper series is Neuro-mechanical control using differential stochastic operators. Building on the probabalistic representation of systems through differential stochastic operators presented in the last paper (Controlling variability, which I review here) Dr. Sanger starts exploring how one could effect control over a system whose dynamics are described in terms of these operators. Here, he specifically looks at driving a population of neurons described by differential stochastic operators to generate the desired system dynamics. Neural control of a system requires that several phenomena outside the realm of classical control theory be addressed, including the effects of variability in control due to stochastic firing, large partially unlabeled cooperative controllers, bandlimited control due to finite neural resources, and variation in the number of available neurons.

The function of a neuron in a control system can be completely described in terms of 1) the probability of it spiking due to the current system state, $p(s=1|x)$, and 2) the effect of its response on the change in the state. Due to the inherent uncertainty in these systems, each individual neuron’s effect on the change in state is captured by a distribution, $p(\dot{x}|s)$. And because the effect of each neuron is only a small part of a large dynamical system that includes the dynamics of the system being controlled and the effects of all the other neurons, these distributions tend to be very broad.

Rephrasing the above description, neurons are mapping a given state $x$ to a change in state $\dot{x}$. Instead of using two conditional densities to describe this mapping, $p(s|x)$ and $p(\dot{x}|s)$, we can rewrite this more compactly as

$p(\dot{x}|x) = p(\dot{x}|s=1)\;p(s=1|x) + p(\dot{x}|s=0)\;p(s=0|x)$,

which can be read as the probability of a change in state $\dot{x}$ given the current state $x$ is equal to the probability of that change in state occurring if the neuron spikes, $p(\dot{x}|s=1)$, multiplied by the probability of that neuron spiking given the current state, $p(s=1|x)$, plus the probability of that state occurring if the neuron doesn’t spike, $p(\dot{x}|s=0)$, multiplied by the probability of that neuron not spiking, $p(s=0|x)$.

Differential stochastic operators

We want to capture the mapping $p(\dot{x}|x)$ in such a way that if we have a description of a bunch of controllers (such as neurons) and the change in system state they effect individually, we can combine them in a straightforward way to get the overall change in state resulting from all the controllers operating in parallel. To do this we can use the linear operators developed in the previous paper, which allows us combine the effects of multiple components through simple summation to determine the overall change in system state. We’ll go over it again here, as I find reading through several different versions of an idea very helpful for solidifying understanding.

Let $\mathcal{L}$ denote a class of linear operators that act on time-varying probability densities, $p(x,t)$, such that $\frac{\partial}{\partial t}p(x,t) = \mathcal{L}p(x,t)$. Because these operators need to preserve the properties of valid probability density (specifically that $\int p(x)dx = 1$ and $p(x) \geq 0$), for a given operator $L$ where $\dot{p}(x) = \int L(x,y) p(y) dy$ we require that:

• 1) $\int L(x,y)dx = 0$ for all $y$,
• 2) $L(x,y) \geq 0$ whenever $x \neq y$,

which respectively enforce the aforementioned constraints.

So, first thing’s first. Let’s read out $\dot{p}(x) = \int L(x,y) p(y) dy$. This says that our change in the probability density, $\dot{p}(x)$, is found by taking our function that tells us what the change in density is for system state $x$ given our current state $y$, which is $L(x,y)$, and weighting that by the probability of currently being in state $y$, which is $p(y)$, then summing that all up, which is the integral.

Now the constraints. The first constraint reads out as the integral of the changes of the probability density at each point $x$ for a given state $y$ must be equal to 0. This means that the area of the probability density over the states after updating them is the same. So, assuming we start out with a valid density whose sum equals 1, we always have a density whose sum equals 1.

The second constraint reads out as the change in probability density for state $x$ given a current state $y$ must be greater than zero whenever $x \neq y$. This means the only time that the change in the probability density can be negative is if there is a probability of being in that state; it enforces that all $p(x) \geq 0$, because $\dot{p}$ can’t be negative when $p(x)$ is zero.

Dr. Sanger defines the linear operators that satisfy these two conditions to be “differential stochastic operators”. The discrete time versions are matrices, dubbed “difference stochastic operators”.

Superposition of differential stochastic operators

Differential stochastic operators can be derived in different ways, here we’ll go through the derivation from the ‘master’ equation defining $p(\dot{x}|x)$, and from a stochastic differential equation. They each have their insights, so it’s helpful to work through both.

Derivation from master equation

The equation for $p(\dot{x}|x)$ written out above,

$p(\dot{x}|x) = p(\dot{x}|s=1)\;p(s=1|x) + p(\dot{x}|s=0)\;p(s=0|x)$,

determines the probability flow on the state. By employing the Kramers-Moyal expansion we can capture this probability flow through a partial differential equation describing change in probability density. In other words, instead of capturing the time evolution of the system state with a probability density over possible changes in state, $\dot{x}$, we capture it through the changing in probability of each state, $\dot{p}(x)$. The Kramers-Moyal expansion looks like:

$\frac{\partial}{\partial t}p(x,t) = -\frac{\partial}{\partial x}(D_1(x)p(x,t)) + \frac{1}{2}\frac{\partial^2}{\partial x^2}(D_1(x)p(x,t)) + ...$,

where $D_k(x) = E[(\dot{x})^k] = \int \dot{x}^k p(\dot{x}|x) d\dot{x}$. Truncating this expansion at the first two terms we get the Fokker-Planck equation, where the first term describes the drift of the density, and the second term the diffusion. These two terms are sufficient for describing Gaussian conditional densities, which capture many physical phenomena. In the case where $p(\dot{x}|x)$ does not have a Gaussian distribution, higher-order terms from the Kramers-Moyal expansion will need to be included.

Now, imagine we have a simple system of two neurons, where the change in state is defined as $\dot{x} = \dot{x}_1 + \dot{x}_2$. If these neurons have conditionally independent variability, i.e. $p(\dot{x}_1 \dot{x}_2 | x) = p(\dot{x}_1|x)p(\dot{x}_2|x)$, then we can sum the Kramers-Moyal expansion of each of these terms to describe the evolution of the overall system state:

$\frac{\partial}{\partial t}p(x,t) = - \sum_i \frac{\partial}{\partial x}(D_{1i}(x)p(x,t)) + \frac{1}{2} \sum_i \frac{\partial^2}{\partial x^2}(D_{2i}(x)p(x,t)) + ...$,

as long as the neurons have conditionally independent variability. This means that they can’t be connected (directly or indirectly) such that a spike in one neuron causes a spike in the other. While this might seem a poor assumption for modeling networks of spiking neurons, in large populations with many input, the effects of any single input neuron tends to be small enough that the assumption holds approximately.

We can rewrite the previous equation now, taking advantage of linearity of $p(x,t)$ and the Kramers-Moyal coefficients, to get

$\dot{p}(x,t) = \mathcal{L}p = \sum_i \mathcal{L}_i p$,

which means that by describing neurons with the differential stochastic operators, $\mathcal{L}_i$, we can determine the cumulative effect on the dynamics of the system through simple summation. Which we all remember from the last paper, but hey, good to review.

Now, in the event that we want to model the effect of a group of strongly interconnected neurons, we can instead consider the effect of the group as a function of the $2^n$ possible firing patterns (spike or no spike from each of the neurons). So where before $p(\dot{x}| x)$ was written in terms of the two cases $s = 0$ and $s = 1$, it would now be written:

$p(\dot{x}|x) = \sum_{i=1}^{2^n} p(\dot{x}|x,i)p(i)$,

where each $i$ is a different spike pattern. This group on neurons and their effect on the system dynamics is then considered as a single unit, and the differential stochastic operator describing them can then be combined with the operator from another group of neurons, provided that the two groups exhibit conditionally independent variability.

If there are no independent groups in the network then it’s fully connected and this is not for you go home.

Derivation from a stochastic differential equation

For this derivation, imagine a set of controllers operating in parallel. Each controller has a stochastic differential equation that defines how a control signal input affects the system dynamics,

$dx = f_i(x)dt + g_i(x)dB_i$,

where $f_i$ and $g_i$ are arbitrary equations and $dB_i$ are random functions of time, or noise terms. Let $f_i(x)dt$ for $i > 0$ be the controller equations, and $f_0(x)dt$ be the unforced, or passive, dynamics of the system, which define how the system behaves without controller input. We can write the equation for the whole system as

$dx = f_0(x)dt + \sum_{i>0}u_i(f_i(x)dt + g_i(x)dB_i)$,

where $u_i$ are constant or slowly varying control inputs. The reason we would choose to denote the unforced (passive) dynamics of the system as $f_0$ is because we can now define $u_0 = 1$, and rewrite the above equation as

$dx = \sum_{i}u_i(f_i(x)dt + g_i(x)dB_i)$.

The corresponding Fokker-Planck equation for the evolution of the state probability density is

$\frac{\partial}{\partial t}p(x,t) = - \sum_i u_i \frac{\partial}{\partial x}(f_i(x)p(x,t)) + \frac{1}{2} \sum_i u_i \frac{\partial^2}{\partial x^2}(g_i(x)p(x,t))$.

Look familiar? We can rewrite this as a superposition of linear operators

$\dot{p}(x,t) = \mathcal{L}p = \sum_i u_i \mathcal{L}_i p$,

and there you go.

Population model

So, now we can apply this superposition of differential stochastic equations to describe the effect of a population of neurons on a given system. Dr. Sanger lists several ways that this model can go about being controlled; 1) modifying the tuning curves of the neurons, which specifies how they respond to stimulus; 2) modify the output functions that determines the effect that a neuron has on the dynamics of the system; and 3) modifying the firing threshold of the neurons.

I found the difference between 1 and 3 can be a little confusing, so let’s look at an example tuning curve in 2D space to try to make this clear. Imagine a neuron sensitive to -dimensional input signals, and that it’s tuning curve looks like this:

If we’re changing the tuning curve, then how this neuron responds to its input stimulus will change. For example, in this diagram we show changing the tuning curve of a neuron:

Here, the neuron no longer responds to the same type of input that it responded to previously. We have made a qualitative change to the type of stimulus this neuron responds to.

If we change the firing threshold, however, then what we’re changing is when the neuron starts responding to stimulus that it is sensitive to.

Here we show the neuron becoming more and more sensitive to a its stimulus, respond stronger sooner and sooner. So the type of signal that the neuron responds to isn’t changing, but rather when the neuron starts responding.

Alright, now that we’ve got that sorted out, let’s move on.
Tuning curves (1) and output functions (2) are both modifiable through learning, by changing the incoming and outgoing connection weights, respectively, but for controlling systems on the fly this is going to be too slow, i.e. slower than the speed at which the system moves. So what’s left is (3), modifying the firing threshold of the neurons. So the model then looks like:

where $p(x)$ is projected in to a population of neurons, each with a stochastic differential operator that sum together to generate $\dot{p}(x)$. In this diagram, $\lambda_i$ is the firing threshold of neuron $i$, and $\lambda_i(x)$ denotes the modulation of the firing rate of neuron $i$ as a function of the current system state. When the modulation is dependent on the system state we have a feedback, or closed-loop, control system. Dr. Sanger notes that in the case that $\lambda_i$ is heavily dependent on $x$, modulating the firing threshold is indistinguishable from modifying the tuning curve, meaning that we can get some pretty powerful control out of this.

Conclusions

‘The theory of differential stochastic operators links the dynamics of individual neurons to the dynamics of a full neuro-mechanical system. The control system is a set of reflex elements whose gain is modulated in order to produce a desired dynamics of the overall system.’

This paper presents a very different formulation of control than classical control theory. Here, the goal is to modulate the dynamics of the system to closely match a desired set of dynamics that achieve some task, rather than to minimize the deviation from some prespecified trajectory through state space. Dr. Sanger notes that this description of control matches well to behavioral descriptions of neural control system, where there are numerous subsystems and circuits that have reflexive responses to external stimuli which must be modulated to achieve desired behavior. The goal of control is to fully define the dynamics of the system’s reaction to the environment.

What comes to mind first for me, in terms of using the modulation of reflex elements to effect a desired set of dynamics, is modeling the spinocerebellum. With a ton of projections directly to the spinal cord, and strong implications in locomotor and balance system, it seems like it would be a very good candidate for being modeled with this type of control. The idea being that the cerebellum is projecting modulatory values to the different spinal circuits (reflex networks and central pattern generators, for example) that specify how to respond to changes in the environment to maintain our balance or the rhythm of our walk. How we go about specifying exactly what those modulatory terms need to be is something that Dr. Sanger tackles in the last paper of this series, which I’ll be reviewing in the next couple of months. I’m looking forward to it.

On another note, in my lab we all work with the Neural Engineering Framework, in which populations of neurons are taken to represent vectors, and to perform transformations on these vectors to relay information an perform various functions. To this end, something that interests me about likelihood calculus is its application to populations of neurons representing vectors. Instead of finding $p(\dot{x}|x)$ by summing the effects of all of the neurons in a population, or defining it in terms of the population spiking patterns, we’re looking at it in terms of the different vectors this population can represent, and the effect of that vector on the system. So we can still have spiking neuron based models, but we can do all the likelihood calculus business one level removed, simplifying the calculation and reducing the number of differential stochastic operators needed.

There are a ton of things going on in this paper, and lots to think about. At several points I deviated from the notation used in this paper because I found it unclear, but aside from that I’ve really enjoyed reading through it and writing it up. Very interesting work.

Sanger TD (2010). Neuro-mechanical control using differential stochastic operators. Conference proceedings : … Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, 2010, 4494-7 PMID: 21095779

## Likelihood calculus paper series review part 1 – Controlling variability

Dr. Terry Sanger has a series of papers that have come out in the last few years describing what he has named ‘likelihood calculus’. The goal of these papers is to develop a ‘a theory of optimal control for variable, uncertain, and noisy systems that nevertheless accomplish real-world tasks reliably.’ The idea being that successful performance can be thought of as modulating variance of movement, allocating resources to tightly control motions when required and allowing variability in task-irrelevant dimensions. To perform variability modulation, we first need a means of capturing mathematically how the features of an uncertain controller operating affect variability in system movement. Defining terms quickly, the features of a controller are the different components that produce signals resulting in movement, variability is taken here to be the trial-to-trial variation in movements, and uncertainty means that the available sensory feedback does not uniquely determine the true state of the world, where uncertainty can arise from noise on sensory feedback signals, unmodeled dynamics, and/or quantitization of sensory feedback. To capture all this uncertainty and variability, probability theory will naturally be employed. In this post I will review the paper ‘Controlling variability’ (2010) by Dr. Sanger, which sets up the framework for describing the time course of uncertainty during movement.

Using probability in system representations

So, here’s a picture of the system our controller (the brain) is in:

There’s the input initial state $x$, and the output change in state, $\dot{x}$, which is generated as a combination of the unforced dynamics of the world and the control dynamics effected by the brain. But since we’re dealing with uncertainty and variability, we’re going to rewrite this such that given an initial state $x$, we get a probability distribution over potential changes in state, $p(\dot{x}|x)$, which specifies the likelihood of each change in state $\dot{x}$, given our initial state probability distribution $p(x)$. So in our system diagram, the word and the brain both define probability distributions over the possible changes in state, $p_1(\dot{x}|x)$ and $p_1(\dot{x}|x)$, respectively, which then combine to create the overall system dynamics $p(\dot{x}|x)$. Redrawing our picture to incorporate the probabilities, we get:

One may ask: how do these probabilities combine? Good question! What we’d like to be able to do is combine them through simple linear operators, because they afford us massive simplifications in our calculations, but the combination of $p_0(\dot{x}|x)$ and $p_1(\dot{x}|x)$ isn’t as simple as summing and normalizing. The reason why can be a little tricky to tease out if you’re unfamiliar with probability, but will becomes clear with some thought. Consider what it means to go about combining the probabilities in this way. Basically, if you sum and normalize, then the result is saying that there is a 50% chance of doing what the brain says to do, and a 50% chance of doing what the world says to do, and doesn’t capture what is actually going to happen, which is an interaction between the effects of the brain and the world. A good example for thinking about this is rolling dice. If you roll each die individually, you have an equal chance of the result being each number 1-6, but if you roll two dice, the overall system probability of rolling numbers changes in a highly nonlinear fashion:

Suddenly there is a 0% chance of the result being a 1, and the probability of rolling a number increases in likelihood until you get to 7, at which point the likelihood decreases with ascending numbers; there is a nonlinear interaction at play that can’t be captured by summing the probabilities of rolling a number on each die individually and normalizing.

So summing the probability distributions over the possible changes in state, $\dot{x}$, isn’t going to work, but is there another way to combine these through linear operators? The answer is yes, but it’s going to require us to undergo a bit of a paradigm shift and expand our minds. What if, instead of capturing the change in the system by looking at $p(\dot{x}|x)$, the probability distribution over possible changes in state given the current state, we instead capture the dynamics of the system by defining $\dot{p}(x)$, the change in the probability of states through time? Instead of describing how the system evolves through the likelihood of different state changes, the dynamics are captured by defining the change in likelihood of different states; we are capturing the effect of the brain and the world on the temporal evolution of state probability. Does that freak you out?

Reworking the problem

Hopefully you’re not too freaked out. Now that you’ve worked your head around this concept, let’s look at $\dot{p}(x)$ a little more closely. Specifically, let’s look at the Kramers-Moyal expansion (for a one-dimensional system):

$\frac{\partial p(x,t)}{\partial t} = \sum_{k=1}^{\inf} \left( - \frac{\partial}{\partial x} \right)^k \{a_k(x) p(x)\} / k!$,

$a_k(x) = \int \dot{x}^k p(\dot{x}|x) d\dot{x}$.

As Dr. Sanger notes, this is a daunting equation, but it can be understood relatively easily. The left side, $\frac{\partial p(x,t)}{\partial t} = \dot{p}_t(x)$, is the rate of change of probability at each point $x$ at time $t$. The right side is just the Taylor series expansion. If we take the first two terms of the Taylor series expansion, we get:

$\frac{\partial p(x,t)}{\partial t} = - a_1 \frac{\partial}{\partial x} p(x) + \frac{a_2}{2} \frac{\partial^2}{\partial x^2} p(x)$,

where the first describes how the probability drifts (or shifts / translates), $a_1$ being the average value of $\dot{x}$ for each value of $x$. The second term relates the rate of diffusion, $a_2$ being the second moment of the speed $\dot{x}$, describing the amount of spread in different possible speeds, where greater variability in speed leads to an increased spread of the probability. This is the Fokker-Planck equation, which describes the evolution of a physical process with constant drift and diffusion. At this point we make the assumption that our probability distributions are all going to be in the form of Gaussians (for which the Fokker-Planck equation exactly describes evolution of the system through time, and can arguably act as a good approximation to neural control systems where movement is based on the average activity of populations of neurons).

As an example of this, think of a 1-dimensional system, and a Gaussian probability distribution describing what state the system is likely to be in. The first term $a_1$ is the average rate of change $\dot{x}$ across the states $x$ the system could be in. The probability shifts through state space as specified by $a_1$. Intuitively, if you have a distribution with mean around position 1 and the system velocity is 4, then the change in your probability distribution, $p(x)$, should shift the mean to position 5. The second term, $a_2$ is a measure of how wide the range of different possible speeds $\dot{x}$ is. The larger the range of possible values, the less certain we become about the location of the system as it moves forward in time; the greater the range of possible states the system might end up in in the next time step. This is reflected by the rate of diffusion of the probability distribution. If we know for sure the speed the system moved at (i.e. all possible states will move with a specific $\dot{x}$), then we simply translate the mean of probability distribution. If however there’s uncertainty in the speed at which the system is moving, then the correct location (reflecting the actual system position) for the mean of the probability distribution could be one of a number of values. This is captured by increasing the width of (diffusing) the Gaussian.

Linear operators

Importantly, the equations above are linear in terms of $p(x)$. This means we can rearrange the above equation:

$\frac{\partial p(x,t)}{\partial t} = \left( -a_1 \frac{\partial}{\partial x} + \frac{a_2}{2} \frac{\partial^2}{\partial x^2} \right) p(x)$,

letting $\mathcal{L} = \left( -a_1 \frac{\partial}{\partial x} + \frac{a_2}{2} \frac{\partial^2}{\partial x^2} \right)$, we have

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L} p(x)$.

Now we can redraw our system above as

where

$\mathcal{L} = \mathcal{L}_0 + \mathcal{L}_1$,

which is the straightforward combination of the different contributions of each of the brain and the world to the overall system state probability. How cool is that?

Alright, calm down. Time to look at using these operators. Let’s assume that the overall system dynamics $\mathcal{L}$ hold constant for some period of time (taking particular care to note that ‘constant dynamics’ does not mean that a single constant output is produced irrespective of the input state, but rather that a given input state $x$ always produces the same result while the dynamics are held constant), and we have discretized our representation of $x$ (to be a range of some values, i.e. -100 to 100) then we can find the state probability distribution at time $T$ by calculating

$p(x, T) = A^T p(x,0)$,

where $A^T = e^{T\mathcal{L}}$.

When combining these $\mathcal{L}$ operators, if we sum them, i.e. overall system dynamics

$\mathcal{L} = \mathcal{L}_0 + \mathcal{L}_1$,

and then apply them to the probability

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L}p(x)$

this is saying that the dynamics provided by $\mathcal{L}_0$ and $\mathcal{L}_1$ are applied at the same time. But if we multiply the component dynamics operators,

$\mathcal{L} = \mathcal{L}_1 \mathcal{L}_0$

then when we apply them we have

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L}p(x) = \mathcal{L}_1 \mathcal{L}_0 p(x)$,

which is interpreted as applying $\mathcal{L}_0$ to the system, and then applying $\mathcal{L}_1$. Just basic algebra, but it allows us to apply simultaneously and sequentially the dynamics generated by our contributing system components (i.e. the brain and the world).

Capturing the effects of control

So now we have a representation of the system dynamics operating with variability and under uncertainty, we’re talking about building a tool to use for controlling these systems though, so where does the control signal $u$ fit in? The $\mathcal{L}$ operator is made to be a function of the control signal, describing the probablistic effect of the control signal given the possible initial states in the current state probability distribution $p(x)$. Thus the change in state probability is now written

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L}(u)p(x)$.

Suppose that we drive a system with constant dynamics $\mathcal{L}(u_1)$ for a period of time $T_1$, at which point we change the control signal and drive the system with constant dynamics $\mathcal{L}(u_2)$ for another period of time $T_2$. The state of the system now be calculated

$p(x, T_1 + T_2) = e^{T_2\mathcal{L}(u_2)}e^{T_1\mathcal{L}(u_1)}p(x,0) = A_{u_2}^{T_2}A_{u_1}^{T_1}p(x,0)$

using the sequential application of dynamics operators discussed above.

Conclusion

And that is the essence of the paper ‘Controlling variability’. There is an additional discussion about the relationship to Bayes’ rule, which I will save for another post, and an example, but this is plenty for this post.

The main point from this paper is that we shouldn’t be focusing on the values of states as the object of control, but rather the probability densities of states. By doing this, we can capture the uncertainty in systems and work towards devising an effecting means of control. So, although the paper is called ‘Controlling variability’, the discussion of how to actually control variability is saved for later papers. All the same, I thought this was a very interesting paper, enjoyed working through it, and am looking forward to the rest of the series.

Sanger TD (2010). Controlling variability. Journal of motor behavior, 42 (6), 401-7 PMID: 21184358