Likelihood calculus paper series review part 1 – Controlling variability

Dr. Terry Sanger has a series of papers that have come out in the last few years describing what he has named ‘likelihood calculus’. The goal of these papers is to develop a ‘a theory of optimal control for variable, uncertain, and noisy systems that nevertheless accomplish real-world tasks reliably.’ The idea being that successful performance can be thought of as modulating variance of movement, allocating resources to tightly control motions when required and allowing variability in task-irrelevant dimensions. To perform variability modulation, we first need a means of capturing mathematically how the features of an uncertain controller operating affect variability in system movement. Defining terms quickly, the features of a controller are the different components that produce signals resulting in movement, variability is taken here to be the trial-to-trial variation in movements, and uncertainty means that the available sensory feedback does not uniquely determine the true state of the world, where uncertainty can arise from noise on sensory feedback signals, unmodeled dynamics, and/or quantitization of sensory feedback. To capture all this uncertainty and variability, probability theory will naturally be employed. In this post I will review the paper ‘Controlling variability’ (2010) by Dr. Sanger, which sets up the framework for describing the time course of uncertainty during movement.

Using probability in system representations

So, here’s a picture of the system our controller (the brain) is in:

There’s the input initial state $x$, and the output change in state, $\dot{x}$, which is generated as a combination of the unforced dynamics of the world and the control dynamics effected by the brain. But since we’re dealing with uncertainty and variability, we’re going to rewrite this such that given an initial state $x$, we get a probability distribution over potential changes in state, $p(\dot{x}|x)$, which specifies the likelihood of each change in state $\dot{x}$, given our initial state probability distribution $p(x)$. So in our system diagram, the word and the brain both define probability distributions over the possible changes in state, $p_1(\dot{x}|x)$ and $p_1(\dot{x}|x)$, respectively, which then combine to create the overall system dynamics $p(\dot{x}|x)$. Redrawing our picture to incorporate the probabilities, we get:

One may ask: how do these probabilities combine? Good question! What we’d like to be able to do is combine them through simple linear operators, because they afford us massive simplifications in our calculations, but the combination of $p_0(\dot{x}|x)$ and $p_1(\dot{x}|x)$ isn’t as simple as summing and normalizing. The reason why can be a little tricky to tease out if you’re unfamiliar with probability, but will becomes clear with some thought. Consider what it means to go about combining the probabilities in this way. Basically, if you sum and normalize, then the result is saying that there is a 50% chance of doing what the brain says to do, and a 50% chance of doing what the world says to do, and doesn’t capture what is actually going to happen, which is an interaction between the effects of the brain and the world. A good example for thinking about this is rolling dice. If you roll each die individually, you have an equal chance of the result being each number 1-6, but if you roll two dice, the overall system probability of rolling numbers changes in a highly nonlinear fashion:

Suddenly there is a 0% chance of the result being a 1, and the probability of rolling a number increases in likelihood until you get to 7, at which point the likelihood decreases with ascending numbers; there is a nonlinear interaction at play that can’t be captured by summing the probabilities of rolling a number on each die individually and normalizing.

So summing the probability distributions over the possible changes in state, $\dot{x}$, isn’t going to work, but is there another way to combine these through linear operators? The answer is yes, but it’s going to require us to undergo a bit of a paradigm shift and expand our minds. What if, instead of capturing the change in the system by looking at $p(\dot{x}|x)$, the probability distribution over possible changes in state given the current state, we instead capture the dynamics of the system by defining $\dot{p}(x)$, the change in the probability of states through time? Instead of describing how the system evolves through the likelihood of different state changes, the dynamics are captured by defining the change in likelihood of different states; we are capturing the effect of the brain and the world on the temporal evolution of state probability. Does that freak you out?

Reworking the problem

Hopefully you’re not too freaked out. Now that you’ve worked your head around this concept, let’s look at $\dot{p}(x)$ a little more closely. Specifically, let’s look at the Kramers-Moyal expansion (for a one-dimensional system):

$\frac{\partial p(x,t)}{\partial t} = \sum_{k=1}^{\inf} \left( - \frac{\partial}{\partial x} \right)^k \{a_k(x) p(x)\} / k!$,

$a_k(x) = \int \dot{x}^k p(\dot{x}|x) d\dot{x}$.

As Dr. Sanger notes, this is a daunting equation, but it can be understood relatively easily. The left side, $\frac{\partial p(x,t)}{\partial t} = \dot{p}_t(x)$, is the rate of change of probability at each point $x$ at time $t$. The right side is just the Taylor series expansion. If we take the first two terms of the Taylor series expansion, we get:

$\frac{\partial p(x,t)}{\partial t} = - a_1 \frac{\partial}{\partial x} p(x) + \frac{a_2}{2} \frac{\partial^2}{\partial x^2} p(x)$,

where the first describes how the probability drifts (or shifts / translates), $a_1$ being the average value of $\dot{x}$ for each value of $x$. The second term relates the rate of diffusion, $a_2$ being the second moment of the speed $\dot{x}$, describing the amount of spread in different possible speeds, where greater variability in speed leads to an increased spread of the probability. This is the Fokker-Planck equation, which describes the evolution of a physical process with constant drift and diffusion. At this point we make the assumption that our probability distributions are all going to be in the form of Gaussians (for which the Fokker-Planck equation exactly describes evolution of the system through time, and can arguably act as a good approximation to neural control systems where movement is based on the average activity of populations of neurons).

As an example of this, think of a 1-dimensional system, and a Gaussian probability distribution describing what state the system is likely to be in. The first term $a_1$ is the average rate of change $\dot{x}$ across the states $x$ the system could be in. The probability shifts through state space as specified by $a_1$. Intuitively, if you have a distribution with mean around position 1 and the system velocity is 4, then the change in your probability distribution, $p(x)$, should shift the mean to position 5. The second term, $a_2$ is a measure of how wide the range of different possible speeds $\dot{x}$ is. The larger the range of possible values, the less certain we become about the location of the system as it moves forward in time; the greater the range of possible states the system might end up in in the next time step. This is reflected by the rate of diffusion of the probability distribution. If we know for sure the speed the system moved at (i.e. all possible states will move with a specific $\dot{x}$), then we simply translate the mean of probability distribution. If however there’s uncertainty in the speed at which the system is moving, then the correct location (reflecting the actual system position) for the mean of the probability distribution could be one of a number of values. This is captured by increasing the width of (diffusing) the Gaussian.

Linear operators

Importantly, the equations above are linear in terms of $p(x)$. This means we can rearrange the above equation:

$\frac{\partial p(x,t)}{\partial t} = \left( -a_1 \frac{\partial}{\partial x} + \frac{a_2}{2} \frac{\partial^2}{\partial x^2} \right) p(x)$,

letting $\mathcal{L} = \left( -a_1 \frac{\partial}{\partial x} + \frac{a_2}{2} \frac{\partial^2}{\partial x^2} \right)$, we have

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L} p(x)$.

Now we can redraw our system above as

where

$\mathcal{L} = \mathcal{L}_0 + \mathcal{L}_1$,

which is the straightforward combination of the different contributions of each of the brain and the world to the overall system state probability. How cool is that?

Alright, calm down. Time to look at using these operators. Let’s assume that the overall system dynamics $\mathcal{L}$ hold constant for some period of time (taking particular care to note that ‘constant dynamics’ does not mean that a single constant output is produced irrespective of the input state, but rather that a given input state $x$ always produces the same result while the dynamics are held constant), and we have discretized our representation of $x$ (to be a range of some values, i.e. -100 to 100) then we can find the state probability distribution at time $T$ by calculating

$p(x, T) = A^T p(x,0)$,

where $A^T = e^{T\mathcal{L}}$.

When combining these $\mathcal{L}$ operators, if we sum them, i.e. overall system dynamics

$\mathcal{L} = \mathcal{L}_0 + \mathcal{L}_1$,

and then apply them to the probability

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L}p(x)$

this is saying that the dynamics provided by $\mathcal{L}_0$ and $\mathcal{L}_1$ are applied at the same time. But if we multiply the component dynamics operators,

$\mathcal{L} = \mathcal{L}_1 \mathcal{L}_0$

then when we apply them we have

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L}p(x) = \mathcal{L}_1 \mathcal{L}_0 p(x)$,

which is interpreted as applying $\mathcal{L}_0$ to the system, and then applying $\mathcal{L}_1$. Just basic algebra, but it allows us to apply simultaneously and sequentially the dynamics generated by our contributing system components (i.e. the brain and the world).

Capturing the effects of control

So now we have a representation of the system dynamics operating with variability and under uncertainty, we’re talking about building a tool to use for controlling these systems though, so where does the control signal $u$ fit in? The $\mathcal{L}$ operator is made to be a function of the control signal, describing the probablistic effect of the control signal given the possible initial states in the current state probability distribution $p(x)$. Thus the change in state probability is now written

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L}(u)p(x)$.

Suppose that we drive a system with constant dynamics $\mathcal{L}(u_1)$ for a period of time $T_1$, at which point we change the control signal and drive the system with constant dynamics $\mathcal{L}(u_2)$ for another period of time $T_2$. The state of the system now be calculated

$p(x, T_1 + T_2) = e^{T_2\mathcal{L}(u_2)}e^{T_1\mathcal{L}(u_1)}p(x,0) = A_{u_2}^{T_2}A_{u_1}^{T_1}p(x,0)$

using the sequential application of dynamics operators discussed above.

Conclusion

And that is the essence of the paper ‘Controlling variability’. There is an additional discussion about the relationship to Bayes’ rule, which I will save for another post, and an example, but this is plenty for this post.

The main point from this paper is that we shouldn’t be focusing on the values of states as the object of control, but rather the probability densities of states. By doing this, we can capture the uncertainty in systems and work towards devising an effecting means of control. So, although the paper is called ‘Controlling variability’, the discussion of how to actually control variability is saved for later papers. All the same, I thought this was a very interesting paper, enjoyed working through it, and am looking forward to the rest of the series.

Sanger TD (2010). Controlling variability. Journal of motor behavior, 42 (6), 401-7 PMID: 21184358