Simultaneous perturbation vs finite differences for linear dynamics estimation and control signal optimization

Recently in my posts I’ve been using finite differences to approximate the gradient of loss functions and dynamical systems, with the intention of creating generalizable controllers that can be run on any system without having to calculate out derivatives beforehand. Finite differences is pretty much the most straight-forward way of approximating a gradient that there is: vary each parameter up and down (assuming we’re doing central differencing), one at a time, run it through your function and estimate the parameters effect on the system by calculating the difference between resulting function output. To do this requires 2 samples of the function for each parameter.

But there’s always more than one way to peel an avocado, and another approach that’s been used with much success is the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm, which was developed by Dr. James Spall (link to overview paper). SPSA is a method of gradient approximation, like finite differences, but, critically, the difference is that it varies all of the parameters at once, rather than one at a time. As a result, you can get an approximation of the gradient with far fewer samples from the system, and also when you don’t have explicit control over your samples (i.e. the ability to vary each parameter one at a time).

Given some function $\textbf{f}$ dependent on some set of parameters $\textbf{x}$, we’re used to finding the gradient of $\textbf{f}(\textbf{x})$ using FDSA (finite differences stochastic approximation) written in this form:

$\textbf{f}_x = \frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}) - \textbf{f}(\textbf{x} - \Delta \textbf{x})}{2 \Delta \textbf{x}} = \Delta x^{-1} \frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}) - \textbf{f}(\textbf{x} - \Delta \textbf{x})}{2},$

where $\Delta\textbf{x}$ is a perturbation to the parameter set, and the subscript on the left-hand side denotes the derivative of $\textbf{f}(\textbf{x})$ with respect to $\textbf{x}$.

And that’s how we’ve calculated it before, estimating the gradient of a single parameter at a time. But, we can rewrite this for a set perturbations $\Delta \textbf{X} = [\Delta\textbf{x}_0, ... , \Delta\textbf{x}_N]^T$:

$\textbf{f}_\textbf{x} = \Delta \textbf{X}^{-1} \textbf{F},$

where

$\textbf{F} = [\frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}_0) - \textbf{f}(\textbf{x} - \Delta \textbf{x}_0)}{2}, ... , \frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}_0) - \textbf{f}(\textbf{x}_N - \Delta \textbf{x}_N)}{2}]^T$,

which works as long as $\Delta\textbf{X}$ is square. When it’s not square (i.e. we have don’t have the same number of samples as we have parameters), we run into problems, because we can’t calculate $\Delta\textbf{X}^{-1}$ directly. To address this, let’s take a step back and then work forward again to get a more general form that works for non-square $\Delta \textbf{X}$ too.

By rewriting the above, and getting rid of the inverse by moving $\Delta\textbf{x}$ back to the other side, we have:

$\Delta\textbf{X} \; \textbf{f}_\textbf{x} = \textbf{F}$

Now, the standard trick to move a matrix that’s not square is to just make it square by multiplying it by its transpose to get a square matrix, and then the whole thing by the inverse:

$\Delta\textbf{X}^T \Delta\textbf{X} \; \textbf{f}_\textbf{x} = \Delta\textbf{X}^T \textbf{F}$

$(\Delta\textbf{X}^T \Delta\textbf{X})^{-1} (\Delta\textbf{X}^T \Delta\textbf{X}) \textbf{f}_\textbf{x} = (\Delta\textbf{X}^T \Delta\textbf{X})^{-1} \Delta\textbf{X}^T \textbf{F}$

$\textbf{f}_\textbf{x} = (\Delta\textbf{X}^T \Delta\textbf{X})^{-1} \Delta\textbf{X}^T \textbf{F}$

Alright! Now we’re comfortable with this characterization of gradient approximation using a form that works with non-square perturbation matrices.

Again, in FDSA, we only vary one parameter at a time. This means that there will only ever be one non-zero entry per row of $\Delta \textbf{X}$. By contrast, in SPSA, we vary multiple parameters, and so rows of $\Delta\textbf{X}$ will be just chalk full of non-zero entries.

Gradient approximation to estimate $\textbf{f}_\textbf{x}$ and $\textbf{f}_\textbf{u}$ for LQR control

Implementing this was pretty simple. Just had to modify the calc_derivs function, which I use to estimate the derivative of the arm with respect to the state and control signal, in my LQR controller code by changing from standard finite differences to simultaneous perturbation:

def calc_derivs(self, x, u):
"""" calculate gradient of plant dynamics using Simultaneous
Perturbation Stochastic Approximation (SPSA). Implemented
based on (Peters & Schaal, 2008).

x np.array: the state of the system
u np.array: the control signal
"""
# Initialization and coefficient selection
num_iters = 20

eps = 1e-4
delta_K = None
delta_J = None
for ii in range(num_iters):
# Generation of simultaneous perturbation vector
# choose each component from a Bernoulli +-1 distribution
# with probability of .5 for each +-1 outcome.
delta_k = np.random.choice([-1,1],
size=len(x) + len(u),
p=[.5, .5])
# Function evaluations
inc_x = np.copy(x) + eps * delta_k[:len(x)]
inc_u = np.copy(u) + eps * delta_k[len(x):]
state_inc = self.plant_dynamics(inc_x, inc_u)
dec_x = np.copy(x) - eps * delta_k[:len(x)]
dec_u = np.copy(u) - eps * delta_k[len(x):]
state_dec = self.plant_dynamics(dec_x, dec_u)

delta_j = ((state_inc - state_dec) /
(2.0 * eps)).reshape(-1)

# Track delta_k and delta_j
delta_K = delta_k if delta_K is None else \
np.vstack([delta_K, delta_k])
delta_J =  delta_j if delta_J is None else \
np.vstack([delta_J, delta_j])

f_xu = np.dot(np.linalg.pinv(np.dot(delta_K.T, delta_K)),
np.dot(delta_K.T, delta_J))
f_x = f_xu[:len(x)]
f_u = f_xu[len(x):]

return f_x.T , f_u.T


A couple notes about the above code. First, you’ll notice that the f_x and f_b matrices are both calculated at the same time. That’s pretty slick! And that calculation for f_xu is just a straight implementation of the matrix form of gradient approximation, where I’ve arranged things so that f_x is in the top part and f_u is in the lower part.

The second thing is that the perturbation vector delta_k is generated from a Bernoulli distribution. The reason behind this is that we want to have a bunch of different samples that pretty reasonably spread the state space and move all the parameters independently. Making each perturbation some distance times -1 or 1 is an easy way to achieve this.

Thirdly, there’s the num_iters variable. This is a very important variable, as it dictates how many random samples of our system we take before we estimate the gradient. I’ve found that to get this to work for both the 2-link arm and the more complex 3-link arm, it needs to be at least 20. Or else things explode and die horribly. Just…horribly.

OK let’s look at the results:

The first thing to notice is that I’ve finally discovered the Seaborn plotting package. The second is that SPSA does as well as FDSA.

You may ask: Is there any difference? Well, if we time these functions, on my lil’ laptop, for the 2-link arm it takes SPSA approximately 2.0ms, but it takes FDSA only 0.8ms. So for the same performance the SPSA is taking almost 3 times as long to run. Why? This boils down to how many times the system dynamics need to be sampled by each algorithm to get a good approximation of the gradient. For a 2-link arm, FDSA has 6 parameters ($\textbf{q}, \dot{\textbf{q}},$ and $\textbf{u}$) that it needs to sample twice (we’re doing central differencing), for a total of 12 samples. And as I mentioned above, the SPSA algorithm needs 20 samples to be stable.

For the 3-link arm, SPSA took about 3.1ms on average and FDSA (which must now perform 18 samples of the dynamics) still only 2.1ms. So number of samples isn’t the only cause of time difference between these two algorithms. SPSA needs to perform that a few more matrix operations, including a matrix inverse, which is expensive, while FDSA can calculate the gradient of each parameter individually, which is much less expensive.

OK so SPSA not really impressive here. BUT! As I discovered, there are other means of employing SPSA.

Gradient approximation to optimize the control signal directly

In the previous set up we were using SPSA to estimate the gradient of the system under control, and then we used that gradient to calculate a control signal that minimized the loss function (as specified inside the LQR). This is one way to use gradient approximation methods. Another way to use these methods is approximate the gradient of the loss function directly, and use that information to iteratively calculate a control signal that minimizes the loss function. This second application is the primary use of the SPSA algorithm, and is what’s described by Dr. Spall in his overview paper.

In this application, the algorithm works like this:

2. perturb input and simulate results
3. observe loss function and calculate gradient
4. update input to system
5. repeat to convergence

Because in this approach we’re iteratively optimizing the input using our gradient estimation, having a noisy estimate is no longer a death sentence, as it was in the LQR. If we update our input to the system with several noisy gradient estimates the noise will essentially just cancel itself out. This means that SPSA now has a powerful advantage over FDSA: Since in SPSA we vary all parameters at once, only 2 samples of the loss function are used to estimate the gradient, regardless of the number of parameters. In contrast, FDSA needs to sample the loss function twice for every input parameter. Here’s a picture from (Spall, 1998) that shows the two running against each other to optimize a 2D problem:

This gets across that even though SPSA bounces around more, they both reach the solution in the same number of steps. And, in general, this is the case, as Dr. Spall talks about in the paper. There’s also a couple more details of the algorithm, so let’s look at it in detail. Here’s the code, which is just a straight translation into Python out of the description in Dr. Spall’s paper:

# Step 1: Initialization and coefficient selection
max_iters = 5
converge_thresh = 1e-5

alpha = 0.602 # from (Spall, 1998)
gamma = 0.101
a = .101 # found empirically using HyperOpt
A = .193
c = .0277

delta_K = None
delta_J = None
u = np.copy(self.u) if self.u is not None \
else np.zeros(self.arm.DOF)
for k in range(max_iters):
ak = a / (A + k + 1)**alpha
ck = c / (k + 1)**gamma

# Step 2: Generation of simultaneous perturbation vector
# choose each component from a bernoulli +-1 distribution with
# probability of .5 for each +-1 outcome.
delta_k = np.random.choice([-1,1], size=arm.DOF, p=[.5, .5])

# Step 3: Function evaluations
inc_u = np.copy(u) + ck * delta_k
cost_inc = self.cost(np.copy(state), inc_u)
dec_u = np.copy(u) - ck * delta_k
cost_dec = self.cost(np.copy(state), dec_u)

gk = np.dot((cost_inc - cost_dec) / (2.0*ck), delta_k)

# Step 5: Update u estimate
old_u = np.copy(u)
u -= ak * gk

# Step 6: Check for convergence
if np.sum(abs(u - old_u)) < converge_thresh:
break


The main as-of-yet-unexplained parts of this code are the alpha, gamma, a, A, and c variables. What’s their deal?

Looking inside the loop, we can see that ck controls the magnitude of our perturbations. Looking a little further down, ak is just the learning rate. And all of those other parameters are just involved in shaping the trajectories that ak and ck follow through iterations, which is a path towards zero. So the first steps and perturbations are the biggest, and each successively becomes smaller as the iteration count increases.

There are a few heuristics that Dr. Spall goes over, but there aren’t any hard and fast rules for setting a, A, and c. Here, I just used HyperOpt to find some values that worked pretty well for this particular problem.

The FDSA version of this is also very straight-forward:

# Step 1: Initialization and coefficient selection
max_iters = 10
converge_thresh = 1e-5

eps = 1e-4
u = np.copy(self.u) if self.u is not None \
else np.zeros(self.arm.DOF)
for k in range(max_iters):

gk = np.zeros(u.shape)
for ii in range(gk.shape[0]):
# Step 2: Generate perturbations one parameter at a time
inc_u = np.copy(u)
inc_u[ii] += eps
dec_u = np.copy(u)
dec_u -= eps

# Step 3: Function evaluation
cost_inc = self.cost(np.copy(state), inc_u)
cost_dec = self.cost(np.copy(state), dec_u)

gk[ii] = (cost_inc - cost_dec) / (2.0 * eps)

old_u = np.copy(u)
# Step 5: Update u estimate
u -= 1e-5 * gk

# Step 6: Check for convergence
if np.sum(abs(u - old_u)) < converge_thresh:
break


You’ll notice that in both the SPSA and FDSA code we’re no longer sampling plant_dynamics, we’re instead sampling cost, a loss function I defined. From just my experience playing around with these algorithms a bit, getting the loss function to be appropriate and give the desired behaviour is definitely a bit of an art. It feels like much more of an art than in other controllers I’ve coded, but that could just be me.

The cost function that I’m using is pretty much the first thing you’d think of. It penalizes distance to target and having non-zero velocity. Getting the weighting between distance to target and velocity set up so that the arm moves to the target but also doesn’t overshoot definitely took a bit of trial and error, er, I mean empirical analysis. Here’s the cost function that I found worked pretty well, note that I had to make special cases for the different arms:

 def cost(self, x, u):
dt = .1 if self.arm.DOF == 3 else .01
next_x = self.plant_dynamics(x, u, dt=dt)
vel_gain = 100 if self.arm.DOF == 3 else 10
return (np.sqrt(np.sum((self.arm.x - self.target)**2)) * 1000 \
+ np.sum((next_x[self.arm.DOF:])**2) * vel_gain)


So that’s all the code, let’s look at the results!

For these results, I used a max of 10 iterations for optimizing the control signal. I was definitely surprised by the quality of the results, especially for the 3-link arm, compared to the results generated by a standard LQR controller. Although I need to note, again, that it was a fair bit of me playing around with the exact cost function to get these results. Lots of empirical analysis.

The two controllers generate results that are identical to visual inspection. However, especially in the 3-link arm, the time required to run the FDSA was significantly longer than the SPSA controller. It took approximately 140ms for the SPSA controller to run a single loop, but took FDSA on average 700ms for a single loop of calculating the control signal. Almost 5 times as long! For the same results! In directly optimizing the control signal, SPSA gets a big win over standard FDSA. So, if you’re looking to directly optimize over a loss function, SPSA is probably the way you want to go.

Conclusions

First off, I thought it was really neat to directly apply gradient approximation methods to optimizing the control signal. It’s something I haven’t tried before, but definitely makes sense, and can generate some really nice results when tuned properly. Automating the tuning is definitely I’ll be discussing in future posts, because doing it by hand takes a long time and is annoying.

In the LQR, the gradient approximation was best done by the FDSA. I think the main reasons for this is that in solving for the control signal the LQR algorithm uses matrix inverses, and any errors in the linear approximations to the dynamics are going to be amplified quite a bit. If I did anything less than 10-15 iterations (20 for the 3-link arm) in the SPSA approximation then things exploded. Also, here the SPSA algorithm required a matrix inverse, where the FDSA didn’t. This is because we only varied one parameter at a time in FDSA, and the effects of changing each was isolated. In the SPSA case, we had to consider the changes across all the variables and the resulting effects all at once, essentially noting which variables changed by how much and the changes in each case, and averaging. Here, even with the more complex 3-link arm, FDSA was faster, so I’m going to stick with it in my LQR and iLQR implementations.

In the direct control signal optimization SPSA beat the pants off of FDSA. It was almost 5 times faster for control of the 3-link arm. This was, again, because in this case we could use noisy samples of the gradient of the loss function and relied on noise to cancel itself out as we iterated. So we only needed 2 samples of the loss function in SPSA, where in FDSA we needed 2*num_parameters. And although this generated pretty good results I would definitely be hesitant against using this for any more complicated systems, because tuning that cost function to get out a good trajectory was a pain. If you’re interested in playing around with this, you can check out the code for the gradient controllers up on my GitHub.

The iterative Linear Quadratic Regulator algorithm

A few months ago I posted on Linear Quadratic Regulators (LQRs) for control of non-linear systems using finite-differences. The gist of it was at every time step linearize the dynamics, quadratize (it could be a word) the cost function around the current point in state space and compute your feedback gain off of that, as though the dynamics were both linear and consistent (i.e. didn’t change in different states). And that was pretty cool because you didn’t need all the equations of motion and inertia matrices etc to generate a control signal. You could just use the simulation you had, sample it a bunch to estimate the dynamics and value function, and go off of that.

The LQR, however, operates with maverick disregard for changes in the future. Careless of the consequences, it optimizes assuming the linear dynamics approximated at the current time step hold for all time. It would be really great to have an algorithm that was able to plan out and optimize a sequence, mindful of the changing dynamics of the system.

This is exactly the iterative Linear Quadratic Regulator method (iLQR) was designed for. iLQR is an extension of LQR control, and the idea here is basically to optimize a whole control sequence rather than just the control signal for the current point in time. The basic flow of the algorithm is:

1. Initialize with initial state $x_0$ and initial control sequence $\textbf{U} = [u_{t_0}, u_{t_1}, ..., u_{t_{N-1}}]$.
2. Do a forward pass, i.e. simulate the system using $(x_0, \textbf{U})$ to get the trajectory through state space, $\textbf{X}$, that results from applying the control sequence $\textbf{U}$ starting in $x_0$.
3. Do a backward pass, estimate the value function and dynamics for each $(\textbf{x}, \textbf{u})$ in the state-space and control signal trajectories.
4. Calculate an updated control signal $\hat{\textbf{U}}$ and evaluate cost of trajectory resulting from $(x_0, \hat{\textbf{U}})$.
1. If $|(\textrm{cost}(x_0, \hat{\textbf{U}}) - \textrm{cost}(x_0, \textbf{U})| < \textrm{threshold}$ then we've converged and exit.
2. If $\textrm{cost}(x_0, \hat{\textbf{U}}) < \textrm{cost}(x_0, \textbf{U})$, then set $\textbf{U} = \hat{\textbf{U}}$, and change the update size to be more aggressive. Go back to step 2.
3. If $\textrm{cost}(x_0, \hat{\textbf{U}}) \geq$ $\textrm{cost}(x_0, \textbf{U})$ change the update size to be more modest. Go back to step 3.

There are a bunch of descriptions of iLQR, and it also goes by names like ‘the sequential linear quadratic algorithm’. The paper that I’m going to be working off of is by Yuval Tassa out of Emo Todorov’s lab, called Control-limited differential dynamic programming. And the Python implementation of this can be found up on my github in my Control repo. Also, a big thank you to Dr. Emo Todorov who provided Matlab code for the iLQG algorithm, which was super helpful.

Defining things

So let’s dive in. Formally defining things, we have our system $\textbf{x}$, and dynamics described with the function $\textbf{f}$, such that

$\textbf{x}_{t+1} = \textbf{f}(\textbf{x}_t, \textbf{u}_t),$

where $\textbf{u}$ is the input control signal. The trajectory $\{\textbf{X}, \textbf{U}\}$ is the sequence of states $\textbf{X} = \{\textbf{x}_0, \textbf{x}_1, ..., \textbf{x}_N\}$ that result from applying the control sequence $\textbf{U} = \{\textbf{u}_0, \textbf{u}_1, ..., \textbf{u}_{N-1}\}$ starting in the initial state $\textbf{x}_0$.

Now we need to define all of our cost related equations, so we know exactly what we’re dealing with.

Define the total cost function $J$, which is the sum of the immediate cost, $\ell$, from each state in the trajectory plus the final cost, $\ell_f$:

$J(\textbf{x}_0, \textbf{U}) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{x}_t, \textbf{u}_t) + \ell_f(\textbf{x}_N).$

Letting $\textbf{U}_t = \{\textbf{u}_t, \textbf{u}_{t+1}, ..., \textbf{U}_{N-1}\}$, we define the cost-to-go as the sum of costs from time $t$ to $N$:

$J_t(\textbf{x}, \textbf{U}_t) = \sum\limits^{N-1}\limits_{i=t} \ell(\textbf{x}_i, \textbf{u}_i) + \ell_f(\textbf{x}_N).$

The value function $V$ at time $t$ is the optimal cost-to-go from a given state:

$V_t(\textbf{x}) = \min\limits_{\textbf{U}_t} J_t(\textbf{x}, \textbf{U}_t),$

where the above equation just says that the optimal cost-to-go is found by using the control sequence $\textbf{U}_t$ that minimizes $J_t$.

At the final time step, $N$, the value function is simply

$V(\textbf{x}_N) = \ell_f(\textbf{x}_N).$

For all preceding time steps, we can write the value function as a function of the immediate cost $\ell(\textbf{x}, \textbf{u})$ and the value function at the next time step:

$V(\textbf{x}) = \min\limits_{\textbf{u}} \left[ \ell(\textbf{x}, \textbf{u}) + V(\textbf{f}(\textbf{x}, \textbf{u})) \right].$

NOTE: In the paper, they use the notation $V'(\textbf{f}(\textbf{x}, \textbf{u}))$ to denote the value function at the next time step, which is redundant since $\textbf{x}_{t+1} = \textbf{f}(\textbf{x}_t, \textbf{u}_t)$, but it comes in handy later when they drop the dependencies to simplify notation. So, heads up: $V' = V(\textbf{f}(\textbf{x}, \textbf{u})$.

Forward rollout

The forward rollout consists of two parts. The first part is to simulating things to generate the $(\textbf{X}, \textbf{U}),$ from which we can calculate the overall cost of the trajectory, and find out the path that the arm will take. To improve things though we’ll need a lot of information about the partial derivatives of the system, calculating these is the second part of the forward rollout phase.

To calculate all these partial derivatives we’ll use $(\textbf{X}, \textbf{U})$. For each $(\textbf{x}_t, \textbf{u}_t)$ we’ll calculate the derivatives of $\textbf{f}(\textbf{x}_t, \textbf{u}_t)$ with respect to $\textbf{x}_t$ and $\textbf{u}_t$, which will give us what we need for our linear approximation of the system dynamics.

To get the information we need about the value function, we’ll need the first and second derivatives of $\ell(\textbf{x}_t, \textbf{u}_t)$ and $\ell_f(\textbf{x}_t, \textbf{x}_t)$ with respect to $\textbf{x}_t$ and $\textbf{u}_t$.

So all in all, we need to calculate $\textbf{f}_\textbf{x}$, $\textbf{f}_\textbf{u}$, $\ell_\textbf{x}$, $\ell_\textbf{u}$, $\ell_\textbf{xx}$, $\ell_\textbf{ux}$, $\ell_\textbf{uu}$, where the subscripts denote a partial derivative, so $\ell_\textbf{x}$ is the partial derivative of $\ell$ with respect to $\textbf{x}$, $\ell_\textbf{xx}$ is the second derivative of $\ell$ with respect to $\textbf{x}$, etc. And to calculate all of these partial derivatives, we’re going to use finite differences! Just like in the LQR with finite differences post. Long story short, load up the simulation for every time step, slightly vary one of the parameters, and measure the resulting change.

Once we have all of these, we’re ready to move on to the backward pass.

Backward pass

Now, we started out with an initial trajectory, but that was just a guess. We want our algorithm to take it and then converge to a local minimum. To do this, we’re going to add some perturbing values and use them to minimize the value function. Specifically, we’re going to compute a local solution to our value function using a quadratic Taylor expansion. So let’s define $Q(\delta \textbf{x}, \delta \textbf{u})$ to be the change in our value function at $(\textbf{x}, \textbf{u})$ as a result of small perturbations $(\delta \textbf{x}, \delta \textbf{u})$:

$Q(\delta \textbf{x}, \delta \textbf{u}) = \ell (\textbf{x} + \delta \textbf{x}, \textbf{u} + \delta \textbf{u}) + V(\textbf{f}(\textbf{x} + \delta\textbf{x}, \textbf{u} + \delta \textbf{u})).$

The second-order expansion of $Q$ is given by:

$Q_\textbf{x} = \ell_\textbf{x} + \textbf{f}_\textbf{x}^T V'_\textbf{x},$

$Q_\textbf{u} = \ell_\textbf{u} + \textbf{f}_\textbf{u}^T V'_\textbf{x},$

$Q_\textbf{xx} = \ell_\textbf{xx} + \textbf{f}_\textbf{x}^T V'_\textbf{xx} \textbf{f}_\textbf{x} + V'_\textbf{x} \cdot \textbf{f}_\textbf{xx},$

$Q_\textbf{ux} = \ell_\textbf{ux} + \textbf{f}_\textbf{u}^T V'_\textbf{xx} \textbf{f}_\textbf{x}+ V'_\textbf{x} \cdot \textbf{f}_\textbf{ux},$

$Q_\textbf{uu} = \ell_\textbf{uu} + \textbf{f}_\textbf{u}^T V'_\textbf{xx} \textbf{f}_\textbf{u}+ V'_\textbf{x} \cdot \textbf{f}_\textbf{uu}.$

Remember that $V' = V(\textbf{f}(\textbf{x}, \textbf{u}))$, which is the value function at the next time step. NOTE: All of the second derivatives of $\textbf{f}$ are zero in the systems we’re controlling here, so when we calculate the second derivatives we don’t need to worry about doing any tensor math, yay!

Given the second-order expansion of $Q$, we can to compute the optimal modification to the control signal, $\delta \textbf{u}^*$. This control signal update has two parts, a feedforward term, $\textbf{k}$, and a feedback term $\textbf{K} \delta\textbf{x}$. The optimal update is the $\delta\textbf{u}$ that minimizes the cost of $Q$:

$\delta\textbf{u}^*(\delta \textbf{x}) = \min\limits_{\delta\textbf{u}}Q(\delta\textbf{x}, \delta\textbf{u}) = \textbf{k} + \textbf{K}\delta\textbf{x},$

where $\textbf{k} = -Q^{-1}_\textbf{uu} Q_\textbf{u}$ and $\textbf{K} = -Q^{-1}_\textbf{uu} Q_\textbf{ux}.$

Derivation can be found in this earlier paper by Li and Todorov. By then substituting this policy into the expansion of $Q$ we get a quadratic model of $V$. They do some mathamagics and come out with:

$V_\textbf{x} = Q_\textbf{x} - \textbf{K}^T Q_\textbf{uu} \textbf{k},$

$V_\textbf{xx} = Q_\textbf{xx} - \textbf{K}^T Q_\textbf{uu} \textbf{K}.$

So now we have all of the terms that we need, and they’re defined in terms of the values at the next time step. We know the value of the value function at the final time step $V_N = \ell_f(\textbf{x}_N)$, and so we’ll simply plug this value in and work backwards in time recursively computing the partial derivatives of $Q$ and $V$.

Calculate control signal update

Once those are all calculated, we can calculate the gain matrices, $\textbf{k}$ and $\textbf{K}$, for our control signal update. Huzzah! Now all that’s left to do is evaluate this new trajectory. So we set up our system

$\hat{\textbf{x}}_0 = \textbf{x}_0,$

$\hat{\textbf{u}}_t = \textbf{u}_t + \textbf{k}_t + \textbf{K}_t (\hat{\textbf{x}}_t - \textbf{x}_t),$

$\hat{\textbf{x}}_{t+1} = \textbf{f}(\hat{\textbf{x}}_t, \hat{\textbf{u}}_t),$

and record the cost. Now if the cost of the new trajectory $(\hat{\textbf{X}}, \hat{\textbf{U}})$ is less than the cost of $(\textbf{X}, \textbf{U})$ then we set $\textbf{U} = \hat{\textbf{U}}$ and go do it all again! And when the cost from an update becomes less than a threshold value, call it done. In code this looks like:

if costnew < cost:
sim_new_trajectory = True

if (abs(costnew - cost)/cost) < self.converge_thresh:
break


Of course, another option we need to account for is when costnew > cost. What do we do in this case? Our control update hasn’t worked, do we just exit?

The Levenberg-Marquardt heuristic
No! Phew.

The control signal update in iLQR is calculated in such a way that it can behave like Gauss-Newton optimization (which uses second-order derivative information) or like gradient descent (which only uses first-order derivative information). The is that if the updates are going well, then lets include curvature information in our update to help optimize things faster. If the updates aren’t going well let’s dial back towards gradient descent, stick to first-order derivative information and use smaller steps. This wizardry is known as the Levenberg-Marquardt heuristic. So how does it work?

Something we skimmed over in the iLQR description was that we need to calculate $Q^{-1}_\textbf{uu}$ to get the $\textbf{k}$ and $\textbf{K}$ matrices. Instead of using np.linalg.pinv or somesuch, we’re going to calculate the inverse ourselves after finding the eigenvalues and eigenvectors, so that we can regularize it. This will let us do a couple of things. First, we’ll be able to make sure that our estimate of curvature ($Q_\textbf{uu}^{-1}$) stays positive definite, which is important to make sure that we always have a descent direction. Second, we’re going to add a regularization term to the eigenvalues to prevent them from exploding when we take their inverse. Here’s our regularization implemented in Python:


Q_uu_evals, Q_uu_evecs = np.linalg.eig(Q_uu)
Q_uu_evals[Q_uu_evals < 0] = 0.0
Q_uu_evals += lamb
Q_uu_inv = np.dot(Q_uu_evecs,
np.dot(np.diag(1.0/Q_uu_evals), Q_uu_evecs.T))


Now, what happens when we change lamb? The eigenvalues represent the magnitude of each of the eigenvectors, and by taking their reciprocal we flip the contributions of the vectors. So the ones that were contributing the least now have the largest singular values, and the ones that contributed the most now have the smallest eigenvalues. By adding a regularization term we ensure that the inverted eigenvalues can never be larger than 1/lamb. So essentially we throw out information.

In the case where we’ve got a really good approximation of the system dynamics and value function, we don’t want to do this. We want to use all of the information available because it’s accurate, so make lamb small and get a more accurate inverse. In the case where we have a bad approximation of the dynamics we want to be more conservative, which means not having those large singular values. Smaller singular values give a smaller $Q_\textbf{uu}^{-1}$ estimate, which then gives smaller gain matrices and control signal update, which is what we want to do when our control signal updates are going poorly.

How do you know if they’re going poorly or not, you now surely ask! Clever as always, we’re going to use the result of the previous iteration to update lamb. So adding to the code from just above, the end of our control update loop is going to look like:

lamb = 1.0 # initial value of lambda
...
if costnew < cost:
lamb /= self.lamb_factor
sim_new_trajectory = True

if (abs(costnew - cost)/cost) < self.converge_thresh:
break
else:
lamb *= self.lamb_factor
if lamb > self.max_lamb:
break


And that is pretty much everything! OK let’s see how this runs!

Simulation results

If you want to run this and see for yourself, you can go copy my Control repo, navigate to the main directory, and run

python run.py arm2 reach


or substitute in arm3. If you’re having trouble getting the arm2 simulation to run, try arm2_python, which is a straight Python implementation of the arm dynamics, and should work no sweat for Windows and Mac.

Below you can see results from the iLQR controller controlling the 2 and 3 link arms (click on the figures to see full sized versions, they got distorted a bit in the shrinking to fit on the page), using immediate and final state cost functions defined as:

l = np.sum(u**2)


and

pos_err = np.array([self.arm.x[0] - self.target[0],
self.arm.x[1] - self.target[1]])
l = (wp * np.sum(pos_err**2) + # pos error
wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error


where wp and wv are just gain values, x is the state of the system, and self.arm.x is the $(x,y)$ position of the hand. These read as “during movement, penalize large control signals, and at the final state, have a big penalty on not being at the target.”

So let’s give it up for iLQR, this is awesome! How much of a crazy improvement is that over LQR? And with all knowledge of the system through finite differences, and with the full movements in exactly 1 second! (Note: The simulation speeds look different because of my editing to keep the gif sizes small, they both take the same amount of time for each movement.)

Changing cost functions
Something that you may notice is that the control of the 3 link is actually straighter than the 2 link. I thought that this might be just an issue with the gain values, since the scale of movement is smaller for the 2 link arm than the 3 link there might have been less of a penalty for not moving in a straight line, BUT this was wrong. You can crank the gains and still get the same movement. The actual reason is that this is what the cost function specifies, if you look in the code, only $\ell_f(\textbf{x}_N)$ penalizes the distance from the target, and the cost function during movement is strictly to minimize the control signal, i.e. $\ell(\textbf{x}_t, \textbf{u}_t) = \textbf{u}_t^2$.

Well that’s a lot of talk, you say, like the incorrigible antagonist we both know you to be, prove it. Alright, fine! Here’s iLQR running with an updated cost function that includes the end-effector’s distance from the target in the immediate cost:

All that I had to do to get this was change the immediate cost from

l = np.sum(u**2)


to

l = np.sum(u**2)
pos_err = np.array([self.arm.x[0] - self.target[0],
self.arm.x[1] - self.target[1]])
l += (wp * np.sum(pos_err**2) + # pos error
wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error


where all I had to do was include the position penalty term from the final state cost into the immediate state cost.

Changing sequence length
In these simulations the system is simulating at .01 time step, and I gave it 100 time steps to reach the target. What if I give it only 50 time steps?

It looks pretty much the same! It’s just now twice as fast, which is of course achieved by using larger control signals, which we don’t see, but dang awesome.

What if we try to make it there in 10 time steps??

OK well that does not look good. So what’s going on in this case? Basically we’ve given the algorithm an impossible task. It can’t make it to the target location in 10 time steps. In the implementation I wrote here, if it hits the end of it’s control sequence and it hasn’t reached the target yet, the control sequence starts over back at t=0. Remember that part of the target state is also velocity, so basically it moves for 10 time steps to try to minimize $(x,y)$ distance, and then slows down to minimize final state cost in the velocity term.

In conclusion

This algorithm has been used in a ton of things, for controlling robots and simulations, and is an important part of guided policy search, which has been used to very successfully train deep networks in control problems. It’s getting really impressive results for controlling the arm models that I’ve built here, and using finite differences should easily generalize to other systems.

iLQR is very computationally expensive, though, so that’s definitely a downside. It’s definitely less expensive if you have the equations of your system, or at least a decent approximation of them, and you don’t need to use finite differences. But you pay for the efficiency with a loss in generality.

There are also a bunch of parameters to play around with that I haven’t explored at all here, like the weights in the cost function penalizing the magnitude of the cost function and the final state position error. I showed a basic example of changing the cost function, which hopefully gets across just how easy changing these things out can be when you’re using finite differences, and there’s a lot to play around with there too.

Implementation note
In the Yuval and Todorov paper, they talked about using backtracking line search when generating the control signal. So the algorithm they had when generating the new control signal was actually:

$\hat{\textbf{u}}_t = \hat{\textbf{u}}_t + \alpha\textbf{k}_t + \textbf{K}_t(\hat{\textbf{x}}_t - \textbf{x}_t)$

where $\alpha$ was the backtracking search parameter, which gets set to one initially and then reduced. It’s very possible I didn’t implement it as intended, but I found consistently that $\alpha = 1$ always generated the best results, so it was just adding computation time. So I left it out of my implementation. If anyone has insights on an implementation that improves results, please let me know!

And then finally, another thank you to Dr. Emo Todorov for providing Matlab code for the iLQG algorithm, which was very helpful, especially for getting the Levenberg-Marquardt heuristic implemented properly.

Linear-Quadratic Regulation for non-linear systems using finite differences

One of the standard controllers in basic control theory is the linear-quadratic regulator (LQR). There is a finite-horizon case (where you have a limited amount of time), and an infinite-horizon case (where you don’t); in this post, for simplicity, we’re only going to be dealing with the infinite-horizon case.

The LQR is designed to handle a very specific kind of problem. First, it assumes you are controlling a system with linear dynamics, which means you can express them as

$\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u}$,

where $\textbf{x}$ and $\dot{\textbf{x}}$ are the state and its time derivative, $\textbf{u}$ is the input, and $\textbf{A}$ and $\textbf{B}$ capture the effects of the state and input on the derivative. And second, it assumes that the cost function, denoted $J$, is a quadratic of the form

$J = \int_0^{\infty} \left( (\textbf{x} - \textbf{x}^*)^T \textbf{Q} (\textbf{x} - \textbf{x}^*) + \textbf{u}^T \textbf{R} \textbf{u} \right) dt$

where $\textbf{x}^*$ is the target state, and $\textbf{Q} = \textbf{Q}^T \geq 0$ and $\textbf{R} = \textbf{R}^T \geq 0$ are weights on the cost of not being at the target state and applying a control signal. The higher $\textbf{Q}$ is, the more important it is to get to the target state asap, the higher $\textbf{R}$ is, the more important it is to keep the control signal small as you go to the target state.

The goal of the LQR is to calculate a feedback gain matrix $\textbf{K}$ such that

$\textbf{u} = -\textbf{K} \textbf{x},$

drives the system to the target. When the system is a linear system with a quadratic cost function, this can be done optimally. There is lots of discussion elsewhere about LQRs and their derivation, so I’m not going to go into that with this post. Instead, I’m going to talk about applying LQRs to non-linear systems, and using finite differences to do it, which works when you have a readily accessible simulation of the system on hand. The fun part is that by using finite differences you can get this to work without working out the dynamics equations yourself.

Using LQRs on non-linear systems

As you may have noticed, non-linear systems violate the first assumption of a linear quadratic regulator; that the system is linear. That doesn’t mean that we can’t apply it, it just means that it’s not going to be optimal. How poorly the LQR will perform depends on a few things, two important factors being how non-linear the system dynamics actually are, and how often you’re able to update the feedback gain matrix $\textbf{K}$. To apply LQR to non-linear systems we’re just going to close our eyes and pretend that the system dynamics are linear, i.e. they fit the form

$\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u}.$

We’ll do this by approximating the actual dynamics of the system linearly. We’ll then solve for our gain value $\textbf{K}$, generate our control signal for this timestep, and then re-approximate the dynamics again at the next time step and solve for $\textbf{K}$ from the new state. The more non-linear the system dynamics are, the less appropriate $\textbf{K}$ will be for generating our control signal $\textbf{u}$ as we move away from the state $\textbf{K}$ was calculated in; this is why update time of the LQR can become an important factor.

Using finite-differences to approximate system dynamics

An important question, then, is how do we find this system approximation? How can we calculate the $\textbf{A}$ and $\textbf{B}$ matrices that we then use to solve for $\textbf{K}$? If we know the dynamics of the system to be

$\dot{\textbf{x}} = f(\textbf{x}, \textbf{u})$,

then we can calculate

$\textbf{A} = \frac{\partial f(\textbf{x}, \textbf{u})}{\partial \textbf{x}}, \;\;\;\; \textbf{B} = \frac{\partial f(\textbf{x}, \textbf{u})}{\partial \textbf{u}}$.

If you’re going to try this for the 3-link arm, though, get out Mathematica. Do not try this by hand. If you disregard my warning and foolhardily attempt such a derivation you will regret, repent, and then appeal to Wolfram Alpha for salvation. These equations quickly become terrible and long even for seemingly not-so-complicated systems.

There are a few ways to skirt this. Here we’re going to assume that the system under control is a simulation, or that we at least have access to an accurate model, and use the finite differences method to compute these values. The idea behind finite differences is to approximate the rate of change of the function $f$ at the point $x$ by sampling $f$ near $x$ and using the difference to calculate $\dot{f}(x)$. Here’s a picture for a 1D system:

So here, our current state $x$ is the blue dot, and the red dots represent the sample points $x + \Delta x$ and $x - \Delta x$. We can then calculate

$\dot{f}(x) \approx \frac{f(x+\Delta x) - f(x-\Delta x)}{2\Delta x},$

and you can see the actual rate of change of $f$ at $x$ plotted in the blue dashed line, and the approximated rate of change calculated using finite differences plotted in the red dashed line. We can also see that the approximated derivative is only accurate near $x$ (the blue dot).

Back in our multi-dimensional system, to use finite differences to calculate the derivative with respect to the state and the input we’re going to vary each of the dimensions of the state and input by some small amount one at a time, calculating the effects of each one by one. Here’s a chunk of pseudo-code to hopefully clarify this idea:

eps = 1e-5
A = np.zeros((len(current_state), len(current_state))
for ii in range(len(current_state)):
x = current_state.copy()
x[ii] += eps
x_inc = simulate_system(state=x, input=control_signal)
x = current_state.copy()
x[ii] -= eps
x_dec = simulate_system(state=x, input=control_signal)
A[:,ii] = (x_inc - x_dec) / (2 * eps)

B = np.zeros((len(current_state), len(control_signal))
for ii in range(len(control_signal)):
u = control_signal.copy()
u[ii] += eps
x_inc = simulate_system(state=current_state, input=u)
u = control_signal.copy()
u[ii] -= eps
x_dec = simulate_system(state=current_state, input=u)
B[:,ii] = (x_inc - x_dec) / (2 * eps)


Now we’re able to generate our $\textbf{A}$ and $\textbf{B}$ matrices we have everything we need to solve for our feedback gain matrix $\textbf{K}$! Which is great.

Note on using finite differences in continuous vs discrete setup

Something that’s important to straighten out too is what exactly is returned by the simulate_system function in the code above. In the continuous case, your system is captured as

$\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u},$,

where in the discrete case your system is defined

$\textbf{x}(t+1) = \textbf{A}\textbf{x}(t) + \textbf{B}\textbf{u}(t).$

If you are calculating your feedback gain matrix $\textbf{K}$ using the continuous solution to the algebraic Riccati equation, then you need to be returning $\dot{\textbf{x}}(t)$. If you’re solving for $\textbf{K}$ using the discrete solution to the algebraic Riccati equation you need to return $\textbf{x}(t+1)$. This was just something that I came across as I was coding and so I wanted to mention it here in case anyone else stumbled across it!

Applying LQR to 2 and 3 link arm control

Alright! Let’s have a look at how the LQR does controlling non-linear systems. Below we have the control of a 2-link arm compared to a 3-link arm, and you can see the control of the 2-link arm is better. This is a direct result of the dynamics of a 3-link arm being significantly more complex.

Note on controlling at different timesteps

When I was first testing the LQR controller I expected the effects of different control update times to be a lot more significant than it was. As it turns out, for controlling a 3-link arm, there’s not really a visible difference in a controller that is updating every .01 seconds vs every .001 seconds vs every .0001 seconds. Let’s have a look:

Can’t even tell, eh? Fun fact, the simulation took 1 minute 30 seconds at .01 seconds time step and 45 minutes at .0001 seconds time step. The left-most animation is the .01 seconds and the right-most the .0001 seconds. But why is there seemingly so little difference? Well, this boils down to the dynamics of the 3-link arm changing actually pretty slowly. Below I’ve plotted just a few of the elements from the $\textbf{A}$, $\textbf{B}$, and $\textbf{K}$ matrices over .5 seconds of simulation time:

So, there are some obvious points where sampling the dynamics at a .01 time step is noticeably less accurate, but all in all there’s not a huuuggge difference between sampling at .01 and .0001 seconds. If you’re just watching the end-effector path it’s really not very noticeable. You can see how the elements of $\textbf{A}$ and $\textbf{B}$ are changing fairly slowly; this means that $\textbf{K}$ is going to be an effective feedback gain for a fair chunk of time. And the computational savings you get by sampling the dynamics and regenerating $\textbf{K}$ every .01 seconds instead of every .0001 seconds are pretty big. This was just another thing that I came across when playing around with the LQR, the take away being don’t just assume you need to update your system crazy often. You might get very comparable performance for much less computational cost.

Conclusions

All in all, the LQR controller is pretty neat! It’s really simple to set up, and generic. We don’t need any specific information about the system dynamics, like we do for effective operational space control (OSC). When we estimate the dynamics with finite differences, all need is a decent system model that we can sample. Again, the more non-linear the system, of course, the less effective a LQR will be. If you’re interested in playing around with one, or generating the figures that I show above, the code is all up and running on my Github for you to explore.

Dynamic movement primitives part 3: Rhythmic movements

So far we’ve looked at using DMPs for discrete movements, but as I briefly mentioned it’s also possible to use them to perform rhythmic movements. In this post we’re going to look at the implementation of rhythmic DMPs and see exactly how it’s done. It’s actually pretty straightforward, which is always nich. We’ll go through everything that has to change and then see a couple of different applications of rhythmic DMPs. Again, all the code for the DMPs and arm control implementations can be found up on my Github.

Obtaining consistent and repeatable basis function activation

We’ll start with the canonical system. In the discrete case our canonical system (which drives the activation of the basis functions) decayed from $1$ to $0$ and then we were all done. In the rhythmic case we want our system to repeat indefinitely, so we need a reliable way of continuously activating the basis functions in the same order. One function that may come to mind that reliably repeats is the cosine function. But how to use the cosine function, exactly?

The idea is that we’ll lay out our basis functions in the range from $0$ to $2\pi$, and we’ll set the canonical system to be ever increasing linearly. Then we’ll use the difference between the canonical system state and each of the center points as the value we pass in to our cosine function. Because the cosine function repeats at $2\pi$ we’ll get a nice repeating spread of basis function activations. We’ll then throw these values through our Gaussian equation (with the gain and bias terms) and we’ll be done! Here’s the equation:

$\psi = \textrm{exp}(h * \textrm{cos}(x - c) - 1),$
where $x$ is the state of the canonical system, $c$ is the basis function center point, and $h$ is a gain term controlling the variance of the Gaussian. And here’s a picture of the activations of three basis functions with centers spread out evenly between $0$ and $2\pi$:

Here’s a picture of the basis function activations on a longer time scale:

As long as our canonical system increases at a steady pace the basis functions will continue to activate in a reliable and repeatable way. As mentioned, to get the canonical system to increase at a steady pace we simply give it linear dynamics:

$\dot{x} = 1$
The placement of the center points of the rhythmic system is a lot easier than in the discrete case because the rhythmic canonical system dynamics are linear.

Other differences between discrete and rhythmic

In the actual implementation of the rhythmic system there are a couple of other differences that you need to be aware of. The first is that there is no diminishing term in the rhythmic system, whereas there is in the discrete case.

The second is how to go about establishing goal states when imitating an path. We’re going to assume that whatever path is passed in to us is going to be a rhythmic pattern, and then goal state is going to be set as the center point of the desired trajectory. So for rhythmic systems the ‘goal’ is actually more like a center point for the trajectory.

And that’s pretty much it! If you look through the code that I have up for the rhythmic DMP subclass you’ll see that only a handful of functions needed to be redefined, the rest is the same! Well that’s great, let’s look at some applications.

Example

I was thinking of different examples that would be interesting to see of the rhythmic DMPs, and after a little consideration the most obvious application of rhythmic DMPs is really to a system that does something like walking. In the spinal cord of animals there are circuits which are known as central pattern generators (CPGs), and when stimulated these have been shown to generate rhythmic, repeated movements (as described in Principles of rhythmic motor pattern generation and other papers). Let’s go ahead and build our own CPG here using DMPs. I don’t have a simulation of a set of legs, however, so instead we’ll look at getting a single 3-link arm to reproduce a pattern of behaviour that you’d expect in a leg that is walking.

To do this, I’m going to need to specify the trajectories for the three joints of the arm, and I’ll be controlling the system in joint space, using the generalized coordinates controller, rather than the operational space controller used in the discrete DMP examples for controlling the end-effector position. So that’s one thing. Another thing is that I’m going to need the kinematics of leg joints during motion. After a quick search of the web I found this presentation from Simon Fraser University which has the following images:

Which is some very useful information! I just need to transfer this into a Numpy vector that I can feed into the system and convert degrees to radians and we should be good to go! To get the information out of the graph, use your favorite data stripper, I used http://arohatgi.info/WebPlotDigitizer/app/.

I also had to do a bit of a shuffle with the trajectories provided to make it look right on the arm. The hip to shoulder joint maps fine, but for the knee to elbow the flexion and extension are reversed, and for the wrist to foot the $0$ angle for the foot is straight out. Once those have been accounted for, though, everything works great. Here is an animation of a single leg walking along, if you use your imagination you can just picture a happy-go-lucky torso attached to this leg whistling while he gets his very human-like gait on:

And there you go! It’s neat how straightforward that was, once you have the data of the system you want to model. It shows the power of DMPs for coordinating multiple degrees of freedom quickly and easily. Also something that’s worth pointing out once again is that this is a force based controller. We were able to go from a kinematic description of the movement we wanted to a reliable system generating torques. It’s another good example of the dexterity of dynamic movement primitives.

You can see the code for everything here up online on my github control repo on the walking_demo branch. To run the code you should run:

python run.py


Dynamic movement primitives part 2: Controlling end-effector trajectories

The dynamic movement primitive (DMP) framework was designed for trajectory control. It so happens that in previous posts we’ve built up to having several arm simulations that are ripe for throwing a trajectory controller on top, and that’s what we’ll do in this post. The system that we will be controlling here is the 3 link arm model with an operational space controller (OSC) that translates end-effector forces into joint torques. The DMPs here will be controlling the $(x,y)$ trajectory of the hand, and the OSC will take care of turning the desired hand forces into torques that can be applied to the arm. All of the code used to generate the animations throughout this post can of course be found up on my github (to run play around with variants of python run.py arm3 dmp write).

Controlling a 3 link arm with DMPs
We have our 3 link arm and its OSC controller; this whole setup we’ll collectively refer to as ‘the plant’ throughout this post. We are going to pass in some $(x,y)$ force signal to the plant, and the plant will carry it out. Additionally we’ll get a feedback signal with the $(x,y)$ position of the hand. At the same time, we also have a DMP system that’s doing its own thing, tracing out a desired trajectory in $(x,y)$ space. We have to tie these two systems together.

To do this is easy, we’ll generate the control signal for the plant from our DMP system simply by measuring the difference between the state of our DMP system and the plant state, use that to drive the plant to the state of the DMP system. Formally,

$u = k_p(y_{\textrm{DMP}} - y)$

where $y_{\textrm{DMP}}$ is the state of the DMP system, $y$ is the state of the plant, and $k_p$ and is the position error gain term.

Once we have this, we just go ahead and step our DMP system forward and make sure the gain values on the control signal are high enough that the plant follows the DMP’s trajectory. And that’s pretty much it, just run the DMP system to the end of the trajectory and then stop your simulation.

To give a demonstration of DMP control I’ve set up the DMP system to follow the same number trajectories that the SPAUN arm followed. As you can see the combination of DMPs and operational space control is much more effective than my previous implementation.

Incorporating system feedback

One of the issues in implementing the control above is that we have to be careful about how quickly the DMP trajectory moves, because while the DMP system isn’t constrained by any physical dynamics, the plant is. Depending on the size of the movement the DMP trajectory may be moving a foot a second or an inch a second. You can see above that the arm doesn’t fully draw out the desired trajectories in places where the DMP system moved too quickly in and out and sharp corners. The only way to remedy this without feedback is to have the DMP system move more slowly throughout the entire trajectory. What would be nice, instead, would be to just say ‘go as fast as you can, as long as the plant state is within some threshold distance of you’, and this is where system feedback comes in.

It’s actually very straightforward to implement this using system feedback: If the plant state drifts away from the state of the DMPs, slow down the execution speed of the DMP to allow the plant time to catch up. The do this we just have to multiply the DMP timestep $dt$ by a new term:

$1 / (1 + \alpha_{\textrm{err}}(y_{\textrm{plant}} - y_{\textrm{DMP}}))$.

All this new term does is slow down the canonical system when there’s an error, you can think of it as a scaling on the time step. Additionally, the sensitivity of this term can be modulated the scaling term $\alpha_{\textrm{err}}$ on the difference between the plant and DMP states.

We can get an idea of how this affects the system by looking at the dynamics of the canonical system when an error term is introduced mid-run:

When the error is introduced the dynamics of the system slow down, great! Lets look at an example comparing execution with and without this feedback term. Here’s the system drawing the number 3 without any feedback incorporation:

and here’s the system drawing the number 3 with the feedback term included:

These two examples are a pretty good case for including the feedback term into your DMP system. You can still see in the second case that the specified trajectory isn’t traced out exactly, but if that’s what you’re shooting for you can just crank up the $\alpha_{\textrm{err}}$ to make the DMP timestep really slow down whenever the DMP gets ahead of the plant at all.

Interpolating trajectories for imitation

When imitating trajectories there can be some issues with having enough sample points and how to fit them to the canonical system’s timeframe, they’re not difficult to get around but I thought I would go over what I did here. The approach I took was to always run the canonical system for 1 second, and whenever a trajectory is passed in that should be imitated to scale the x-axis of the trajectory such that it’s between 0 and 1. Then I shove it into an interpolator and use the resulting function to generate an abundance of nicely spaced sample points for the DMP imitator to match. Here’s the code for that:

# generate function to interpolate the desired trajectory
import scipy.interpolate

path = np.zeros((self.dmps, self.timesteps))
x = np.linspace(0, self.cs.run_time, y_des.shape[1])

for d in range(self.dmps):

# create interpolation function
path_gen = scipy.interpolate.interp1d(x, y_des[d])

# create evenly spaced sample points of desired trajectory
for t in range(self.timesteps):
path[d, t] = path_gen(t * self.dt)

y_des = path


Direct trajectory control vs DMP based control

Now, using the above described interpolation function we can just directly use its output to guide our system. And, in fact, when we do this we get very precise control of the end-effector, more precise than the DMP control, as it happens. The reason for this is because our DMP system is approximating the desired trajectory and with a set of basis functions, and some accuracy is being lost in this approximation.

So if we instead use the interpolation function to drive the plant we can get exactly the points that we specified. The times when this comes up especially are when the trajectories that you’re trying to imitate are especially complicated. There are ways to address this with DMPs by placing your basis functions more appropriately, but if you’re just looking for the exact replication of an input trajectory (as often people are) this is a simpler way to go. You can see the execution of this in the trace.py code up on my github. Here’s a comparison of a single word drawn using the interpolation function:

and here’s the same word drawn using a DMP system with 1,000 basis function per DOF:

We can see that just using the interpolation function here gives us the exact path that we specified, where using DMPs we have some error, and this error increases with the size of the desired trajectory. But! That’s for exactly following a given trajectory, which is often not the case. The strength of the DMP framework is that the trajectory is a dynamical system. This lets us do simple things to get really neat performance, like scale the trajectory spatially on the fly simply by changing the goal, rather than rescaling the entire trajectory:

Conclusions

Some basic examples of using DMPs to control the end-effector trajectory of an arm with operational space control were gone over here, and you can see that they work really nicely together. I like when things build like this. We also saw that power of DMPs in this situation is in their generalizability, and not in exact reproduction of a given path. If I have a single complex trajectory that I only want the end-effector to follow once then I’m going to be better off just interpolating that trajectory and feeding the coordinates into the arm controller rather than go through the whole process of setting up the DMPs.

Drawing words, though, is just one basic example of using the DMP framework. It’s a very simple application and really doesn’t do justice to the flexibility and power of DMPs. Other example applications include things like playing ping pong. This is done by creating a desired trajectory showing the robot how to swing a ping pong paddle, and then using a vision system to track the current location of the incoming ping pong ball and changing the target of the movement to compensate dynamically. There’s also some really awesome stuff with object avoidance, that is implemented by adding another term with some simple dynamics to the DMP. Discussed here, basically you just have another system that moves you away from the object with a strength relative to your distance from the object. You can also use DMPs to control gain terms on your PD control signal, which is useful for things like object manipulation.

And of course I haven’t touched on rhythmic DMPs or learning with DMPs at all, and those are both also really interesting topics! But this serves as a decent introduction to the whole area, which has been developed in the Schaal lab over the last decade or so. I recommend further reading with some of these papers if you’re interested, there are a ton of neat ways to apply the DMP framework! And, again, the code for everything here is up on my github in the control and pydmps repositories.

Robot control part 6: Handling singularities

We’re back! Another exciting post about robotic control theory, but don’t worry, it’s short and ends with simulation code. The subject of today’s post is handling singularities.

What is a singularity

This came up recently when I had build this beautiful controller for a simple two link arm that would occasionally go nuts. After looking at it for a while it became obvious this was happening whenever the elbow angle reached or got close to 0 or $\pi$. Here’s an animation:

What’s going on here? Here’s what. The Jacobian has dropped rank and become singular (i.e. non-invertible), and when we try to calculate our mass matrix for operational space

$\textbf{M}_\textbf{x}(\textbf{q}) = (\textbf{J} (\textbf{q}) \; \textbf{M}^{-1} (\textbf{q}) \; \textbf{J}^T(\textbf{q}))^{-1},$

the values explode in the inverse calculation. Dropping rank means that the rows of the Jacobian are no longer linearly independent, which means that the matrix can be rotated such that it gives a matrix with a row of zeros. This row of zeros is the degenerate direction, and the problems come from trying to send forces in that direction.

To determine when the Jacobian becomes singular its determinant can be examined; if the determinant of the matrix is zero, then it is singular. Looking the Jacobian for the end-effector:

$\textbf{J}(\textbf{q}) = \left[ \begin{array}{cc} -L_0 sin(q_0) - L_1 sin(q_0 + q_1) & -L_1 sin(q_0 + q_1) \\ L_0 cos(q_0) + L_1 cos(q_0 + q_1) & L_1 cos(q_0 + q_1) \end{array} \right].$

When $q_1 = 0$ it can be that $sin(q_0 + 0) = sin(q_0)$, so the Jacobian becomes

$\textbf{J}(\textbf{q}) = \left[ \begin{array}{cc} (L_0 + L_1)(-sin(q_0)) & -L_1 sin(q_0) \\ (L_0 + L_1) cos(q_0) & L_1 cos(q_0) \end{array} \right],$

which gives a determinant of

$(L_0 + L_1)(-sin(q_0))(L_1)(cos(q_0)) - (L_1)(-sin(q_0))(L_0 + L_1)(cos(q_0)) = 0.$

Similarly, when $q_1 = \pi$, where $sin(q_0 + \pi) = -sin(q_0)$ and $cos(q_0 + \pi) = -cos(q_0)$, the Jacobian is

$\textbf{J}(\textbf{q}) = \left[ \begin{array}{cc} -(L_0 - L_1) sin(q_0) & L_1 sin(q_0) \\ (L_0 + L_1) cos(q_0) & - L_1 cos(q_0) \end{array} \right].$

Calculating the determinant of this we get

$(L_0 + L_1)(-sin(q_0))(L_1)(-cos(q_0)) - (L_1)(sin(q_0))(L_0 + L_1)(-cos(q_0)) = 0.$

Note that while in these cases the Jacobian is a square matrix in the event that it is not a square matrix, the determinant of $\textbf{J}(\textbf{q})\;\textbf{J}^T(\textbf{q})$ can be found instead.

Fixing the problem

When a singularity is occurring it can be detected, but now it must be handled such that the controller behaves appropriately. This can be done by identifying the degenerate dimensions and setting the force in those directions to zero.

First the SVD decomposition of $\textbf{M}_\textbf{x}^{-1}(\textbf{q}) = \textbf{V}\textbf{S}\textbf{U}^T$ is found. To get the inverse of this matrix (i.e. to find $\textbf{M}_\textbf{x}(\textbf{q})$) from the returned $\textbf{V}, \textbf{S}$ and $\textbf{U}$ matrices is a matter of inverting the matrix $\textbf{S}$:

$\textbf{M}_\textbf{x}(\textbf{q}) = \textbf{V} \textbf{S}^{-1} \textbf{U}^T,$

where $\textbf{S}$ is a diagonal matrix of singular values.

Because $\textbf{S}$ is diagonal it is very easy to find its inverse, which is calculated by taking the reciprocal of each of the diagonal elements.

Whenever the system approaches a singularity some of the values of $\textbf{S}$ will start to get very small, and when we take the reciprocal of them we start getting huge numbers, which is where the value explosion comes from. Instead of allowing this to happen, a check for approaching the singularity can be implemented, which then sets the singular values entries smaller than the threshold equal to zero, canceling out any forces that would be sent in that direction.

Here’s the code:

Mx_inv = np.dot(JEE, np.dot(np.linalg.inv(Mq), JEE.T))
if abs(np.linalg.det(np.dot(JEE,JEE.T))) > .005**2:
# if we're not near a singularity
Mx = np.linalg.inv(Mx_inv)
else:
# in the case that the robot is entering near singularity
u,s,v = np.linalg.svd(Mx_inv)
for i in range(len(s)):
if s[i] < .005: s[i] = 0
else: s[i] = 1.0/float(s[i])
Mx = np.dot(v, np.dot(np.diag(s), u.T))


And here’s an animation of the controlled arm now that we’ve accounted for movement when near singular configurations:

As always, the code for this can be found up on my Github. The default is to run using a two link arm simulator written in Python. To run, simply download everything and run the run_this.py file.

Everything is also included required to run the MapleSim arm simulator. To do this, go into the TwoLinkArm folder, and run python setup.py build_ext -i. This should compile the arm simulation to a shared object library that Python can now access on your system. To use it, edit the run_this.py file to import from TwoLinkArm/arm_python to TwoLinkArm/arm and you should be good to go!
More details on getting the MapleSim arm to run can be found in this post.

Likelihood calculus paper series review part 2 – Neuro-mechanical control using differential stochastic operators

The second paper put out by Dr. Terence Sanger in the likelihood calculus paper series is Neuro-mechanical control using differential stochastic operators. Building on the probabalistic representation of systems through differential stochastic operators presented in the last paper (Controlling variability, which I review here) Dr. Sanger starts exploring how one could effect control over a system whose dynamics are described in terms of these operators. Here, he specifically looks at driving a population of neurons described by differential stochastic operators to generate the desired system dynamics. Neural control of a system requires that several phenomena outside the realm of classical control theory be addressed, including the effects of variability in control due to stochastic firing, large partially unlabeled cooperative controllers, bandlimited control due to finite neural resources, and variation in the number of available neurons.

The function of a neuron in a control system can be completely described in terms of 1) the probability of it spiking due to the current system state, $p(s=1|x)$, and 2) the effect of its response on the change in the state. Due to the inherent uncertainty in these systems, each individual neuron’s effect on the change in state is captured by a distribution, $p(\dot{x}|s)$. And because the effect of each neuron is only a small part of a large dynamical system that includes the dynamics of the system being controlled and the effects of all the other neurons, these distributions tend to be very broad.

Rephrasing the above description, neurons are mapping a given state $x$ to a change in state $\dot{x}$. Instead of using two conditional densities to describe this mapping, $p(s|x)$ and $p(\dot{x}|s)$, we can rewrite this more compactly as

$p(\dot{x}|x) = p(\dot{x}|s=1)\;p(s=1|x) + p(\dot{x}|s=0)\;p(s=0|x)$,

which can be read as the probability of a change in state $\dot{x}$ given the current state $x$ is equal to the probability of that change in state occurring if the neuron spikes, $p(\dot{x}|s=1)$, multiplied by the probability of that neuron spiking given the current state, $p(s=1|x)$, plus the probability of that state occurring if the neuron doesn’t spike, $p(\dot{x}|s=0)$, multiplied by the probability of that neuron not spiking, $p(s=0|x)$.

Differential stochastic operators

We want to capture the mapping $p(\dot{x}|x)$ in such a way that if we have a description of a bunch of controllers (such as neurons) and the change in system state they effect individually, we can combine them in a straightforward way to get the overall change in state resulting from all the controllers operating in parallel. To do this we can use the linear operators developed in the previous paper, which allows us combine the effects of multiple components through simple summation to determine the overall change in system state. We’ll go over it again here, as I find reading through several different versions of an idea very helpful for solidifying understanding.

Let $\mathcal{L}$ denote a class of linear operators that act on time-varying probability densities, $p(x,t)$, such that $\frac{\partial}{\partial t}p(x,t) = \mathcal{L}p(x,t)$. Because these operators need to preserve the properties of valid probability density (specifically that $\int p(x)dx = 1$ and $p(x) \geq 0$), for a given operator $L$ where $\dot{p}(x) = \int L(x,y) p(y) dy$ we require that:

• 1) $\int L(x,y)dx = 0$ for all $y$,
• 2) $L(x,y) \geq 0$ whenever $x \neq y$,

which respectively enforce the aforementioned constraints.

So, first thing’s first. Let’s read out $\dot{p}(x) = \int L(x,y) p(y) dy$. This says that our change in the probability density, $\dot{p}(x)$, is found by taking our function that tells us what the change in density is for system state $x$ given our current state $y$, which is $L(x,y)$, and weighting that by the probability of currently being in state $y$, which is $p(y)$, then summing that all up, which is the integral.

Now the constraints. The first constraint reads out as the integral of the changes of the probability density at each point $x$ for a given state $y$ must be equal to 0. This means that the area of the probability density over the states after updating them is the same. So, assuming we start out with a valid density whose sum equals 1, we always have a density whose sum equals 1.

The second constraint reads out as the change in probability density for state $x$ given a current state $y$ must be greater than zero whenever $x \neq y$. This means the only time that the change in the probability density can be negative is if there is a probability of being in that state; it enforces that all $p(x) \geq 0$, because $\dot{p}$ can’t be negative when $p(x)$ is zero.

Dr. Sanger defines the linear operators that satisfy these two conditions to be “differential stochastic operators”. The discrete time versions are matrices, dubbed “difference stochastic operators”.

Superposition of differential stochastic operators

Differential stochastic operators can be derived in different ways, here we’ll go through the derivation from the ‘master’ equation defining $p(\dot{x}|x)$, and from a stochastic differential equation. They each have their insights, so it’s helpful to work through both.

Derivation from master equation

The equation for $p(\dot{x}|x)$ written out above,

$p(\dot{x}|x) = p(\dot{x}|s=1)\;p(s=1|x) + p(\dot{x}|s=0)\;p(s=0|x)$,

determines the probability flow on the state. By employing the Kramers-Moyal expansion we can capture this probability flow through a partial differential equation describing change in probability density. In other words, instead of capturing the time evolution of the system state with a probability density over possible changes in state, $\dot{x}$, we capture it through the changing in probability of each state, $\dot{p}(x)$. The Kramers-Moyal expansion looks like:

$\frac{\partial}{\partial t}p(x,t) = -\frac{\partial}{\partial x}(D_1(x)p(x,t)) + \frac{1}{2}\frac{\partial^2}{\partial x^2}(D_1(x)p(x,t)) + ...$,

where $D_k(x) = E[(\dot{x})^k] = \int \dot{x}^k p(\dot{x}|x) d\dot{x}$. Truncating this expansion at the first two terms we get the Fokker-Planck equation, where the first term describes the drift of the density, and the second term the diffusion. These two terms are sufficient for describing Gaussian conditional densities, which capture many physical phenomena. In the case where $p(\dot{x}|x)$ does not have a Gaussian distribution, higher-order terms from the Kramers-Moyal expansion will need to be included.

Now, imagine we have a simple system of two neurons, where the change in state is defined as $\dot{x} = \dot{x}_1 + \dot{x}_2$. If these neurons have conditionally independent variability, i.e. $p(\dot{x}_1 \dot{x}_2 | x) = p(\dot{x}_1|x)p(\dot{x}_2|x)$, then we can sum the Kramers-Moyal expansion of each of these terms to describe the evolution of the overall system state:

$\frac{\partial}{\partial t}p(x,t) = - \sum_i \frac{\partial}{\partial x}(D_{1i}(x)p(x,t)) + \frac{1}{2} \sum_i \frac{\partial^2}{\partial x^2}(D_{2i}(x)p(x,t)) + ...$,

as long as the neurons have conditionally independent variability. This means that they can’t be connected (directly or indirectly) such that a spike in one neuron causes a spike in the other. While this might seem a poor assumption for modeling networks of spiking neurons, in large populations with many input, the effects of any single input neuron tends to be small enough that the assumption holds approximately.

We can rewrite the previous equation now, taking advantage of linearity of $p(x,t)$ and the Kramers-Moyal coefficients, to get

$\dot{p}(x,t) = \mathcal{L}p = \sum_i \mathcal{L}_i p$,

which means that by describing neurons with the differential stochastic operators, $\mathcal{L}_i$, we can determine the cumulative effect on the dynamics of the system through simple summation. Which we all remember from the last paper, but hey, good to review.

Now, in the event that we want to model the effect of a group of strongly interconnected neurons, we can instead consider the effect of the group as a function of the $2^n$ possible firing patterns (spike or no spike from each of the neurons). So where before $p(\dot{x}| x)$ was written in terms of the two cases $s = 0$ and $s = 1$, it would now be written:

$p(\dot{x}|x) = \sum_{i=1}^{2^n} p(\dot{x}|x,i)p(i)$,

where each $i$ is a different spike pattern. This group on neurons and their effect on the system dynamics is then considered as a single unit, and the differential stochastic operator describing them can then be combined with the operator from another group of neurons, provided that the two groups exhibit conditionally independent variability.

If there are no independent groups in the network then it’s fully connected and this is not for you go home.

Derivation from a stochastic differential equation

For this derivation, imagine a set of controllers operating in parallel. Each controller has a stochastic differential equation that defines how a control signal input affects the system dynamics,

$dx = f_i(x)dt + g_i(x)dB_i$,

where $f_i$ and $g_i$ are arbitrary equations and $dB_i$ are random functions of time, or noise terms. Let $f_i(x)dt$ for $i > 0$ be the controller equations, and $f_0(x)dt$ be the unforced, or passive, dynamics of the system, which define how the system behaves without controller input. We can write the equation for the whole system as

$dx = f_0(x)dt + \sum_{i>0}u_i(f_i(x)dt + g_i(x)dB_i)$,

where $u_i$ are constant or slowly varying control inputs. The reason we would choose to denote the unforced (passive) dynamics of the system as $f_0$ is because we can now define $u_0 = 1$, and rewrite the above equation as

$dx = \sum_{i}u_i(f_i(x)dt + g_i(x)dB_i)$.

The corresponding Fokker-Planck equation for the evolution of the state probability density is

$\frac{\partial}{\partial t}p(x,t) = - \sum_i u_i \frac{\partial}{\partial x}(f_i(x)p(x,t)) + \frac{1}{2} \sum_i u_i \frac{\partial^2}{\partial x^2}(g_i(x)p(x,t))$.

Look familiar? We can rewrite this as a superposition of linear operators

$\dot{p}(x,t) = \mathcal{L}p = \sum_i u_i \mathcal{L}_i p$,

and there you go.

Population model

So, now we can apply this superposition of differential stochastic equations to describe the effect of a population of neurons on a given system. Dr. Sanger lists several ways that this model can go about being controlled; 1) modifying the tuning curves of the neurons, which specifies how they respond to stimulus; 2) modify the output functions that determines the effect that a neuron has on the dynamics of the system; and 3) modifying the firing threshold of the neurons.

I found the difference between 1 and 3 can be a little confusing, so let’s look at an example tuning curve in 2D space to try to make this clear. Imagine a neuron sensitive to -dimensional input signals, and that it’s tuning curve looks like this:

If we’re changing the tuning curve, then how this neuron responds to its input stimulus will change. For example, in this diagram we show changing the tuning curve of a neuron:

Here, the neuron no longer responds to the same type of input that it responded to previously. We have made a qualitative change to the type of stimulus this neuron responds to.

If we change the firing threshold, however, then what we’re changing is when the neuron starts responding to stimulus that it is sensitive to.

Here we show the neuron becoming more and more sensitive to a its stimulus, respond stronger sooner and sooner. So the type of signal that the neuron responds to isn’t changing, but rather when the neuron starts responding.

Alright, now that we’ve got that sorted out, let’s move on.
Tuning curves (1) and output functions (2) are both modifiable through learning, by changing the incoming and outgoing connection weights, respectively, but for controlling systems on the fly this is going to be too slow, i.e. slower than the speed at which the system moves. So what’s left is (3), modifying the firing threshold of the neurons. So the model then looks like:

where $p(x)$ is projected in to a population of neurons, each with a stochastic differential operator that sum together to generate $\dot{p}(x)$. In this diagram, $\lambda_i$ is the firing threshold of neuron $i$, and $\lambda_i(x)$ denotes the modulation of the firing rate of neuron $i$ as a function of the current system state. When the modulation is dependent on the system state we have a feedback, or closed-loop, control system. Dr. Sanger notes that in the case that $\lambda_i$ is heavily dependent on $x$, modulating the firing threshold is indistinguishable from modifying the tuning curve, meaning that we can get some pretty powerful control out of this.

Conclusions

‘The theory of differential stochastic operators links the dynamics of individual neurons to the dynamics of a full neuro-mechanical system. The control system is a set of reflex elements whose gain is modulated in order to produce a desired dynamics of the overall system.’

This paper presents a very different formulation of control than classical control theory. Here, the goal is to modulate the dynamics of the system to closely match a desired set of dynamics that achieve some task, rather than to minimize the deviation from some prespecified trajectory through state space. Dr. Sanger notes that this description of control matches well to behavioral descriptions of neural control system, where there are numerous subsystems and circuits that have reflexive responses to external stimuli which must be modulated to achieve desired behavior. The goal of control is to fully define the dynamics of the system’s reaction to the environment.

What comes to mind first for me, in terms of using the modulation of reflex elements to effect a desired set of dynamics, is modeling the spinocerebellum. With a ton of projections directly to the spinal cord, and strong implications in locomotor and balance system, it seems like it would be a very good candidate for being modeled with this type of control. The idea being that the cerebellum is projecting modulatory values to the different spinal circuits (reflex networks and central pattern generators, for example) that specify how to respond to changes in the environment to maintain our balance or the rhythm of our walk. How we go about specifying exactly what those modulatory terms need to be is something that Dr. Sanger tackles in the last paper of this series, which I’ll be reviewing in the next couple of months. I’m looking forward to it.

On another note, in my lab we all work with the Neural Engineering Framework, in which populations of neurons are taken to represent vectors, and to perform transformations on these vectors to relay information an perform various functions. To this end, something that interests me about likelihood calculus is its application to populations of neurons representing vectors. Instead of finding $p(\dot{x}|x)$ by summing the effects of all of the neurons in a population, or defining it in terms of the population spiking patterns, we’re looking at it in terms of the different vectors this population can represent, and the effect of that vector on the system. So we can still have spiking neuron based models, but we can do all the likelihood calculus business one level removed, simplifying the calculation and reducing the number of differential stochastic operators needed.

There are a ton of things going on in this paper, and lots to think about. At several points I deviated from the notation used in this paper because I found it unclear, but aside from that I’ve really enjoyed reading through it and writing it up. Very interesting work.

Sanger TD (2010). Neuro-mechanical control using differential stochastic operators. Conference proceedings : … Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, 2010, 4494-7 PMID: 21095779

Likelihood calculus paper series review part 1 – Controlling variability

Dr. Terry Sanger has a series of papers that have come out in the last few years describing what he has named ‘likelihood calculus’. The goal of these papers is to develop a ‘a theory of optimal control for variable, uncertain, and noisy systems that nevertheless accomplish real-world tasks reliably.’ The idea being that successful performance can be thought of as modulating variance of movement, allocating resources to tightly control motions when required and allowing variability in task-irrelevant dimensions. To perform variability modulation, we first need a means of capturing mathematically how the features of an uncertain controller operating affect variability in system movement. Defining terms quickly, the features of a controller are the different components that produce signals resulting in movement, variability is taken here to be the trial-to-trial variation in movements, and uncertainty means that the available sensory feedback does not uniquely determine the true state of the world, where uncertainty can arise from noise on sensory feedback signals, unmodeled dynamics, and/or quantitization of sensory feedback. To capture all this uncertainty and variability, probability theory will naturally be employed. In this post I will review the paper ‘Controlling variability’ (2010) by Dr. Sanger, which sets up the framework for describing the time course of uncertainty during movement.

Using probability in system representations

So, here’s a picture of the system our controller (the brain) is in:

There’s the input initial state $x$, and the output change in state, $\dot{x}$, which is generated as a combination of the unforced dynamics of the world and the control dynamics effected by the brain. But since we’re dealing with uncertainty and variability, we’re going to rewrite this such that given an initial state $x$, we get a probability distribution over potential changes in state, $p(\dot{x}|x)$, which specifies the likelihood of each change in state $\dot{x}$, given our initial state probability distribution $p(x)$. So in our system diagram, the word and the brain both define probability distributions over the possible changes in state, $p_1(\dot{x}|x)$ and $p_1(\dot{x}|x)$, respectively, which then combine to create the overall system dynamics $p(\dot{x}|x)$. Redrawing our picture to incorporate the probabilities, we get:

One may ask: how do these probabilities combine? Good question! What we’d like to be able to do is combine them through simple linear operators, because they afford us massive simplifications in our calculations, but the combination of $p_0(\dot{x}|x)$ and $p_1(\dot{x}|x)$ isn’t as simple as summing and normalizing. The reason why can be a little tricky to tease out if you’re unfamiliar with probability, but will becomes clear with some thought. Consider what it means to go about combining the probabilities in this way. Basically, if you sum and normalize, then the result is saying that there is a 50% chance of doing what the brain says to do, and a 50% chance of doing what the world says to do, and doesn’t capture what is actually going to happen, which is an interaction between the effects of the brain and the world. A good example for thinking about this is rolling dice. If you roll each die individually, you have an equal chance of the result being each number 1-6, but if you roll two dice, the overall system probability of rolling numbers changes in a highly nonlinear fashion:

Suddenly there is a 0% chance of the result being a 1, and the probability of rolling a number increases in likelihood until you get to 7, at which point the likelihood decreases with ascending numbers; there is a nonlinear interaction at play that can’t be captured by summing the probabilities of rolling a number on each die individually and normalizing.

So summing the probability distributions over the possible changes in state, $\dot{x}$, isn’t going to work, but is there another way to combine these through linear operators? The answer is yes, but it’s going to require us to undergo a bit of a paradigm shift and expand our minds. What if, instead of capturing the change in the system by looking at $p(\dot{x}|x)$, the probability distribution over possible changes in state given the current state, we instead capture the dynamics of the system by defining $\dot{p}(x)$, the change in the probability of states through time? Instead of describing how the system evolves through the likelihood of different state changes, the dynamics are captured by defining the change in likelihood of different states; we are capturing the effect of the brain and the world on the temporal evolution of state probability. Does that freak you out?

Reworking the problem

Hopefully you’re not too freaked out. Now that you’ve worked your head around this concept, let’s look at $\dot{p}(x)$ a little more closely. Specifically, let’s look at the Kramers-Moyal expansion (for a one-dimensional system):

$\frac{\partial p(x,t)}{\partial t} = \sum_{k=1}^{\inf} \left( - \frac{\partial}{\partial x} \right)^k \{a_k(x) p(x)\} / k!$,

$a_k(x) = \int \dot{x}^k p(\dot{x}|x) d\dot{x}$.

As Dr. Sanger notes, this is a daunting equation, but it can be understood relatively easily. The left side, $\frac{\partial p(x,t)}{\partial t} = \dot{p}_t(x)$, is the rate of change of probability at each point $x$ at time $t$. The right side is just the Taylor series expansion. If we take the first two terms of the Taylor series expansion, we get:

$\frac{\partial p(x,t)}{\partial t} = - a_1 \frac{\partial}{\partial x} p(x) + \frac{a_2}{2} \frac{\partial^2}{\partial x^2} p(x)$,

where the first describes how the probability drifts (or shifts / translates), $a_1$ being the average value of $\dot{x}$ for each value of $x$. The second term relates the rate of diffusion, $a_2$ being the second moment of the speed $\dot{x}$, describing the amount of spread in different possible speeds, where greater variability in speed leads to an increased spread of the probability. This is the Fokker-Planck equation, which describes the evolution of a physical process with constant drift and diffusion. At this point we make the assumption that our probability distributions are all going to be in the form of Gaussians (for which the Fokker-Planck equation exactly describes evolution of the system through time, and can arguably act as a good approximation to neural control systems where movement is based on the average activity of populations of neurons).

As an example of this, think of a 1-dimensional system, and a Gaussian probability distribution describing what state the system is likely to be in. The first term $a_1$ is the average rate of change $\dot{x}$ across the states $x$ the system could be in. The probability shifts through state space as specified by $a_1$. Intuitively, if you have a distribution with mean around position 1 and the system velocity is 4, then the change in your probability distribution, $p(x)$, should shift the mean to position 5. The second term, $a_2$ is a measure of how wide the range of different possible speeds $\dot{x}$ is. The larger the range of possible values, the less certain we become about the location of the system as it moves forward in time; the greater the range of possible states the system might end up in in the next time step. This is reflected by the rate of diffusion of the probability distribution. If we know for sure the speed the system moved at (i.e. all possible states will move with a specific $\dot{x}$), then we simply translate the mean of probability distribution. If however there’s uncertainty in the speed at which the system is moving, then the correct location (reflecting the actual system position) for the mean of the probability distribution could be one of a number of values. This is captured by increasing the width of (diffusing) the Gaussian.

Linear operators

Importantly, the equations above are linear in terms of $p(x)$. This means we can rearrange the above equation:

$\frac{\partial p(x,t)}{\partial t} = \left( -a_1 \frac{\partial}{\partial x} + \frac{a_2}{2} \frac{\partial^2}{\partial x^2} \right) p(x)$,

letting $\mathcal{L} = \left( -a_1 \frac{\partial}{\partial x} + \frac{a_2}{2} \frac{\partial^2}{\partial x^2} \right)$, we have

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L} p(x)$.

Now we can redraw our system above as

where

$\mathcal{L} = \mathcal{L}_0 + \mathcal{L}_1$,

which is the straightforward combination of the different contributions of each of the brain and the world to the overall system state probability. How cool is that?

Alright, calm down. Time to look at using these operators. Let’s assume that the overall system dynamics $\mathcal{L}$ hold constant for some period of time (taking particular care to note that ‘constant dynamics’ does not mean that a single constant output is produced irrespective of the input state, but rather that a given input state $x$ always produces the same result while the dynamics are held constant), and we have discretized our representation of $x$ (to be a range of some values, i.e. -100 to 100) then we can find the state probability distribution at time $T$ by calculating

$p(x, T) = A^T p(x,0)$,

where $A^T = e^{T\mathcal{L}}$.

When combining these $\mathcal{L}$ operators, if we sum them, i.e. overall system dynamics

$\mathcal{L} = \mathcal{L}_0 + \mathcal{L}_1$,

and then apply them to the probability

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L}p(x)$

this is saying that the dynamics provided by $\mathcal{L}_0$ and $\mathcal{L}_1$ are applied at the same time. But if we multiply the component dynamics operators,

$\mathcal{L} = \mathcal{L}_1 \mathcal{L}_0$

then when we apply them we have

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L}p(x) = \mathcal{L}_1 \mathcal{L}_0 p(x)$,

which is interpreted as applying $\mathcal{L}_0$ to the system, and then applying $\mathcal{L}_1$. Just basic algebra, but it allows us to apply simultaneously and sequentially the dynamics generated by our contributing system components (i.e. the brain and the world).

Capturing the effects of control

So now we have a representation of the system dynamics operating with variability and under uncertainty, we’re talking about building a tool to use for controlling these systems though, so where does the control signal $u$ fit in? The $\mathcal{L}$ operator is made to be a function of the control signal, describing the probablistic effect of the control signal given the possible initial states in the current state probability distribution $p(x)$. Thus the change in state probability is now written

$\frac{\partial p(x,t)}{\partial t} = \mathcal{L}(u)p(x)$.

Suppose that we drive a system with constant dynamics $\mathcal{L}(u_1)$ for a period of time $T_1$, at which point we change the control signal and drive the system with constant dynamics $\mathcal{L}(u_2)$ for another period of time $T_2$. The state of the system now be calculated

$p(x, T_1 + T_2) = e^{T_2\mathcal{L}(u_2)}e^{T_1\mathcal{L}(u_1)}p(x,0) = A_{u_2}^{T_2}A_{u_1}^{T_1}p(x,0)$

using the sequential application of dynamics operators discussed above.

Conclusion

And that is the essence of the paper ‘Controlling variability’. There is an additional discussion about the relationship to Bayes’ rule, which I will save for another post, and an example, but this is plenty for this post.

The main point from this paper is that we shouldn’t be focusing on the values of states as the object of control, but rather the probability densities of states. By doing this, we can capture the uncertainty in systems and work towards devising an effecting means of control. So, although the paper is called ‘Controlling variability’, the discussion of how to actually control variability is saved for later papers. All the same, I thought this was a very interesting paper, enjoyed working through it, and am looking forward to the rest of the series.

Sanger TD (2010). Controlling variability. Journal of motor behavior, 42 (6), 401-7 PMID: 21184358