## Deep learning for control using augmented Hessian-free optimization

Traditionally, deep learning is applied to feed-forward tasks, like classification, where the output of the network doesn’t affect the input to the network. It is a decidedly harder problem when the output is recurrently connected such that network output affects the input. Because of this application of deep learning methods to control was largely unexplored until a few years ago. Recently, however, there’s been a lot of progress and research in this area. In this post I’m going to talk about an implementation of deep learning for control presented by Dr. Ilya Sutskever in his thesis Training Recurrent Neural Networks.

In his thesis, Dr. Sutskever uses augmented Hessian-free (AHF) optimization for learning. There are a bunch of papers and posts that go into details about AHF, here’s a good one by Andrew Gibiansky up on his blog, that I recommend you check out. I’m not going to really talk much here about what AHF is specifically, or how it differs from other methods, if you’re unfamiliar there are lots of places you can read up on it. Quickly, though, AHF is kind of a bag of tricks you can use with a fast method for estimating the curvature of the loss function with respect to the weights of a neural network, as well as the gradient, which allows it to make larger updates in directions where the loss function doesn’t change quickly. So rather than estimating the gradient and then doing a small update along each dimension, you can make the size of your update large in directions that change slowly and small along dimensions where things change quickly. And now that’s enough about that.

In this post I’m going to walk through using a Hessian-free optimization library (version 0.3.8) written by my compadre Dr. Daniel Rasmussen to train up a neural network to train up a 2-link arm, and talk about the various hellish gauntlets you need run to get something that works. Whooo! The first thing to do is install this Hessian-free library, linked above.

I’ll be working through code edited a bit for readability, to find the code in full you can check out the files up on my GitHub.

Build the network

Dr. Sutskever specified the structure of the network in his thesis to be 4 layers: 1) a linear input layer, 2) 100 Tanh nodes, 3) 100 Tanh nodes, 4) linear output layer. The network is connected up with the standard feedforward connections from 1 to 2 to 3 to 4, plus recurrent connections on 2 and 3 to themselves, plus a ‘skip’ connection from layer 1 to layer 3. Finally, the input to the network is the target state for the plant and the current state of the plant. So, lots of recursion! Here’s a picture:

The output layer connects in to the plant, and, for those unfamiliar with control theory terminology, ‘plant’ just means the system that you’re controlling. In this case an arm simulation.

Before we can go ahead and set up the network that we want to train, we also need to specify the loss function that we’re going to be using during training. The loss function in Ilya’s thesis is a standard one:

$L(\theta) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{u}_t) + \ell_f(\textbf{x}_N),$
$\ell(\textbf{u}_t) = \alpha \frac{||\textbf{u}_t||^2}{2},$
$\ell_f(\textbf{x}_N) = \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2}$

where $L(\theta)$ is the total cost of the trajectory generated with $\theta$, the set of network parameters, $\ell(\textbf{u})$ is the immediate state cost, $\ell_f(\textbf{x})$ is the final state cost, $\textbf{x}$ is the state of the arm, $\textbf{x}^*$ is the target state of the arm, $\textbf{u}$ is the control signal (torques) that drives the arm, and $\alpha$ is a gain value.

To code this up using the hessianfree library we do:

from hessianfree import RNNet
from hessianfree.nonlinearities import (Tanh, Linear, Plant)
from hessianfree.loss_funcs import SquaredError, SparseL2

l2gain = 10e-3 * dt # gain on control signal loss
rnn = RNNet(
# specify the number of nodes in each layer
shape=[num_states * 2, 96, 96, num_states, num_states],
# specify the function of the nodes in each layer
layers=[Linear(), Tanh(), Tanh(), Linear(), plant],
# specify the layers that have recurrent connections
rec_layers=[1,2],
# specify the connections between layers
conns={0:[1, 2], 1:[2], 2:[3], 3:[4]},
# specify the loss function
loss_type=[
# squared error between plant output and targets
SquaredError(),
# penalize magnitude of control signal (output of layer 3)
SparseL2(l2gain, layers=[3])],
use_GPU=True)


Note that if you want to run it on your GPU you’ll need PyCuda and sklearn installed. And a GPU.

An important thing to note as well is that in Dr. Sustkever’s thesis when we’re calculating the squared error of the distance from the arm state to the target, this is measured in joint angles. So it’s kind of a weird set up to be looking at the movement of the hand but have your cost function in joint-space instead of end-effector space, but it definitely simplifies training by making the cost more directly relatable to the control signal. So we need to calculate the joint angles of the arm that will have the hand at different targets around a circle. To do this we’ll take advantage of our inverse kinematics solver from way back when, and use the following code:

def gen_targets(arm, n_targets=8, sig_len=100):
#Generate target angles corresponding to target
#(x,y) coordinates around a circle
import scipy.optimize

x_bias = 0
if arm.DOF == 2:
y_bias = .35
dist = .075
elif arm.DOF == 3:
y_bias = .5
dist = .2

# set up the reaching trajectories around circle
targets_x = [dist * np.cos(theta) + x_bias \
for theta in np.linspace(0, np.pi*2, 65)][:-1]
targets_y = [dist * np.sin(theta) + y_bias \
for theta in np.linspace(0, np.pi*2, 65)][:-1]

joint_targets = []
for ii in range(len(targets_x)):
joint_targets.append(arm.inv_kinematics(xy=(targets_x[ii],
targets_y[ii])))
targs = np.asarray(joint_targets)

# repeat the targets over time
for ii in range(targs.shape[1]-1):
targets = np.concatenate(
(np.outer(targs[:, ii], np.ones(sig_len))[:, :, None],
np.outer(targs[:, ii+1], np.ones(sig_len))[:, :, None]), axis=-1)
targets = np.concatenate((targets, np.zeros(targets.shape)), axis=-1)
# only want to penalize the system for not being at the
# target at the final state, set everything before to np.nan
targets[:, :-1] = np.nan

return targets


And you can see in the last couple lines that to implement the distance to target as a final state cost penalty only we just set all of the targets before the final time step equal to np.nan. If we wanted to penalize distance to target throughout the whole trajectory we would just comment that line out.

Create the plant

You’ll notice in the code that defines our RNN I set the last layer of the network to be plant, but that that’s not defined anywhere. Let’s talk. There are a couple of things that we’re going to need to incorporate our plant into this network and be able to use any deep learning method to train it. We need to be able to:

1. Simulate the plant forward; i.e. pass in input and get back the resulting plant state at the next timestep.
2. Calculate the derivative of the plant state with respect to the input; i.e. how do small changes in the input affect the state.
3. Calculate the derivative of the plant state with respect to the previous state; i.e. how do small changes in the plant state affect the state at the next timestep.
4. Calculate the derivative of the plant output with respect to its state; i.e. how do small changes in the current position of the state affect the output of the plant.

So 1 is easy, we have the arm simulations that we want already, they’re up on my GitHub. Number 4 is actually trivial too, because the output of our plant is going to be the state itself, so the derivative of the output with respect to the state is just the identity matrix.

For 2 and 3, we’re going to need to calculate some derivatives. If you’ve read the last few posts you’ll note that I’m on a finite differences kick. So let’s get that going! Because no one wants to calculate derivatives!

Important note, the notation in these next couple pieces of code is going to be a bit different from my normal notation because they’re matching with the hessianfree library notation, which is coming from a reinforcement learning literature background instead of a control theory background. So, s is the state of the plant, and x is the input to the plant. I know, I know. All the same, make sure to keep that in mind.

# calculate ds0/dx0 with finite differences
d_input_FD = np.zeros((x.shape[0], # number of trials
x.shape[1], # number of inputs
self.state.shape[1])) # number of states
for ii in range(x.shape[1]):
# calculate state adding eps to x[ii]
self.reset_plant(self.prev_state)
inc_x = x.copy()
inc_x[:, ii] += self.eps
self.activation(inc_x)
state_inc = self.state.copy()
# calculate state subtracting eps from x[ii]
self.reset_plant(self.prev_state)
dec_x = x.copy()
dec_x[:, ii] -= self.eps
self.activation(dec_x)
state_dec = self.state.copy()

d_input_FD[:, :, ii] = \
(state_inc - state_dec) / (2 * self.eps)
d_input_FD = d_input_FD[..., None]


Alrighty. First we create a tensor to store the results. Why is it a tensor? Because we’re going to be doing a bunch of runs at once. So our state dimensions are actually trials x size_input. When we then take the partial derivative, we end up with trials many size_input x size_state matrices. Then we increase each of the parameters of the input slightly one at a time and store the results, decrease them all one at a time and store the results, and compute our approximation of the gradient.

Next we’ll do the same for calculating the derivative of the state with respect to the previous state.

# calculate ds1/ds0
d_state_FD = np.zeros((x.shape[0], # number of trials
self.state.shape[1], # number of states
self.state.shape[1])) # number of states
for ii in range(self.state.shape[1]):
# calculate state adding eps to self.state[ii]
state = np.copy(self.prev_state)
state[:, ii] += self.eps
self.reset_plant(state)
self.activation(x)
state_inc = self.state.copy()
# calculate state subtracting eps from self.state[ii]
state = np.copy(self.prev_state)
state[:, ii] -= self.eps
self.reset_plant(state)
self.activation(x)
state_dec = self.state.copy()

d_state_FD[:, :, ii] = \
(state_inc - state_dec) / (2 * self.eps)
d_state_FD = d_state_FD[..., None]


Great! We’re getting closer to having everything we need. Another thing we need is a wrapper for running our arm simulation. It’s going to look like this:

def activation(self, x):
state = []
# iterate through and simulate the plant forward
# for each trial
for ii in range(x.shape[0]):
self.arm.reset(q=self.state[ii, :self.arm.DOF],
dq=self.state[ii, self.arm.DOF:])
self.arm.apply_torque(u[ii])
state.append(np.hstack([self.arm.q, self.arm.dq]))
state = np.asarray(state)

self.state = self.squashing(state)


This is definitely not the fastest code to run. Much more ideally we would put the state and input into vectors and do a single set of computations for each call to activation rather than having that for loop in there. Unfortunately, though, we’re not assuming that we have access to the dynamics equations / will be able to pass in vector states and inputs.

Squashing
Looking at the above code that seems pretty clear what’s going on, except you might notice that last line calling self.squashing. What’s going on there?

The squashing function looks like this:

def squashing(self, x):
index_below = np.where(x < -2*np.pi)
x[index_below] = np.tanh(x[index_below]+2*np.pi) - 2*np.pi
index_above = np.where(x > 2*np.pi)
x[index_above] = np.tanh(x[index_above]-2*np.pi) + 2*np.pi
return x


All that’s happening here is that we’re taking our input, and doing nothing to it as long as it doesn’t start to get too positive or too negative. If it does then we just taper it off and prevent it from going off to infinity. So running a 1D vector through this function we get:

This ends up being a pretty important piece of the code here. Basically it prevents wild changes to the weights during learning to result in the system breaking down. So the state of the plant can’t go off to infinity and cause an error to be thrown, stopping our simulation. But because the target state is well within the bounds of where the squashing function does nothing, post-training we’ll still be able to use the resulting network to control a system that doesn’t have this fail safe built in. Think of this function as training wheels that catch you only if you start falling.

With that, we no have pretty much all of the parts necessary to begin training our network!

Training the network

We’re going to be training this network on the centre-out reaching task, where you start at a centre point and reach to a bunch of target locations around a circle. I’m just going to be re-implementing the task as it was done in Dr. Sutskever’s thesis, so we’ll have 64 targets around the circle, and train using a 2-link arm. Here’s the code that we’ll use to actually run the training:

for ii in range(last_trial+1, num_batches):
# train a bunch of batches using the same input every time
# to allow the network a chance to minimize things with
# stable input (speeds up training)
err = rnn.run_batches(plant, targets=None,
max_epochs=batch_size,
optimizer=HessianFree(CG_iter=96, init_damping=100))

# save the weights to file, track trial and error
# err = rnn.error(inputs)
err = rnn.best_error
name = 'weights/rnn_weights-trial%04i-err%.3f'%(ii, err)
np.savez_compressed(name, rnn.W)


A quick aside: if you want to run this code yourself, get a real good computer, have an arm simulation ready, the hessianfree Python library installed, and download and run this train_hf.py file. (Note: I used version 0.3.8 of the hessianfree library, which you can install using pip install hessianfree==0.3.8) This will start training and save the weights into a weights/ folder, so make sure that that exists in the same folder as train_hf.py. If you want to view the results of the training at any point run the plot_error.py file, which will load in the most recent version of the weights and plot the error so far. If you want to generate an animated plot like I have below run gen_animation_plots.py and then the last command from my post on generating animated gifs.

Another means of seeing the results of your trained up network is to use the controller I’ve implemented in my controls benchmarking suite, which looks for a set of saved weights in the controllers/weights folder, and will load it in and use it to generate command signals for the arm by running it with

python run.py arm2_python ahf reach --dt=1e-2


where you replace arm2_python with whatever arm model you trained your model on. Note the --dt=1e-2 flag, that is important because the model was trained with a .01 timestep and things get a bit weird if you suddenly change the dynamics on the controller.

OK let’s look at some results!

Results

OK full discretion, these results are not optimizing the cost function we discussed above. They’re implementing a simpler cost function that only looks at the final state, i.e. it doesn’t penalize the magnitude of the control signal. I did this because Dr. Sutskever says in his thesis he was able to optimize with just the final state cost using much smaller networks. I originally looked at neurons with 96 neurons in each layer, and it just took forgoddamnedever to run. So after running for 4 weeks (not joking) and needing to make some more changes I dropped the number of neurons and simplified the task.

The results below are from running a network with 32 neurons in each layer controlling this 2-link arm, and took another 4-5 weeks to train up.

Hey that looks good! Not bad, augmented Hessian-free learning, not bad. It had pretty consistent (if slow) decline in the error rate, with a few crazy bumps from which it quickly recovered. Also take note that each training iteration is actually 32 runs, so it’s not 12,50-ish runs it’s closer to 400,000 training runs that it took to get here.

One biggish thing that was a pain was that it turns out that I only trained the neural network for reaching in the one direction, and when you only train it to reach one way it doesn’t generalize to reaching back to the starting point (which, fair enough). But, I didn’t realize this until I was took the trained network and ran it in the benchmarking code, at which point I was not keen to redo all of the training it took to get the neural network to the level of accuracy it was at under a more complicated training set. The downside of this is that even though I’ve implemented a controller that takes in the trained network and uses it to control the arm, to do the reaching task I have to just do a hard reset after the arm reaches the target, because it can’t reach back to the center, like all the other controllers. All the same, here’s an animation of the trained up AHF controller reaching to 8 targets (it was trained on all 64 above though):

Things don’t always go so smoothly, though. Here’s results from another training run that took around 2-3 weeks, and uses a different 2-link arm model (translated from Matlab code written by Dr. Emo Todorov):

What I found frustrating about this was that if you look at the error over time then this arm is doing as well or better than the previous arm at a lot of points. But the corresponding trajectories look terrible, like something you would see in a horror movie based around getting good machine learning results. This of course comes down to how I specified the cost function, and when I looked at the trajectories plotted over time the velocity of the arm is right at zero at the final time step, which it is not quiiiitte the case for the first controller. So this second network has found a workaround to minimize the cost function I specified in a way I did not intend. To prevent this, doing something like weighting the distance to target heavier than non-zero velocity would probably work. Or possibly just rerunning the training with a different random starting point you could get out a better controller, I don’t have a great feel for how important the random initialization is, but I’m hoping that it’s not all too important and its effects go to zero with enough training. Also, it should be noted I’ve run the first network for 12,500 iterations and the second for less than 6,000, so I’ll keep letting them run and maybe it will come around. The first one looked pretty messy too until about 4,000 iterations in.

Training regimes

Frustratingly, the way that you train deep networks is very important. So, very much like the naive deep learning network trainer that I am, I tried the first thing that pretty much anyone would try:

• run the network,
• update the weights,
• repeat.

This is what I’ve done in the results above. And it worked well enough in that case.

If you remember back to the iLQR I made a little while ago, I was able to change the cost function to be

$L(\theta) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{u}_t) + \ell_f(\textbf{x}_N),$
$\ell(\textbf{u}_t, \textbf{x}_t) = \alpha \frac{||\textbf{u}_t||^2}{2} + \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2},$
$\ell_f(\textbf{x}_N) = \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2}$

(i.e. to include a penalty for distance to target throughout the trajectory and not just at the final time step) which resulted in straighter trajectories when controlling the 2-link arm. So I thought I would try this here as well. Sadly (incredibly sadly), this was fairly fruitless. The network didn’t really learn or improve much at all.

After much consideration and quandary on my part, I talked with Dr. Dan and he suggested that I try another method:

• run the network,
• record the input,
• hold the input constant for a few batches of weight updating,
• repeat.

This method gave much better results. BUT WHY? I hear you ask! Good question. Let me give giving explanation a go.

Essentially, it’s because the cost function is more complex now. In the first training method, the output from the plant is fed back into the network as input at every time step. When the cost function was simpler this was OK, but now we’re getting very different input to train on at every iteration. So the system is being pulled in different directions back and forth at every iteration. In the second training regime, the same input is given several times in a row, which let’s the system follow the same gradient for a few training iterations before things change again. In my head I picture this as giving the algorithm a couple seconds to catch its breath dunking it back underwater.

This is a method that’s been used in a bunch of places recently. One of the more high-profile instances is in the results published from DeepMind on deep RL for control and for playing Go. And indeed, it also works well here.

To implement this training regime, we set up the following code:

for ii in range(last_trial+1, num_batches):

# run the plant forward once
rnn.forward(input=plant, params=rnn.W)

# get the input and targets from above rollout
inputs = plant.get_vecs()[0].astype(np.float32)
targets = np.asarray(plant.get_vecs()[1], dtype=np.float32)

# train a bunch of batches using the same input every time
# to allow the network a chance to minimize things with
# stable input (speeds up training)
err = rnn.run_batches(inputs, targets, max_epochs=batch_size,
optimizer=HessianFree(CG_iter=96, init_damping=100))

# save the weights to file, track trial and error
# err = rnn.error(inputs)
err = rnn.best_error
name = 'weights/rnn_weights-trial%04i-err%.3f'%(ii, err)
np.savez_compressed(name, rnn.W)


So you can see that we do one rollout with the weights, then go in and get the inputs and targets that were used in that rollout, and start training the network while holding those constant for batch_size epochs (training sessions). From a little bit of messing around I’ve found batch_size=32 to be a pretty good number. So then it runs 32 training iterations where it’s updating the weights, and then saves those weights (because we want a loooottttt of check-points) and then restarts the loop.

Embarrassingly, I’ve lost my simulation results from this trial, somehow…so I don’t have any nice plots to back up the above, unfortunately. But since this is just a blog post I figured I would at least talk about it a little bit, since people might still find it useful if they’re just getting into the field like me. and just update this post whenever I re-run them. If I rerun them.

What I do have, however, are results where this method doesn’t work! I tried this with the simpler cost function, that only looks at the final state distance from the target, and it did not go so well. Let’s look at that one!

My guess here is basically that the system has gotten to a point where it’s narrowed things down in the parameter space and now when you run 32 batches it’s overshooting. It needs feedback about its updates after every update at this point. That’s my guess, at least. So it could be the case that for more complex cost functions you’d want to train it while holding the input constant for a while, and then when the error starts to plateau switch to updating the input after every parameter update.

Conclusions

All in all, AHF for training neural networks in control is pretty awesome. There are of course still some major hold-backs, mostly related to how long it takes to train up a network, and having to guess at effective training regimes and network structures etc. But! It was able to train up a relatively small neural network to move an arm model from a center point to 64 targets around a circle, with no knowledge of the system under control at all. In Dr. Sutskever’s thesis he goes on to use the same set up under more complicated circumstances, such as when there’s a feedback delay, or a delay on the outgoing control signal, and unexpected noise etc, so it is able to learn under a number of different, fairly complex situations. Which is pretty slick.

Related to the insane training time required, I very easily could be missing some basic thing that would help speed things up. If you, reader, get ambitious and run the code on your own machine and find out useful methods for speeding up the training please let me know! Personally, my plan is to next investigate guided policy search, which seems like it’s found a way around this crazy training time.

## Simultaneous perturbation vs finite differences for linear dynamics estimation and control signal optimization

Recently in my posts I’ve been using finite differences to approximate the gradient of loss functions and dynamical systems, with the intention of creating generalizable controllers that can be run on any system without having to calculate out derivatives beforehand. Finite differences is pretty much the most straight-forward way of approximating a gradient that there is: vary each parameter up and down (assuming we’re doing central differencing), one at a time, run it through your function and estimate the parameters effect on the system by calculating the difference between resulting function output. To do this requires 2 samples of the function for each parameter.

But there’s always more than one way to peel an avocado, and another approach that’s been used with much success is the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm, which was developed by Dr. James Spall (link to overview paper). SPSA is a method of gradient approximation, like finite differences, but, critically, the difference is that it varies all of the parameters at once, rather than one at a time. As a result, you can get an approximation of the gradient with far fewer samples from the system, and also when you don’t have explicit control over your samples (i.e. the ability to vary each parameter one at a time).

Given some function $\textbf{f}$ dependent on some set of parameters $\textbf{x}$, we’re used to finding the gradient of $\textbf{f}(\textbf{x})$ using FDSA (finite differences stochastic approximation) written in this form:

$\textbf{f}_x = \frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}) - \textbf{f}(\textbf{x} - \Delta \textbf{x})}{2 \Delta \textbf{x}} = \Delta x^{-1} \frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}) - \textbf{f}(\textbf{x} - \Delta \textbf{x})}{2},$

where $\Delta\textbf{x}$ is a perturbation to the parameter set, and the subscript on the left-hand side denotes the derivative of $\textbf{f}(\textbf{x})$ with respect to $\textbf{x}$.

And that’s how we’ve calculated it before, estimating the gradient of a single parameter at a time. But, we can rewrite this for a set perturbations $\Delta \textbf{X} = [\Delta\textbf{x}_0, ... , \Delta\textbf{x}_N]^T$:

$\textbf{f}_\textbf{x} = \Delta \textbf{X}^{-1} \textbf{F},$

where

$\textbf{F} = [\frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}_0) - \textbf{f}(\textbf{x} - \Delta \textbf{x}_0)}{2}, ... , \frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}_0) - \textbf{f}(\textbf{x}_N - \Delta \textbf{x}_N)}{2}]^T$,

which works as long as $\Delta\textbf{X}$ is square. When it’s not square (i.e. we have don’t have the same number of samples as we have parameters), we run into problems, because we can’t calculate $\Delta\textbf{X}^{-1}$ directly. To address this, let’s take a step back and then work forward again to get a more general form that works for non-square $\Delta \textbf{X}$ too.

By rewriting the above, and getting rid of the inverse by moving $\Delta\textbf{x}$ back to the other side, we have:

$\Delta\textbf{X} \; \textbf{f}_\textbf{x} = \textbf{F}$

Now, the standard trick to move a matrix that’s not square is to just make it square by multiplying it by its transpose to get a square matrix, and then the whole thing by the inverse:

$\Delta\textbf{X}^T \Delta\textbf{X} \; \textbf{f}_\textbf{x} = \Delta\textbf{X}^T \textbf{F}$

$(\Delta\textbf{X}^T \Delta\textbf{X})^{-1} (\Delta\textbf{X}^T \Delta\textbf{X}) \textbf{f}_\textbf{x} = (\Delta\textbf{X}^T \Delta\textbf{X})^{-1} \Delta\textbf{X}^T \textbf{F}$

$\textbf{f}_\textbf{x} = (\Delta\textbf{X}^T \Delta\textbf{X})^{-1} \Delta\textbf{X}^T \textbf{F}$

Alright! Now we’re comfortable with this characterization of gradient approximation using a form that works with non-square perturbation matrices.

Again, in FDSA, we only vary one parameter at a time. This means that there will only ever be one non-zero entry per row of $\Delta \textbf{X}$. By contrast, in SPSA, we vary multiple parameters, and so rows of $\Delta\textbf{X}$ will be just chalk full of non-zero entries.

Gradient approximation to estimate $\textbf{f}_\textbf{x}$ and $\textbf{f}_\textbf{u}$ for LQR control

Implementing this was pretty simple. Just had to modify the calc_derivs function, which I use to estimate the derivative of the arm with respect to the state and control signal, in my LQR controller code by changing from standard finite differences to simultaneous perturbation:

def calc_derivs(self, x, u):
"""" calculate gradient of plant dynamics using Simultaneous
Perturbation Stochastic Approximation (SPSA). Implemented
based on (Peters & Schaal, 2008).

x np.array: the state of the system
u np.array: the control signal
"""
# Initialization and coefficient selection
num_iters = 20

eps = 1e-4
delta_K = None
delta_J = None
for ii in range(num_iters):
# Generation of simultaneous perturbation vector
# choose each component from a Bernoulli +-1 distribution
# with probability of .5 for each +-1 outcome.
delta_k = np.random.choice([-1,1],
size=len(x) + len(u),
p=[.5, .5])
# Function evaluations
inc_x = np.copy(x) + eps * delta_k[:len(x)]
inc_u = np.copy(u) + eps * delta_k[len(x):]
state_inc = self.plant_dynamics(inc_x, inc_u)
dec_x = np.copy(x) - eps * delta_k[:len(x)]
dec_u = np.copy(u) - eps * delta_k[len(x):]
state_dec = self.plant_dynamics(dec_x, dec_u)

delta_j = ((state_inc - state_dec) /
(2.0 * eps)).reshape(-1)

# Track delta_k and delta_j
delta_K = delta_k if delta_K is None else \
np.vstack([delta_K, delta_k])
delta_J =  delta_j if delta_J is None else \
np.vstack([delta_J, delta_j])

f_xu = np.dot(np.linalg.pinv(np.dot(delta_K.T, delta_K)),
np.dot(delta_K.T, delta_J))
f_x = f_xu[:len(x)]
f_u = f_xu[len(x):]

return f_x.T , f_u.T


A couple notes about the above code. First, you’ll notice that the f_x and f_b matrices are both calculated at the same time. That’s pretty slick! And that calculation for f_xu is just a straight implementation of the matrix form of gradient approximation, where I’ve arranged things so that f_x is in the top part and f_u is in the lower part.

The second thing is that the perturbation vector delta_k is generated from a Bernoulli distribution. The reason behind this is that we want to have a bunch of different samples that pretty reasonably spread the state space and move all the parameters independently. Making each perturbation some distance times -1 or 1 is an easy way to achieve this.

Thirdly, there’s the num_iters variable. This is a very important variable, as it dictates how many random samples of our system we take before we estimate the gradient. I’ve found that to get this to work for both the 2-link arm and the more complex 3-link arm, it needs to be at least 20. Or else things explode and die horribly. Just…horribly.

OK let’s look at the results:

The first thing to notice is that I’ve finally discovered the Seaborn plotting package. The second is that SPSA does as well as FDSA.

You may ask: Is there any difference? Well, if we time these functions, on my lil’ laptop, for the 2-link arm it takes SPSA approximately 2.0ms, but it takes FDSA only 0.8ms. So for the same performance the SPSA is taking almost 3 times as long to run. Why? This boils down to how many times the system dynamics need to be sampled by each algorithm to get a good approximation of the gradient. For a 2-link arm, FDSA has 6 parameters ($\textbf{q}, \dot{\textbf{q}},$ and $\textbf{u}$) that it needs to sample twice (we’re doing central differencing), for a total of 12 samples. And as I mentioned above, the SPSA algorithm needs 20 samples to be stable.

For the 3-link arm, SPSA took about 3.1ms on average and FDSA (which must now perform 18 samples of the dynamics) still only 2.1ms. So number of samples isn’t the only cause of time difference between these two algorithms. SPSA needs to perform that a few more matrix operations, including a matrix inverse, which is expensive, while FDSA can calculate the gradient of each parameter individually, which is much less expensive.

OK so SPSA not really impressive here. BUT! As I discovered, there are other means of employing SPSA.

Gradient approximation to optimize the control signal directly

In the previous set up we were using SPSA to estimate the gradient of the system under control, and then we used that gradient to calculate a control signal that minimized the loss function (as specified inside the LQR). This is one way to use gradient approximation methods. Another way to use these methods is approximate the gradient of the loss function directly, and use that information to iteratively calculate a control signal that minimizes the loss function. This second application is the primary use of the SPSA algorithm, and is what’s described by Dr. Spall in his overview paper.

In this application, the algorithm works like this:

2. perturb input and simulate results
3. observe loss function and calculate gradient
4. update input to system
5. repeat to convergence

Because in this approach we’re iteratively optimizing the input using our gradient estimation, having a noisy estimate is no longer a death sentence, as it was in the LQR. If we update our input to the system with several noisy gradient estimates the noise will essentially just cancel itself out. This means that SPSA now has a powerful advantage over FDSA: Since in SPSA we vary all parameters at once, only 2 samples of the loss function are used to estimate the gradient, regardless of the number of parameters. In contrast, FDSA needs to sample the loss function twice for every input parameter. Here’s a picture from (Spall, 1998) that shows the two running against each other to optimize a 2D problem:

This gets across that even though SPSA bounces around more, they both reach the solution in the same number of steps. And, in general, this is the case, as Dr. Spall talks about in the paper. There’s also a couple more details of the algorithm, so let’s look at it in detail. Here’s the code, which is just a straight translation into Python out of the description in Dr. Spall’s paper:

# Step 1: Initialization and coefficient selection
max_iters = 5
converge_thresh = 1e-5

alpha = 0.602 # from (Spall, 1998)
gamma = 0.101
a = .101 # found empirically using HyperOpt
A = .193
c = .0277

delta_K = None
delta_J = None
u = np.copy(self.u) if self.u is not None \
else np.zeros(self.arm.DOF)
for k in range(max_iters):
ak = a / (A + k + 1)**alpha
ck = c / (k + 1)**gamma

# Step 2: Generation of simultaneous perturbation vector
# choose each component from a bernoulli +-1 distribution with
# probability of .5 for each +-1 outcome.
delta_k = np.random.choice([-1,1], size=arm.DOF, p=[.5, .5])

# Step 3: Function evaluations
inc_u = np.copy(u) + ck * delta_k
cost_inc = self.cost(np.copy(state), inc_u)
dec_u = np.copy(u) - ck * delta_k
cost_dec = self.cost(np.copy(state), dec_u)

gk = np.dot((cost_inc - cost_dec) / (2.0*ck), delta_k)

# Step 5: Update u estimate
old_u = np.copy(u)
u -= ak * gk

# Step 6: Check for convergence
if np.sum(abs(u - old_u)) < converge_thresh:
break


The main as-of-yet-unexplained parts of this code are the alpha, gamma, a, A, and c variables. What’s their deal?

Looking inside the loop, we can see that ck controls the magnitude of our perturbations. Looking a little further down, ak is just the learning rate. And all of those other parameters are just involved in shaping the trajectories that ak and ck follow through iterations, which is a path towards zero. So the first steps and perturbations are the biggest, and each successively becomes smaller as the iteration count increases.

There are a few heuristics that Dr. Spall goes over, but there aren’t any hard and fast rules for setting a, A, and c. Here, I just used HyperOpt to find some values that worked pretty well for this particular problem.

The FDSA version of this is also very straight-forward:

# Step 1: Initialization and coefficient selection
max_iters = 10
converge_thresh = 1e-5

eps = 1e-4
u = np.copy(self.u) if self.u is not None \
else np.zeros(self.arm.DOF)
for k in range(max_iters):

gk = np.zeros(u.shape)
for ii in range(gk.shape[0]):
# Step 2: Generate perturbations one parameter at a time
inc_u = np.copy(u)
inc_u[ii] += eps
dec_u = np.copy(u)
dec_u -= eps

# Step 3: Function evaluation
cost_inc = self.cost(np.copy(state), inc_u)
cost_dec = self.cost(np.copy(state), dec_u)

gk[ii] = (cost_inc - cost_dec) / (2.0 * eps)

old_u = np.copy(u)
# Step 5: Update u estimate
u -= 1e-5 * gk

# Step 6: Check for convergence
if np.sum(abs(u - old_u)) < converge_thresh:
break


You’ll notice that in both the SPSA and FDSA code we’re no longer sampling plant_dynamics, we’re instead sampling cost, a loss function I defined. From just my experience playing around with these algorithms a bit, getting the loss function to be appropriate and give the desired behaviour is definitely a bit of an art. It feels like much more of an art than in other controllers I’ve coded, but that could just be me.

The cost function that I’m using is pretty much the first thing you’d think of. It penalizes distance to target and having non-zero velocity. Getting the weighting between distance to target and velocity set up so that the arm moves to the target but also doesn’t overshoot definitely took a bit of trial and error, er, I mean empirical analysis. Here’s the cost function that I found worked pretty well, note that I had to make special cases for the different arms:

 def cost(self, x, u):
dt = .1 if self.arm.DOF == 3 else .01
next_x = self.plant_dynamics(x, u, dt=dt)
vel_gain = 100 if self.arm.DOF == 3 else 10
return (np.sqrt(np.sum((self.arm.x - self.target)**2)) * 1000 \
+ np.sum((next_x[self.arm.DOF:])**2) * vel_gain)


So that’s all the code, let’s look at the results!

For these results, I used a max of 10 iterations for optimizing the control signal. I was definitely surprised by the quality of the results, especially for the 3-link arm, compared to the results generated by a standard LQR controller. Although I need to note, again, that it was a fair bit of me playing around with the exact cost function to get these results. Lots of empirical analysis.

The two controllers generate results that are identical to visual inspection. However, especially in the 3-link arm, the time required to run the FDSA was significantly longer than the SPSA controller. It took approximately 140ms for the SPSA controller to run a single loop, but took FDSA on average 700ms for a single loop of calculating the control signal. Almost 5 times as long! For the same results! In directly optimizing the control signal, SPSA gets a big win over standard FDSA. So, if you’re looking to directly optimize over a loss function, SPSA is probably the way you want to go.

Conclusions

First off, I thought it was really neat to directly apply gradient approximation methods to optimizing the control signal. It’s something I haven’t tried before, but definitely makes sense, and can generate some really nice results when tuned properly. Automating the tuning is definitely I’ll be discussing in future posts, because doing it by hand takes a long time and is annoying.

In the LQR, the gradient approximation was best done by the FDSA. I think the main reasons for this is that in solving for the control signal the LQR algorithm uses matrix inverses, and any errors in the linear approximations to the dynamics are going to be amplified quite a bit. If I did anything less than 10-15 iterations (20 for the 3-link arm) in the SPSA approximation then things exploded. Also, here the SPSA algorithm required a matrix inverse, where the FDSA didn’t. This is because we only varied one parameter at a time in FDSA, and the effects of changing each was isolated. In the SPSA case, we had to consider the changes across all the variables and the resulting effects all at once, essentially noting which variables changed by how much and the changes in each case, and averaging. Here, even with the more complex 3-link arm, FDSA was faster, so I’m going to stick with it in my LQR and iLQR implementations.

In the direct control signal optimization SPSA beat the pants off of FDSA. It was almost 5 times faster for control of the 3-link arm. This was, again, because in this case we could use noisy samples of the gradient of the loss function and relied on noise to cancel itself out as we iterated. So we only needed 2 samples of the loss function in SPSA, where in FDSA we needed 2*num_parameters. And although this generated pretty good results I would definitely be hesitant against using this for any more complicated systems, because tuning that cost function to get out a good trajectory was a pain. If you’re interested in playing around with this, you can check out the code for the gradient controllers up on my GitHub.

## The iterative Linear Quadratic Regulator algorithm

A few months ago I posted on Linear Quadratic Regulators (LQRs) for control of non-linear systems using finite-differences. The gist of it was at every time step linearize the dynamics, quadratize (it could be a word) the cost function around the current point in state space and compute your feedback gain off of that, as though the dynamics were both linear and consistent (i.e. didn’t change in different states). And that was pretty cool because you didn’t need all the equations of motion and inertia matrices etc to generate a control signal. You could just use the simulation you had, sample it a bunch to estimate the dynamics and value function, and go off of that.

The LQR, however, operates with maverick disregard for changes in the future. Careless of the consequences, it optimizes assuming the linear dynamics approximated at the current time step hold for all time. It would be really great to have an algorithm that was able to plan out and optimize a sequence, mindful of the changing dynamics of the system.

This is exactly the iterative Linear Quadratic Regulator method (iLQR) was designed for. iLQR is an extension of LQR control, and the idea here is basically to optimize a whole control sequence rather than just the control signal for the current point in time. The basic flow of the algorithm is:

1. Initialize with initial state $x_0$ and initial control sequence $\textbf{U} = [u_{t_0}, u_{t_1}, ..., u_{t_{N-1}}]$.
2. Do a forward pass, i.e. simulate the system using $(x_0, \textbf{U})$ to get the trajectory through state space, $\textbf{X}$, that results from applying the control sequence $\textbf{U}$ starting in $x_0$.
3. Do a backward pass, estimate the value function and dynamics for each $(\textbf{x}, \textbf{u})$ in the state-space and control signal trajectories.
4. Calculate an updated control signal $\hat{\textbf{U}}$ and evaluate cost of trajectory resulting from $(x_0, \hat{\textbf{U}})$.
1. If $|(\textrm{cost}(x_0, \hat{\textbf{U}}) - \textrm{cost}(x_0, \textbf{U})| < \textrm{threshold}$ then we've converged and exit.
2. If $\textrm{cost}(x_0, \hat{\textbf{U}}) < \textrm{cost}(x_0, \textbf{U})$, then set $\textbf{U} = \hat{\textbf{U}}$, and change the update size to be more aggressive. Go back to step 2.
3. If $\textrm{cost}(x_0, \hat{\textbf{U}}) \geq$ $\textrm{cost}(x_0, \textbf{U})$ change the update size to be more modest. Go back to step 3.

There are a bunch of descriptions of iLQR, and it also goes by names like ‘the sequential linear quadratic algorithm’. The paper that I’m going to be working off of is by Yuval Tassa out of Emo Todorov’s lab, called Control-limited differential dynamic programming. And the Python implementation of this can be found up on my github in my Control repo. Also, a big thank you to Dr. Emo Todorov who provided Matlab code for the iLQG algorithm, which was super helpful.

Defining things

So let’s dive in. Formally defining things, we have our system $\textbf{x}$, and dynamics described with the function $\textbf{f}$, such that

$\textbf{x}_{t+1} = \textbf{f}(\textbf{x}_t, \textbf{u}_t),$

where $\textbf{u}$ is the input control signal. The trajectory $\{\textbf{X}, \textbf{U}\}$ is the sequence of states $\textbf{X} = \{\textbf{x}_0, \textbf{x}_1, ..., \textbf{x}_N\}$ that result from applying the control sequence $\textbf{U} = \{\textbf{u}_0, \textbf{u}_1, ..., \textbf{u}_{N-1}\}$ starting in the initial state $\textbf{x}_0$.

Now we need to define all of our cost related equations, so we know exactly what we’re dealing with.

Define the total cost function $J$, which is the sum of the immediate cost, $\ell$, from each state in the trajectory plus the final cost, $\ell_f$:

$J(\textbf{x}_0, \textbf{U}) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{x}_t, \textbf{u}_t) + \ell_f(\textbf{x}_N).$

Letting $\textbf{U}_t = \{\textbf{u}_t, \textbf{u}_{t+1}, ..., \textbf{U}_{N-1}\}$, we define the cost-to-go as the sum of costs from time $t$ to $N$:

$J_t(\textbf{x}, \textbf{U}_t) = \sum\limits^{N-1}\limits_{i=t} \ell(\textbf{x}_i, \textbf{u}_i) + \ell_f(\textbf{x}_N).$

The value function $V$ at time $t$ is the optimal cost-to-go from a given state:

$V_t(\textbf{x}) = \min\limits_{\textbf{U}_t} J_t(\textbf{x}, \textbf{U}_t),$

where the above equation just says that the optimal cost-to-go is found by using the control sequence $\textbf{U}_t$ that minimizes $J_t$.

At the final time step, $N$, the value function is simply

$V(\textbf{x}_N) = \ell_f(\textbf{x}_N).$

For all preceding time steps, we can write the value function as a function of the immediate cost $\ell(\textbf{x}, \textbf{u})$ and the value function at the next time step:

$V(\textbf{x}) = \min\limits_{\textbf{u}} \left[ \ell(\textbf{x}, \textbf{u}) + V(\textbf{f}(\textbf{x}, \textbf{u})) \right].$

NOTE: In the paper, they use the notation $V'(\textbf{f}(\textbf{x}, \textbf{u}))$ to denote the value function at the next time step, which is redundant since $\textbf{x}_{t+1} = \textbf{f}(\textbf{x}_t, \textbf{u}_t)$, but it comes in handy later when they drop the dependencies to simplify notation. So, heads up: $V' = V(\textbf{f}(\textbf{x}, \textbf{u})$.

Forward rollout

The forward rollout consists of two parts. The first part is to simulating things to generate the $(\textbf{X}, \textbf{U}),$ from which we can calculate the overall cost of the trajectory, and find out the path that the arm will take. To improve things though we’ll need a lot of information about the partial derivatives of the system, calculating these is the second part of the forward rollout phase.

To calculate all these partial derivatives we’ll use $(\textbf{X}, \textbf{U})$. For each $(\textbf{x}_t, \textbf{u}_t)$ we’ll calculate the derivatives of $\textbf{f}(\textbf{x}_t, \textbf{u}_t)$ with respect to $\textbf{x}_t$ and $\textbf{u}_t$, which will give us what we need for our linear approximation of the system dynamics.

To get the information we need about the value function, we’ll need the first and second derivatives of $\ell(\textbf{x}_t, \textbf{u}_t)$ and $\ell_f(\textbf{x}_t, \textbf{x}_t)$ with respect to $\textbf{x}_t$ and $\textbf{u}_t$.

So all in all, we need to calculate $\textbf{f}_\textbf{x}$, $\textbf{f}_\textbf{u}$, $\ell_\textbf{x}$, $\ell_\textbf{u}$, $\ell_\textbf{xx}$, $\ell_\textbf{ux}$, $\ell_\textbf{uu}$, where the subscripts denote a partial derivative, so $\ell_\textbf{x}$ is the partial derivative of $\ell$ with respect to $\textbf{x}$, $\ell_\textbf{xx}$ is the second derivative of $\ell$ with respect to $\textbf{x}$, etc. And to calculate all of these partial derivatives, we’re going to use finite differences! Just like in the LQR with finite differences post. Long story short, load up the simulation for every time step, slightly vary one of the parameters, and measure the resulting change.

Once we have all of these, we’re ready to move on to the backward pass.

Backward pass

Now, we started out with an initial trajectory, but that was just a guess. We want our algorithm to take it and then converge to a local minimum. To do this, we’re going to add some perturbing values and use them to minimize the value function. Specifically, we’re going to compute a local solution to our value function using a quadratic Taylor expansion. So let’s define $Q(\delta \textbf{x}, \delta \textbf{u})$ to be the change in our value function at $(\textbf{x}, \textbf{u})$ as a result of small perturbations $(\delta \textbf{x}, \delta \textbf{u})$:

$Q(\delta \textbf{x}, \delta \textbf{u}) = \ell (\textbf{x} + \delta \textbf{x}, \textbf{u} + \delta \textbf{u}) + V(\textbf{f}(\textbf{x} + \delta\textbf{x}, \textbf{u} + \delta \textbf{u})).$

The second-order expansion of $Q$ is given by:

$Q_\textbf{x} = \ell_\textbf{x} + \textbf{f}_\textbf{x}^T V'_\textbf{x},$

$Q_\textbf{u} = \ell_\textbf{u} + \textbf{f}_\textbf{u}^T V'_\textbf{x},$

$Q_\textbf{xx} = \ell_\textbf{xx} + \textbf{f}_\textbf{x}^T V'_\textbf{xx} \textbf{f}_\textbf{x} + V'_\textbf{x} \cdot \textbf{f}_\textbf{xx},$

$Q_\textbf{ux} = \ell_\textbf{ux} + \textbf{f}_\textbf{u}^T V'_\textbf{xx} \textbf{f}_\textbf{x}+ V'_\textbf{x} \cdot \textbf{f}_\textbf{ux},$

$Q_\textbf{uu} = \ell_\textbf{uu} + \textbf{f}_\textbf{u}^T V'_\textbf{xx} \textbf{f}_\textbf{u}+ V'_\textbf{x} \cdot \textbf{f}_\textbf{uu}.$

Remember that $V' = V(\textbf{f}(\textbf{x}, \textbf{u}))$, which is the value function at the next time step. NOTE: All of the second derivatives of $\textbf{f}$ are zero in the systems we’re controlling here, so when we calculate the second derivatives we don’t need to worry about doing any tensor math, yay!

Given the second-order expansion of $Q$, we can to compute the optimal modification to the control signal, $\delta \textbf{u}^*$. This control signal update has two parts, a feedforward term, $\textbf{k}$, and a feedback term $\textbf{K} \delta\textbf{x}$. The optimal update is the $\delta\textbf{u}$ that minimizes the cost of $Q$:

$\delta\textbf{u}^*(\delta \textbf{x}) = \min\limits_{\delta\textbf{u}}Q(\delta\textbf{x}, \delta\textbf{u}) = \textbf{k} + \textbf{K}\delta\textbf{x},$

where $\textbf{k} = -Q^{-1}_\textbf{uu} Q_\textbf{u}$ and $\textbf{K} = -Q^{-1}_\textbf{uu} Q_\textbf{ux}.$

Derivation can be found in this earlier paper by Li and Todorov. By then substituting this policy into the expansion of $Q$ we get a quadratic model of $V$. They do some mathamagics and come out with:

$V_\textbf{x} = Q_\textbf{x} - \textbf{K}^T Q_\textbf{uu} \textbf{k},$

$V_\textbf{xx} = Q_\textbf{xx} - \textbf{K}^T Q_\textbf{uu} \textbf{K}.$

So now we have all of the terms that we need, and they’re defined in terms of the values at the next time step. We know the value of the value function at the final time step $V_N = \ell_f(\textbf{x}_N)$, and so we’ll simply plug this value in and work backwards in time recursively computing the partial derivatives of $Q$ and $V$.

Calculate control signal update

Once those are all calculated, we can calculate the gain matrices, $\textbf{k}$ and $\textbf{K}$, for our control signal update. Huzzah! Now all that’s left to do is evaluate this new trajectory. So we set up our system

$\hat{\textbf{x}}_0 = \textbf{x}_0,$

$\hat{\textbf{u}}_t = \textbf{u}_t + \textbf{k}_t + \textbf{K}_t (\hat{\textbf{x}}_t - \textbf{x}_t),$

$\hat{\textbf{x}}_{t+1} = \textbf{f}(\hat{\textbf{x}}_t, \hat{\textbf{u}}_t),$

and record the cost. Now if the cost of the new trajectory $(\hat{\textbf{X}}, \hat{\textbf{U}})$ is less than the cost of $(\textbf{X}, \textbf{U})$ then we set $\textbf{U} = \hat{\textbf{U}}$ and go do it all again! And when the cost from an update becomes less than a threshold value, call it done. In code this looks like:

if costnew < cost:
sim_new_trajectory = True

if (abs(costnew - cost)/cost) < self.converge_thresh:
break


Of course, another option we need to account for is when costnew > cost. What do we do in this case? Our control update hasn’t worked, do we just exit?

The Levenberg-Marquardt heuristic
No! Phew.

The control signal update in iLQR is calculated in such a way that it can behave like Gauss-Newton optimization (which uses second-order derivative information) or like gradient descent (which only uses first-order derivative information). The is that if the updates are going well, then lets include curvature information in our update to help optimize things faster. If the updates aren’t going well let’s dial back towards gradient descent, stick to first-order derivative information and use smaller steps. This wizardry is known as the Levenberg-Marquardt heuristic. So how does it work?

Something we skimmed over in the iLQR description was that we need to calculate $Q^{-1}_\textbf{uu}$ to get the $\textbf{k}$ and $\textbf{K}$ matrices. Instead of using np.linalg.pinv or somesuch, we’re going to calculate the inverse ourselves after finding the eigenvalues and eigenvectors, so that we can regularize it. This will let us do a couple of things. First, we’ll be able to make sure that our estimate of curvature ($Q_\textbf{uu}^{-1}$) stays positive definite, which is important to make sure that we always have a descent direction. Second, we’re going to add a regularization term to the eigenvalues to prevent them from exploding when we take their inverse. Here’s our regularization implemented in Python:


Q_uu_evals, Q_uu_evecs = np.linalg.eig(Q_uu)
Q_uu_evals[Q_uu_evals < 0] = 0.0
Q_uu_evals += lamb
Q_uu_inv = np.dot(Q_uu_evecs,
np.dot(np.diag(1.0/Q_uu_evals), Q_uu_evecs.T))


Now, what happens when we change lamb? The eigenvalues represent the magnitude of each of the eigenvectors, and by taking their reciprocal we flip the contributions of the vectors. So the ones that were contributing the least now have the largest singular values, and the ones that contributed the most now have the smallest eigenvalues. By adding a regularization term we ensure that the inverted eigenvalues can never be larger than 1/lamb. So essentially we throw out information.

In the case where we’ve got a really good approximation of the system dynamics and value function, we don’t want to do this. We want to use all of the information available because it’s accurate, so make lamb small and get a more accurate inverse. In the case where we have a bad approximation of the dynamics we want to be more conservative, which means not having those large singular values. Smaller singular values give a smaller $Q_\textbf{uu}^{-1}$ estimate, which then gives smaller gain matrices and control signal update, which is what we want to do when our control signal updates are going poorly.

How do you know if they’re going poorly or not, you now surely ask! Clever as always, we’re going to use the result of the previous iteration to update lamb. So adding to the code from just above, the end of our control update loop is going to look like:

lamb = 1.0 # initial value of lambda
...
if costnew < cost:
lamb /= self.lamb_factor
sim_new_trajectory = True

if (abs(costnew - cost)/cost) < self.converge_thresh:
break
else:
lamb *= self.lamb_factor
if lamb > self.max_lamb:
break


And that is pretty much everything! OK let’s see how this runs!

Simulation results

If you want to run this and see for yourself, you can go copy my Control repo, navigate to the main directory, and run

python run.py arm2 reach


or substitute in arm3. If you’re having trouble getting the arm2 simulation to run, try arm2_python, which is a straight Python implementation of the arm dynamics, and should work no sweat for Windows and Mac.

Below you can see results from the iLQR controller controlling the 2 and 3 link arms (click on the figures to see full sized versions, they got distorted a bit in the shrinking to fit on the page), using immediate and final state cost functions defined as:

l = np.sum(u**2)


and

pos_err = np.array([self.arm.x[0] - self.target[0],
self.arm.x[1] - self.target[1]])
l = (wp * np.sum(pos_err**2) + # pos error
wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error


where wp and wv are just gain values, x is the state of the system, and self.arm.x is the $(x,y)$ position of the hand. These read as “during movement, penalize large control signals, and at the final state, have a big penalty on not being at the target.”

So let’s give it up for iLQR, this is awesome! How much of a crazy improvement is that over LQR? And with all knowledge of the system through finite differences, and with the full movements in exactly 1 second! (Note: The simulation speeds look different because of my editing to keep the gif sizes small, they both take the same amount of time for each movement.)

Changing cost functions
Something that you may notice is that the control of the 3 link is actually straighter than the 2 link. I thought that this might be just an issue with the gain values, since the scale of movement is smaller for the 2 link arm than the 3 link there might have been less of a penalty for not moving in a straight line, BUT this was wrong. You can crank the gains and still get the same movement. The actual reason is that this is what the cost function specifies, if you look in the code, only $\ell_f(\textbf{x}_N)$ penalizes the distance from the target, and the cost function during movement is strictly to minimize the control signal, i.e. $\ell(\textbf{x}_t, \textbf{u}_t) = \textbf{u}_t^2$.

Well that’s a lot of talk, you say, like the incorrigible antagonist we both know you to be, prove it. Alright, fine! Here’s iLQR running with an updated cost function that includes the end-effector’s distance from the target in the immediate cost:

All that I had to do to get this was change the immediate cost from

l = np.sum(u**2)


to

l = np.sum(u**2)
pos_err = np.array([self.arm.x[0] - self.target[0],
self.arm.x[1] - self.target[1]])
l += (wp * np.sum(pos_err**2) + # pos error
wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error


where all I had to do was include the position penalty term from the final state cost into the immediate state cost.

Changing sequence length
In these simulations the system is simulating at .01 time step, and I gave it 100 time steps to reach the target. What if I give it only 50 time steps?

It looks pretty much the same! It’s just now twice as fast, which is of course achieved by using larger control signals, which we don’t see, but dang awesome.

What if we try to make it there in 10 time steps??

OK well that does not look good. So what’s going on in this case? Basically we’ve given the algorithm an impossible task. It can’t make it to the target location in 10 time steps. In the implementation I wrote here, if it hits the end of it’s control sequence and it hasn’t reached the target yet, the control sequence starts over back at t=0. Remember that part of the target state is also velocity, so basically it moves for 10 time steps to try to minimize $(x,y)$ distance, and then slows down to minimize final state cost in the velocity term.

In conclusion

This algorithm has been used in a ton of things, for controlling robots and simulations, and is an important part of guided policy search, which has been used to very successfully train deep networks in control problems. It’s getting really impressive results for controlling the arm models that I’ve built here, and using finite differences should easily generalize to other systems.

iLQR is very computationally expensive, though, so that’s definitely a downside. It’s definitely less expensive if you have the equations of your system, or at least a decent approximation of them, and you don’t need to use finite differences. But you pay for the efficiency with a loss in generality.

There are also a bunch of parameters to play around with that I haven’t explored at all here, like the weights in the cost function penalizing the magnitude of the cost function and the final state position error. I showed a basic example of changing the cost function, which hopefully gets across just how easy changing these things out can be when you’re using finite differences, and there’s a lot to play around with there too.

Implementation note
In the Yuval and Todorov paper, they talked about using backtracking line search when generating the control signal. So the algorithm they had when generating the new control signal was actually:

$\hat{\textbf{u}}_t = \hat{\textbf{u}}_t + \alpha\textbf{k}_t + \textbf{K}_t(\hat{\textbf{x}}_t - \textbf{x}_t)$

where $\alpha$ was the backtracking search parameter, which gets set to one initially and then reduced. It’s very possible I didn’t implement it as intended, but I found consistently that $\alpha = 1$ always generated the best results, so it was just adding computation time. So I left it out of my implementation. If anyone has insights on an implementation that improves results, please let me know!

And then finally, another thank you to Dr. Emo Todorov for providing Matlab code for the iLQG algorithm, which was very helpful, especially for getting the Levenberg-Marquardt heuristic implemented properly.

## Linear-Quadratic Regulation for non-linear systems using finite differences

One of the standard controllers in basic control theory is the linear-quadratic regulator (LQR). There is a finite-horizon case (where you have a limited amount of time), and an infinite-horizon case (where you don’t); in this post, for simplicity, we’re only going to be dealing with the infinite-horizon case.

The LQR is designed to handle a very specific kind of problem. First, it assumes you are controlling a system with linear dynamics, which means you can express them as

$\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u}$,

where $\textbf{x}$ and $\dot{\textbf{x}}$ are the state and its time derivative, $\textbf{u}$ is the input, and $\textbf{A}$ and $\textbf{B}$ capture the effects of the state and input on the derivative. And second, it assumes that the cost function, denoted $J$, is a quadratic of the form

$J = \int_0^{\infty} \left( (\textbf{x} - \textbf{x}^*)^T \textbf{Q} (\textbf{x} - \textbf{x}^*) + \textbf{u}^T \textbf{R} \textbf{u} \right) dt$

where $\textbf{x}^*$ is the target state, and $\textbf{Q} = \textbf{Q}^T \geq 0$ and $\textbf{R} = \textbf{R}^T \geq 0$ are weights on the cost of not being at the target state and applying a control signal. The higher $\textbf{Q}$ is, the more important it is to get to the target state asap, the higher $\textbf{R}$ is, the more important it is to keep the control signal small as you go to the target state.

The goal of the LQR is to calculate a feedback gain matrix $\textbf{K}$ such that

$\textbf{u} = -\textbf{K} \textbf{x},$

drives the system to the target. When the system is a linear system with a quadratic cost function, this can be done optimally. There is lots of discussion elsewhere about LQRs and their derivation, so I’m not going to go into that with this post. Instead, I’m going to talk about applying LQRs to non-linear systems, and using finite differences to do it, which works when you have a readily accessible simulation of the system on hand. The fun part is that by using finite differences you can get this to work without working out the dynamics equations yourself.

Using LQRs on non-linear systems

As you may have noticed, non-linear systems violate the first assumption of a linear quadratic regulator; that the system is linear. That doesn’t mean that we can’t apply it, it just means that it’s not going to be optimal. How poorly the LQR will perform depends on a few things, two important factors being how non-linear the system dynamics actually are, and how often you’re able to update the feedback gain matrix $\textbf{K}$. To apply LQR to non-linear systems we’re just going to close our eyes and pretend that the system dynamics are linear, i.e. they fit the form

$\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u}.$

We’ll do this by approximating the actual dynamics of the system linearly. We’ll then solve for our gain value $\textbf{K}$, generate our control signal for this timestep, and then re-approximate the dynamics again at the next time step and solve for $\textbf{K}$ from the new state. The more non-linear the system dynamics are, the less appropriate $\textbf{K}$ will be for generating our control signal $\textbf{u}$ as we move away from the state $\textbf{K}$ was calculated in; this is why update time of the LQR can become an important factor.

Using finite-differences to approximate system dynamics

An important question, then, is how do we find this system approximation? How can we calculate the $\textbf{A}$ and $\textbf{B}$ matrices that we then use to solve for $\textbf{K}$? If we know the dynamics of the system to be

$\dot{\textbf{x}} = f(\textbf{x}, \textbf{u})$,

then we can calculate

$\textbf{A} = \frac{\partial f(\textbf{x}, \textbf{u})}{\partial \textbf{x}}, \;\;\;\; \textbf{B} = \frac{\partial f(\textbf{x}, \textbf{u})}{\partial \textbf{u}}$.

If you’re going to try this for the 3-link arm, though, get out Mathematica. Do not try this by hand. If you disregard my warning and foolhardily attempt such a derivation you will regret, repent, and then appeal to Wolfram Alpha for salvation. These equations quickly become terrible and long even for seemingly not-so-complicated systems.

There are a few ways to skirt this. Here we’re going to assume that the system under control is a simulation, or that we at least have access to an accurate model, and use the finite differences method to compute these values. The idea behind finite differences is to approximate the rate of change of the function $f$ at the point $x$ by sampling $f$ near $x$ and using the difference to calculate $\dot{f}(x)$. Here’s a picture for a 1D system:

So here, our current state $x$ is the blue dot, and the red dots represent the sample points $x + \Delta x$ and $x - \Delta x$. We can then calculate

$\dot{f}(x) \approx \frac{f(x+\Delta x) - f(x-\Delta x)}{2\Delta x},$

and you can see the actual rate of change of $f$ at $x$ plotted in the blue dashed line, and the approximated rate of change calculated using finite differences plotted in the red dashed line. We can also see that the approximated derivative is only accurate near $x$ (the blue dot).

Back in our multi-dimensional system, to use finite differences to calculate the derivative with respect to the state and the input we’re going to vary each of the dimensions of the state and input by some small amount one at a time, calculating the effects of each one by one. Here’s a chunk of pseudo-code to hopefully clarify this idea:

eps = 1e-5
A = np.zeros((len(current_state), len(current_state))
for ii in range(len(current_state)):
x = current_state.copy()
x[ii] += eps
x_inc = simulate_system(state=x, input=control_signal)
x = current_state.copy()
x[ii] -= eps
x_dec = simulate_system(state=x, input=control_signal)
A[:,ii] = (x_inc - x_dec) / (2 * eps)

B = np.zeros((len(current_state), len(control_signal))
for ii in range(len(control_signal)):
u = control_signal.copy()
u[ii] += eps
x_inc = simulate_system(state=current_state, input=u)
u = control_signal.copy()
u[ii] -= eps
x_dec = simulate_system(state=current_state, input=u)
B[:,ii] = (x_inc - x_dec) / (2 * eps)


Now we’re able to generate our $\textbf{A}$ and $\textbf{B}$ matrices we have everything we need to solve for our feedback gain matrix $\textbf{K}$! Which is great.

Note on using finite differences in continuous vs discrete setup

Something that’s important to straighten out too is what exactly is returned by the simulate_system function in the code above. In the continuous case, your system is captured as

$\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u},$,

where in the discrete case your system is defined

$\textbf{x}(t+1) = \textbf{A}\textbf{x}(t) + \textbf{B}\textbf{u}(t).$

If you are calculating your feedback gain matrix $\textbf{K}$ using the continuous solution to the algebraic Riccati equation, then you need to be returning $\dot{\textbf{x}}(t)$. If you’re solving for $\textbf{K}$ using the discrete solution to the algebraic Riccati equation you need to return $\textbf{x}(t+1)$. This was just something that I came across as I was coding and so I wanted to mention it here in case anyone else stumbled across it!

Applying LQR to 2 and 3 link arm control

Alright! Let’s have a look at how the LQR does controlling non-linear systems. Below we have the control of a 2-link arm compared to a 3-link arm, and you can see the control of the 2-link arm is better. This is a direct result of the dynamics of a 3-link arm being significantly more complex.

Note on controlling at different timesteps

When I was first testing the LQR controller I expected the effects of different control update times to be a lot more significant than it was. As it turns out, for controlling a 3-link arm, there’s not really a visible difference in a controller that is updating every .01 seconds vs every .001 seconds vs every .0001 seconds. Let’s have a look:

Can’t even tell, eh? Fun fact, the simulation took 1 minute 30 seconds at .01 seconds time step and 45 minutes at .0001 seconds time step. The left-most animation is the .01 seconds and the right-most the .0001 seconds. But why is there seemingly so little difference? Well, this boils down to the dynamics of the 3-link arm changing actually pretty slowly. Below I’ve plotted just a few of the elements from the $\textbf{A}$, $\textbf{B}$, and $\textbf{K}$ matrices over .5 seconds of simulation time:

So, there are some obvious points where sampling the dynamics at a .01 time step is noticeably less accurate, but all in all there’s not a huuuggge difference between sampling at .01 and .0001 seconds. If you’re just watching the end-effector path it’s really not very noticeable. You can see how the elements of $\textbf{A}$ and $\textbf{B}$ are changing fairly slowly; this means that $\textbf{K}$ is going to be an effective feedback gain for a fair chunk of time. And the computational savings you get by sampling the dynamics and regenerating $\textbf{K}$ every .01 seconds instead of every .0001 seconds are pretty big. This was just another thing that I came across when playing around with the LQR, the take away being don’t just assume you need to update your system crazy often. You might get very comparable performance for much less computational cost.

Conclusions

All in all, the LQR controller is pretty neat! It’s really simple to set up, and generic. We don’t need any specific information about the system dynamics, like we do for effective operational space control (OSC). When we estimate the dynamics with finite differences, all need is a decent system model that we can sample. Again, the more non-linear the system, of course, the less effective a LQR will be. If you’re interested in playing around with one, or generating the figures that I show above, the code is all up and running on my Github for you to explore.

## Setting up an arm simulation interface in Nengo 2

I got an email the other day asking about how to set up an arm controller in Nengo, where they had been working from the Spaun code to strip away things until they had just the motor portion left they could play with. I ended up putting together a quick script to get them started and thought I would share it here in case anyone else was interested. It’s kind of fun because it shows off some of the new GUI and node interfacing. Note that you’ll need nengo_gui version .15+ for this to work. In general I recommend getting the dev version installed, as it’s stable and updates are made all the time improving functionality.

Nengo 1.4 core was all written in Java, with Jython and Python scripting thrown in on top, and since then a lot of work has gone into the re-write of the entire code base for Nengo 2. Nengo 2 is now written in Python, all the scripting is in Python, and we have a kickass GUI and support for running neural simulations on CPUs, GPUs, and specialized neuromorphic hardware like SpiNNaKer. I super recommend checking it out if you’re at all interested in neural modelling, we’ve got a bunch of tutorials up and a very active support board to help with any questions or problems. You can find the simulator code for installation here: https://github.com/nengo/nengo and the GUI code here: https://github.com/nengo/nengo_gui, where you can also find installation instructions.

And once you have that up and running, to run an arm simulation you can download and run the following code I have up on my GitHub. When you pop it open at the top is a run_in_GUI boolean, which you can use to open the sim up in the GUI, if you set it to False then it will run in the Nengo simulator and once finished will pop up with some basic graphs. Shout out to Terry Stewart for putting together the arm-visualization. It’s a pretty slick little demo of the extensibility of the Nengo GUI, you can see the code for it all in the <code>arm_func</code> in the <code>nengo_arm.py</code> file.

As it’s set up right now, it uses a 2-link arm, but you can simply swap out the Arm.py file with whatever plant you want to control. And as for the neural model, there isn’t one implemented in here, it’s just a simple input node that runs through a neural population to apply torque to the two joints of the arm. But! It should be a good start for anyone looking to experiment with arm control in Nengo. Here’s what it looks like when you pull it up in the GUI (also note that the arm visualization only appears once you hit the play button!):

## Dynamic movement primitives part 2: Controlling end-effector trajectories

The dynamic movement primitive (DMP) framework was designed for trajectory control. It so happens that in previous posts we’ve built up to having several arm simulations that are ripe for throwing a trajectory controller on top, and that’s what we’ll do in this post. The system that we will be controlling here is the 3 link arm model with an operational space controller (OSC) that translates end-effector forces into joint torques. The DMPs here will be controlling the $(x,y)$ trajectory of the hand, and the OSC will take care of turning the desired hand forces into torques that can be applied to the arm. All of the code used to generate the animations throughout this post can of course be found up on my github (to run play around with variants of python run.py arm3 dmp write).

Controlling a 3 link arm with DMPs
We have our 3 link arm and its OSC controller; this whole setup we’ll collectively refer to as ‘the plant’ throughout this post. We are going to pass in some $(x,y)$ force signal to the plant, and the plant will carry it out. Additionally we’ll get a feedback signal with the $(x,y)$ position of the hand. At the same time, we also have a DMP system that’s doing its own thing, tracing out a desired trajectory in $(x,y)$ space. We have to tie these two systems together.

To do this is easy, we’ll generate the control signal for the plant from our DMP system simply by measuring the difference between the state of our DMP system and the plant state, use that to drive the plant to the state of the DMP system. Formally,

$u = k_p(y_{\textrm{DMP}} - y)$

where $y_{\textrm{DMP}}$ is the state of the DMP system, $y$ is the state of the plant, and $k_p$ and is the position error gain term.

Once we have this, we just go ahead and step our DMP system forward and make sure the gain values on the control signal are high enough that the plant follows the DMP’s trajectory. And that’s pretty much it, just run the DMP system to the end of the trajectory and then stop your simulation.

To give a demonstration of DMP control I’ve set up the DMP system to follow the same number trajectories that the SPAUN arm followed. As you can see the combination of DMPs and operational space control is much more effective than my previous implementation.

Incorporating system feedback

One of the issues in implementing the control above is that we have to be careful about how quickly the DMP trajectory moves, because while the DMP system isn’t constrained by any physical dynamics, the plant is. Depending on the size of the movement the DMP trajectory may be moving a foot a second or an inch a second. You can see above that the arm doesn’t fully draw out the desired trajectories in places where the DMP system moved too quickly in and out and sharp corners. The only way to remedy this without feedback is to have the DMP system move more slowly throughout the entire trajectory. What would be nice, instead, would be to just say ‘go as fast as you can, as long as the plant state is within some threshold distance of you’, and this is where system feedback comes in.

It’s actually very straightforward to implement this using system feedback: If the plant state drifts away from the state of the DMPs, slow down the execution speed of the DMP to allow the plant time to catch up. The do this we just have to multiply the DMP timestep $dt$ by a new term:

$1 / (1 + \alpha_{\textrm{err}}(y_{\textrm{plant}} - y_{\textrm{DMP}}))$.

All this new term does is slow down the canonical system when there’s an error, you can think of it as a scaling on the time step. Additionally, the sensitivity of this term can be modulated the scaling term $\alpha_{\textrm{err}}$ on the difference between the plant and DMP states.

We can get an idea of how this affects the system by looking at the dynamics of the canonical system when an error term is introduced mid-run:

When the error is introduced the dynamics of the system slow down, great! Lets look at an example comparing execution with and without this feedback term. Here’s the system drawing the number 3 without any feedback incorporation:

and here’s the system drawing the number 3 with the feedback term included:

These two examples are a pretty good case for including the feedback term into your DMP system. You can still see in the second case that the specified trajectory isn’t traced out exactly, but if that’s what you’re shooting for you can just crank up the $\alpha_{\textrm{err}}$ to make the DMP timestep really slow down whenever the DMP gets ahead of the plant at all.

Interpolating trajectories for imitation

When imitating trajectories there can be some issues with having enough sample points and how to fit them to the canonical system’s timeframe, they’re not difficult to get around but I thought I would go over what I did here. The approach I took was to always run the canonical system for 1 second, and whenever a trajectory is passed in that should be imitated to scale the x-axis of the trajectory such that it’s between 0 and 1. Then I shove it into an interpolator and use the resulting function to generate an abundance of nicely spaced sample points for the DMP imitator to match. Here’s the code for that:

# generate function to interpolate the desired trajectory
import scipy.interpolate

path = np.zeros((self.dmps, self.timesteps))
x = np.linspace(0, self.cs.run_time, y_des.shape[1])

for d in range(self.dmps):

# create interpolation function
path_gen = scipy.interpolate.interp1d(x, y_des[d])

# create evenly spaced sample points of desired trajectory
for t in range(self.timesteps):
path[d, t] = path_gen(t * self.dt)

y_des = path


Direct trajectory control vs DMP based control

Now, using the above described interpolation function we can just directly use its output to guide our system. And, in fact, when we do this we get very precise control of the end-effector, more precise than the DMP control, as it happens. The reason for this is because our DMP system is approximating the desired trajectory and with a set of basis functions, and some accuracy is being lost in this approximation.

So if we instead use the interpolation function to drive the plant we can get exactly the points that we specified. The times when this comes up especially are when the trajectories that you’re trying to imitate are especially complicated. There are ways to address this with DMPs by placing your basis functions more appropriately, but if you’re just looking for the exact replication of an input trajectory (as often people are) this is a simpler way to go. You can see the execution of this in the trace.py code up on my github. Here’s a comparison of a single word drawn using the interpolation function:

and here’s the same word drawn using a DMP system with 1,000 basis function per DOF:

We can see that just using the interpolation function here gives us the exact path that we specified, where using DMPs we have some error, and this error increases with the size of the desired trajectory. But! That’s for exactly following a given trajectory, which is often not the case. The strength of the DMP framework is that the trajectory is a dynamical system. This lets us do simple things to get really neat performance, like scale the trajectory spatially on the fly simply by changing the goal, rather than rescaling the entire trajectory:

Conclusions

Some basic examples of using DMPs to control the end-effector trajectory of an arm with operational space control were gone over here, and you can see that they work really nicely together. I like when things build like this. We also saw that power of DMPs in this situation is in their generalizability, and not in exact reproduction of a given path. If I have a single complex trajectory that I only want the end-effector to follow once then I’m going to be better off just interpolating that trajectory and feeding the coordinates into the arm controller rather than go through the whole process of setting up the DMPs.

Drawing words, though, is just one basic example of using the DMP framework. It’s a very simple application and really doesn’t do justice to the flexibility and power of DMPs. Other example applications include things like playing ping pong. This is done by creating a desired trajectory showing the robot how to swing a ping pong paddle, and then using a vision system to track the current location of the incoming ping pong ball and changing the target of the movement to compensate dynamically. There’s also some really awesome stuff with object avoidance, that is implemented by adding another term with some simple dynamics to the DMP. Discussed here, basically you just have another system that moves you away from the object with a strength relative to your distance from the object. You can also use DMPs to control gain terms on your PD control signal, which is useful for things like object manipulation.

And of course I haven’t touched on rhythmic DMPs or learning with DMPs at all, and those are both also really interesting topics! But this serves as a decent introduction to the whole area, which has been developed in the Schaal lab over the last decade or so. I recommend further reading with some of these papers if you’re interested, there are a ton of neat ways to apply the DMP framework! And, again, the code for everything here is up on my github in the control and pydmps repositories.

## Inverse kinematics of 3-link arm with constrained minimization in Python

Inverse kinematics is a common topic in robotics control; one that most anyone working with a robotic arm needs to address at some point. The use case I’ll be looking at here is the situation where we want to specify an $(x,y)$ coordinate for the end-effector, and solve for the appropriate joint angle configuration that best achieves this. It’s surprisingly straightforward to handle with the right approach, so it’s a nice little exercise in Python optimization (and we’ll even work in some graphical display!) And per usual, the code is all up on my github.

Most of the time the servos that anyone is working with are position controlled, meaning that you provide them with a set of desired angles for them to move to, rather than force controlled, where you provide them with a torque signal that directly moves them. In this case, inverse kinematics is all you need. You simply work out the set of joint angles you want the servos to move to and send those out directly. In biological systems this clearly isn’t the end of the work, because even if you know where you want to go you still have to provide a force (muscle activation) signal that will move your system there. But! as mentioned in the motor babbling paper review post I wrote a while back, having the target specified in the most basic space (i.e. joint angles instead of $(x,y)$ coordinates for a servo based arm, or muscle lengths instead of joint angles for a muscle based arm) can make calculating, or learning, the appropriate force signal much easier; so something that can do the inverse kinematics calculation is worth having.

Inverse kinematics

Now then, how do we go about finding the joint angle configuration that best achieves a desired $(x,y)$ coordinate? Constrained minimization! Our constraint is that the tip of the arm must be at the specified $(x,y)$ position, and we need to specify some cost function for the system to minimize to choose among all the possible configurations that accomplish this. The cost function that we’re going to use here is a very intuitive one, that I first encountered in the paper ‘Learning operational space control’ by Jan Peters. What we’re going to do is specify a default, or resting, joint state configuration, and minimize our distance from that. Which makes a lot of sense.

I’m going to post the code for calculating the inverse kinematics of a 3-link arm, and then we’ll work through it.

import math
import numpy as np
import scipy.optimize

def __init__(self, q=None, q0=None, L=None):
"""Set up the basic parameters of the arm.
All lists are in order [shoulder, elbow, wrist].

:param list q: the initial joint angles of the arm
:param list q0: the default (resting state) joint configuration
:param list L: the arm segment lengths
"""
# initial joint angles
if q is None: q = [.3, .3, 0]
self.q = q
# some default arm positions
if q0 is None: q0 = np.array([math.pi/4, math.pi/4, math.pi/4])
self.q0 = q0
# arm segment lengths
if L is None: L = np.array([1, 1, 1])
self.L = L

self.max_angles = [math.pi, math.pi, math.pi/4]
self.min_angles = [0, 0, -math.pi/4]

def get_xy(self, q=None):
if q is None: q = self.q

x = self.L[0]*np.cos(q[0]) + \
self.L[1]*np.cos(q[0]+q[1]) + \
self.L[2]*np.cos(np.sum(q))

y = self.L[0]*np.sin(q[0]) + \
self.L[1]*np.sin(q[0]+q[1]) + \
self.L[2]*np.sin(np.sum(q))

return [x, y]

def inv_kin(self, xy):

def distance_to_default(q, *args):
# weights found with trial and error, get some wrist bend, but not much
weight = [1, 1, 1.3]
return np.sqrt(np.sum([(qi - q0i)**2 * wi
for qi,q0i,wi in zip(q, self.q0, weight)]))

def x_constraint(q, xy):
x = ( self.L[0]*np.cos(q[0]) + self.L[1]*np.cos(q[0]+q[1]) +
self.L[2]*np.cos(np.sum(q)) ) - xy[0]
return x

def y_constraint(q, xy):
y = ( self.L[0]*np.sin(q[0]) + self.L[1]*np.sin(q[0]+q[1]) +
self.L[2]*np.sin(np.sum(q)) ) - xy[1]
return y

return scipy.optimize.fmin_slsqp( func=distance_to_default,
x0=self.q, eqcons=[x_constraint, y_constraint],
args=[xy], iprint=0) # iprint=0 suppresses output



I’ve taken out most of the comments for compactness, but there are plenty in the code on github, don’t you worry. Alright, let’s take a look!

First, we go through and set up the parameters of our arm. If nothing was passed in during construction, then we’re going to make an arm that is initially at it’s resting position, with shoulder and elbow joint angles = $\frac{\pi}{4}$, and wrist angle = 0. We also set up the arm segment lengths to all be 1, and the minimum and maximum joint angles.

Next, there’s the get_xy function, which just uses some simple trig to calculate the current $(x,y)$ position of the end-effector. This is going to be a handy thing to have when we want to see how well the joint configurations we calculate achieve our desired hand position.

And then after this, we have the actual inverse kinematics function. inv_kin takes in a desired $(x,y)$ position, and returns to us a list of the joint angles that will best achieve this. We define three methods inside inv_kin that define our constrained optimization problem, distance_to_default, which is the function we are minimizing, and x_constraint and y_constraint, which are the constraints that must be met for a solution to be valid. These are all pretty straight forward functions, the distance_to_default function calculates the Euclidean distance in joint space to the resting state, and the constraint functions calculate the difference between the actual and desired end-effector positions for a given set of joint angles.

There is one point of interest in the distance_to_default method, and that is the use of the weights therein. What is the point of applying a gain to these values, you ask? Why, simply that it lets us weight the relative importance of each of these joints remaining close to their default configuration! If you think about moving your arm on a horizontal plane, the wrist actually bends very little. To mimic this we can set the weight applied to the distance of the wrist from its resting state higher than those of the other joints, so that it becomes a higher priority to keep the wrist near its resting state. I suggest playing around with once the graphical display (below) is built up, it’s fun to see the effects different weightings can give on joint configurations.

With these three functions defined, we can now call up the scipy.optimize.fmin_slsqp function, which performs the sequential least squares method to arrive at a solution.

Really, that’s all there is to it!

To test this now, we have the following method:

def test():
############Test it!##################

# set of desired (x,y) hand positions
x = np.arange(-.75, .75, .05)
y = np.arange(0, .75, .05)

# threshold for printing out information, to find trouble spots
thresh = .025

count = 0
total_error = 0
# test it across the range of specified x and y values
for xi in range(len(x)):
for yi in range(len(y)):
# test the inv_kin function on a range of different targets
xy = [x[xi], y[yi]]
# run the inv_kin function, get the optimal joint angles
q = arm.inv_kin(xy=xy)
# find the (x,y) position of the hand given these angles
actual_xy = arm.get_xy(q)
# calculate the root squared error
error = np.sqrt((np.array(xy) - np.array(actual_xy))**2)
# total the error
total_error += error

if np.sum(error) > thresh:
print '-------------------------'
print 'Initial joint angles', arm.q
print 'Final joint angles: ', q
print 'Desired hand position: ', xy
print 'Actual hand position: ', actual_xy
print 'Error: ', error
print '-------------------------'

count += 1

print '\n---------Results---------'
print 'Total number of trials: ', count
print 'Total error: ', total_error
print '-------------------------'


Which is simply working through a set of target $(x,y)$ points, and calculating the total error across all of them. If any particular point gives an error above the threshold level, print out the information for that point so we can take a closer look if so desired. Fortunately, the threshold error isn’t even approached in this test, and for output we get:

---------Results---------
Total number of trials:  450
Total error:  [  3.33831421e-05   2.89667496e-05]
-------------------------


Visualization

So that’s all great and dandy, but it’s always nice to be able to really visualize these things, so I wrote another method that uses Pyglet to help visualize. This is a really easy to use graphics program that I tested out as a means here, and I was very happy with it. I ended up writing a method that pops up a window with a 3-link arm drawn on it, and the arm uses the above inverse kinematics function in the Arm class written above to calculate the appropriate joint angles for the arm to be at to follow the mouse. Once the joint angles are calculated, then their $(x,y)$ locations are also calculated, and the arm is drawn. Pleasingly, the inv_kin above is fast enough to work in real time, so it’s a kind of fun toy example. Again, I’ll show the code and then we’ll work through it below.

import numpy as np
import pyglet
import time

import Arm

def plot():
"""A function for plotting an arm, and having it calculate the
inverse kinematics such that given the mouse (x, y) position it
finds the appropriate joint angles to reach that point."""

# create an instance of the arm

# make our window for drawin'
window = pyglet.window.Window()

label = pyglet.text.Label('Mouse (x,y)', font_name='Times New Roman',
font_size=36, x=window.width//2, y=window.height//2,
anchor_x='center', anchor_y='center')

def get_joint_positions():
"""This method finds the (x,y) coordinates of each joint"""

x = np.array([ 0,
arm.L[0]*np.cos(arm.q[0]),
arm.L[0]*np.cos(arm.q[0]) + arm.L[1]*np.cos(arm.q[0]+arm.q[1]),
arm.L[0]*np.cos(arm.q[0]) + arm.L[1]*np.cos(arm.q[0]+arm.q[1]) +
arm.L[2]*np.cos(np.sum(arm.q)) ]) + window.width/2

y = np.array([ 0,
arm.L[0]*np.sin(arm.q[0]),
arm.L[0]*np.sin(arm.q[0]) + arm.L[1]*np.sin(arm.q[0]+arm.q[1]),
arm.L[0]*np.sin(arm.q[0]) + arm.L[1]*np.sin(arm.q[0]+arm.q[1]) +
arm.L[2]*np.sin(np.sum(arm.q)) ])

return np.array([x, y]).astype('int')

window.jps = get_joint_positions()

@window.event
def on_draw():
window.clear()
label.draw()
for i in range(3):
pyglet.graphics.draw(2, pyglet.gl.GL_LINES, ('v2i',
(window.jps[0][i], window.jps[1][i],
window.jps[0][i+1], window.jps[1][i+1])))

@window.event
def on_mouse_motion(x, y, dx, dy):
# call the inverse kinematics function of the arm
# to find the joint angles optimal for pointing at
# this position of the mouse
label.text = '(x,y) = (%.3f, %.3f)'%(x,y)
arm.q = arm.inv_kin([x - window.width/2, y]) # get new arm angles
window.jps = get_joint_positions() # get new joint (x,y) positions

pyglet.app.run()

plot()


There are really only a few things that we’re doing in this method: We create an arm with somewhat normal segment lengths and a window for plotting; we have a function for calculating the $(x,y)$ coordinates of each of the joints for plotting; we have a function that tracks the position of the mouse and updates our arm joint angles by calling the arm.inv_kin function we wrote above; and lastly we plot it! Easy. The functions aren’t quite in that order, and that’s because I chose alphabetical ordering over conceptual. Here’s a picture!

I think the only kind of weird thing going on in here is that my variables that I use in both this method and in the overwritten window methods are defined as belonging to window. Aside from that, it’s straightforward trig to calculate the joint angle Cartesian coordinates, and then there’s also a label that displays the mouse $(x,y)$ position for the hell of it. Honestly, the label was in a Pyglet tutorial example and I just left it in.

It’s not going to win any awards for graphics by any means, but it’s simple and it works! And, assuming you have Pyglet installed (which is very easy to do), you should straight up be able to run it and be amazed. Or a little impressed. Or not, whatever. But in any case now you have your very own inverse kinematics example on a 3-link arm! Moving to any other number of joints is exactly the same, you just have to add the corresponding number of default joint configurations and everything should work out as is.

From here, the plan is to look at bringing in an actual force driven arm model to the mix, and then building up a system that can track the mouse, but with a control signal that specifies forces to apply to the arm, rather than positions. Which I think will be pretty neat. And in case you missed it at the top, here’s a link to all the code on my github.