## Building a spiking neural model of adaptive arm control

About a year ago I published the work from my thesis in a paper called ‘A spiking neural model of adaptive arm control’. In this paper I presented the Recurrent Error-driven Adaptive Control Hierarchy (REACH) model. The goal of the model is to begin working towards reproducing behavioural level phenomena of human movement with biologically plausible spiking neural networks.

To do this, I start by using three methods from control literature (operational space control, dynamic movement primitives, and non-linear adaptive control) to create an algorithms level model of the motor control system that captures behavioural level phenomena of human movement. Then I explore how this functionality could be mapped to the primate brain and implemented in spiking neurons. Finally, I look at the data generated by this model on the behavioural level (e.g. kinematics of movement), the systems level (i.e. analysis of populations of neurons), and the single-cell level (e.g. correlating neural activity with movement parameters) and compare/contrast with experimental data.

By having a full model framework (from observable behaviour to neural spikes) is to have a more constrained computational model of the motor control system; adding lower-level biological constraints to behavioural models and higher-level behavioural constraints to neural models.

In general, once you have a model, the critical next step is to generating testable predictions that can be used to discriminate between other models with different implementations or underlying algorithms. Discriminative predictions allow us to design experiments that can gather evidence in favour or against different hypotheses of brain function, and provide clues to useful directions for further research. Which is the real contribution of computational modeling.

So that’s a quick overview of the paper; there are quite a few pages of supplementary information that describe the details of the model implementation, and I provided the code and data used to generate the data analysis figures. However, code to explicitly run the model on your own has been missing. As one of the major points of this blog is to provide code for furthering research, this is pretty embarrassing. So, to begin to remedy this, in this post I’m going to work through a REACH framework for building models to control a two-link arm through reaching in a line, tracing a circle, performing the centre-out reaching task, and adapting online to unexpected perturbations during reaching imposed by a joint-velocity based forcefield.

This post is directed towards those who have already read the paper (although not necessarily the supplementary material). To run the scripts you’ll need Nengo, Nengo GUI, and NengoLib all installed. There’s a description of the theory behind the Neural Engineering Framework, which I use extensively in my Nengo modeling, in the paper. I’m hoping that between that and code readability / my explanations below that most will be comfortable starting to play around with the code. But if you’re not, and would like more resources, you can check out the Getting Started page on the Nengo website, the tutorials from How To Build a Brain, and the examples in the Nengo GUI.

You can find all of the code up on my GitHub.

NOTE: I’m using the two-link arm (which is fully implemented in Python) instead of the three-link arm (which has compile issues for Macs) both so that everyone should be able to run the model arm code and to reduce the number of neurons that are required for control, so that hopefully you can run this on you laptop in the event that you don’t have a super power Ubuntu work station. Scaling this model up to the three-link arm is straight-forward though, and I will work on getting code up (for the three-link arm for non-Mac users) as a next project.

Implementing PMC – the trajectory generation system

I’ve talked at length about dynamic movement primitives (DMPs) in previous posts, so I won’t describe those again here. Instead I will focus on their implementation in neurons.

def generate(y_des, speed=1, alpha=1000.0):
beta = alpha / 4.0

# generate the forcing function
forces, _, goals = forcing_functions.generate(
y_des=y_des, rhythmic=False, alpha=alpha, beta=beta)

# create alpha synapse, which has point attractor dynamics
tau = np.sqrt(1.0 / (alpha*beta))
alpha_synapse = nengolib.Alpha(tau)

net = nengo.Network('PMC')
with net:
net.output = nengo.Node(size_in=2)

# create a start / stop movement signal
time_func = lambda t: min(max(
(t * speed) % 4.5 - 2.5, -1), 1)

def goal_func(t):
t = time_func(t)
if t <= -1:
return goals[0]
return goals[1]
net.goal = nengo.Node(output=goal_func, label='goal')

# -------------------- Ramp ---------------------------
ramp_node = nengo.Node(output=time_func, label='ramp')
net.ramp = nengo.Ensemble(
n_neurons=500, dimensions=1, label='ramp ens')
nengo.Connection(ramp_node, net.ramp)

# ------------------- Forcing Functions ---------------
def relay_func(t, x):
t = time_func(t)
if t <= -1:
return [0, 0]
return x
# the relay prevents forces from being sent on reset
relay = nengo.Node(output=relay_func, size_in=2)

domain = np.linspace(-1, 1, len(forces[0]))
x_func = interpolate.interp1d(domain, forces[0])
y_func = interpolate.interp1d(domain, forces[1])
transform=1.0/alpha/beta
nengo.Connection(net.ramp, relay[0], transform=transform,
function=x_func, synapse=alpha_synapse)
nengo.Connection(net.ramp, relay[1], transform=transform,
function=y_func, synapse=alpha_synapse)
nengo.Connection(relay, net.output)

nengo.Connection(net.goal[0], net.output[0],
synapse=alpha_synapse)
nengo.Connection(net.goal[1], net.output[1],
synapse=alpha_synapse)
return net


The generate method for the PMC takes in a desired trajectory, y_des, as a parameter. The first thing we do (on lines 5-6) is calculate the forcing function that will push the DMP point attractor along the desired trajectory.

The next thing (on lines 9-10) is creating an Alpha (second-order low-pass filter) synapse. By writing out the dynamics of a point attractor in Laplace space, one of the lab members, Aaron Voelker, noticed that the dynamics could be fully implemented by creating an Alpha synapse with an appropriate choice of tau. I walk through all of the math behind this in this post. Here we’ll use that more compact method and project directly to the output node, which improves performance and reduces the number of neurons.

Inside the PMC network we create a time_func node, which is the pace-setter during simulation. It will output a linear ramp from -1 to 1 every few seconds, with the pace set by the speed parameter, and then pause before restarting.

We also have a goal node, which will provide a target starting and ending point for the trajectory. Both the time_func and goal nodes are created and used as a model simulation convenience, and proposed to be generated elsewhere in the brain (the basal ganglia, why not? #igotreasons #provemewrong).

The ramp ensemble is the most important component of the trajectory generation system. It takes the output from the time_func node as input, and generates the forcing function which will guide our little system through the trajectory that was passed in. The ensemble itself is nothing special, but rather the function that it approximates on its outgoing connection. We set up this function approximation with the following code:

        domain = np.linspace(-1, 1, len(forces[0]))
x_func = interpolate.interp1d(domain, forces[0])
y_func = interpolate.interp1d(domain, forces[1])
transform=1.0/alpha/beta
nengo.Connection(net.ramp, relay[0], transform=transform,
function=x_func, synapse=alpha_synapse)
nengo.Connection(net.ramp, relay[1], transform=transform,
function=y_func, synapse=alpha_synapse)
nengo.Connection(relay, net.output)


We want the forcing function be generated as the signals represented in the ramp ensemble moves from -1 to 1. To achieve this, we create interpolation functions, x_func and y_func, which are set up to generate the forcing function values mapped to input values between -1 and 1. We pass these functions into the outgoing connections from the ramp population (one for x and one for y). So now when the ramp ensemble is representing -1, 0, and 1 the output along the two connections will be the starting, middle, and ending x and y points of the forcing function trajectory. The transform and synapse are set on each connection with the appropriate gain values and Alpha synapse, respectively, to implement point attractor dynamics.

NOTE: The above DMP implementation can generate a trajectory signal with as many dimensions as you would like, and all that’s required is adding another outgoing Connection projecting from the ramp ensemble.

The last thing in the code is hooking up the goal node to the output, which completes the point attractor implementation.

Implementing M1 – the kinematics of operational space control

In REACH, we’ve modelled the primary motor cortex (M1) as responsible for taking in a desired hand movement (i.e. target_position - current_hand_position) and calculating a set of joint torques to carry that out. Explicitly, M1 generates the kinematics component of an OSC signal:

$\textbf{u}_\textrm{M1} = \textbf{J}^T \textbf{M}_\textbf{x} (k_p (\textbf{x}^* - \textbf{x}))$

In the paper I did this using several populations of neurons, one to calculate the Jacobian, and then an EnsembleArray to perform the multiplication for the dot product of each dimension separately. Since then I’ve had the chance to rework things and it’s now done entirely in one ensemble.

Now, the set up for the M1 model that computes the above function is to have a single ensemble of neurons that takes in the joint angles, $\textbf{q}$, and control signal $\textbf{u}_\textbf{x} = k_p (\textbf{x}^* - \textbf{x})$, and outputs the function above. Sounds pretty simple, right? Simple is good.

Let’s look at the code (where I’ve stripped out the comments and some debugging code):

def generate(arm, kp=1, operational_space=True,
inertia_compensation=True, means=None, scales=None):

dim = arm.DOF + 2

means = np.zeros(dim) if means is None else means
scales = np.ones(dim) if scales is None else scales
scale_down, scale_up = generate_scaling_functions(
np.asarray(means), np.asarray(scales))

net = nengo.Network('M1')
with net:
# create / connect up M1 ------------------------------
net.M1 = nengo.Ensemble(
n_neurons=1000, dimensions=dim,
intercepts=AreaIntercepts(
dimensions=dim,
base=nengo.dists.Uniform(-1, .1)))

# expecting input in form [q, x_des]
net.input = nengo.Node(output=scale_down, size_in=dim)
net.output = nengo.Node(size_in=arm.DOF)

def M1_func(x, operational_space, inertia_compensation):
""" calculate the kinematics of the OSC signal """
x = scale_up(x)
q = x[:arm.DOF]
x_des = kp * x[arm.DOF:]

# calculate hand (dx, dy)
if operational_space:
J = arm.J(q=q)

if inertia_compensation:
# account for inertia
Mx = arm.Mx(q=q)
x_des = np.dot(Mx, x_des)
# transform to joint torques
u = np.dot(J.T, x_des)
else:
u = x_des

if inertia_compensation:
# account for mass
M = arm.M(q=q)
u = np.dot(M, x_des)

return u

# send in system feedback and target information
# don't account for synapses twice
nengo.Connection(net.input, net.M1, synapse=None)
nengo.Connection(
net.M1, net.output,
function=lambda x: M1_func(
x, operational_space, inertia_compensation),
synapse=None)

return net


The ensemble of neurons itself is created with a few interesting parameters:

        net.M1 = nengo.Ensemble(
n_neurons=1000, dimensions=dim,
intercepts=AreaIntercepts(
dimensions=dim, base=nengo.dists.Uniform(-1, .1)))


Specifically, the radius and intercepts parameters.

Setting the intercepts

First we’ll discuss setting the intercepts using the AreaIntercepts distribution. The intercepts of a neuron determine how much of state space a neuron is active in, which we’ll refer to as the ‘active proportion’. If you don’t know what kind of functions you want to approximate with your neurons, then you having the active proportions for your ensemble chosen from a uniform distribution is a good starting point. This means that you’ll have roughly the same number of neurons active across all of state space as you do neurons that are active over half of state space as you do neurons that are active over very small portions of state space.

By default, Nengo sets the intercepts such that the distribution of active proportions is uniform for lower dimensional spaces. But when you start moving into higher dimensional spaces (into a hypersphere) the default method breaks down and you get mostly neurons that are either active for all of state space or almost none of state space. The AreaIntercepts class calculates how the intercepts should be set to achieve the desire active proportion inside a hypersphere. There are a lot more details here that you can read up on in this IPython notebook by Dr. Terrence C. Stewart.

What you need to know right now is that we’re setting the intercepts of the neurons such that the percentage of state space for which any given neuron is active is chosen from a uniform distribution between 0% and 55%. In other words, a neuron will maximally be active in 55% of state space, no more. This will let us model more nonlinear functions (such as the kinematics of the OSC signal) with fewer neurons. If this description is clear as mud, I really recommend checking out the IPython notebook linked above to get an intuition for what I’m talking about.

Scaling the input signal

The other parameter we set on the M1 ensemble is the radius. The radius scales the range of input values that the ensemble can represent, which is by default anything inside the unit hypersphere (i.e. hypersphere with radius=1). If the radius is left at this default value, the neural activity will saturate for vectors with magnitude greater than 1, leading to inaccurate vector representation and function approximation for input vectors with magnitude > 1. For lower dimensions this isn’t a terrible problem, but as the dimensions of the state space you’re representing grow it becomes more common for input vectors to have a norm greater than 1. Typically, we’d like to be able to, at a minimum, represent vectors with any number of dimensions where any element can be anywhere between -1 and 1. To do this, all we have to do is calculate the norm of the unit vector size dim, which is np.sqrt(dim) (the magnitude of a vector size dim with all elements set to one).

Now that we’re able to represent vectors where the input values are all between -1 and 1, the last part of this sub-network is scaling the input to be between -1 and 1. We use two scaling functions, scale_down and scale_up. The scale_down function subtracts a mean value and scales the input signal to be between -1 and 1. The scale_up function reverts the vector back to it’s original values so that calculations can be carried out normally on the decoding. In choosing mean and scaling values, there are two ways we can set these functions up:

1. Set them generally, based on the upper and lower bounds of the input signal. For M1, the input is $[\textbf{q}, \textbf{u}_\textbf{x}]$ where $\textbf{u}_\textbf{x}$ is the control signal in end-effector space, we know that the joint angles are always in the range 0 to $\pi$ (because that’s how the arm simulation is programmed), so we can set the means and scales to be $\frac{\pi}{2}$ for $\textbf{q}$. For $\textbf{u}$ a mean of zero is reasonable, and we can choose (arbitrarily, empirically, or analytically) the largest task space control signal we want to represent accurately.
2. Or, if we know the model will be performing a specific task, we can look at the range of input values encountered during that task and set the means and scales terms appropriately. For the task of reaching in a straight line, the arm moves in a very limited state space and we can set the mean and we can tune these parameter to be very specific:
                             means=[0.6, 2.2, 0, 0],
scales=[.25, .25, .25, .25]


The benefit of the second method, although one can argue it’s parameter tuning and makes things less biologically plausible, is that it lets us run simulations with fewer neurons. The first method works for all of state space, given enough neurons, but seeing as we don’t always want to be simulating 100k+ neurons we’re using the second method here. By tuning the scaling functions more specifically we’re able to run our model using 1k neurons (and could probably get away with fewer). It’s important to keep in mind though that if the arm moves outside the expected range the control will become unstable.

Implementing CB – the dynamics of operational space control

The cerebellum (CB) sub-network has two components to it: dynamics compensation and dynamics adaptation. First we’ll discuss the dynamics compensation. By which I mean the $k_v \textbf{M} (\textbf{q}^* - \textbf{q})$ term from the OSC signal.

Much like the calculating the kinematics term of the OSC signal in M1, we can calculate the entire dynamics compensation term using a single ensemble with an appropriate radius, scaled inputs, and well chosen intercepts.

def generate(arm, kv=1, learning_rate=None, learned_weights=None,
means=None, scales=None):
dim = arm.DOF * 2

means = np.zeros(dim) if means is None else means
scales = np.ones(dim) if scales is None else scales
scale_down, scale_up = generate_scaling_functions(
np.asarray(means), np.asarray(scales))

net = nengo.Network('CB')
with net:
# create / connect up CB --------------------------------
net.CB = nengo.Ensemble(
n_neurons=1000, dimensions=dim,
intercepts=AreaIntercepts(
dimensions=dim,
base=nengo.dists.Uniform(-1, .1)))
# expecting input in form [q, dq, u]
net.input = nengo.Node(output=scale_down,
size_in=dim+arm.DOF+2)
cb_input = nengo.Node(size_in=dim, label='CB input')
nengo.Connection(net.input[:dim], cb_input)
net.output = nengo.Node(size_in=arm.DOF*2)

def CB_func(x):
""" calculate the dynamic component of OSC signal """
x = scale_up(x)
q = x[:arm.DOF]
dq = x[arm.DOF:arm.DOF*2]

# calculate inertia matrix
M = arm.M(q=q)
return -np.dot(M, kv * dq)
# connect up the input and output
nengo.Connection(net.input[:dim], net.CB)
nengo.Connection(net.CB, net.output[:arm.DOF],
function=CB_func, synapse=None)


I don’t think there’s anything noteworthy going on here, most of the relevant details have already been discussed…so we’ll move on to the adaptation!

Implementing CB – non-linear dynamics adaptation

The final part of the model is the non-linear dynamics adaptation, modelled as a separate ensemble in the cerebellar sub-network (a separate ensemble so that it’s more modular, the learning connection could also come off of the other CB population). I work through the details and proof of the learning rule in the paper, so I’m not going to discuss that here. But I will restate the learning rule here:

$\dot{\textbf{d}} = \textbf{L}_d \textbf{A} \otimes \textbf{u},$

where $\textbf{d}$ are the neural decoders, $\textbf{L}_d$ is the learning rate, $\textbf{A}$ is the neural activity of the ensemble, and $\textbf{u}$ is the joint space control signal sent to the arm. This is a basic delta learning rule, where the decoders of the active neurons are modified to push the decoded function in a direction that reduces the error.

The adaptive ensemble can be initialized either using saved weights (passed in with the learned_weights paramater) or as all zeros. It is important to note that setting decoders to all zeros means that does not mean having zero neural activity, so learning will not be affected by this initialization.

        # dynamics adaptation------------------------------------
if learning_rate is not None:
n_neurons=1000, dimensions=arm.DOF*2,
# enforce spiking neurons
neuron_type=nengo.LIF(),
intercepts=AreaIntercepts(
dimensions=arm.DOF,
base=nengo.dists.Uniform(-.5, .2)))

net.learn_encoders = nengo.Connection(

# if no saved weights were passed in start from zero
weights = (
learned_weights if learned_weights is not None
net.learn_conn = nengo.Connection(
# connect directly to arm so that adaptive signal
# is not included in the training signal
transform=weights,
learning_rule_type=nengo.PES(
learning_rate=learning_rate),
synapse=None)

nengo.Connection(net.input[dim:dim+2],
net.learn_conn.learning_rule,
transform=-1, synapse=.01)
return net


We’re able to implement that learning rule using Nengo’s prescribed error-sensitivity (PES) learning on our connection from the adaptive ensemble to the output. With this set up the system will be able to learn to adapt to perturbations that are functions of the input (set here to be $[\textbf{q}, \dot{\textbf{q}}]$).

The intercepts in this population are set to values I found worked well for adapting to a few different forces, but it’s definitely a parameter to play with in your own scripts if you’re finding that there’s too much or not enough generalization of the decoded function output across the state space.

One other thing to mention is that we need to have a relay node to amalgamate the control signals output from M1 and the dynamics compensation ensemble in the CB. This signal is used to train the adaptive ensemble, and it’s important that the adaptive ensemble’s output is not included in the training signal, or else the system quickly goes off to positive or negative infinity.

Implementing S1 – a placeholder

The last sub-network in the REACH model is a placeholder for a primary sensory cortex (S1) model. This is just a set of ensembles that represents the feedback from the arm and relay it on to the rest of the model.

def generate(arm, direct_mode=False, means=None, scales=None):
dim = arm.DOF*2 + 2  # represents [q, dq, hand_xy]

means = np.zeros(dim) if means is None else means
scales = np.ones(dim) if scales is None else scales
scale_down, scale_up = generate_scaling_functions(
np.asarray(means), np.asarray(scales))

net = nengo.Network('S1')
with net:
# create / connect up S1 --------------------------------
net.S1 = nengo.networks.EnsembleArray(
n_neurons=50, n_ensembles=dim)

# expecting input in form [q, x_des]
net.input = nengo.Node(output=scale_down, size_in=dim)
net.output = nengo.Node(
lambda t, x: scale_up(x), size_in=dim)

# send in system feedback and target information
# don't account for synapses twice
nengo.Connection(net.input, net.S1.input, synapse=None)
nengo.Connection(net.S1.output, net.output, synapse=None)

return net


Since there’s no function that we’re decoding off of the represented variables we can use separate ensembles to represent each dimension with an EnsembleArray. If we were going to decode some function of, for example, q0 and dq0, then we would need an ensemble that represents both variables. But since we’re just decoding out f(x) = x, using an EnsembleArray is a convenient way to decrease the number of neurons needed to accurately represent the input.

Creating a model using the framework

The REACH model has been set up to be as much of a plug and play system as possible. To generate a model you first create the M1, PMC, CB, and S1 networks, and then they’re all hooked up to each other using the framework.py file. Here’s an example script that controls the arm to trace a circle:

def generate():
kp = 200
kv = np.sqrt(kp) * 1.5

center = np.array([0, 1.25])
# set the initial position of the arm
arm_sim.init_q = arm_sim.inv_kinematics(center)
arm_sim.reset()

net = nengo.Network(seed=0)
with net:
net.dim = arm_sim.DOF
net.arm_node = arm_sim.create_nengo_node()
net.error = nengo.Ensemble(1000, 2)
net.xy = nengo.Node(size_in=2)

# create an M1 model -------------------------------------
net.M1 = M1.generate(arm_sim, kp=kp,
operational_space=True,
inertia_compensation=True,
means=[0.6, 2.2, 0, 0],
scales=[.5, .5, .25, .25])

# create an S1 model -------------------------------------
net.S1 = S1.generate(arm_sim,
means=[.6, 2.2, -.5, 0, 0, 1.25],
scales=[.5, .5, 1.7, 1.5, .75, .75])

# subtract current position to get task space direction
nengo.Connection(net.S1.output[net.dim*2:], net.error,
transform=-1)

# create a trajectory for the hand to follow -------------
x = np.linspace(0.0, 2.0*np.pi, 100)
PMC_trajectory = np.vstack([np.cos(x) * .5,
np.sin(x) * .5])
PMC_trajectory += center[:, None]
# create / connect up PMC --------------------------------
net.PMC = PMC.generate(PMC_trajectory, speed=1)
# send target for calculating control signal
nengo.Connection(net.PMC.output, net.error)
# send target (x,y) for plotting
nengo.Connection(net.PMC.output, net.xy)

net.CB = CB.generate(arm_sim, kv=kv,
means=[0.6, 2.2, -.5, 0],
scales=[.5, .5, 1.6, 1.5])

model = framework.generate(net=net, probes_on=True)
return model


In line 50 you can see the call to the framework code, which will hook up the most common connections that don’t vary between the different scripts.

The REACH model has assigned functionality to each area / sub-network, and you can see the expected input / output in the comments at the top of each sub-network file, but the implementations are open. You can create your own M1, PMC, CB, or S1 sub-networks and try them out in the context of a full model that generates high-level movement behaviour.

Running the model

To run the model you’ll need Nengo, Nengo GUI, and NengoLib all installed. You can then pull open Nengo GUI and load any of the a# scripts. In all of these scripts the S1 model is just an ensemble that represents the output from the arm_node. Here’s what each of the scripts does:

1. a01 has a spiking M1 and CB, dynamics adaptation turned off. The model guides the arm in reaching in a straight line to a single target and back.
2. a02 has a spiking M1, PMC, and CB, dynamics adaptation turned off. The PMC generates a path for the hand to follow that traces out a circle.
3. a03 has a spiking M1, PMC, and CB, dynamics adaptation turned off. The PMC generates a path for the joints to follow, which moves the hand in a straight line to a target and back.
4. a04 has a spiking M1 and CB, dynamics adaptation turned off. The model performs the centre-out reaching task, starting at a central point and reaching to 8 points around a circle.
5. a05 has a spiking M1 and CB, and dynamics adaptation turned on. The model performs the centre-out reaching task, starting at a central point and reaching to 8 points around a circle. As the model reaches, a forcefield is applied based on the joint velocities that pushes the arm as it tries to reach the target. After 100-150 seconds of simulation the arm has adapted and learned to reach in a straight line again.

Here’s what it looks like when you pull open a02 in Nengo GUI:

I’m not going to win any awards for arm animation, but! It’s still a useful visualization, and if anyone is proficient in javascript and want’s to improve it, please do! You can see the network architecture in the top left, the spikes generated by M1 and CB to the right of that, the arm in the bottom left, and the path traced out by the hand just to the right of that. On the top right you can see the a02 script code, and below that the Nengo console.

Conclusions

One of the most immediate extensions (aside from any sort of model of S1) that comes to mind here is implementing a more detailed cerebellar model, as there are several anatomically detailed models which have the same supervised learner functionality (for example (Yamazaki and Nagao, 2012)).

Hopefully people find this post and the code useful for their own work, or at least interesting! In the ideal scenario this would be a community project, where researchers add models of different brain areas and we end up with a large library of implementations to build larger models with in a Mr. Potato Head kind of fashion.

You can find all of the code up on my GitHub. And again, this code all should have been publicly available along with the publication. Hopefully the code proves useful! If you have any questions about it please don’t hesitate to make a comment here or contact me through email, and I’ll get back to you as soon as I can.

## Using SymPy’s lambdify for generating transform matrices and Jacobians

I’ve been working in VREP with some of their different robot models and testing out force control, and one of the things that becomes pretty important for efficient workflow is to have a streamlined method for setting up the transform matrices and calculating the Jacobians for different robots. You do not want to be working these all out by hand and then writing them in yourself.

A solution that’s been working well for me (and is fully implemented in Python) is to use SymPy to set up the basic transform matrices, from each joint to the next, and then use it’s derivative function to calculate the Jacobian. Once this is calculated you can then use SymPy’s lambdify function to parameterize this, and off you go! In this post I’m going to work through an example for controlling VREP’s UR5 arm using force control. And if you’re just looking for code examples, you can find it all up on my GitHub.

Edit: Updated the code to be much nicer, added saving of calculated functions, and a null space controller.
Edit 2: Removed the simplify calls from the code, not worth the generation time increase (if execution is a great concern can generate the functions using cython).

Setting up the transform matrices

This is the part of this process that is unique to each arm. The goal is to set up a system so that you can specify your transforms for each joint (and to each centre-of-mass (COM) too, of course) and then be on your way. So here’s the UR5, cute thing that it is:

For the UR5, there are 6 joints, and we’re going to be specifying 9 transform matrices: 6 joints and the COM for three arm segments (the remaining arm segment COMs are centered at their respective joint’s reference frame). The joints are all rotary, with 0 and 4 rotating around the z-axis, and the rest all rotating around x.

So first, we’re going to create a class called robot_config. Then, to create our transform matrices in SymPy first we need to set up the variables we’re going to be using:

# set up our joint angle symbols (6th angle doesn't affect any kinematics)
self.q = [sp.Symbol('q%i'%ii) for ii in range(6)]
# segment lengths associated with each joint
self.L = np.array([0.0935, .13453, .4251, .12, .3921, .0935, .0935, .0935])


where self.q is an array storing all our joint angle symbols, and self.L is an array of all of the offsets for each joint and arm segment lengths.

Using these to create the transform matrices for the joints, we get a set up that looks like this:

# transform matrix from origin to joint 0 reference frame
self.T0org = sp.Matrix([[sp.cos(self.q[0]), -sp.sin(self.q[0]), 0, 0],
[sp.sin(self.q[0]), sp.cos(self.q[0]), 0, 0],
[0, 0, 1, self.L[0]],
[0, 0, 0, 1]])

# transform matrix from joint 0 to joint 1 reference frame
self.T10 = sp.Matrix([[1, 0, 0, -L[1]],
[0, sp.cos(-self.q[1] + sp.pi/2),
-sp.sin(-self.q[1] + sp.pi/2), 0],
[0, sp.sin(-self.q[1] + sp.pi/2),
sp.cos(-self.q[1] + sp.pi/2), 0],
[0, 0, 0, 1]])

# transform matrix from joint 1 to joint 2 reference frame
self.T21 = sp.Matrix([[1, 0, 0, 0],
[0, sp.cos(-self.q[2]),
-sp.sin(-self.q[2]), self.L[2]],
[0, sp.sin(-self.q[2]),
sp.cos(-self.q[2]), 0],
[0, 0, 0, 1]])

# transform matrix from joint 2 to joint 3
self.T32 = sp.Matrix([[1, 0, 0, L[3]],
[0, sp.cos(-self.q[3] - sp.pi/2),
-sp.sin(-self.q[3] - sp.pi/2), self.L[4]],
[0, sp.sin(-self.q[3] - sp.pi/2),
sp.cos(-self.q[3] - sp.pi/2), 0],
[0, 0, 0, 1]])

# transform matrix from joint 3 to joint 4
self.T43 = sp.Matrix([[sp.sin(-self.q[4] - sp.pi/2),
sp.cos(-self.q[4] - sp.pi/2), 0, -self.L[5]],
[sp.cos(-self.q[4] - sp.pi/2),
-sp.sin(-self.q[4] - sp.pi/2), 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]])

# transform matrix from joint 4 to joint 5
self.T54 = sp.Matrix([[1, 0, 0, 0],
[0, sp.cos(self.q[5]), -sp.sin(self.q[5]), 0],
[0, sp.sin(self.q[5]), sp.cos(self.q[5]), self.L[6]],
[0, 0, 0, 1]])

# transform matrix from joint 5 to end-effector
self.TEE5 = sp.Matrix([[1, 0, 0, self.L[7]],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]])


You can see a bunch of offsetting of the joint angles by -sp.pi/2 and this is to account for the expected 0 angle (straight along the reference frame’s x-axis) at those joints being different than the 0 angle defined in the VREP simulation (at a 90 degrees from the x-axis). You can find these by either looking at and finding the joints’ 0 position in VREP or by trial-and-error empirical analysis.

Once you have your transforms, then you have to specify how to move from the origin to each reference frame of interest (the joints and COMs). For that, I’ve just set up a simple function with a switch statement:

# point of interest in the reference frame (right at the origin)
self.x = sp.Matrix([0,0,0,1])

def _calc_T(self, name, lambdify=True):
""" Uses Sympy to generate the transform for a joint or link

name string: name of the joint or link, or end-effector
lambdify boolean: if True returns a function to calculate
the transform. If False returns the Sympy
matrix
"""

# check to see if we have our transformation saved in file
if os.path.isfile('%s/%s.T' % (self.config_folder, name)):
Tx = cloudpickle.load(open('%s/%s.T' % (self.config_folder, name),
'rb'))
else:
if name == 'joint0' or name == 'link0':
T = self.T0org
elif name == 'joint1' or name == 'link1':
T = self.T0org * self.T10
elif name == 'joint2':
T = self.T0org * self.T10 * self.T21
T = self.T0org * self.T10 * self.Tl21
elif name == 'joint3':
T = self.T0org * self.T10 * self.T21 * self.T32
T = self.T0org * self.T10 * self.T21 * self.Tl32
elif name == 'joint4' or name == 'link4':
T = self.T0org * self.T10 * self.T21 * self.T32 * self.T43
elif name == 'joint5' or name == 'link5':
T = self.T0org * self.T10 * self.T21 * self.T32 * self.T43 * \
self.T54
elif name == 'link6' or name == 'EE':
T = self.T0org * self.T10 * self.T21 * self.T32 * self.T43 * \
self.T54 * self.TEE5
Tx = T * self.x  # to convert from transform matrix to (x,y,z)

# save to file
cloudpickle.dump(Tx, open('%s/%s.T' % (self.config_folder, name),
'wb'))

if lambdify is False:
return Tx
return sp.lambdify(self.q, Tx)


So the first part is pretty straight forward, create the transform matrix, and then at the end to get the (x,y,z) position we just multiply by a vector we created that represents a point at the origin of the last reference frame. Some of the transform matrices (the ones to the arm segments) I didn’t specify above just to cut down on space.

The second part is where we use this awesome lambify function, which lets us turn the matrix we’ve defined into a function, so that we can pass in joint angles and get back out the resulting (x,y,z) position. There’s also the option to get the straight up SymPy matrix return, in case you need the symbolic form (which we will!).

NOTE: You can also see that there’s a check built in to look for saved files, and to just load those saved files instead of recalculating things if they’re available. This is because calculating some of these matrices and their derivatives takes a long, long time. I used the cloudpickle module to do this because it’s able to easily handle saving a whole bunch of weird things that makes normal pickle sour.

Calculating the Jacobian

So now that we’re able to quickly generate the transform matrix for each point of interest on the UR5, we simply take the derivative of the equation for each (x,y,z) coordinate with respect to each joint angle to generate our Jacobian.

def _calc_J(self, name, lambdify=True):
""" Uses Sympy to generate the Jacobian for a joint or link

name string: name of the joint or link, or end-effector
lambdify boolean: if True returns a function to calculate
the Jacobian. If False returns the Sympy
matrix
"""

# check to see if we have our Jacobian saved in file
if os.path.isfile('%s/%s.J' % (self.config_folder, name)):
(self.config_folder, name), 'rb'))
else:
Tx = self._calc_T(name, lambdify=False)
J = []
# calculate derivative of (x,y,z) wrt to each joint
for ii in range(self.num_joints):
J.append([])
J[ii].append(Tx[0].diff(self.q[ii]))  # dx/dq[ii]
J[ii].append(Tx[1].diff(self.q[ii]))  # dy/dq[ii]
J[ii].append(Tx[2].diff(self.q[ii]))  # dz/dq[ii]


Here we retrieve the Tx vector from our _calc_T function, and then calculate the derivatives. When calculating the Jacobian for the end-effector, this is all we need! Huzzah!

But to calculate the Jacobian for transforming the inertia matrices of each arm segment into joint space we’re going to need the orientation information added to our Jacobian as well. This we know ahead of time, for each joint it’s a 3D vector with a 1 on the axis being rotated around. So we can predefine this:

# orientation part of the Jacobian (compensating for orientations)
self.J_orientation = [
[0, 0, 1], # joint 0 rotates around z axis
[1, 0, 0], # joint 1 rotates around x axis
[1, 0, 0], # joint 2 rotates around x axis
[1, 0, 0], # joint 3 rotates around x axis
[0, 0, 1], # joint 4 rotates around z axis
[1, 0, 0]] # joint 5 rotates around x axis


And then we just fill in the Jacobians for each reference frame with self.J_orientation up to the last joint, and fill in the rest of the Jacobian with zeros. So e.g. when we’re calculating the Jacobian for the arm segment just past the second joint we’ll use the first two rows of self.J_orientation and the rest of the rows will be 0.

So this leads us to the second half of the _calc_J function:

        end_point = name.strip('link').strip('joint')
if end_point != 'EE':
end_point = min(int(end_point) + 1, self.num_joints)
# add on the orientation information up to the last joint
for ii in range(end_point):
J[ii] = J[ii] + self.J_orientation[ii]
# fill in the rest of the joints orientation info with 0
for ii in range(end_point, self.num_joints):
J[ii] = J[ii] + [0, 0, 0]

# save to file
cloudpickle.dump(J, open('%s/%s.J' %
(self.config_folder, name), 'wb'))

J = sp.Matrix(J).T  # correct the orientation of J
if lambdify is False:
return J
return sp.lambdify(self.q, J)


The orientation information is added in, we save the result to file, and a function that takes in the joint angles and outputs the Jacobian is created (unless lambdify == False in which case the SymPy symbolic form is returned.)

Then finally, two wrapper functions are added in to make creating / accessing these functions easier. First, define a couple of dictionaries

# create function dictionaries
self._T = {}  # for transform calculations
self._J = {}  # for Jacobian calculations


and then our wrapper functions look like this

def T(self, name, q):
""" Calculates the transform for a joint or link
name string: name of the joint or link, or end-effector
q np.array: joint angles
"""
# check for function in dictionary
if self._T.get(name, None) is None:
print('Generating transform function for %s'%name)
self._T[name] = self.calc_T(name)
return self._T[name](*q)[:-1].flatten()

def J(self, name, q):
""" Calculates the transform for a joint or link
name string: name of the joint or link, or end-effector
q np.array: joint angles
"""
# check for function in dictionary
if self._J.get(name, None) is None:
print('Generating Jacobian function for %s'%name)
self._J[name] = self.calc_J(name)
return np.array(self._J[name](*q)).T


So how you use this class (all of this is in a class) is to call these T and J functions with the current joint angles. They’ll check to see if the functions have already be created or stored in file, if they haven’t then the T and / or J functions will be created, then our wrappers do a bit of formatting to get them into the proper form (i.e. transposing or cropping), and return you your (x,y,z) or Jacobian!

NOTE: It’s a bit of a misnomer to have the function be called T and actually return to you Tx, but hey this is a blog. Lay off.

Calculating the inertia matrix in joint-space and force of gravity
Now, since we’re here we might as well also calculate the functions for our inertia matrix in joint space and the effect of gravity. So, define a couple more placeholders in our robot_config class to help us:

self._M = []  # placeholder for (x,y,z) inertia matrices
self._Mq = None  # placeholder for joint space inertia matrix function
self._Mq_g = None  # placeholder for joint space gravity term function


and then add in our inertia matrix information (defined in each link’s centre-of-mass (COM) reference frame)

# create the inertia matrices for each link of the ur5
self._M.append(np.diag([1.0, 1.0, 1.0,
self._M.append(np.diag([2.5, 2.5, 2.5,
self._M.append(np.diag([5.7, 5.7, 5.7,
self._M.append(np.diag([3.9, 3.9, 3.9,
self._M.append(np.diag([0.7, 0.7, 0.7,


and then using our equations for calculating the system’s inertia and gravity we create our _calc_Mq and _calc_Mq_g functions

def _calc_Mq(self, lambdify=True):
""" Uses Sympy to generate the inertia matrix in
joint space for the ur5

lambdify boolean: if True returns a function to calculate
the Jacobian. If False returns the Sympy
matrix
"""

# check to see if we have our inertia matrix saved in file
if os.path.isfile('%s/Mq' % self.config_folder):
Mq = cloudpickle.load(open('%s/Mq' % self.config_folder, 'rb'))
else:
# get the Jacobians for each link's COM
J = [self._calc_J('link%s' % ii, lambdify=False)

# transform each inertia matrix into joint space
# sum together the effects of arm segments' inertia on each motor
Mq = sp.zeros(self.num_joints)
Mq += J[ii].T * self._M[ii] * J[ii]

# save to file
cloudpickle.dump(Mq, open('%s/Mq' % self.config_folder, 'wb'))

if lambdify is False:
return Mq
return sp.lambdify(self.q, Mq)

def _calc_Mq_g(self, lambdify=True):
""" Uses Sympy to generate the force of gravity in
joint space for the ur5

lambdify boolean: if True returns a function to calculate
the Jacobian. If False returns the Sympy
matrix
"""

# check to see if we have our gravity term saved in file
if os.path.isfile('%s/Mq_g' % self.config_folder):
'rb'))
else:
# get the Jacobians for each link's COM
J = [self._calc_J('link%s' % ii, lambdify=False)

# transform each inertia matrix into joint space and
# sum together the effects of arm segments' inertia on each motor
Mq_g = sp.zeros(self.num_joints, 1)
for ii in range(self.num_joints):
Mq_g += J[ii].T * self._M[ii] * self.gravity

# save to file
cloudpickle.dump(Mq_g, open('%s/Mq_g' % self.config_folder,
'wb'))

if lambdify is False:
return Mq_g
return sp.lambdify(self.q, Mq_g)


and wrapper functions

def Mq(self, q):
""" Calculates the joint space inertia matrix for the ur5

q np.array: joint angles
"""
# check for function in dictionary
if self._Mq is None:
print('Generating inertia matrix function')
self._Mq = self._calc_Mq()
return np.array(self._Mq(*q))

def Mq_g(self, q):
""" Calculates the force of gravity in joint space for the ur5

q np.array: joint angles
"""
# check for function in dictionary
if self._Mq_g is None:
print('Generating gravity effects function')
self._Mq_g = self._calc_Mq_g()
return np.array(self._Mq_g(*q)).flatten()


and we’re all set!

Putting it all together

Now we have nice clean code to generate everything we need for our controller. Using the controller developed in this post as a base, we can replace those calculations with the following nice compact code (which also includes a secondary null-space controller to keep the arm near resting joint angles):

# calculate position of the end-effector
# derived in the ur5 calc_TnJ class
xyz = robot_config.T('EE', q)
# calculate the Jacobian for the end effector
JEE = robot_config.J('EE', q)
# calculate the inertia matrix in joint space
Mq = robot_config.Mq(q)
# calculate the effect of gravity in joint space
Mq_g = robot_config.Mq_g(q)

# convert the mass compensation into end effector space
Mx_inv = np.dot(JEE, np.dot(np.linalg.inv(Mq), JEE.T))
svd_u, svd_s, svd_v = np.linalg.svd(Mx_inv)
# cut off any singular values that could cause control problems
singularity_thresh = .00025
for i in range(len(svd_s)):
svd_s[i] = 0 if svd_s[i] < singularity_thresh else \
1./float(svd_s[i])
# numpy returns U,S,V.T, so have to transpose both here
Mx = np.dot(svd_v.T, np.dot(np.diag(svd_s), svd_u.T))

kp = 100
kv = np.sqrt(kp)
# calculate desired force in (x,y,z) space
u_xyz = np.dot(Mx, target_xyz - xyz)
# transform into joint space, add vel and gravity compensation
u = (kp * np.dot(JEE.T, u_xyz) - np.dot(Mq, kv * dq) - Mq_g)

# calculate our secondary control signal
# calculated desired joint angle acceleration
q_des = (((robot_config.rest_angles - q) + np.pi) %
(np.pi*2) - np.pi)
u_null = np.dot(Mq, (kp * q_des - kv * dq))

# calculate the null space filter
Jdyn_inv = np.dot(Mx, np.dot(JEE, np.linalg.inv(Mq)))
null_filter = (np.eye(robot_config.num_joints) -
np.dot(JEE.T, Jdyn_inv))
u_null_filtered = np.dot(null_filter, u_null)

u += u_null_filtered


And there you go!

You can see all of this code up on my GitHub, along a full example controlling a UR5 VREP model though reaching to a series of targets. It looks something pretty much like this (slightly better because this gif was made before adding in the null space controller):

Overhead of using lambdify instead of hard-coding

This was a big question that I had, because when I’m running simulations the time step is on the order of a few milliseconds, with the controller code called at every time step. So I reaaaally can’t afford a crazy overhead for using lambdify.

To test this, I used the handy Python timeit, which requires a bit awkward setup, but quite nicely calls the function a whole bunch of times (1,000,000 by default) and accounts for various other things going on that could affect the execution time.

I tested two sample functions, one simpler than the other. Here’s the code for setting up and testing the first function:

import timeit
import seaborn

# Test 1 ----------------------------------------------------------------------
print('\nTest function 1: ')
time_sympy1 = timeit.timeit(
stmt = 'f(np.random.random(), np.random.random())',
setup = 'import numpy as np;\
import sympy as sp;\
q0 = sp.Symbol("q0");\
l0 = sp.Symbol("l0");\
a = sp.cos(q0) * l0;\
f = sp.lambdify((q0, l0), a, "numpy")')
print('Sympy lambdify function 1 time: ', time_sympy1)

time_hardcoded1 = timeit.timeit(
stmt = 'np.cos(np.random.random())*np.random.random()',
setup = 'import numpy as np')
print('Hard coded function 1 time: ', time_hardcoded1)


Pretty simple, a bit of a pain in the sympy setup, but other than that not bad at all. The second function I tested was just a random collection of cos and sin calls that resemble what gets computed in a Jacobian:

l1*np.sin(q0 - l0*np.sin(q1)*np.cos(q2) - l2*np.sin(q2) - l0*np.sin(q1) + q0*l0)*np.cos(q0) + l2*np.sin(q0)


And here’s the results:

So it’s slower for sure, but again this is the difference in time after 1,000,000 function calls, so until some big optimization needs to be done using the SymPy lambdify function is definitely worth the slight gain in execution time for the insane convenience.

The full code for the timing tests here are also up on my GitHub.

## Using VREP for simulation of force-controlled models

I’ve been playing around a bit with different simulators, and one that we’re a big fan of in the lab is VREP. It’s free for academics and you can talk to them about licences if you’re looking for commercial use. I haven’t actually had much experience with it before myself, so I decided to make a simple force controlled arm model to get experience using it. All in all, there were only a few weird API things that I had to get through, and once you have them down it’s pretty straight forward. This post is going to be about the steps that I needed to take to get things all set up. For a more general start-up on VREP check out All the code in this post and the model I use can be found up on my GitHub.

Getting the right files where you need them

As discussed in the remote API overview, you’ll need three files in whatever folder you want to run your Python script from to be able to hook into VREP remotely:

• remoteApi.dll, remoteApi.dylib or remoteApi.so (depending on what OS you’re using)
• vrep.py
• vrepConstants.py

You can find these files inside your VREP_HOME/programming/remoteApiBindings/python/python and VREP_HOME/programming/remoteApiBindings/lib/lib folders. Make sure that these files are in whatever folder you’re running your Python scripts from.

The model

It’s easy to create a new model to mess around with in VREP, so that’s the route I went, rather than importing one of their pre-made models and having some sneaky parameter setting cause me a bunch of grief. You can just right click->add then go at it. There are a bunch of tutorials so I’m not going to go into detail here. The main things are:

• Make sure joints are in ‘Torque/force’ mode.
• Make sure that joint ‘Motor enabled’ property is checked. The motor enabled property is in the ‘Show dynamic properties dialogue’ menu, which you find when you double click on the joint in the Scene Hierarchy.
• Know the names of the joints as shown in the Scene Hierarchy.

So here’s a picture:

where you can see the names of the objects in the model highlighted in red, the Torque/force selection highlighted in blue, and the Motor enabled option highlighted in green. And of course my beautiful arm model in the background.

Setting up the continuous server

The goal is to connect VREP to Python so that we can send torques to the arm from our Python script and get the feedback necessary to calculate those torques. There are a few ways to set up a remote connection in VREP.

The basic one is they have you add a child script in VREP and attach it to an object in the world that sets up a port and then you hit go on your simulation and can then run your Python script to connect in. This gets old real fast. Fortunately, it’s easy to automate everything from Python so that you can connect in, start the simulation, run it for however long, and then stop the simulation without having to click back and forth.

The first step is to make sure that your remoteApiConnections.txt file in you VREP home directory is set up properly. A continuous server should be set up by default, but doesn’t hurt to double check. And you can take the chance to turn on debugging, which can be pretty helpful. So open up that file and make sure it looks like this:

portIndex1_port                 = 19997
portIndex1_debug                = true
portIndex1_syncSimTrigger       = true


Once that’s set up, when VREP starts we can connect in from Python. In our Python script, first we’ll close any open connections that might be around, and then we’ll set up a new connection in:

import vrep

# close any open connections
vrep.simxFinish(-1)
# Connect to the V-REP continuous server
clientID = vrep.simxStart('127.0.0.1', 19997, True, True, 500, 5)

if clientID != -1: # if we connected successfully
print ('Connected to remote API server')


So once the connection is made, we check the clientID value to make sure that it didn’t fail, and then we carry on with our script.

Synchronizing

By default, VREP will run its simulation in its own thread, and both the simulation and the controller using the remote API will be running simultaneously. This can lead to some weird behaviour as things fall out of synch etc etc, so what we want instead is for the VREP sim to only run one time step for each time step the controller runs. To do that, we need to set VREP to synchronous mode. So the next few lines of our Python script look like:

    # --------------------- Setup the simulation

vrep.simxSynchronous(clientID,True)


and then later, once we’ve calculated our control signal, sent it over, and want the simulation to run one time step forward, we do that by calling

    # move simulation ahead one time step
vrep.simxSynchronousTrigger(clientID)


Get the handles to objects of interest

OK the next chunk of code in our script uses the names of our objects (as specified in the Scene Hierarchy in VREP) to get an ID for each which to identify which object we want to send a command to or receive feedback from:

    joint_names = ['shoulder', 'elbow']
# joint target velocities discussed below
joint_target_velocities = np.ones(len(joint_names)) * 10000.0

# get the handles for each joint and set up streaming
joint_handles = [vrep.simxGetObjectHandle(clientID,
name, vrep.simx_opmode_blocking)[1] for name in joint_names]

# get handle for target and set up streaming
_, target_handle = vrep.simxGetObjectHandle(clientID,
'target', vrep.simx_opmode_blocking)


Set dt and start the simulation

And the final thing that we’re going to do in our simulation set up is specify the timestep that we want to use, and then start the simulation. I found this in a forum post, and I must say whatever VREP lacks in intuitive API their forum moderators are on the ball. NOTE: To use a custom time step you have to also set the dt option in the VREP simulator to ‘custom’. The drop down is to the left of the ‘play’ arrow, and if you don’t have it set to ‘custom’ you won’t be able to set the time step through Python.

    dt = .01
vrep.simxSetFloatingParameter(clientID,
vrep.sim_floatparam_simulation_time_step,
dt, # specify a simulation time step
vrep.simx_opmode_oneshot)

# --------------------- Start the simulation

# start our simulation in lockstep with our code
vrep.simxStartSimulation(clientID,
vrep.simx_opmode_blocking)


Main loop

For this next chunk I’m going to cut out everything that’s not VREP, since I have a bunch of posts explaining the control signal derivation and forward transformation matrices.

So, once we’ve started the simulation, I’ve set things up for the arm to be controlled for 1 second and then for the simulation to stop and everything shut down and disconnect.

    while count < 1: # run for 1 simulated second

# get the (x,y,z) position of the target
_, target_xyz = vrep.simxGetObjectPosition(clientID,
target_handle,
-1, # retrieve absolute, not relative, position
vrep.simx_opmode_blocking)
if _ !=0 : raise Exception()
track_target.append(np.copy(target_xyz)) # store for plotting
target_xyz = np.asarray(target_xyz)

q = np.zeros(len(joint_handles))
dq = np.zeros(len(joint_handles))
for ii,joint_handle in enumerate(joint_handles):
# get the joint angles
_, q[ii] = vrep.simxGetJointPosition(clientID,
joint_handle,
vrep.simx_opmode_blocking)
if _ !=0 : raise Exception()
# get the joint velocity
_, dq[ii] = vrep.simxGetObjectFloatParameter(clientID,
joint_handle,
2012, # ID for angular velocity of the joint
vrep.simx_opmode_blocking)
if _ !=0 : raise Exception()


In the above chunk of code, I think the big thing to point out is that I’m using vrep.simx_opmode_blocking in each call, instead of vrep.simx_opmode_buffer. The difference is that you for sure get the current values of the simulation when you use blocking, and you can be behind a time step using buffer.

Aside from that the other notable things are I raise an exception if the first parameter (which is the return code) is ever not 0, and that I use simxGetObjectFloatParameter to get the joint velocity instead of simxGetObjectVelocity, which has a rotational velocity component. Zero is the return code for ‘everything worked’, and if you don’t check it and have some silly things going on you can be veeerrrrryyy mystified when things don’t work. And what simxGetObjectVelocity returns is the rotational velocity of the joint relative to the world frame, and not the angular velocity of the joint in its own coordinates. That was also a briefly confusing.

So the next thing I do is calculate u, which we’ll skip, and then we need to set the forces for the joint. This part of the API is real screwy. You can’t set the force applied to the joint directly. Instead, you have to set the target velocity of the joint to some really high value (hence that array we made before), and then modulate the maximum force that can be applied to that joint. Also important: When you want to apply a force in the other direction, you change the sign on the target velocity, but keep the force sign positive.

        # ... calculate u ...

for ii,joint_handle in enumerate(joint_handles):
# get the current joint torque
_, torque = \
vrep.simxGetJointForce(clientID,
joint_handle,
vrep.simx_opmode_blocking)
if _ !=0 : raise Exception()

# if force has changed signs,
# we need to change the target velocity sign
if np.sign(torque) * np.sign(u[ii]) < 0:
joint_target_velocities[ii] = \
joint_target_velocities[ii] * -1
vrep.simxSetJointTargetVelocity(clientID,
joint_handle,
joint_target_velocities[ii], # target velocity
vrep.simx_opmode_blocking)
if _ !=0 : raise Exception()

# and now modulate the force
vrep.simxSetJointForce(clientID,
joint_handle,
abs(u[ii]), # force to apply
vrep.simx_opmode_blocking)
if _ !=0 : raise Exception()

# move simulation ahead one time step
vrep.simxSynchronousTrigger(clientID)
count += dt


So as you can see we check the current torque, see if we need to change the sign on the target velocity, modulate the maximum allowed force, and then finally step the VREP simulation forward.

Conclusions

And there you go! Here’s an animation of it in action (note this is a super low quality gif and it looks way better / smoother when actually running it yourself):

All in all, VREP has been enjoyable to work with so far. It didn’t take long to get things moving and off the ground, the visualization is great, and I haven’t even scratched the surface of what you can do with it. Best of all (so far) you can fully automate everything from Python. Hopefully this is enough to help some people get their own models going and save a few hours and headaches! Again, the full code and the model are up on my GitHub.

Nits

• When you’re applying your control signal, make sure you test each joint in isolation, to make sure your torques push things in the direction you think they do. I had checked the rotation direction in VREP, but the control signal for both joints ended up needing to be multiplied by -1.
• Another nit when you’re building your model, if you use the rotate button from the VREP toolbar on your model, wherever that joint rotates to is now 0 degrees. If you want to set the joint to start at 45 degrees, instead double click and change Pos[deg] option inside ‘Joint’ in Scene Object Properties.

## Deep learning for control using augmented Hessian-free optimization

Traditionally, deep learning is applied to feed-forward tasks, like classification, where the output of the network doesn’t affect the input to the network. It is a decidedly harder problem when the output is recurrently connected such that network output affects the input. Because of this application of deep learning methods to control was largely unexplored until a few years ago. Recently, however, there’s been a lot of progress and research in this area. In this post I’m going to talk about an implementation of deep learning for control presented by Dr. Ilya Sutskever in his thesis Training Recurrent Neural Networks.

In his thesis, Dr. Sutskever uses augmented Hessian-free (AHF) optimization for learning. There are a bunch of papers and posts that go into details about AHF, here’s a good one by Andrew Gibiansky up on his blog, that I recommend you check out. I’m not going to really talk much here about what AHF is specifically, or how it differs from other methods, if you’re unfamiliar there are lots of places you can read up on it. Quickly, though, AHF is kind of a bag of tricks you can use with a fast method for estimating the curvature of the loss function with respect to the weights of a neural network, as well as the gradient, which allows it to make larger updates in directions where the loss function doesn’t change quickly. So rather than estimating the gradient and then doing a small update along each dimension, you can make the size of your update large in directions that change slowly and small along dimensions where things change quickly. And now that’s enough about that.

In this post I’m going to walk through using a Hessian-free optimization library (version 0.3.8) written by my compadre Dr. Daniel Rasmussen to train up a neural network to train up a 2-link arm, and talk about the various hellish gauntlets you need run to get something that works. Whooo! The first thing to do is install this Hessian-free library, linked above.

I’ll be working through code edited a bit for readability, to find the code in full you can check out the files up on my GitHub.

Build the network

Dr. Sutskever specified the structure of the network in his thesis to be 4 layers: 1) a linear input layer, 2) 100 Tanh nodes, 3) 100 Tanh nodes, 4) linear output layer. The network is connected up with the standard feedforward connections from 1 to 2 to 3 to 4, plus recurrent connections on 2 and 3 to themselves, plus a ‘skip’ connection from layer 1 to layer 3. Finally, the input to the network is the target state for the plant and the current state of the plant. So, lots of recursion! Here’s a picture:

The output layer connects in to the plant, and, for those unfamiliar with control theory terminology, ‘plant’ just means the system that you’re controlling. In this case an arm simulation.

Before we can go ahead and set up the network that we want to train, we also need to specify the loss function that we’re going to be using during training. The loss function in Ilya’s thesis is a standard one:

$L(\theta) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{u}_t) + \ell_f(\textbf{x}_N),$
$\ell(\textbf{u}_t) = \alpha \frac{||\textbf{u}_t||^2}{2},$
$\ell_f(\textbf{x}_N) = \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2}$

where $L(\theta)$ is the total cost of the trajectory generated with $\theta$, the set of network parameters, $\ell(\textbf{u})$ is the immediate state cost, $\ell_f(\textbf{x})$ is the final state cost, $\textbf{x}$ is the state of the arm, $\textbf{x}^*$ is the target state of the arm, $\textbf{u}$ is the control signal (torques) that drives the arm, and $\alpha$ is a gain value.

To code this up using the hessianfree library we do:

from hessianfree import RNNet
from hessianfree.nonlinearities import (Tanh, Linear, Plant)
from hessianfree.loss_funcs import SquaredError, SparseL2

l2gain = 10e-3 * dt # gain on control signal loss
rnn = RNNet(
# specify the number of nodes in each layer
shape=[num_states * 2, 96, 96, num_states, num_states],
# specify the function of the nodes in each layer
layers=[Linear(), Tanh(), Tanh(), Linear(), plant],
# specify the layers that have recurrent connections
rec_layers=[1,2],
# specify the connections between layers
conns={0:[1, 2], 1:[2], 2:[3], 3:[4]},
# specify the loss function
loss_type=[
# squared error between plant output and targets
SquaredError(),
# penalize magnitude of control signal (output of layer 3)
SparseL2(l2gain, layers=[3])],
use_GPU=True)


Note that if you want to run it on your GPU you’ll need PyCuda and sklearn installed. And a GPU.

An important thing to note as well is that in Dr. Sustkever’s thesis when we’re calculating the squared error of the distance from the arm state to the target, this is measured in joint angles. So it’s kind of a weird set up to be looking at the movement of the hand but have your cost function in joint-space instead of end-effector space, but it definitely simplifies training by making the cost more directly relatable to the control signal. So we need to calculate the joint angles of the arm that will have the hand at different targets around a circle. To do this we’ll take advantage of our inverse kinematics solver from way back when, and use the following code:

def gen_targets(arm, n_targets=8, sig_len=100):
#Generate target angles corresponding to target
#(x,y) coordinates around a circle
import scipy.optimize

x_bias = 0
if arm.DOF == 2:
y_bias = .35
dist = .075
elif arm.DOF == 3:
y_bias = .5
dist = .2

# set up the reaching trajectories around circle
targets_x = [dist * np.cos(theta) + x_bias \
for theta in np.linspace(0, np.pi*2, 65)][:-1]
targets_y = [dist * np.sin(theta) + y_bias \
for theta in np.linspace(0, np.pi*2, 65)][:-1]

joint_targets = []
for ii in range(len(targets_x)):
joint_targets.append(arm.inv_kinematics(xy=(targets_x[ii],
targets_y[ii])))
targs = np.asarray(joint_targets)

# repeat the targets over time
for ii in range(targs.shape[1]-1):
targets = np.concatenate(
(np.outer(targs[:, ii], np.ones(sig_len))[:, :, None],
np.outer(targs[:, ii+1], np.ones(sig_len))[:, :, None]), axis=-1)
targets = np.concatenate((targets, np.zeros(targets.shape)), axis=-1)
# only want to penalize the system for not being at the
# target at the final state, set everything before to np.nan
targets[:, :-1] = np.nan

return targets


And you can see in the last couple lines that to implement the distance to target as a final state cost penalty only we just set all of the targets before the final time step equal to np.nan. If we wanted to penalize distance to target throughout the whole trajectory we would just comment that line out.

Create the plant

You’ll notice in the code that defines our RNN I set the last layer of the network to be plant, but that that’s not defined anywhere. Let’s talk. There are a couple of things that we’re going to need to incorporate our plant into this network and be able to use any deep learning method to train it. We need to be able to:

1. Simulate the plant forward; i.e. pass in input and get back the resulting plant state at the next timestep.
2. Calculate the derivative of the plant state with respect to the input; i.e. how do small changes in the input affect the state.
3. Calculate the derivative of the plant state with respect to the previous state; i.e. how do small changes in the plant state affect the state at the next timestep.
4. Calculate the derivative of the plant output with respect to its state; i.e. how do small changes in the current position of the state affect the output of the plant.

So 1 is easy, we have the arm simulations that we want already, they’re up on my GitHub. Number 4 is actually trivial too, because the output of our plant is going to be the state itself, so the derivative of the output with respect to the state is just the identity matrix.

For 2 and 3, we’re going to need to calculate some derivatives. If you’ve read the last few posts you’ll note that I’m on a finite differences kick. So let’s get that going! Because no one wants to calculate derivatives!

Important note, the notation in these next couple pieces of code is going to be a bit different from my normal notation because they’re matching with the hessianfree library notation, which is coming from a reinforcement learning literature background instead of a control theory background. So, s is the state of the plant, and x is the input to the plant. I know, I know. All the same, make sure to keep that in mind.

# calculate ds0/dx0 with finite differences
d_input_FD = np.zeros((x.shape[0], # number of trials
x.shape[1], # number of inputs
self.state.shape[1])) # number of states
for ii in range(x.shape[1]):
# calculate state adding eps to x[ii]
self.reset_plant(self.prev_state)
inc_x = x.copy()
inc_x[:, ii] += self.eps
self.activation(inc_x)
state_inc = self.state.copy()
# calculate state subtracting eps from x[ii]
self.reset_plant(self.prev_state)
dec_x = x.copy()
dec_x[:, ii] -= self.eps
self.activation(dec_x)
state_dec = self.state.copy()

d_input_FD[:, :, ii] = \
(state_inc - state_dec) / (2 * self.eps)
d_input_FD = d_input_FD[..., None]


Alrighty. First we create a tensor to store the results. Why is it a tensor? Because we’re going to be doing a bunch of runs at once. So our state dimensions are actually trials x size_input. When we then take the partial derivative, we end up with trials many size_input x size_state matrices. Then we increase each of the parameters of the input slightly one at a time and store the results, decrease them all one at a time and store the results, and compute our approximation of the gradient.

Next we’ll do the same for calculating the derivative of the state with respect to the previous state.

# calculate ds1/ds0
d_state_FD = np.zeros((x.shape[0], # number of trials
self.state.shape[1], # number of states
self.state.shape[1])) # number of states
for ii in range(self.state.shape[1]):
# calculate state adding eps to self.state[ii]
state = np.copy(self.prev_state)
state[:, ii] += self.eps
self.reset_plant(state)
self.activation(x)
state_inc = self.state.copy()
# calculate state subtracting eps from self.state[ii]
state = np.copy(self.prev_state)
state[:, ii] -= self.eps
self.reset_plant(state)
self.activation(x)
state_dec = self.state.copy()

d_state_FD[:, :, ii] = \
(state_inc - state_dec) / (2 * self.eps)
d_state_FD = d_state_FD[..., None]


Great! We’re getting closer to having everything we need. Another thing we need is a wrapper for running our arm simulation. It’s going to look like this:

def activation(self, x):
state = []
# iterate through and simulate the plant forward
# for each trial
for ii in range(x.shape[0]):
self.arm.reset(q=self.state[ii, :self.arm.DOF],
dq=self.state[ii, self.arm.DOF:])
self.arm.apply_torque(u[ii])
state.append(np.hstack([self.arm.q, self.arm.dq]))
state = np.asarray(state)

self.state = self.squashing(state)


This is definitely not the fastest code to run. Much more ideally we would put the state and input into vectors and do a single set of computations for each call to activation rather than having that for loop in there. Unfortunately, though, we’re not assuming that we have access to the dynamics equations / will be able to pass in vector states and inputs.

Squashing
Looking at the above code that seems pretty clear what’s going on, except you might notice that last line calling self.squashing. What’s going on there?

The squashing function looks like this:

def squashing(self, x):
index_below = np.where(x < -2*np.pi)
x[index_below] = np.tanh(x[index_below]+2*np.pi) - 2*np.pi
index_above = np.where(x > 2*np.pi)
x[index_above] = np.tanh(x[index_above]-2*np.pi) + 2*np.pi
return x


All that’s happening here is that we’re taking our input, and doing nothing to it as long as it doesn’t start to get too positive or too negative. If it does then we just taper it off and prevent it from going off to infinity. So running a 1D vector through this function we get:

This ends up being a pretty important piece of the code here. Basically it prevents wild changes to the weights during learning to result in the system breaking down. So the state of the plant can’t go off to infinity and cause an error to be thrown, stopping our simulation. But because the target state is well within the bounds of where the squashing function does nothing, post-training we’ll still be able to use the resulting network to control a system that doesn’t have this fail safe built in. Think of this function as training wheels that catch you only if you start falling.

With that, we no have pretty much all of the parts necessary to begin training our network!

Training the network

We’re going to be training this network on the centre-out reaching task, where you start at a centre point and reach to a bunch of target locations around a circle. I’m just going to be re-implementing the task as it was done in Dr. Sutskever’s thesis, so we’ll have 64 targets around the circle, and train using a 2-link arm. Here’s the code that we’ll use to actually run the training:

for ii in range(last_trial+1, num_batches):
# train a bunch of batches using the same input every time
# to allow the network a chance to minimize things with
# stable input (speeds up training)
err = rnn.run_batches(plant, targets=None,
max_epochs=batch_size,
optimizer=HessianFree(CG_iter=96, init_damping=100))

# save the weights to file, track trial and error
# err = rnn.error(inputs)
err = rnn.best_error
name = 'weights/rnn_weights-trial%04i-err%.3f'%(ii, err)
np.savez_compressed(name, rnn.W)


A quick aside: if you want to run this code yourself, get a real good computer, have an arm simulation ready, the hessianfree Python library installed, and download and run this train_hf.py file. (Note: I used version 0.3.8 of the hessianfree library, which you can install using pip install hessianfree==0.3.8) This will start training and save the weights into a weights/ folder, so make sure that that exists in the same folder as train_hf.py. If you want to view the results of the training at any point run the plot_error.py file, which will load in the most recent version of the weights and plot the error so far. If you want to generate an animated plot like I have below run gen_animation_plots.py and then the last command from my post on generating animated gifs.

Another means of seeing the results of your trained up network is to use the controller I’ve implemented in my controls benchmarking suite, which looks for a set of saved weights in the controllers/weights folder, and will load it in and use it to generate command signals for the arm by running it with

python run.py arm2_python ahf reach --dt=1e-2


where you replace arm2_python with whatever arm model you trained your model on. Note the --dt=1e-2 flag, that is important because the model was trained with a .01 timestep and things get a bit weird if you suddenly change the dynamics on the controller.

OK let’s look at some results!

Results

OK full discretion, these results are not optimizing the cost function we discussed above. They’re implementing a simpler cost function that only looks at the final state, i.e. it doesn’t penalize the magnitude of the control signal. I did this because Dr. Sutskever says in his thesis he was able to optimize with just the final state cost using much smaller networks. I originally looked at neurons with 96 neurons in each layer, and it just took forgoddamnedever to run. So after running for 4 weeks (not joking) and needing to make some more changes I dropped the number of neurons and simplified the task.

The results below are from running a network with 32 neurons in each layer controlling this 2-link arm, and took another 4-5 weeks to train up.

Hey that looks good! Not bad, augmented Hessian-free learning, not bad. It had pretty consistent (if slow) decline in the error rate, with a few crazy bumps from which it quickly recovered. Also take note that each training iteration is actually 32 runs, so it’s not 12,50-ish runs it’s closer to 400,000 training runs that it took to get here.

One biggish thing that was a pain was that it turns out that I only trained the neural network for reaching in the one direction, and when you only train it to reach one way it doesn’t generalize to reaching back to the starting point (which, fair enough). But, I didn’t realize this until I was took the trained network and ran it in the benchmarking code, at which point I was not keen to redo all of the training it took to get the neural network to the level of accuracy it was at under a more complicated training set. The downside of this is that even though I’ve implemented a controller that takes in the trained network and uses it to control the arm, to do the reaching task I have to just do a hard reset after the arm reaches the target, because it can’t reach back to the center, like all the other controllers. All the same, here’s an animation of the trained up AHF controller reaching to 8 targets (it was trained on all 64 above though):

Things don’t always go so smoothly, though. Here’s results from another training run that took around 2-3 weeks, and uses a different 2-link arm model (translated from Matlab code written by Dr. Emo Todorov):

What I found frustrating about this was that if you look at the error over time then this arm is doing as well or better than the previous arm at a lot of points. But the corresponding trajectories look terrible, like something you would see in a horror movie based around getting good machine learning results. This of course comes down to how I specified the cost function, and when I looked at the trajectories plotted over time the velocity of the arm is right at zero at the final time step, which it is not quiiiitte the case for the first controller. So this second network has found a workaround to minimize the cost function I specified in a way I did not intend. To prevent this, doing something like weighting the distance to target heavier than non-zero velocity would probably work. Or possibly just rerunning the training with a different random starting point you could get out a better controller, I don’t have a great feel for how important the random initialization is, but I’m hoping that it’s not all too important and its effects go to zero with enough training. Also, it should be noted I’ve run the first network for 12,500 iterations and the second for less than 6,000, so I’ll keep letting them run and maybe it will come around. The first one looked pretty messy too until about 4,000 iterations in.

Training regimes

Frustratingly, the way that you train deep networks is very important. So, very much like the naive deep learning network trainer that I am, I tried the first thing that pretty much anyone would try:

• run the network,
• update the weights,
• repeat.

This is what I’ve done in the results above. And it worked well enough in that case.

If you remember back to the iLQR I made a little while ago, I was able to change the cost function to be

$L(\theta) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{u}_t) + \ell_f(\textbf{x}_N),$
$\ell(\textbf{u}_t, \textbf{x}_t) = \alpha \frac{||\textbf{u}_t||^2}{2} + \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2},$
$\ell_f(\textbf{x}_N) = \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2}$

(i.e. to include a penalty for distance to target throughout the trajectory and not just at the final time step) which resulted in straighter trajectories when controlling the 2-link arm. So I thought I would try this here as well. Sadly (incredibly sadly), this was fairly fruitless. The network didn’t really learn or improve much at all.

After much consideration and quandary on my part, I talked with Dr. Dan and he suggested that I try another method:

• run the network,
• record the input,
• hold the input constant for a few batches of weight updating,
• repeat.

This method gave much better results. BUT WHY? I hear you ask! Good question. Let me give giving explanation a go.

Essentially, it’s because the cost function is more complex now. In the first training method, the output from the plant is fed back into the network as input at every time step. When the cost function was simpler this was OK, but now we’re getting very different input to train on at every iteration. So the system is being pulled in different directions back and forth at every iteration. In the second training regime, the same input is given several times in a row, which let’s the system follow the same gradient for a few training iterations before things change again. In my head I picture this as giving the algorithm a couple seconds to catch its breath dunking it back underwater.

This is a method that’s been used in a bunch of places recently. One of the more high-profile instances is in the results published from DeepMind on deep RL for control and for playing Go. And indeed, it also works well here.

To implement this training regime, we set up the following code:

for ii in range(last_trial+1, num_batches):

# run the plant forward once
rnn.forward(input=plant, params=rnn.W)

# get the input and targets from above rollout
inputs = plant.get_vecs()[0].astype(np.float32)
targets = np.asarray(plant.get_vecs()[1], dtype=np.float32)

# train a bunch of batches using the same input every time
# to allow the network a chance to minimize things with
# stable input (speeds up training)
err = rnn.run_batches(inputs, targets, max_epochs=batch_size,
optimizer=HessianFree(CG_iter=96, init_damping=100))

# save the weights to file, track trial and error
# err = rnn.error(inputs)
err = rnn.best_error
name = 'weights/rnn_weights-trial%04i-err%.3f'%(ii, err)
np.savez_compressed(name, rnn.W)


So you can see that we do one rollout with the weights, then go in and get the inputs and targets that were used in that rollout, and start training the network while holding those constant for batch_size epochs (training sessions). From a little bit of messing around I’ve found batch_size=32 to be a pretty good number. So then it runs 32 training iterations where it’s updating the weights, and then saves those weights (because we want a loooottttt of check-points) and then restarts the loop.

Embarrassingly, I’ve lost my simulation results from this trial, somehow…so I don’t have any nice plots to back up the above, unfortunately. But since this is just a blog post I figured I would at least talk about it a little bit, since people might still find it useful if they’re just getting into the field like me. and just update this post whenever I re-run them. If I rerun them.

What I do have, however, are results where this method doesn’t work! I tried this with the simpler cost function, that only looks at the final state distance from the target, and it did not go so well. Let’s look at that one!

My guess here is basically that the system has gotten to a point where it’s narrowed things down in the parameter space and now when you run 32 batches it’s overshooting. It needs feedback about its updates after every update at this point. That’s my guess, at least. So it could be the case that for more complex cost functions you’d want to train it while holding the input constant for a while, and then when the error starts to plateau switch to updating the input after every parameter update.

Conclusions

All in all, AHF for training neural networks in control is pretty awesome. There are of course still some major hold-backs, mostly related to how long it takes to train up a network, and having to guess at effective training regimes and network structures etc. But! It was able to train up a relatively small neural network to move an arm model from a center point to 64 targets around a circle, with no knowledge of the system under control at all. In Dr. Sutskever’s thesis he goes on to use the same set up under more complicated circumstances, such as when there’s a feedback delay, or a delay on the outgoing control signal, and unexpected noise etc, so it is able to learn under a number of different, fairly complex situations. Which is pretty slick.

Related to the insane training time required, I very easily could be missing some basic thing that would help speed things up. If you, reader, get ambitious and run the code on your own machine and find out useful methods for speeding up the training please let me know! Personally, my plan is to next investigate guided policy search, which seems like it’s found a way around this crazy training time.

## Simultaneous perturbation vs finite differences for linear dynamics estimation and control signal optimization

Recently in my posts I’ve been using finite differences to approximate the gradient of loss functions and dynamical systems, with the intention of creating generalizable controllers that can be run on any system without having to calculate out derivatives beforehand. Finite differences is pretty much the most straight-forward way of approximating a gradient that there is: vary each parameter up and down (assuming we’re doing central differencing), one at a time, run it through your function and estimate the parameters effect on the system by calculating the difference between resulting function output. To do this requires 2 samples of the function for each parameter.

But there’s always more than one way to peel an avocado, and another approach that’s been used with much success is the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm, which was developed by Dr. James Spall (link to overview paper). SPSA is a method of gradient approximation, like finite differences, but, critically, the difference is that it varies all of the parameters at once, rather than one at a time. As a result, you can get an approximation of the gradient with far fewer samples from the system, and also when you don’t have explicit control over your samples (i.e. the ability to vary each parameter one at a time).

Given some function $\textbf{f}$ dependent on some set of parameters $\textbf{x}$, we’re used to finding the gradient of $\textbf{f}(\textbf{x})$ using FDSA (finite differences stochastic approximation) written in this form:

$\textbf{f}_x = \frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}) - \textbf{f}(\textbf{x} - \Delta \textbf{x})}{2 \Delta \textbf{x}} = \Delta x^{-1} \frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}) - \textbf{f}(\textbf{x} - \Delta \textbf{x})}{2},$

where $\Delta\textbf{x}$ is a perturbation to the parameter set, and the subscript on the left-hand side denotes the derivative of $\textbf{f}(\textbf{x})$ with respect to $\textbf{x}$.

And that’s how we’ve calculated it before, estimating the gradient of a single parameter at a time. But, we can rewrite this for a set perturbations $\Delta \textbf{X} = [\Delta\textbf{x}_0, ... , \Delta\textbf{x}_N]^T$:

$\textbf{f}_\textbf{x} = \Delta \textbf{X}^{-1} \textbf{F},$

where

$\textbf{F} = [\frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}_0) - \textbf{f}(\textbf{x} - \Delta \textbf{x}_0)}{2}, ... , \frac{\textbf{f}(\textbf{x} + \Delta \textbf{x}_0) - \textbf{f}(\textbf{x}_N - \Delta \textbf{x}_N)}{2}]^T$,

which works as long as $\Delta\textbf{X}$ is square. When it’s not square (i.e. we have don’t have the same number of samples as we have parameters), we run into problems, because we can’t calculate $\Delta\textbf{X}^{-1}$ directly. To address this, let’s take a step back and then work forward again to get a more general form that works for non-square $\Delta \textbf{X}$ too.

By rewriting the above, and getting rid of the inverse by moving $\Delta\textbf{x}$ back to the other side, we have:

$\Delta\textbf{X} \; \textbf{f}_\textbf{x} = \textbf{F}$

Now, the standard trick to move a matrix that’s not square is to just make it square by multiplying it by its transpose to get a square matrix, and then the whole thing by the inverse:

$\Delta\textbf{X}^T \Delta\textbf{X} \; \textbf{f}_\textbf{x} = \Delta\textbf{X}^T \textbf{F}$

$(\Delta\textbf{X}^T \Delta\textbf{X})^{-1} (\Delta\textbf{X}^T \Delta\textbf{X}) \textbf{f}_\textbf{x} = (\Delta\textbf{X}^T \Delta\textbf{X})^{-1} \Delta\textbf{X}^T \textbf{F}$

$\textbf{f}_\textbf{x} = (\Delta\textbf{X}^T \Delta\textbf{X})^{-1} \Delta\textbf{X}^T \textbf{F}$

Alright! Now we’re comfortable with this characterization of gradient approximation using a form that works with non-square perturbation matrices.

Again, in FDSA, we only vary one parameter at a time. This means that there will only ever be one non-zero entry per row of $\Delta \textbf{X}$. By contrast, in SPSA, we vary multiple parameters, and so rows of $\Delta\textbf{X}$ will be just chalk full of non-zero entries.

Gradient approximation to estimate $\textbf{f}_\textbf{x}$ and $\textbf{f}_\textbf{u}$ for LQR control

Implementing this was pretty simple. Just had to modify the calc_derivs function, which I use to estimate the derivative of the arm with respect to the state and control signal, in my LQR controller code by changing from standard finite differences to simultaneous perturbation:

def calc_derivs(self, x, u):
"""" calculate gradient of plant dynamics using Simultaneous
Perturbation Stochastic Approximation (SPSA). Implemented
based on (Peters & Schaal, 2008).

x np.array: the state of the system
u np.array: the control signal
"""
# Initialization and coefficient selection
num_iters = 20

eps = 1e-4
delta_K = None
delta_J = None
for ii in range(num_iters):
# Generation of simultaneous perturbation vector
# choose each component from a Bernoulli +-1 distribution
# with probability of .5 for each +-1 outcome.
delta_k = np.random.choice([-1,1],
size=len(x) + len(u),
p=[.5, .5])
# Function evaluations
inc_x = np.copy(x) + eps * delta_k[:len(x)]
inc_u = np.copy(u) + eps * delta_k[len(x):]
state_inc = self.plant_dynamics(inc_x, inc_u)
dec_x = np.copy(x) - eps * delta_k[:len(x)]
dec_u = np.copy(u) - eps * delta_k[len(x):]
state_dec = self.plant_dynamics(dec_x, dec_u)

delta_j = ((state_inc - state_dec) /
(2.0 * eps)).reshape(-1)

# Track delta_k and delta_j
delta_K = delta_k if delta_K is None else \
np.vstack([delta_K, delta_k])
delta_J =  delta_j if delta_J is None else \
np.vstack([delta_J, delta_j])

f_xu = np.dot(np.linalg.pinv(np.dot(delta_K.T, delta_K)),
np.dot(delta_K.T, delta_J))
f_x = f_xu[:len(x)]
f_u = f_xu[len(x):]

return f_x.T , f_u.T


A couple notes about the above code. First, you’ll notice that the f_x and f_b matrices are both calculated at the same time. That’s pretty slick! And that calculation for f_xu is just a straight implementation of the matrix form of gradient approximation, where I’ve arranged things so that f_x is in the top part and f_u is in the lower part.

The second thing is that the perturbation vector delta_k is generated from a Bernoulli distribution. The reason behind this is that we want to have a bunch of different samples that pretty reasonably spread the state space and move all the parameters independently. Making each perturbation some distance times -1 or 1 is an easy way to achieve this.

Thirdly, there’s the num_iters variable. This is a very important variable, as it dictates how many random samples of our system we take before we estimate the gradient. I’ve found that to get this to work for both the 2-link arm and the more complex 3-link arm, it needs to be at least 20. Or else things explode and die horribly. Just…horribly.

OK let’s look at the results:

The first thing to notice is that I’ve finally discovered the Seaborn plotting package. The second is that SPSA does as well as FDSA.

You may ask: Is there any difference? Well, if we time these functions, on my lil’ laptop, for the 2-link arm it takes SPSA approximately 2.0ms, but it takes FDSA only 0.8ms. So for the same performance the SPSA is taking almost 3 times as long to run. Why? This boils down to how many times the system dynamics need to be sampled by each algorithm to get a good approximation of the gradient. For a 2-link arm, FDSA has 6 parameters ($\textbf{q}, \dot{\textbf{q}},$ and $\textbf{u}$) that it needs to sample twice (we’re doing central differencing), for a total of 12 samples. And as I mentioned above, the SPSA algorithm needs 20 samples to be stable.

For the 3-link arm, SPSA took about 3.1ms on average and FDSA (which must now perform 18 samples of the dynamics) still only 2.1ms. So number of samples isn’t the only cause of time difference between these two algorithms. SPSA needs to perform that a few more matrix operations, including a matrix inverse, which is expensive, while FDSA can calculate the gradient of each parameter individually, which is much less expensive.

OK so SPSA not really impressive here. BUT! As I discovered, there are other means of employing SPSA.

Gradient approximation to optimize the control signal directly

In the previous set up we were using SPSA to estimate the gradient of the system under control, and then we used that gradient to calculate a control signal that minimized the loss function (as specified inside the LQR). This is one way to use gradient approximation methods. Another way to use these methods is approximate the gradient of the loss function directly, and use that information to iteratively calculate a control signal that minimizes the loss function. This second application is the primary use of the SPSA algorithm, and is what’s described by Dr. Spall in his overview paper.

In this application, the algorithm works like this:

2. perturb input and simulate results
3. observe loss function and calculate gradient
4. update input to system
5. repeat to convergence

Because in this approach we’re iteratively optimizing the input using our gradient estimation, having a noisy estimate is no longer a death sentence, as it was in the LQR. If we update our input to the system with several noisy gradient estimates the noise will essentially just cancel itself out. This means that SPSA now has a powerful advantage over FDSA: Since in SPSA we vary all parameters at once, only 2 samples of the loss function are used to estimate the gradient, regardless of the number of parameters. In contrast, FDSA needs to sample the loss function twice for every input parameter. Here’s a picture from (Spall, 1998) that shows the two running against each other to optimize a 2D problem:

This gets across that even though SPSA bounces around more, they both reach the solution in the same number of steps. And, in general, this is the case, as Dr. Spall talks about in the paper. There’s also a couple more details of the algorithm, so let’s look at it in detail. Here’s the code, which is just a straight translation into Python out of the description in Dr. Spall’s paper:

# Step 1: Initialization and coefficient selection
max_iters = 5
converge_thresh = 1e-5

alpha = 0.602 # from (Spall, 1998)
gamma = 0.101
a = .101 # found empirically using HyperOpt
A = .193
c = .0277

delta_K = None
delta_J = None
u = np.copy(self.u) if self.u is not None \
else np.zeros(self.arm.DOF)
for k in range(max_iters):
ak = a / (A + k + 1)**alpha
ck = c / (k + 1)**gamma

# Step 2: Generation of simultaneous perturbation vector
# choose each component from a bernoulli +-1 distribution with
# probability of .5 for each +-1 outcome.
delta_k = np.random.choice([-1,1], size=arm.DOF, p=[.5, .5])

# Step 3: Function evaluations
inc_u = np.copy(u) + ck * delta_k
cost_inc = self.cost(np.copy(state), inc_u)
dec_u = np.copy(u) - ck * delta_k
cost_dec = self.cost(np.copy(state), dec_u)

gk = np.dot((cost_inc - cost_dec) / (2.0*ck), delta_k)

# Step 5: Update u estimate
old_u = np.copy(u)
u -= ak * gk

# Step 6: Check for convergence
if np.sum(abs(u - old_u)) < converge_thresh:
break


The main as-of-yet-unexplained parts of this code are the alpha, gamma, a, A, and c variables. What’s their deal?

Looking inside the loop, we can see that ck controls the magnitude of our perturbations. Looking a little further down, ak is just the learning rate. And all of those other parameters are just involved in shaping the trajectories that ak and ck follow through iterations, which is a path towards zero. So the first steps and perturbations are the biggest, and each successively becomes smaller as the iteration count increases.

There are a few heuristics that Dr. Spall goes over, but there aren’t any hard and fast rules for setting a, A, and c. Here, I just used HyperOpt to find some values that worked pretty well for this particular problem.

The FDSA version of this is also very straight-forward:

# Step 1: Initialization and coefficient selection
max_iters = 10
converge_thresh = 1e-5

eps = 1e-4
u = np.copy(self.u) if self.u is not None \
else np.zeros(self.arm.DOF)
for k in range(max_iters):

gk = np.zeros(u.shape)
for ii in range(gk.shape[0]):
# Step 2: Generate perturbations one parameter at a time
inc_u = np.copy(u)
inc_u[ii] += eps
dec_u = np.copy(u)
dec_u -= eps

# Step 3: Function evaluation
cost_inc = self.cost(np.copy(state), inc_u)
cost_dec = self.cost(np.copy(state), dec_u)

gk[ii] = (cost_inc - cost_dec) / (2.0 * eps)

old_u = np.copy(u)
# Step 5: Update u estimate
u -= 1e-5 * gk

# Step 6: Check for convergence
if np.sum(abs(u - old_u)) < converge_thresh:
break


You’ll notice that in both the SPSA and FDSA code we’re no longer sampling plant_dynamics, we’re instead sampling cost, a loss function I defined. From just my experience playing around with these algorithms a bit, getting the loss function to be appropriate and give the desired behaviour is definitely a bit of an art. It feels like much more of an art than in other controllers I’ve coded, but that could just be me.

The cost function that I’m using is pretty much the first thing you’d think of. It penalizes distance to target and having non-zero velocity. Getting the weighting between distance to target and velocity set up so that the arm moves to the target but also doesn’t overshoot definitely took a bit of trial and error, er, I mean empirical analysis. Here’s the cost function that I found worked pretty well, note that I had to make special cases for the different arms:

 def cost(self, x, u):
dt = .1 if self.arm.DOF == 3 else .01
next_x = self.plant_dynamics(x, u, dt=dt)
vel_gain = 100 if self.arm.DOF == 3 else 10
return (np.sqrt(np.sum((self.arm.x - self.target)**2)) * 1000 \
+ np.sum((next_x[self.arm.DOF:])**2) * vel_gain)


So that’s all the code, let’s look at the results!

For these results, I used a max of 10 iterations for optimizing the control signal. I was definitely surprised by the quality of the results, especially for the 3-link arm, compared to the results generated by a standard LQR controller. Although I need to note, again, that it was a fair bit of me playing around with the exact cost function to get these results. Lots of empirical analysis.

The two controllers generate results that are identical to visual inspection. However, especially in the 3-link arm, the time required to run the FDSA was significantly longer than the SPSA controller. It took approximately 140ms for the SPSA controller to run a single loop, but took FDSA on average 700ms for a single loop of calculating the control signal. Almost 5 times as long! For the same results! In directly optimizing the control signal, SPSA gets a big win over standard FDSA. So, if you’re looking to directly optimize over a loss function, SPSA is probably the way you want to go.

Conclusions

First off, I thought it was really neat to directly apply gradient approximation methods to optimizing the control signal. It’s something I haven’t tried before, but definitely makes sense, and can generate some really nice results when tuned properly. Automating the tuning is definitely I’ll be discussing in future posts, because doing it by hand takes a long time and is annoying.

In the LQR, the gradient approximation was best done by the FDSA. I think the main reasons for this is that in solving for the control signal the LQR algorithm uses matrix inverses, and any errors in the linear approximations to the dynamics are going to be amplified quite a bit. If I did anything less than 10-15 iterations (20 for the 3-link arm) in the SPSA approximation then things exploded. Also, here the SPSA algorithm required a matrix inverse, where the FDSA didn’t. This is because we only varied one parameter at a time in FDSA, and the effects of changing each was isolated. In the SPSA case, we had to consider the changes across all the variables and the resulting effects all at once, essentially noting which variables changed by how much and the changes in each case, and averaging. Here, even with the more complex 3-link arm, FDSA was faster, so I’m going to stick with it in my LQR and iLQR implementations.

In the direct control signal optimization SPSA beat the pants off of FDSA. It was almost 5 times faster for control of the 3-link arm. This was, again, because in this case we could use noisy samples of the gradient of the loss function and relied on noise to cancel itself out as we iterated. So we only needed 2 samples of the loss function in SPSA, where in FDSA we needed 2*num_parameters. And although this generated pretty good results I would definitely be hesitant against using this for any more complicated systems, because tuning that cost function to get out a good trajectory was a pain. If you’re interested in playing around with this, you can check out the code for the gradient controllers up on my GitHub.

## The iterative Linear Quadratic Regulator algorithm

A few months ago I posted on Linear Quadratic Regulators (LQRs) for control of non-linear systems using finite-differences. The gist of it was at every time step linearize the dynamics, quadratize (it could be a word) the cost function around the current point in state space and compute your feedback gain off of that, as though the dynamics were both linear and consistent (i.e. didn’t change in different states). And that was pretty cool because you didn’t need all the equations of motion and inertia matrices etc to generate a control signal. You could just use the simulation you had, sample it a bunch to estimate the dynamics and value function, and go off of that.

The LQR, however, operates with maverick disregard for changes in the future. Careless of the consequences, it optimizes assuming the linear dynamics approximated at the current time step hold for all time. It would be really great to have an algorithm that was able to plan out and optimize a sequence, mindful of the changing dynamics of the system.

This is exactly the iterative Linear Quadratic Regulator method (iLQR) was designed for. iLQR is an extension of LQR control, and the idea here is basically to optimize a whole control sequence rather than just the control signal for the current point in time. The basic flow of the algorithm is:

1. Initialize with initial state $x_0$ and initial control sequence $\textbf{U} = [u_{t_0}, u_{t_1}, ..., u_{t_{N-1}}]$.
2. Do a forward pass, i.e. simulate the system using $(x_0, \textbf{U})$ to get the trajectory through state space, $\textbf{X}$, that results from applying the control sequence $\textbf{U}$ starting in $x_0$.
3. Do a backward pass, estimate the value function and dynamics for each $(\textbf{x}, \textbf{u})$ in the state-space and control signal trajectories.
4. Calculate an updated control signal $\hat{\textbf{U}}$ and evaluate cost of trajectory resulting from $(x_0, \hat{\textbf{U}})$.
1. If $|(\textrm{cost}(x_0, \hat{\textbf{U}}) - \textrm{cost}(x_0, \textbf{U})| < \textrm{threshold}$ then we've converged and exit.
2. If $\textrm{cost}(x_0, \hat{\textbf{U}}) < \textrm{cost}(x_0, \textbf{U})$, then set $\textbf{U} = \hat{\textbf{U}}$, and change the update size to be more aggressive. Go back to step 2.
3. If $\textrm{cost}(x_0, \hat{\textbf{U}}) \geq$ $\textrm{cost}(x_0, \textbf{U})$ change the update size to be more modest. Go back to step 3.

There are a bunch of descriptions of iLQR, and it also goes by names like ‘the sequential linear quadratic algorithm’. The paper that I’m going to be working off of is by Yuval Tassa out of Emo Todorov’s lab, called Control-limited differential dynamic programming. And the Python implementation of this can be found up on my github in my Control repo. Also, a big thank you to Dr. Emo Todorov who provided Matlab code for the iLQG algorithm, which was super helpful.

Defining things

So let’s dive in. Formally defining things, we have our system $\textbf{x}$, and dynamics described with the function $\textbf{f}$, such that

$\textbf{x}_{t+1} = \textbf{f}(\textbf{x}_t, \textbf{u}_t),$

where $\textbf{u}$ is the input control signal. The trajectory $\{\textbf{X}, \textbf{U}\}$ is the sequence of states $\textbf{X} = \{\textbf{x}_0, \textbf{x}_1, ..., \textbf{x}_N\}$ that result from applying the control sequence $\textbf{U} = \{\textbf{u}_0, \textbf{u}_1, ..., \textbf{u}_{N-1}\}$ starting in the initial state $\textbf{x}_0$.

Now we need to define all of our cost related equations, so we know exactly what we’re dealing with.

Define the total cost function $J$, which is the sum of the immediate cost, $\ell$, from each state in the trajectory plus the final cost, $\ell_f$:

$J(\textbf{x}_0, \textbf{U}) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{x}_t, \textbf{u}_t) + \ell_f(\textbf{x}_N).$

Letting $\textbf{U}_t = \{\textbf{u}_t, \textbf{u}_{t+1}, ..., \textbf{U}_{N-1}\}$, we define the cost-to-go as the sum of costs from time $t$ to $N$:

$J_t(\textbf{x}, \textbf{U}_t) = \sum\limits^{N-1}\limits_{i=t} \ell(\textbf{x}_i, \textbf{u}_i) + \ell_f(\textbf{x}_N).$

The value function $V$ at time $t$ is the optimal cost-to-go from a given state:

$V_t(\textbf{x}) = \min\limits_{\textbf{U}_t} J_t(\textbf{x}, \textbf{U}_t),$

where the above equation just says that the optimal cost-to-go is found by using the control sequence $\textbf{U}_t$ that minimizes $J_t$.

At the final time step, $N$, the value function is simply

$V(\textbf{x}_N) = \ell_f(\textbf{x}_N).$

For all preceding time steps, we can write the value function as a function of the immediate cost $\ell(\textbf{x}, \textbf{u})$ and the value function at the next time step:

$V(\textbf{x}) = \min\limits_{\textbf{u}} \left[ \ell(\textbf{x}, \textbf{u}) + V(\textbf{f}(\textbf{x}, \textbf{u})) \right].$

NOTE: In the paper, they use the notation $V'(\textbf{f}(\textbf{x}, \textbf{u}))$ to denote the value function at the next time step, which is redundant since $\textbf{x}_{t+1} = \textbf{f}(\textbf{x}_t, \textbf{u}_t)$, but it comes in handy later when they drop the dependencies to simplify notation. So, heads up: $V' = V(\textbf{f}(\textbf{x}, \textbf{u})$.

Forward rollout

The forward rollout consists of two parts. The first part is to simulating things to generate the $(\textbf{X}, \textbf{U}),$ from which we can calculate the overall cost of the trajectory, and find out the path that the arm will take. To improve things though we’ll need a lot of information about the partial derivatives of the system, calculating these is the second part of the forward rollout phase.

To calculate all these partial derivatives we’ll use $(\textbf{X}, \textbf{U})$. For each $(\textbf{x}_t, \textbf{u}_t)$ we’ll calculate the derivatives of $\textbf{f}(\textbf{x}_t, \textbf{u}_t)$ with respect to $\textbf{x}_t$ and $\textbf{u}_t$, which will give us what we need for our linear approximation of the system dynamics.

To get the information we need about the value function, we’ll need the first and second derivatives of $\ell(\textbf{x}_t, \textbf{u}_t)$ and $\ell_f(\textbf{x}_t, \textbf{x}_t)$ with respect to $\textbf{x}_t$ and $\textbf{u}_t$.

So all in all, we need to calculate $\textbf{f}_\textbf{x}$, $\textbf{f}_\textbf{u}$, $\ell_\textbf{x}$, $\ell_\textbf{u}$, $\ell_\textbf{xx}$, $\ell_\textbf{ux}$, $\ell_\textbf{uu}$, where the subscripts denote a partial derivative, so $\ell_\textbf{x}$ is the partial derivative of $\ell$ with respect to $\textbf{x}$, $\ell_\textbf{xx}$ is the second derivative of $\ell$ with respect to $\textbf{x}$, etc. And to calculate all of these partial derivatives, we’re going to use finite differences! Just like in the LQR with finite differences post. Long story short, load up the simulation for every time step, slightly vary one of the parameters, and measure the resulting change.

Once we have all of these, we’re ready to move on to the backward pass.

Backward pass

Now, we started out with an initial trajectory, but that was just a guess. We want our algorithm to take it and then converge to a local minimum. To do this, we’re going to add some perturbing values and use them to minimize the value function. Specifically, we’re going to compute a local solution to our value function using a quadratic Taylor expansion. So let’s define $Q(\delta \textbf{x}, \delta \textbf{u})$ to be the change in our value function at $(\textbf{x}, \textbf{u})$ as a result of small perturbations $(\delta \textbf{x}, \delta \textbf{u})$:

$Q(\delta \textbf{x}, \delta \textbf{u}) = \ell (\textbf{x} + \delta \textbf{x}, \textbf{u} + \delta \textbf{u}) + V(\textbf{f}(\textbf{x} + \delta\textbf{x}, \textbf{u} + \delta \textbf{u})).$

The second-order expansion of $Q$ is given by:

$Q_\textbf{x} = \ell_\textbf{x} + \textbf{f}_\textbf{x}^T V'_\textbf{x},$

$Q_\textbf{u} = \ell_\textbf{u} + \textbf{f}_\textbf{u}^T V'_\textbf{x},$

$Q_\textbf{xx} = \ell_\textbf{xx} + \textbf{f}_\textbf{x}^T V'_\textbf{xx} \textbf{f}_\textbf{x} + V'_\textbf{x} \cdot \textbf{f}_\textbf{xx},$

$Q_\textbf{ux} = \ell_\textbf{ux} + \textbf{f}_\textbf{u}^T V'_\textbf{xx} \textbf{f}_\textbf{x}+ V'_\textbf{x} \cdot \textbf{f}_\textbf{ux},$

$Q_\textbf{uu} = \ell_\textbf{uu} + \textbf{f}_\textbf{u}^T V'_\textbf{xx} \textbf{f}_\textbf{u}+ V'_\textbf{x} \cdot \textbf{f}_\textbf{uu}.$

Remember that $V' = V(\textbf{f}(\textbf{x}, \textbf{u}))$, which is the value function at the next time step. NOTE: All of the second derivatives of $\textbf{f}$ are zero in the systems we’re controlling here, so when we calculate the second derivatives we don’t need to worry about doing any tensor math, yay!

Given the second-order expansion of $Q$, we can to compute the optimal modification to the control signal, $\delta \textbf{u}^*$. This control signal update has two parts, a feedforward term, $\textbf{k}$, and a feedback term $\textbf{K} \delta\textbf{x}$. The optimal update is the $\delta\textbf{u}$ that minimizes the cost of $Q$:

$\delta\textbf{u}^*(\delta \textbf{x}) = \min\limits_{\delta\textbf{u}}Q(\delta\textbf{x}, \delta\textbf{u}) = \textbf{k} + \textbf{K}\delta\textbf{x},$

where $\textbf{k} = -Q^{-1}_\textbf{uu} Q_\textbf{u}$ and $\textbf{K} = -Q^{-1}_\textbf{uu} Q_\textbf{ux}.$

Derivation can be found in this earlier paper by Li and Todorov. By then substituting this policy into the expansion of $Q$ we get a quadratic model of $V$. They do some mathamagics and come out with:

$V_\textbf{x} = Q_\textbf{x} - \textbf{K}^T Q_\textbf{uu} \textbf{k},$

$V_\textbf{xx} = Q_\textbf{xx} - \textbf{K}^T Q_\textbf{uu} \textbf{K}.$

So now we have all of the terms that we need, and they’re defined in terms of the values at the next time step. We know the value of the value function at the final time step $V_N = \ell_f(\textbf{x}_N)$, and so we’ll simply plug this value in and work backwards in time recursively computing the partial derivatives of $Q$ and $V$.

Calculate control signal update

Once those are all calculated, we can calculate the gain matrices, $\textbf{k}$ and $\textbf{K}$, for our control signal update. Huzzah! Now all that’s left to do is evaluate this new trajectory. So we set up our system

$\hat{\textbf{x}}_0 = \textbf{x}_0,$

$\hat{\textbf{u}}_t = \textbf{u}_t + \textbf{k}_t + \textbf{K}_t (\hat{\textbf{x}}_t - \textbf{x}_t),$

$\hat{\textbf{x}}_{t+1} = \textbf{f}(\hat{\textbf{x}}_t, \hat{\textbf{u}}_t),$

and record the cost. Now if the cost of the new trajectory $(\hat{\textbf{X}}, \hat{\textbf{U}})$ is less than the cost of $(\textbf{X}, \textbf{U})$ then we set $\textbf{U} = \hat{\textbf{U}}$ and go do it all again! And when the cost from an update becomes less than a threshold value, call it done. In code this looks like:

if costnew < cost:
sim_new_trajectory = True

if (abs(costnew - cost)/cost) < self.converge_thresh:
break


Of course, another option we need to account for is when costnew > cost. What do we do in this case? Our control update hasn’t worked, do we just exit?

The Levenberg-Marquardt heuristic
No! Phew.

The control signal update in iLQR is calculated in such a way that it can behave like Gauss-Newton optimization (which uses second-order derivative information) or like gradient descent (which only uses first-order derivative information). The is that if the updates are going well, then lets include curvature information in our update to help optimize things faster. If the updates aren’t going well let’s dial back towards gradient descent, stick to first-order derivative information and use smaller steps. This wizardry is known as the Levenberg-Marquardt heuristic. So how does it work?

Something we skimmed over in the iLQR description was that we need to calculate $Q^{-1}_\textbf{uu}$ to get the $\textbf{k}$ and $\textbf{K}$ matrices. Instead of using np.linalg.pinv or somesuch, we’re going to calculate the inverse ourselves after finding the eigenvalues and eigenvectors, so that we can regularize it. This will let us do a couple of things. First, we’ll be able to make sure that our estimate of curvature ($Q_\textbf{uu}^{-1}$) stays positive definite, which is important to make sure that we always have a descent direction. Second, we’re going to add a regularization term to the eigenvalues to prevent them from exploding when we take their inverse. Here’s our regularization implemented in Python:


Q_uu_evals, Q_uu_evecs = np.linalg.eig(Q_uu)
Q_uu_evals[Q_uu_evals < 0] = 0.0
Q_uu_evals += lamb
Q_uu_inv = np.dot(Q_uu_evecs,
np.dot(np.diag(1.0/Q_uu_evals), Q_uu_evecs.T))


Now, what happens when we change lamb? The eigenvalues represent the magnitude of each of the eigenvectors, and by taking their reciprocal we flip the contributions of the vectors. So the ones that were contributing the least now have the largest singular values, and the ones that contributed the most now have the smallest eigenvalues. By adding a regularization term we ensure that the inverted eigenvalues can never be larger than 1/lamb. So essentially we throw out information.

In the case where we’ve got a really good approximation of the system dynamics and value function, we don’t want to do this. We want to use all of the information available because it’s accurate, so make lamb small and get a more accurate inverse. In the case where we have a bad approximation of the dynamics we want to be more conservative, which means not having those large singular values. Smaller singular values give a smaller $Q_\textbf{uu}^{-1}$ estimate, which then gives smaller gain matrices and control signal update, which is what we want to do when our control signal updates are going poorly.

How do you know if they’re going poorly or not, you now surely ask! Clever as always, we’re going to use the result of the previous iteration to update lamb. So adding to the code from just above, the end of our control update loop is going to look like:

lamb = 1.0 # initial value of lambda
...
if costnew < cost:
lamb /= self.lamb_factor
sim_new_trajectory = True

if (abs(costnew - cost)/cost) < self.converge_thresh:
break
else:
lamb *= self.lamb_factor
if lamb > self.max_lamb:
break


And that is pretty much everything! OK let’s see how this runs!

Simulation results

If you want to run this and see for yourself, you can go copy my Control repo, navigate to the main directory, and run

python run.py arm2 reach


or substitute in arm3. If you’re having trouble getting the arm2 simulation to run, try arm2_python, which is a straight Python implementation of the arm dynamics, and should work no sweat for Windows and Mac.

Below you can see results from the iLQR controller controlling the 2 and 3 link arms (click on the figures to see full sized versions, they got distorted a bit in the shrinking to fit on the page), using immediate and final state cost functions defined as:

l = np.sum(u**2)


and

pos_err = np.array([self.arm.x[0] - self.target[0],
self.arm.x[1] - self.target[1]])
l = (wp * np.sum(pos_err**2) + # pos error
wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error


where wp and wv are just gain values, x is the state of the system, and self.arm.x is the $(x,y)$ position of the hand. These read as “during movement, penalize large control signals, and at the final state, have a big penalty on not being at the target.”

So let’s give it up for iLQR, this is awesome! How much of a crazy improvement is that over LQR? And with all knowledge of the system through finite differences, and with the full movements in exactly 1 second! (Note: The simulation speeds look different because of my editing to keep the gif sizes small, they both take the same amount of time for each movement.)

Changing cost functions
Something that you may notice is that the control of the 3 link is actually straighter than the 2 link. I thought that this might be just an issue with the gain values, since the scale of movement is smaller for the 2 link arm than the 3 link there might have been less of a penalty for not moving in a straight line, BUT this was wrong. You can crank the gains and still get the same movement. The actual reason is that this is what the cost function specifies, if you look in the code, only $\ell_f(\textbf{x}_N)$ penalizes the distance from the target, and the cost function during movement is strictly to minimize the control signal, i.e. $\ell(\textbf{x}_t, \textbf{u}_t) = \textbf{u}_t^2$.

Well that’s a lot of talk, you say, like the incorrigible antagonist we both know you to be, prove it. Alright, fine! Here’s iLQR running with an updated cost function that includes the end-effector’s distance from the target in the immediate cost:

All that I had to do to get this was change the immediate cost from

l = np.sum(u**2)


to

l = np.sum(u**2)
pos_err = np.array([self.arm.x[0] - self.target[0],
self.arm.x[1] - self.target[1]])
l += (wp * np.sum(pos_err**2) + # pos error
wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error


where all I had to do was include the position penalty term from the final state cost into the immediate state cost.

Changing sequence length
In these simulations the system is simulating at .01 time step, and I gave it 100 time steps to reach the target. What if I give it only 50 time steps?

It looks pretty much the same! It’s just now twice as fast, which is of course achieved by using larger control signals, which we don’t see, but dang awesome.

What if we try to make it there in 10 time steps??

OK well that does not look good. So what’s going on in this case? Basically we’ve given the algorithm an impossible task. It can’t make it to the target location in 10 time steps. In the implementation I wrote here, if it hits the end of it’s control sequence and it hasn’t reached the target yet, the control sequence starts over back at t=0. Remember that part of the target state is also velocity, so basically it moves for 10 time steps to try to minimize $(x,y)$ distance, and then slows down to minimize final state cost in the velocity term.

In conclusion

This algorithm has been used in a ton of things, for controlling robots and simulations, and is an important part of guided policy search, which has been used to very successfully train deep networks in control problems. It’s getting really impressive results for controlling the arm models that I’ve built here, and using finite differences should easily generalize to other systems.

iLQR is very computationally expensive, though, so that’s definitely a downside. It’s definitely less expensive if you have the equations of your system, or at least a decent approximation of them, and you don’t need to use finite differences. But you pay for the efficiency with a loss in generality.

There are also a bunch of parameters to play around with that I haven’t explored at all here, like the weights in the cost function penalizing the magnitude of the cost function and the final state position error. I showed a basic example of changing the cost function, which hopefully gets across just how easy changing these things out can be when you’re using finite differences, and there’s a lot to play around with there too.

Implementation note
In the Yuval and Todorov paper, they talked about using backtracking line search when generating the control signal. So the algorithm they had when generating the new control signal was actually:

$\hat{\textbf{u}}_t = \hat{\textbf{u}}_t + \alpha\textbf{k}_t + \textbf{K}_t(\hat{\textbf{x}}_t - \textbf{x}_t)$

where $\alpha$ was the backtracking search parameter, which gets set to one initially and then reduced. It’s very possible I didn’t implement it as intended, but I found consistently that $\alpha = 1$ always generated the best results, so it was just adding computation time. So I left it out of my implementation. If anyone has insights on an implementation that improves results, please let me know!

And then finally, another thank you to Dr. Emo Todorov for providing Matlab code for the iLQG algorithm, which was very helpful, especially for getting the Levenberg-Marquardt heuristic implemented properly.

## Linear-Quadratic Regulation for non-linear systems using finite differences

One of the standard controllers in basic control theory is the linear-quadratic regulator (LQR). There is a finite-horizon case (where you have a limited amount of time), and an infinite-horizon case (where you don’t); in this post, for simplicity, we’re only going to be dealing with the infinite-horizon case.

The LQR is designed to handle a very specific kind of problem. First, it assumes you are controlling a system with linear dynamics, which means you can express them as

$\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u}$,

where $\textbf{x}$ and $\dot{\textbf{x}}$ are the state and its time derivative, $\textbf{u}$ is the input, and $\textbf{A}$ and $\textbf{B}$ capture the effects of the state and input on the derivative. And second, it assumes that the cost function, denoted $J$, is a quadratic of the form

$J = \int_0^{\infty} \left( (\textbf{x} - \textbf{x}^*)^T \textbf{Q} (\textbf{x} - \textbf{x}^*) + \textbf{u}^T \textbf{R} \textbf{u} \right) dt$

where $\textbf{x}^*$ is the target state, and $\textbf{Q} = \textbf{Q}^T \geq 0$ and $\textbf{R} = \textbf{R}^T \geq 0$ are weights on the cost of not being at the target state and applying a control signal. The higher $\textbf{Q}$ is, the more important it is to get to the target state asap, the higher $\textbf{R}$ is, the more important it is to keep the control signal small as you go to the target state.

The goal of the LQR is to calculate a feedback gain matrix $\textbf{K}$ such that

$\textbf{u} = -\textbf{K} \textbf{x},$

drives the system to the target. When the system is a linear system with a quadratic cost function, this can be done optimally. There is lots of discussion elsewhere about LQRs and their derivation, so I’m not going to go into that with this post. Instead, I’m going to talk about applying LQRs to non-linear systems, and using finite differences to do it, which works when you have a readily accessible simulation of the system on hand. The fun part is that by using finite differences you can get this to work without working out the dynamics equations yourself.

Using LQRs on non-linear systems

As you may have noticed, non-linear systems violate the first assumption of a linear quadratic regulator; that the system is linear. That doesn’t mean that we can’t apply it, it just means that it’s not going to be optimal. How poorly the LQR will perform depends on a few things, two important factors being how non-linear the system dynamics actually are, and how often you’re able to update the feedback gain matrix $\textbf{K}$. To apply LQR to non-linear systems we’re just going to close our eyes and pretend that the system dynamics are linear, i.e. they fit the form

$\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u}.$

We’ll do this by approximating the actual dynamics of the system linearly. We’ll then solve for our gain value $\textbf{K}$, generate our control signal for this timestep, and then re-approximate the dynamics again at the next time step and solve for $\textbf{K}$ from the new state. The more non-linear the system dynamics are, the less appropriate $\textbf{K}$ will be for generating our control signal $\textbf{u}$ as we move away from the state $\textbf{K}$ was calculated in; this is why update time of the LQR can become an important factor.

Using finite-differences to approximate system dynamics

An important question, then, is how do we find this system approximation? How can we calculate the $\textbf{A}$ and $\textbf{B}$ matrices that we then use to solve for $\textbf{K}$? If we know the dynamics of the system to be

$\dot{\textbf{x}} = f(\textbf{x}, \textbf{u})$,

then we can calculate

$\textbf{A} = \frac{\partial f(\textbf{x}, \textbf{u})}{\partial \textbf{x}}, \;\;\;\; \textbf{B} = \frac{\partial f(\textbf{x}, \textbf{u})}{\partial \textbf{u}}$.

If you’re going to try this for the 3-link arm, though, get out Mathematica. Do not try this by hand. If you disregard my warning and foolhardily attempt such a derivation you will regret, repent, and then appeal to Wolfram Alpha for salvation. These equations quickly become terrible and long even for seemingly not-so-complicated systems.

There are a few ways to skirt this. Here we’re going to assume that the system under control is a simulation, or that we at least have access to an accurate model, and use the finite differences method to compute these values. The idea behind finite differences is to approximate the rate of change of the function $f$ at the point $x$ by sampling $f$ near $x$ and using the difference to calculate $\dot{f}(x)$. Here’s a picture for a 1D system:

So here, our current state $x$ is the blue dot, and the red dots represent the sample points $x + \Delta x$ and $x - \Delta x$. We can then calculate

$\dot{f}(x) \approx \frac{f(x+\Delta x) - f(x-\Delta x)}{2\Delta x},$

and you can see the actual rate of change of $f$ at $x$ plotted in the blue dashed line, and the approximated rate of change calculated using finite differences plotted in the red dashed line. We can also see that the approximated derivative is only accurate near $x$ (the blue dot).

Back in our multi-dimensional system, to use finite differences to calculate the derivative with respect to the state and the input we’re going to vary each of the dimensions of the state and input by some small amount one at a time, calculating the effects of each one by one. Here’s a chunk of pseudo-code to hopefully clarify this idea:

eps = 1e-5
A = np.zeros((len(current_state), len(current_state))
for ii in range(len(current_state)):
x = current_state.copy()
x[ii] += eps
x_inc = simulate_system(state=x, input=control_signal)
x = current_state.copy()
x[ii] -= eps
x_dec = simulate_system(state=x, input=control_signal)
A[:,ii] = (x_inc - x_dec) / (2 * eps)

B = np.zeros((len(current_state), len(control_signal))
for ii in range(len(control_signal)):
u = control_signal.copy()
u[ii] += eps
x_inc = simulate_system(state=current_state, input=u)
u = control_signal.copy()
u[ii] -= eps
x_dec = simulate_system(state=current_state, input=u)
B[:,ii] = (x_inc - x_dec) / (2 * eps)


Now we’re able to generate our $\textbf{A}$ and $\textbf{B}$ matrices we have everything we need to solve for our feedback gain matrix $\textbf{K}$! Which is great.

Note on using finite differences in continuous vs discrete setup

Something that’s important to straighten out too is what exactly is returned by the simulate_system function in the code above. In the continuous case, your system is captured as

$\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u},$,

where in the discrete case your system is defined

$\textbf{x}(t+1) = \textbf{A}\textbf{x}(t) + \textbf{B}\textbf{u}(t).$

If you are calculating your feedback gain matrix $\textbf{K}$ using the continuous solution to the algebraic Riccati equation, then you need to be returning $\dot{\textbf{x}}(t)$. If you’re solving for $\textbf{K}$ using the discrete solution to the algebraic Riccati equation you need to return $\textbf{x}(t+1)$. This was just something that I came across as I was coding and so I wanted to mention it here in case anyone else stumbled across it!

Applying LQR to 2 and 3 link arm control

Alright! Let’s have a look at how the LQR does controlling non-linear systems. Below we have the control of a 2-link arm compared to a 3-link arm, and you can see the control of the 2-link arm is better. This is a direct result of the dynamics of a 3-link arm being significantly more complex.

Note on controlling at different timesteps

When I was first testing the LQR controller I expected the effects of different control update times to be a lot more significant than it was. As it turns out, for controlling a 3-link arm, there’s not really a visible difference in a controller that is updating every .01 seconds vs every .001 seconds vs every .0001 seconds. Let’s have a look:

Can’t even tell, eh? Fun fact, the simulation took 1 minute 30 seconds at .01 seconds time step and 45 minutes at .0001 seconds time step. The left-most animation is the .01 seconds and the right-most the .0001 seconds. But why is there seemingly so little difference? Well, this boils down to the dynamics of the 3-link arm changing actually pretty slowly. Below I’ve plotted just a few of the elements from the $\textbf{A}$, $\textbf{B}$, and $\textbf{K}$ matrices over .5 seconds of simulation time:

So, there are some obvious points where sampling the dynamics at a .01 time step is noticeably less accurate, but all in all there’s not a huuuggge difference between sampling at .01 and .0001 seconds. If you’re just watching the end-effector path it’s really not very noticeable. You can see how the elements of $\textbf{A}$ and $\textbf{B}$ are changing fairly slowly; this means that $\textbf{K}$ is going to be an effective feedback gain for a fair chunk of time. And the computational savings you get by sampling the dynamics and regenerating $\textbf{K}$ every .01 seconds instead of every .0001 seconds are pretty big. This was just another thing that I came across when playing around with the LQR, the take away being don’t just assume you need to update your system crazy often. You might get very comparable performance for much less computational cost.

Conclusions

All in all, the LQR controller is pretty neat! It’s really simple to set up, and generic. We don’t need any specific information about the system dynamics, like we do for effective operational space control (OSC). When we estimate the dynamics with finite differences, all need is a decent system model that we can sample. Again, the more non-linear the system, of course, the less effective a LQR will be. If you’re interested in playing around with one, or generating the figures that I show above, the code is all up and running on my Github for you to explore.

## Dynamic movement primitives part 1: The basics

Dynamic movement primitives (DMPs) are a method of trajectory control / planning from Stefan Schaal’s lab. They were presented way back in 2002 in this paper, and then updated in 2013 by Auke Ijspeert in this paper. This work was motivated by the desire to find a way to represent complex motor actions that can be flexibly adjusted without manual parameter tuning or having to worry about instability.

Complex movements have long been thought to be composed of sets of primitive action ‘building blocks’ executed in sequence and \ or in parallel, and DMPs are a proposed mathematical formalization of these primitives. The difference between DMPs and previously proposed building blocks is that each DMP is a nonlinear dynamical system. The basic idea is that you take a dynamical system with well specified, stable behaviour and add another term that makes it follow some interesting trajectory as it goes about its business. There are two kinds of DMPs: discrete and rhythmic. For discrete movements the base system is a point attractor, and for rhythmic movements a limit cycle is used. In this post we’re only going to worry about discrete DMPs, because by the time we get through all the basics this will already be a long post.

Imagine that you have two systems: An imaginary system where you plan trajectories, and a real system where you carry them out. When you use a DMP what you’re doing is planning a trajectory for your real system to follow. A DMP has its own set of dynamics, and by setting up your DMP properly you can get the control signal for your actual system to follow. If our DMP system is planing a path for the hand to follow, then what gets sent to the real system is the set of forces that need to be applied to the hand. It’s up to the real system to take these hand forces and apply them, by converting them down to joint torques or muscle activations (through something like the operation space control framework) or whatever. That’s pretty much all I’ll say here about the real system, what we’re going to focus on here is the DMP system. But keep in mind that the whole DMP framework is for generating a trajectory \ control signal to guide the real system.

I’ve got the code for the basic discrete DMP setup and examples I work through in this post up on my github, so if you want to jump straight to that, there’s the link! You can run test code for each class just by executing that file.

Discrete DMPs

Let’s start out with point attractor dynamics:

$\ddot{y} = \alpha_y ( \beta_y (g - y) - \dot{y})$,

where $y$ is our system state, $g$ is the goal, and $\alpha$ and $\beta$ are gain terms. This should look very familiar, it’s a PD control signal, all this is going to do is draw our system to the target. Now what we’ll do is add on a forcing term that will let us modify this trajectory:

$\ddot{y} = \alpha_y ( \beta_y (g - y) - \dot{y}) + f$.

How to define a nonlinear function $f$ such that you get the desire behaviour is a non-trivial question. The crux of the DMP framework is an additional nonlinear system used to define the forcing function $f$ over time, giving the problem a well defined structure that can be solved in a straight-forward way and easily generalizes. The introduced system is called the canonical dynamical system, is denoted $x$, and has very simple dynamics:

$\dot{x} = -\alpha_x x$.

The forcing function $f$ is defined as a function of the canonical system:

$f(x,g) = \frac{\Sigma_{i=1}^N \psi_i w_i}{\Sigma_{i=1}^N \psi_i} x(g - y_0)$,

where $y_0$ is the initial position of the system,

$\psi_i = \textrm{exp}\left( -h_i \left( x - c_i\right)^2 \right)$,

and $w_i$ is a weighting for a given basis function $\psi_i$. You may recognize that the $\psi_i$ equation above defines a Gaussian centered at $c_i$, where $h_i$ is the variance. So our forcing function is a set of Gaussians that are ‘activated’ as the canonical system $x$ converges to its target. Their weighted summation is normalized, and then multiplied by the $x (g - y_0)$ term, which is both a ‘diminishing’ and spatial scaling term.

Let’s break this down a bit. The canonical system starts at some arbitrary value, throughout this post $x_0 = 1$, and goes to 0 as time goes to infinity. For right now, let’s pretend that $x$ decays linearly to $0$. The first main point is that there are some basis functions which are activated as a function of $x$, this is displayed in the top figure below. As the value of $x$ decreases from 1 to 0, each of the Gaussians are activated (or centered) around different $x$ values. The second thing is that each of these basis functions are also assigned a weight, $w_i$. These weights are displayed in the lower figure in the bar plot. The output of the forcing function $f$ is then the summation of the activations of these basis functions multiplied by their weight, also displayed in the lower figure below.

The diminishing term
Incorporating the $x$ term into the forcing function guarantees that the contribution of the forcing term goes to zero over time, as the canonical system does. This means that we can sleep easy at night knowing that our system can trace out some crazy path, and regardless will eventually return to its simpler point attractor dynamics and converge to the target.

Spatial scaling
Spatial scaling means that once we’ve set up the system to follow a desired trajectory to a specific goal we would like to be able to move that goal farther away or closer in and get a scaled version of our trajectory. This is what the $(g - y_0)$ term of the forcing function handles, by scaling the activation of each of these basis functions to be relative to the distance to the target, causing the system to cover more or less distance. For example, let’s say that we have a set of discrete DMPs set up to follow a given trajectory:

The goals in this case are 1 and .5, which you can see is where the DMPs end up. Now, we’ve specified everything in this case for these particular goals (1 and .5), but let’s say we’d like to now generalize and get a scaled up version of this trajectory for moving by DMPs to a goal of 2. If we don’t appropriately scale our forcing function, with the $(g - y_0)$ term, then we end up with this:

Basically what’s happened is that for these new goals the same weightings of the basis functions were too weak to get the system to follow or desired trajectory. Once the $(g - y_0)$ term included in the forcing function, however, we get:

which is exactly what we want! Our movements now scale spatially. Awesome.

Alright, now, unfortunately for us, our canonical system does not converge linearly to the target, as we assumed above. Here’s a comparison of a linear decay vs the exponential decay of actual system:

This is an issue because our basis functions activate dependent on $x$. If the system was linear then we would be fine and the basis function activations would be well spread out as the system converged to the target. But, with the actual dynamics, $x$ is not a linear function of time. When we plot the basis function activations as a function of time, we see that the majority are activated immediately as $x$ moves quickly at the beginning, and then the activations stretch out as the $x$ slows down at the end:

In the interest of having the basis functions spaced out more evenly through time (so that our forcing function can still move the system along interesting paths as it nears the target, we need to choose our Gaussian center points more shrewdly. If we look at the values of $x$ over time, we can choose the times that we want the Gaussians to be activated, and then work backwards to find the corresponding $x$ value that will give us activation at that time. So, let’s look at a picture:

The red dots are the times we’d like the Gaussians to be centered around, and the blue line is our canonical system $x$. Following the dotted lines up to the corresponding $x$ values we see what values of $x$ the Gaussians need to be centered around. Additionally, we need to worry a bit about the width of each of the Gaussians, because those activated later will be activated for longer periods of time. To even it out the later basis function widths should be smaller. Through the very nonanalytical method of trial and error I’ve come to calculate the variance as

$h_i = \frac{\#BFs}{c_i}.$

Which reads the variance of basis function $i$ is equal to the number of basis functions divided by the center of that basis function. When we do this, we can now generate centers for our basis functions that are well spaced out:

Temporal scaling

Again, generalizability is one of the really important things that we want out of this system. There are two obvious kinds, temporal and spatial. Spatial scaling we discussed above, in the temporal case we’d like to be able to follow this same trajectory at different speeds. Sometimes quick, sometimes slow, but always tracing out the same path. To do that we’re going to add another term to our system dynamics, $\tau$, our temporal scaling term. Given that our system dynamics are:

$\ddot{y} = \alpha_y ( \beta_y (g - y) - \dot{y}) + f$,
$\dot{x} = -\alpha_x x$,

to give us temporal flexibility we can add the $\tau$ term:

$\ddot{y} = \tau^2 (\alpha_y ( \beta_y (g - y) - \dot{y}) + f)$,
$\dot{x} = \tau(-\alpha_x x)$,

where we use $\tau^2$ for $\ddot{y}$ because it’s the second derivative, and that’s all we have to do! Now to slow down the system you set $\tau$ between 0 and 1, and to speed it up you set $\tau$ greater than 1.

Imitating a desired path

Alright, great. We have a forcing term that can make the system take a weird path as it converges to a target point, and temporal and spatial scalability. How do we set up the system to follow a path that we specify? That would be ideal, to show the system the path to follow, and have it be able to work backwards and figure out the forces and then be able to generate that trajectory whenever we want. This ends up being a pretty straight forward process.

We have control over the forcing term, which affects the system acceleration. So we first need to take our desired trajectory, $\textbf{y}_d$ (where bold denotes a vector, in this case the time series of desired points in the trajectory), and differentiate it twice to get the accelerations:

$\ddot{\textbf{y}}_d = \frac{\partial}{\partial t} \dot{\textbf{y}}_d = \frac{\partial}{\partial t} \frac{\partial}{\partial t} \textbf{y}_d$.

Once we have the desired acceleration trajectory, we need to remove the effect of the base point attractor system. We have the equation above for exactly what the acceleration induced by the point attractor system at each point in time is:

$\ddot{y} = \alpha_y ( \beta_y (g - y) - \dot{y})$,

so then to calculate what the forcing term needs to be generate this trajectory we have:

$\textbf{f}_d = \ddot{\textbf{y}}_d - \alpha_y ( \beta_y (g - \textbf{y}) - \dot{\textbf{y}})$.

From here we know that the forcing term is comprised of a weighted summation of basis functions which are activated through time, so we can use an optimization technique like locally weighted regression to choose the weights over our basis functions such that the forcing function matches the desired trajectory $\textbf{f}_d$. In locally weighted regression sets up to minimize:

$\Sigma_t \psi_i(t)(f_d(t) - w_i (x(t) (g - y_0)))^2$

and the solution (which I won’t derive here, but is worked through in Schaal’s 1998 paper) is

$w_i = \frac{\textbf{s}^T \pmb{\psi}_i \textbf{f}_d}{\textbf{s}^T \pmb{\psi}_i \textbf{s}}$,

where

$\textbf{s} = \left( \begin{array}{c}x_{t_0}(g - y_0) \\ \vdots \\ x_{t_N}(g - y_0) \end{array} \right), \;\;\; \pmb{\psi}_i = \left( \begin{array}{ccc} \psi_i(t_0) & \dots & 0 \\ 0 & \ddots & 0 \\ 0 & \dots & \psi_i(t_n) \end{array} \right)$

Great! Now we have everything we need to start making some basic discrete DMPs!

Different numbers of basis functions

One of the things you’ll notice right off the bat when imitating paths, is that as the complexity of the trajectory increases, so does the required number of basis functions. For example, below, the system is trying to follow a sine wave and a highly nonlinear piecewise function:

We can see in the second case that although the DMP is never able to exactly reproduce the desired trajectory, the approximation continues to get better as the number of basis functions increases. This kind of slow improvement in certain nonlinear areas is to be expected from how the basis functions are being placed. An even spreading of the centers of the basis functions through time was used, but for imitation there is another method out of Dr. Schaal’s lab that places the basis functions more strategically. Need is determined by the function complexity is in that region, and basis function centers and widths are defined accordingly. In highly nonlinear areas we would expect there to be many narrow basis functions, and in linear areas we would expect fewer basis functions, but ones that are wider. The method is called locally weighted projection regression, which I plan on writing about and applying in a future post!

Conclusions \ thoughts

There’s really a lot of power in this framework, and there are a ton of expansions on this basic setup, including things like incorporating system feedback, spatio-temporal coupling of DMPs, using DMPs for gain control as well as trajectory control, incorporating a cost function and reinforcement learning, identifying action types, and other really exciting stuff.

I deviated from the terminology used in the papers here in a couple of places. First, I didn’t see a reason to reduce the second order systems to two first order systems. When working through it I found it more confusing than helpful, so I left the dynamics as a second order systems. Second, I also moved the $\tau$ term to the right hand side, and that’s just so that it matches the code, it doesn’t really matter. Neither of these were big changes, but in case you’re reading the papers and wondering.

Something that I kind of skirted above is planning along multiple dimensions. It’s actually very simple; the DMP framework simply assigns one DMP per degree of freedom being controlled. But, it’s definitely worth explicitly stating at some point.

I also mentioned this above, but this is a great trajectory control system to throw on top of the previously discussed operational space control framework. With the DMP framework on top to plan robust, generalizable movements, and the OSCs underneath to carry out those commands we can start to get some really neat applications. For use on real systems the incorporation of feedback and spatio-temporal coupling terms is going to be important, so the next post will likely be working through those and then we can start looking at some exciting implementations!

Speaking of implementations, there’s a DMP and canonical system code up on my github, please feel free to explore it, run it, send me questions about it. Whatever. I should also mention that there’s this and a lot more all up and implemented on Stefan Schaal’s lab website.

## Robot control part 7: OSC of a 3-link arm

So we’ve done control for the 2-link arm, and control of the one link arm is trivial (where we control joint angle, or x or y coordinate of the pendulum), so here I’ll just show an implementation of operation space control for a more interesting arm model, the 3-link arm model. The code can all be found up on my Github.

In theory there’s nothing different or more difficult about controlling a 3-link arm versus a 2-link arm. For the inertia matrix, what I ended up doing here is just jacking up all the values of the matrix to about 100, which causes the controller to way over control the arm, and you can see the torques are much larger than they would need to be if we had an accurate inertia matrix. But the result is the same super straight trajectories that we’ve come to expect from operational space control:

It’s a little choppy because I cut out a bunch of frames to keep the gif size down. But you get the point, it works. And quite well!

Because this is also a 3-link arm now and our task space force signal is 2D, we have redundant space of solutions, meaning that the task space control signal can be carried out in a number of ways. In other words, a null space exists for this controller. This means that we can throw another controller in our system to operate inside the null space of the first controller. We’ve already worked through all the math for this, so it’s straightforward to implement.

What kind of null space controller should we put in? Well, you may have noticed in the above animation the arm goes through itself, here’s another example:

Often it’s desirable to avoid this (because of physics or whatever), so what we can do is add a secondary controller that works to keep the arm’s elbow and wrist near some comfortable default angles. When we do this we get the following:

And there you have it! Operational space control of a three link arm with a secondary controller in the null space to try to keep the angles near their default / resting positions.

I also added mouse based control to the arm so it will try to follow your mouse when you move over the figure, which makes it pretty fun to explore the power of the controller. It’s interesting to see where the singularities become an issue, and how having a null space controller that’s operating in joint space can actually come to help the system move through those problem points more smoothly. Check it out! It’s all up on my Github.

## Robot control part 5: Controlling in the null space

In the last post, I went through how to build an operational space controller. It was surprisingly easy after we’ve worked through all the other posts. But maybe that was a little too easy for you. Maybe you want to do something more interesting like implement more than one controller at the same time. In this post we’ll go through how to work inside the null space of a controller to implement several seperate controllers simultaneously without interference.
Buckle up.

Null space forces

The last example comprises the basics of operational space control; describe the system, calculate the system dynamics, transform desired forces from an operational space to the generalized coordinates, and build the control signal to cancel out the undesired system dynamics. Basic operational space control works quite well, but it is not uncommon to have several control goals at once; such as move the end-effector to this position’ (primary goal), and keep the elbow raised high’ (secondary goal) in the control of a robot arm.

If the operational space can also serve as generalized coordinates, i.e. if the system state specified in operational space constrains all of the degrees of freedom of the robot, then there is nothing that can be done without affecting the performance of the primary controller. In the case of controlling a two-link robot arm this is the case. The end-effector Cartesian space (chosen as the operational space) could also be a generalized coordinates system, because a specific $(x,y)$ position fully constrains the position of the arm.

But often when using operational space control for more complex robots this is not the case. In these situations, the forces controlled in operational space have fewer dimensions than the robot has degrees of freedom, and so it is possible to accomplish the primary goal in a number of ways. The null space of this primary controller is the region of state space where there is a redundancy of solutions; the system can move in a number of ways and still not affect the completion of the goals of the primary controller. An example of this is all the different configurations the elbow can be in while a person moves their hand in a straight line. In these situations, a secondary controller can be created to operate in the null space of the primary controller, and the full control signal sent to the system is a sum of the primary control signal and a filtered version of the secondary control signal. In this section the derivation of the null-space filter will be worked through for a system with only a primary and secondary controller, but note that the process can be applied iteratively for systems with further controllers.

The filtering of the secondary control signal means that the secondary controller’s goals will only be accomplished if it is possible to do so without affecting the performance of the first controller. In other words, the secondary controller must operate in the null space of the first controller. Denote the primary operational space control signal, e.g. the control signal defined in the previous post, as $\textbf{u}_{\textbf{x}}$ and the control signal from the secondary controller $\textbf{u}_{\textrm{null}}$. Define the force to apply to the system

$\textbf{u} = \textbf{u}_{\textbf{x}} + (\textbf{I} - \textbf{J}_{ee}^T(\textbf{q}) \; \textbf{J}_{ee}^{T+}(\textbf{q})) \textbf{u}_{\textrm{null}},$

where $\textbf{J}_{ee}^{T+}(\textbf{q})$ is the pseudo-inverse of $\textbf{J}_{ee}^T(\textbf{q})$.

Examining the filtering term that was added,

$(\textbf{I} - \textbf{J}_{ee}^T(\textbf{q}) \textbf{J}_{ee}^{T+}(\textbf{q})) \textbf{u}_{\textrm{null}},$

it can be seen that the Jacobian transpose multiplied by its pseudo-inverse will be 1’s all along the diagonal, except in the null space. This means that $\textbf{u}_{\textrm{null}}$ is subtracted from itself everywhere that affects the operational space movement and is left to apply any arbitrary control signal in the null space of the primary controller.

Unfortunately, this initial set up does not adequately filter out the effects of forces that might be generated by the secondary controller. The Jacobian is defined as a relationship between the velocities of two spaces, and so operating in the null space defined by the Jacobian ensures that no velocities are applied in operational space, but the required filter must also prevent any accelerations from affecting movement in operational space. The standard Jacobian pseudo-inverse null space is a velocity null space, and so a filter built using it will allow forces affecting the system’s acceleration to still get through. What is required is a pseudo-inverse Jacobian defined to filter signals through an acceleration null space.

To acquire this acceleration filter, our control signal will be substituted into the equation for acceleration in the operational space, which, after cancelling out gravity effects with the control signal and removing the unmodeled dynamics, gives

$\ddot{\textbf{x}} = \textbf{J}_{ee}(\textbf{q}) \textbf{M}^{-1}(\textbf{q}) [\textbf{J}_{ee}^T(\textbf{q}) \; \textbf{M}_{\textbf{x}_{ee}}(\textbf{q}) \; \ddot{\textbf{x}}_\textrm{des} - (\textbf{I} - \textbf{J}_{ee}^T(\textbf{q})\;\textbf{J}_{ee}^{T+}(\textbf{q}))\;\textbf{u}_{\textrm{null}}].$

Rewriting this to separate the secondary controller into its own term

$\ddot{\textbf{x}} = \textbf{J}_{ee}(\textbf{q}) \textbf{M}^{-1}(\textbf{q}) \textbf{J}_{ee}^T(\textbf{q}) \; \textbf{M}_{\textbf{x}_{ee}}(\textbf{q}) \; \ddot{\textbf{x}}_\textrm{des} - \textbf{J}_{ee}(\textbf{q}) \textbf{M}^{-1}(\textbf{q})[\textbf{I} - \textbf{J}_{ee}^T(\textbf{q})\;\textbf{J}_{ee}^{T+}(\textbf{q})]\;\textbf{u}_{\textrm{null}},$

it becomes clear that to not cause any unwanted movement in operational space the second term must be zero.

There is only one free term left in the second term, and that is the pseudo-inverse. There are numerous different pseudo-inverses that can be chosen for a given situation, and here what is required is to engineer a pseudo-inverse such that the term multiplying $\textbf{u}_{\textrm{null}}$ in the above operational space acceleration equation is guaranteed to go to zero.

$\textbf{J}_{ee}(\textbf{q})\textbf{M}^{-1}(\textbf{q}) [\textbf{I} - \textbf{J}_{ee}^T (\textbf{q})\textbf{J}_{ee}^{T+}(\textbf{q})] \textbf{u}_{\textrm{null}} = \textbf{0},$

this needs to be true for all $\textbf{u}_{\textrm{null}}$, so it can be removed,

$\textbf{J}_{ee} (\textbf{q}) \; \textbf{M}^{-1}(\textbf{q}) [\textbf{1} - \textbf{J}_{ee}^T (\textbf{q}) \; \textbf{J}_{ee}^{T+} (\textbf{q})] = \textbf{0},$

$\textbf{J}_{ee}(\textbf{q}) \; \textbf{M}^{-1}(\textbf{q}) = \textbf{J}_{ee}(\textbf{q}) \; \textbf{M}^{-1}(\textbf{q}) \; \textbf{J}_{ee}^T(\textbf{q})\; \textbf{J}_{ee}^{T+}(\textbf{q}),$

substituting in our inertia matrix for operational space, which defines

$\textbf{J}_{ee} (\textbf{q}) \textbf{M}^{-1} (\textbf{q}) = \textbf{M}_{\textbf{x}_{ee}}^{-1} (\textbf{q}) \textbf{J}_{ee}^{T+} (\textbf{q}),$

$\textbf{J}_{ee}^{T+}(\textbf{q}) = \textbf{M}_{\textbf{x}_{ee}} (\textbf{q}) \; \textbf{J}_{ee}(\textbf{q}) \; \textbf{M}^{-1}(\textbf{q}).$

This specific Jacobian inverse was presented in this 1987 paper by Dr. Oussama Khatib and is called the `dynamically consistent generalized inverse’. Using this psuedo-inverse guarantees that any signal coming from the secondary controller will not affect movement in the primary controller’s operational space. Just as a side-note, the name ‘pseudo-inverse’ is a bit of misnomer here, since it doesn’t try to produce the identity when multiplied by the original Jacobian transpose, but hey. That’s what they’re calling it.

The null space filter cancels out the acceleration effects of forces in operational space from a signal that is being applied as part of the control system. But it can also be used to cancel out the effects of any unwanted signal that can be modeled. Given some undesirable force signal interfering with the system that can be effectively modeled, a null space filtering term can be implemented to cancel it out. The control signal in this case, with one primary operational space controller and a null space filter for the undesired force, looks like:

$\textbf{u} = \textbf{J}^T_{ee}(\textbf{q}) \; \textbf{M}_\textbf{x}(\textbf{q}) \; \ddot{\textbf{x}}_\textrm{des} - \textbf{g}(\textbf{q}) - \textbf{J}^T_{ee}(\textbf{q}) \;\textbf{J}^{T+}_{ee}(\textbf{q}) \; \textbf{u}_{\textrm{undesired force}}.$

We did it! This will now allow a high-priority operational space controller to execute without interference from a secondary controller operating in its null space to complete it’s own set of goals (when possible).

Example:

Given a three link arm (revolute-revolute-revolute) operating in the $(x,z)$ plane, shown below:

this example will construct the control system for a primary controller controlling the end-effector and a secondary controller working to keep the arm near its joint angles’ default resting positions.

Let the system state be $\textbf{q} = [q_0, q_1, q_2]^T$ with default positions $\textbf{q}^0 = \left[\frac{\pi}{3}, \frac{\pi}{4}, \frac{\pi}{4} \right]^T$. The control signal of the secondary controller is the difference between the target state and the current system state

$\textbf{u}_{\textrm{null}} = k_{p_{\textrm{null}}}(\textbf{q}^0 - \textbf{q}),$

where $k_{p_\textrm{null}}$ is a gain term.

Let the centres of mass be

$\textrm{com}_0 = \left[ \begin{array}{c} \frac{1}{2}cos(q_0) \\ 0 \\ \frac{1}{2}sin(q_0) \end{array} \right], \;\;\;\; \textrm{com}_1 = \left[ \begin{array}{c} \frac{1}{4}cos(q_1) \\ 0 \\ \frac{1}{4}sin(q_1) \end{array} \right] \;\;\;\; \textrm{com}_2 = \left[ \begin{array}{c} \frac{1}{2}cos(q_2) \\ 0 \\ \frac{1}{4} sin (q_2) \end{array} \right],$

the Jacobians for the COMs are

$\textbf{J}_0(\textbf{q}) = \left[ \begin{array}{ccc} -\frac{1}{2} sin(q_0) & 0 & 0 \\ 0 & 0 & 0 \\ \frac{1}{2} cos(q_0) & 0 & 0 \\ 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 0 \end{array} \right],$

$\textbf{J}_1(\textbf{q}) = \left[ \begin{array}{ccc} -L_0sin(q_0) - \frac{1}{4}sin(q_{01}) & -\frac{1}{4}sin(q_{01}) & 0 \\ 0 & 0 & 0 \\ L_0 cos(q_0) + \frac{1}{4} cos(q_{01})& \frac{1}{4} cos(q_{01}) & 0 \\ 0 & 0 & 0 \\ 1 & 1 & 0 \\ 0 & 0 & 0 \end{array} \right]$

$\textbf{J}_2(\textbf{q}) = \left[ \begin{array}{ccc} -L_0sin(q_0) - L_1sin(q_{01}) - \frac{1}{2}sin(q_{012}) & -L_1sin(q_{01}) - \frac{1}{2}sin(q_{012}) & - \frac{1}{2}sin(q_{012}) \\ 0 & 0 & 0 \\ L_0 cos(q_0) + L_1 cos(q_{01}) + \frac{1}{4}cos(q_{012}) & L_1 cos(q_{01}) + \frac{1}{4} cos(q_{012}) & \frac{1}{4}cos(q_{012}) \\ 0 & 0 & 0 \\ 1 & 1 & 1 \\ 0 & 0 & 0 \end{array} \right].$

The Jacobian for the end-effector of this three link arm is

$\textbf{J}_{ee} = \left[ \begin{array}{ccc} -L_0 sin(q_0) - L_1 sin(q_{01}) - L_2 sin(q_{012}) & - L_1 sin(q_{01}) - L_2 sin(q_{012}) & - L_2 sin(q_{012}) \\ L_0 cos(q_0) + L_1 cos(q_{01}) + L_2 cos(q_{012}) & L_1 cos(q_{01}) + L_2 cos(q_{012}) & L_2 cos(q_{012}), \end{array} \right]$

where $q_{01} = q_0 + q_1$ and $q_{012} = q_0 + q_1 + q_2$.

Taking the control signal developed in Section~\ref{sec:exampleOS}

$\textbf{u} = \textbf{J}^T_{ee}(\textbf{q}) \; \textbf{M}_{\textbf{x}_{ee}}(\textbf{q}) [k_p (\textbf{x}_{\textrm{des}} - \textbf{x}) + k_v (\dot{\textbf{x}}_{\textrm{des}} - \dot{\textbf{x}})] - \textbf{g}(\textbf{q}),$

where $\textbf{M}_{\textbf{x}_{ee}}(\textbf{q})$ was defined in the previous post, and $\textbf{g}(\textbf{q})$ is defined two posts ago, and $k_p$ and $k_v$ are gain terms, usually set such that $k_v = \sqrt{k_p}$, and adding in the null space control signal and filter gives

$\textbf{u} = \textbf{J}^T_{ee}(\textbf{q}) \; \textbf{M}_{\textbf{x}_{ee}}(\textbf{q}) [k_p (\textbf{x}_{\textrm{des}} - \textbf{x}) + k_v (\dot{\textbf{x}}_{\textrm{des}} - \dot{\textbf{x}})] - (\textbf{I} - \textbf{J}^T_{ee}(\textbf{q}) \textbf{J}^{T+}_{ee}(\textbf{q})) \textbf{u}_{\textrm{null}} - \textbf{g}(\textbf{q}),$

where $\textbf{J}^{T+}(\textbf{q})$ is the dynamically consistent generalized inverse defined above, and $\textbf{u}_{\textrm{null}}$ is our null space signal!

Conclusions

It’s a lot of math, but when you start to get a feel for it what’s really awesome is that this is it. We’re describing the whole system, and so by working with these equations we can get a super effective controller. Which is pretty cool. Especially in relation to other possible controllers.

Alright! We’ve now worked through all the basic theory for operational space control, it is time to get some implementations going.