Tag Archives: Python

Velocity limiting in operational space control

Recently, I was reading through an older paper on effective operational space control, talking specifically point to point control in operational space. The paper mentioned that even if you have a perfect model of the system, you’re going to run into trouble if you use just a basic PD formula to define your control signal in operational space:

u_x = k_p (\textbf{x}^* - \textbf{x}) - k_v \dot{\textbf{x}},

where \textbf{x} and \dot{\textbf{x}} are the system position and velocity in operational space, \textbf{x}^* is the target position, and k_p and k_v are gains.

If you define your operational space control signal like this, and then translate this signal into joint torques (using, for example, methods discussed in other posts), you’re going to see a very non-straight trajectory emerge in larger movements as a result of “actuator saturation, and bandwidth and velocity limitations”. In the example of a standard robot, you might run into issues with your motors not being able to actually generate the torques that have been specified, the frequency of control and feedback might not be sufficient, and you could hit hard constraints on system velocity. The solution to this problem was presented in this 1987 paper by Dr. Oussama Khatib, and is pretty slick and very useful, so I thought I’d write it up here for any other unfortunate souls wandering around in ignorance. First though, here’s what it looks like to move large point to point distances without velocity limiting:

As you can see, the system isn’t moving in a straight line, which can be very aggravating if you’ve worked and reworked out the equations and double checked all your parameters, etc etc. A few things, first, when working with simulations it’s easy to forget how ridiculously fast this actually would be in real-time. Even though it takes a minute to simulate the above movement, in real-time, is happening over the course of 200ms. Taking that into account, this is pretty good. Also, there’s an obvious solution here, slow down your movement. The source of this problem is largely that all of the motors are not able to apply the torques specified and move at the required speed. Some of the motors have a lot less mass to throw around and will be able to move at the specified torques, but not all. Hence the not straight trajectory.

You can of course drop the gains on your PD signal, but that’s not really a great methodical solution. So, what can we do?

Well, if we rearrange the PD control signal specified above into

u_x = k_v (\dot{\textbf{x}}^* - \dot{\textbf{x}}),

where \dot{\textbf{x}}^* is the desired velocity, we see that this signal can be interpreted as a pure velocity servo-control signal, with velocity gain k_v and a desired velocity

\dot{\textbf{x}}^* = \frac{k_p}{k_v}(\textbf{x}^* - \textbf{x}).

When things are in this form, it becomes a bit more clear what we have to do: limit the desired velocity to be at most some specified maximum velocity of the end-effector, V_\textrm{max}. This value should be low enough that the transformation into joint torques doesn’t result in anything larger than the actuators can generate.

Taking V_\textrm{max}, what we want is to clip the magnitude of the control signal to be V_\textrm{max} if it’s ever larger (in positive or negative directions), and to be equal to \frac{kp}{kv}(\textbf{x}^* - \textbf{x}) otherwise. The math for this works out such that we can accomplish this with a control signal of the form:

\textbf{u}_\textbf{x} = -k_v (\dot{\textbf{x}} + \textrm{sat}\left(\frac{V_\textrm{max}}{\lambda |\tilde{\textbf{x}}|} \right) \lambda \tilde{\textbf{x}}),

where \lambda = \frac{k_p}{k_v} , \tilde{\textbf{x}} = \textbf{x} - \textbf{x}^*, and \textrm{sat} is the saturation function, such that

\textrm{sat}(y) = \left\{ \begin{array}{cc} |y| \leq 1 & \Rightarrow y \\ |y| > 1 & \Rightarrow 1 \end{array} \right.

where |y| is the absolute value of y, and is applied element wise to the vector \tilde{\textbf{x}} in the control signal.

As a result of using this saturation function, the control signal behaves differently depending on whether or not \dot{\textbf{x}}^* > V_\textrm{max}:

\textbf{u}_\textbf{x} = \left\{ \begin{array}{cc} \dot{\textbf{x}}^* \geq V_\textrm{max} & \Rightarrow -k_v (\dot{\textbf{x}} + V_\textbf{max} \textrm{sgn}(\tilde{\textbf{x}})) \\ \dot{\textbf{x}}^* < V_\textrm{max} & \Rightarrow -k_v \dot{\textbf{x}} + k_p \tilde{\textbf{x}} \end{array} \right.

where \textrm{sgn}(y) is a function that returns -1 if y < 0 and 1 if y \geq 0, and is again applied element-wise to vectors. Note that the control signal in the second condition is equivalent to our original PD control signal k_p(\textbf{x}^* - \textbf{x}) - k_v \dot{\textbf{x}}. If you’re wondering about negative signs make sure you note that \tilde{\textbf{x}} = \textbf{x} - \textbf{x}^* and not \textbf{x}^* - \textbf{x}, as you might assume.

So now the control signal is behaving exactly as desired! Moves the system towards the target, but no faster than the specified maximum velocity. Now our trajectories look like this:

So those are starting to look a lot better! The first thing you’ll notice is that this is a fair bit slower of a movement. Well, actually, it’s waaaayyyy slower because the playback speed here is 4x faster than in that first animation, and this is a movement over 2s. Which has pros and cons, con: it’s slower, pro: it’s straighter, and you’re less likely to be murdered by it. When you move from simulations to real systems that latter point really moves way up the priority list.

Second thing to notice, the system seems to be minimising the error along one dimension, and then along the next, and then arrives at the target. What’s going on?  Because the error along each of the (x,y,z) dimensions isn’t the same, when speed gets clipped along one of the dimensions you’re no longer going to be moving in a straight line directly to the target. To address this, we’re going to add a scaling term whenever clipping happens, such that you reduce the speed you move along each dimension by the same ratio, so that you’re still moving in a straight line.

It’s a liiiiittle bit more complicated than that, but not much. First, we’ll calculate the values being passed in to the saturation function for each (x,y,z) dimension. We’ll then check to see if any of them are going to get clipped, and if there’s more than one that saturates we’ll find the one that is affected the most. After we’ve identified which dimension it is, we go through and calculate what the control signal would have been without velocity limiting, and what it will be now with velocity limiting. This scaling term tells us how much the control signal was reduced, and we can then use it to reduce the control signals of the other dimensions by the same amount. These other dimensions might still saturate, though, so we have to recalculate the saturation function for them once they’ve been scaled. Here’s what this all looks like in code:

# implement velocity limiting
lamb = kp / kv
x_tilde = xyz - target_xyz
sat = vmax / (lamb * np.abs(x_tilde))
scale = np.ones(3)
if np.any(sat < 1):
    index = np.argmin(sat)
    unclipped = kp * x_tilde[index]
    clipped = kv * vmax * np.sign(x_tilde[index])
    scale = np.ones(3) * clipped / unclipped
    scale[index] = 1
u_xyz = -kv * (dx + np.clip(sat / scale, 0, 1) *
               scale * lamb * x_tilde)

And now, finally, we start getting the trajectories that we’ve been wanting the whole time:

And finally we can rest easy, knowing that our robot is moving at a reasonable speed along a direct path to its goals. Wherever you’d like to use this neato ‘ish you should be able to just paste in the above code, define your vmax, kp, and kv values and be good to go!

Tagged , , , ,

Using SymPy’s lambdify for generating transform matrices and Jacobians

I’ve been working in VREP with some of their different robot models and testing out force control, and one of the things that becomes pretty important for efficient workflow is to have a streamlined method for setting up the transform matrices and calculating the Jacobians for different robots. You do not want to be working these all out by hand and then writing them in yourself.

A solution that’s been working well for me (and is fully implemented in Python) is to use SymPy to set up the basic transform matrices, from each joint to the next, and then use it’s derivative function to calculate the Jacobian. Once this is calculated you can then use SymPy’s lambdify function to parameterize this, and off you go! In this post I’m going to work through an example for controlling VREP’s UR5 arm using force control. And if you’re just looking for code examples, you can find it all up on my GitHub.

Edit: Updated the code to be much nicer, added saving of calculated functions, and a null space controller.

Setting up the transform matrices

This is the part of this process that is unique to each arm. The goal is to set up a system so that you can specify your transforms for each joint (and to each centre-of-mass (COM) too, of course) and then be on your way. So here’s the UR5, cute thing that it is:


For the UR5, there are 6 joints, and we’re going to be specifying 9 transform matrices: 6 joints and the COM for three arm segments (the remaining arm segment COMs are centered at their respective joint’s reference frame). The joints are all rotary, with 0 and 4 rotating around the z-axis, and the rest all rotating around x.

So first, we’re going to create a class called robot_config. Then, to create our transform matrices in SymPy first we need to set up the variables we’re going to be using:

# set up our joint angle symbols (6th angle doesn't affect any kinematics)
self.q = [sp.Symbol('q%i'%ii) for ii in range(6)]
# segment lengths associated with each joint
self.L = np.array([0.0935, .13453, .4251, .12, .3921, .0935, .0935, .0935])

where self.q is an array storing all our joint angle symbols, and self.L is an array of all of the offsets for each joint and arm segment lengths.

Using these to create the transform matrices for the joints, we get a set up that looks like this:

# transform matrix from origin to joint 0 reference frame
self.T0org = sp.Matrix([[sp.cos(self.q[0]), -sp.sin(self.q[0]), 0, 0],
                        [sp.sin(self.q[0]), sp.cos(self.q[0]), 0, 0],
                        [0, 0, 1, self.L[0]],
                        [0, 0, 0, 1]])

# transform matrix from joint 0 to joint 1 reference frame
self.T10 = sp.Matrix([[1, 0, 0, -L[1]],
                      [0, sp.cos(-self.q[1] + sp.pi/2),
                       -sp.sin(-self.q[1] + sp.pi/2), 0],
                      [0, sp.sin(-self.q[1] + sp.pi/2),
                       sp.cos(-self.q[1] + sp.pi/2), 0],
                      [0, 0, 0, 1]])

# transform matrix from joint 1 to joint 2 reference frame
self.T21 = sp.Matrix([[1, 0, 0, 0],
                      [0, sp.cos(-self.q[2]),
                       -sp.sin(-self.q[2]), self.L[2]],
                      [0, sp.sin(-self.q[2]),
                       sp.cos(-self.q[2]), 0],
                      [0, 0, 0, 1]])

# transform matrix from joint 2 to joint 3
self.T32 = sp.Matrix([[1, 0, 0, L[3]],
                      [0, sp.cos(-self.q[3] - sp.pi/2),
                       -sp.sin(-self.q[3] - sp.pi/2), self.L[4]],
                      [0, sp.sin(-self.q[3] - sp.pi/2),
                       sp.cos(-self.q[3] - sp.pi/2), 0],
                      [0, 0, 0, 1]])

# transform matrix from joint 3 to joint 4
self.T43 = sp.Matrix([[sp.sin(-self.q[4] - sp.pi/2),
                       sp.cos(-self.q[4] - sp.pi/2), 0, -self.L[5]],
                      [sp.cos(-self.q[4] - sp.pi/2),
                       -sp.sin(-self.q[4] - sp.pi/2), 0, 0],
                      [0, 0, 1, 0],
                      [0, 0, 0, 1]])

# transform matrix from joint 4 to joint 5
self.T54 = sp.Matrix([[1, 0, 0, 0],
                      [0, sp.cos(self.q[5]), -sp.sin(self.q[5]), 0],
                      [0, sp.sin(self.q[5]), sp.cos(self.q[5]), self.L[6]],
                      [0, 0, 0, 1]])

# transform matrix from joint 5 to end-effector
self.TEE5 = sp.Matrix([[1, 0, 0, self.L[7]],
                       [0, 1, 0, 0],
                       [0, 0, 1, 0],
                       [0, 0, 0, 1]])

You can see a bunch of offsetting of the joint angles by -sp.pi/2 and this is to account for the expected 0 angle (straight along the reference frame’s x-axis) at those joints being different than the 0 angle defined in the VREP simulation (at a 90 degrees from the x-axis). You can find these by either looking at and finding the joints’ 0 position in VREP or by trial-and-error empirical analysis.

Once you have your transforms, then you have to specify how to move from the origin to each reference frame of interest (the joints and COMs). For that, I’ve just set up a simple function with a switch statement:

# point of interest in the reference frame (right at the origin)
self.x = sp.Matrix([0,0,0,1])

def _calc_T(self, name, lambdify=True):
    """ Uses Sympy to generate the transform for a joint or link

    name string: name of the joint or link, or end-effector
    lambdify boolean: if True returns a function to calculate
                      the transform. If False returns the Sympy

    # check to see if we have our transformation saved in file
    if os.path.isfile('%s/%s.T' % (self.config_folder, name)):
        Tx = cloudpickle.load(open('%s/%s.T' % (self.config_folder, name),
        if name == 'joint0' or name == 'link0':
            T = self.T0org
        elif name == 'joint1' or name == 'link1':
            T = self.T0org * self.T10
        elif name == 'joint2':
            T = self.T0org * self.T10 * self.T21
        elif name == 'link2':
            T = self.T0org * self.T10 * self.Tl21
        elif name == 'joint3':
            T = self.T0org * self.T10 * self.T21 * self.T32
        elif name == 'link3':
            T = self.T0org * self.T10 * self.T21 * self.Tl32
        elif name == 'joint4' or name == 'link4':
            T = self.T0org * self.T10 * self.T21 * self.T32 * self.T43
        elif name == 'joint5' or name == 'link5':
            T = self.T0org * self.T10 * self.T21 * self.T32 * self.T43 * \
        elif name == 'link6' or name == 'EE':
            T = self.T0org * self.T10 * self.T21 * self.T32 * self.T43 * \
                self.T54 * self.TEE5
        Tx = T * self.x  # to convert from transform matrix to (x,y,z)

        # save to file
        cloudpickle.dump(Tx, open('%s/%s.T' % (self.config_folder, name),

    if lambdify is False:
        return Tx
    return sp.lambdify(self.q, Tx)

So the first part is pretty straight forward, create the transform matrix, and then at the end to get the (x,y,z) position we just multiply by a vector we created that represents a point at the origin of the last reference frame. Some of the transform matrices (the ones to the arm segments) I didn’t specify above just to cut down on space.

The second part is where we use this awesome lambify function, which lets us turn the matrix we’ve defined into a function, so that we can pass in joint angles and get back out the resulting (x,y,z) position. There’s also the option to get the straight up SymPy matrix return, in case you need the symbolic form (which we will!).

NOTE: You can also see that there’s a check built in to look for saved files, and to just load those saved files instead of recalculating things if they’re available. This is because calculating some of these matrices and their derivatives takes a long, long time. I used the cloudpickle module to do this because it’s able to easily handle saving a whole bunch of weird things that makes normal pickle sour.

Calculating the Jacobian

So now that we’re able to quickly generate the transform matrix for each point of interest on the UR5, we simply take the derivative of the equation for each (x,y,z) coordinate with respect to each joint angle to generate our Jacobian.

def _calc_J(self, name, lambdify=True):
    """ Uses Sympy to generate the Jacobian for a joint or link

    name string: name of the joint or link, or end-effector
    lambdify boolean: if True returns a function to calculate
                      the Jacobian. If False returns the Sympy

    # check to see if we have our Jacobian saved in file
    if os.path.isfile('%s/%s.J' % (self.config_folder, name)):
        J = cloudpickle.load(open('%s/%s.J' %
                             (self.config_folder, name), 'rb'))
        Tx = self._calc_T(name, lambdify=False)
        J = []
        # calculate derivative of (x,y,z) wrt to each joint
        for ii in range(self.num_joints):
            J[ii].append(sp.simplify(Tx[0].diff(self.q[ii])))  # dx/dq[ii]
            J[ii].append(sp.simplify(Tx[1].diff(self.q[ii])))  # dy/dq[ii]
            J[ii].append(sp.simplify(Tx[2].diff(self.q[ii])))  # dz/dq[ii]

Here we retrieve the Tx vector from our _calc_T function, and then calculate the derivatives. When calculating the Jacobian for the end-effector, this is all we need! Huzzah!

But to calculate the Jacobian for transforming the inertia matrices of each arm segment into joint space we’re going to need the orientation information added to our Jacobian as well. This we know ahead of time, for each joint it’s a 3D vector with a 1 on the axis being rotated around. So we can predefine this:

# orientation part of the Jacobian (compensating for orientations)
self.J_orientation = [
    [0, 0, 1], # joint 0 rotates around z axis
    [1, 0, 0], # joint 1 rotates around x axis
    [1, 0, 0], # joint 2 rotates around x axis
    [1, 0, 0], # joint 3 rotates around x axis
    [0, 0, 1], # joint 4 rotates around z axis
    [1, 0, 0]] # joint 5 rotates around x axis

And then we just fill in the Jacobians for each reference frame with self.J_orientation up to the last joint, and fill in the rest of the Jacobian with zeros. So e.g. when we’re calculating the Jacobian for the arm segment just past the second joint we’ll use the first two rows of self.J_orientation and the rest of the rows will be 0.

So this leads us to the second half of the _calc_J function:

        end_point = name.strip('link').strip('joint')
        if end_point != 'EE':
            end_point = min(int(end_point) + 1, self.num_joints)
            # add on the orientation information up to the last joint
            for ii in range(end_point):
                J[ii] = J[ii] + self.J_orientation[ii]
            # fill in the rest of the joints orientation info with 0
            for ii in range(end_point, self.num_joints):
                J[ii] = J[ii] + [0, 0, 0]

        # save to file
        cloudpickle.dump(J, open('%s/%s.J' %
                                 (self.config_folder, name), 'wb'))

    J = sp.Matrix(J).T  # correct the orientation of J
    if lambdify is False:
        return J
    return sp.lambdify(self.q, J)

The orientation information is added in, we save the result to file, and a function that takes in the joint angles and outputs the Jacobian is created (unless lambdify == False in which case the SymPy symbolic form is returned.)

Then finally, two wrapper functions are added in to make creating / accessing these functions easier. First, define a couple of dictionaries

# create function dictionaries
self._T = {}  # for transform calculations
self._J = {}  # for Jacobian calculations

and then our wrapper functions look like this

def T(self, name, q):
    """ Calculates the transform for a joint or link
    name string: name of the joint or link, or end-effector
    q np.array: joint angles
    # check for function in dictionary
    if self._T.get(name, None) is None:
        print('Generating transform function for %s'%name)
        self._T[name] = self.calc_T(name)
    return self._T[name](*q)[:-1].flatten()

def J(self, name, q):
   """ Calculates the transform for a joint or link
   name string: name of the joint or link, or end-effector
   q np.array: joint angles
   # check for function in dictionary
   if self._J.get(name, None) is None:
        print('Generating Jacobian function for %s'%name)
        self._J[name] = self.calc_J(name)
   return np.array(self._J[name](*q)).T

So how you use this class (all of this is in a class) is to call these T and J functions with the current joint angles. They’ll check to see if the functions have already be created or stored in file, if they haven’t then the T and / or J functions will be created, then our wrappers do a bit of formatting to get them into the proper form (i.e. transposing or cropping), and return you your (x,y,z) or Jacobian!

NOTE: It’s a bit of a misnomer to have the function be called T and actually return to you Tx, but hey this is a blog. Lay off.

Calculating the inertia matrix in joint-space and force of gravity
Now, since we’re here we might as well also calculate the functions for our inertia matrix in joint space and the effect of gravity. So, define a couple more placeholders in our robot_config class to help us:

self._M = []  # placeholder for (x,y,z) inertia matrices
self._Mq = None  # placeholder for joint space inertia matrix function
self._Mq_g = None  # placeholder for joint space gravity term function

and then add in our inertia matrix information (defined in each link’s centre-of-mass (COM) reference frame)

# create the inertia matrices for each link of the ur5
self._M.append(np.diag([1.0, 1.0, 1.0,
                        0.02, 0.02, 0.02]))  # link0
self._M.append(np.diag([2.5, 2.5, 2.5,
                        0.04, 0.04, 0.04]))  # link1
self._M.append(np.diag([5.7, 5.7, 5.7,
                        0.06, 0.06, 0.04]))  # link2
self._M.append(np.diag([3.9, 3.9, 3.9,
                        0.055, 0.055, 0.04]))  # link3
self._M.append(np.copy(self._M[1]))  # link4
self._M.append(np.copy(self._M[1]))  # link5
self._M.append(np.diag([0.7, 0.7, 0.7,
                        0.01, 0.01, 0.01]))  # link6

and then using our equations for calculating the system’s inertia and gravity we create our _calc_Mq and _calc_Mq_g functions

def _calc_Mq(self, lambdify=True):
    """ Uses Sympy to generate the inertia matrix in
    joint space for the ur5

    lambdify boolean: if True returns a function to calculate
                      the Jacobian. If False returns the Sympy

    # check to see if we have our inertia matrix saved in file
    if os.path.isfile('%s/Mq' % self.config_folder):
        Mq = cloudpickle.load(open('%s/Mq' % self.config_folder, 'rb'))
        # get the Jacobians for each link's COM
        J = [self._calc_J('link%s' % ii, lambdify=False)
             for ii in range(self.num_links)]

        # transform each inertia matrix into joint space
        # sum together the effects of arm segments' inertia on each motor
        Mq = sp.zeros(self.num_joints)
        for ii in range(self.num_links):
            Mq += J[ii].T * self._M[ii] * J[ii]

        # save to file
        cloudpickle.dump(Mq, open('%s/Mq' % self.config_folder, 'wb'))

    if lambdify is False:
        return Mq
    return sp.lambdify(self.q, Mq)

def _calc_Mq_g(self, lambdify=True):
    """ Uses Sympy to generate the force of gravity in
    joint space for the ur5

    lambdify boolean: if True returns a function to calculate
                      the Jacobian. If False returns the Sympy

    # check to see if we have our gravity term saved in file
    if os.path.isfile('%s/Mq_g' % self.config_folder):
        Mq_g = cloudpickle.load(open('%s/Mq_g' % self.config_folder,
        # get the Jacobians for each link's COM
        J = [self._calc_J('link%s' % ii, lambdify=False)
             for ii in range(self.num_links)]

        # transform each inertia matrix into joint space and
        # sum together the effects of arm segments' inertia on each motor
        Mq_g = sp.zeros(self.num_joints, 1)
        for ii in range(self.num_joints):
            Mq_g += J[ii].T * self._M[ii] * self.gravity

        # save to file
        cloudpickle.dump(Mq_g, open('%s/Mq_g' % self.config_folder,

    if lambdify is False:
        return Mq_g
    return sp.lambdify(self.q, Mq_g)

and wrapper functions

def Mq(self, q):
    """ Calculates the joint space inertia matrix for the ur5

    q np.array: joint angles
    # check for function in dictionary
    if self._Mq is None:
        print('Generating inertia matrix function')
        self._Mq = self._calc_Mq()
    return np.array(self._Mq(*q))

def Mq_g(self, q):
    """ Calculates the force of gravity in joint space for the ur5

    q np.array: joint angles
    # check for function in dictionary
    if self._Mq_g is None:
        print('Generating gravity effects function')
        self._Mq_g = self._calc_Mq_g()
    return np.array(self._Mq_g(*q)).flatten()

and we’re all set!

Putting it all together

Now we have nice clean code to generate everything we need for our controller. Using the controller developed in this post as a base, we can replace those calculations with the following nice compact code (which also includes a secondary null-space controller to keep the arm near resting joint angles):

# calculate position of the end-effector
# derived in the ur5 calc_TnJ class
xyz = robot_config.T('EE', q)
# calculate the Jacobian for the end effector
JEE = robot_config.J('EE', q)
# calculate the inertia matrix in joint space
Mq = robot_config.Mq(q)
# calculate the effect of gravity in joint space
Mq_g = robot_config.Mq_g(q)

# convert the mass compensation into end effector space
Mx_inv = np.dot(JEE, np.dot(np.linalg.inv(Mq), JEE.T))
svd_u, svd_s, svd_v = np.linalg.svd(Mx_inv)
# cut off any singular values that could cause control problems
singularity_thresh = .00025
for i in range(len(svd_s)):
    svd_s[i] = 0 if svd_s[i] < singularity_thresh else \
# numpy returns U,S,V.T, so have to transpose both here
Mx = np.dot(svd_v.T, np.dot(np.diag(svd_s), svd_u.T))

kp = 100
kv = np.sqrt(kp)
# calculate desired force in (x,y,z) space
u_xyz = np.dot(Mx, target_xyz - xyz)
# transform into joint space, add vel and gravity compensation
u = (kp * np.dot(JEE.T, u_xyz) - np.dot(Mq, kv * dq) - Mq_g)

# calculate our secondary control signal
# calculated desired joint angle acceleration
q_des = (((robot_config.rest_angles - q) + np.pi) %
         (np.pi*2) - np.pi)
u_null = np.dot(Mq, (kp * q_des - kv * dq))

# calculate the null space filter
Jdyn_inv = np.dot(Mx, np.dot(JEE, np.linalg.inv(Mq)))
null_filter = (np.eye(robot_config.num_joints) -
               np.dot(JEE.T, Jdyn_inv))
u_null_filtered = np.dot(null_filter, u_null)

u += u_null_filtered

And there you go!

You can see all of this code up on my GitHub, along a full example controlling a UR5 VREP model though reaching to a series of targets. It looks something pretty much like this (slightly better because this gif was made before adding in the null space controller):


Overhead of using lambdify instead of hard-coding

This was a big question that I had, because when I’m running simulations the time step is on the order of a few milliseconds, with the controller code called at every time step. So I reaaaally can’t afford a crazy overhead for using lambdify.

To test this, I used the handy Python timeit, which requires a bit awkward setup, but quite nicely calls the function a whole bunch of times (1,000,000 by default) and accounts for various other things going on that could affect the execution time.

I tested two sample functions, one simpler than the other. Here’s the code for setting up and testing the first function:

import timeit
import seaborn

# Test 1 ----------------------------------------------------------------------
print('\nTest function 1: ')
time_sympy1 = timeit.timeit(
        stmt = 'f(np.random.random(), np.random.random())',
        setup = 'import numpy as np;\
                import sympy as sp;\
                q0 = sp.Symbol("q0");\
                l0 = sp.Symbol("l0");\
                a = sp.cos(q0) * l0;\
                f = sp.lambdify((q0, l0), a, "numpy")')
print('Sympy lambdify function 1 time: ', time_sympy1)

time_hardcoded1 = timeit.timeit(
        stmt = 'np.cos(np.random.random())*np.random.random()',
        setup = 'import numpy as np')
print('Hard coded function 1 time: ', time_hardcoded1)

Pretty simple, a bit of a pain in the sympy setup, but other than that not bad at all. The second function I tested was just a random collection of cos and sin calls that resemble what gets computed in a Jacobian:

l1*np.sin(q0 - l0*np.sin(q1)*np.cos(q2) - l2*np.sin(q2) - l0*np.sin(q1) + q0*l0)*np.cos(q0) + l2*np.sin(q0)

And here’s the results:


So it’s slower for sure, but again this is the difference in time after 1,000,000 function calls, so until some big optimization needs to be done using the SymPy lambdify function is definitely worth the slight gain in execution time for the insane convenience.

The full code for the timing tests here are also up on my GitHub.

Tagged , , , ,

Using VREP for simulation of force-controlled models

I’ve been playing around a bit with different simulators, and one that we’re a big fan of in the lab is VREP. It’s free for academics and you can talk to them about licences if you’re looking for commercial use. I haven’t actually had much experience with it before myself, so I decided to make a simple force controlled arm model to get experience using it. All in all, there were only a few weird API things that I had to get through, and once you have them down it’s pretty straight forward. This post is going to be about the steps that I needed to take to get things all set up. For a more general start-up on VREP check out All the code in this post and the model I use can be found up on my GitHub.

Getting the right files where you need them

As discussed in the remote API overview, you’ll need three files in whatever folder you want to run your Python script from to be able to hook into VREP remotely:

  • remoteApi.dll, remoteApi.dylib or remoteApi.so (depending on what OS you’re using)
  • vrep.py
  • vrepConstants.py

You can find these files inside your VREP_HOME/programming/remoteApiBindings/python/python and VREP_HOME/programming/remoteApiBindings/lib/lib folders. Make sure that these files are in whatever folder you’re running your Python scripts from.

The model

It’s easy to create a new model to mess around with in VREP, so that’s the route I went, rather than importing one of their pre-made models and having some sneaky parameter setting cause me a bunch of grief. You can just right click->add then go at it. There are a bunch of tutorials so I’m not going to go into detail here. The main things are:

  • Make sure joints are in ‘Torque/force’ mode.
  • Make sure that joint ‘Motor enabled’ property is checked. The motor enabled property is in the ‘Show dynamic properties dialogue’ menu, which you find when you double click on the joint in the Scene Hierarchy.
  • Know the names of the joints as shown in the Scene Hierarchy.

So here’s a picture:

where you can see the names of the objects in the model highlighted in red, the Torque/force selection highlighted in blue, and the Motor enabled option highlighted in green. And of course my beautiful arm model in the background.

Setting up the continuous server

The goal is to connect VREP to Python so that we can send torques to the arm from our Python script and get the feedback necessary to calculate those torques. There are a few ways to set up a remote connection in VREP.

The basic one is they have you add a child script in VREP and attach it to an object in the world that sets up a port and then you hit go on your simulation and can then run your Python script to connect in. This gets old real fast. Fortunately, it’s easy to automate everything from Python so that you can connect in, start the simulation, run it for however long, and then stop the simulation without having to click back and forth.

The first step is to make sure that your remoteApiConnections.txt file in you VREP home directory is set up properly. A continuous server should be set up by default, but doesn’t hurt to double check. And you can take the chance to turn on debugging, which can be pretty helpful. So open up that file and make sure it looks like this:

portIndex1_port                 = 19997
portIndex1_debug                = true
portIndex1_syncSimTrigger       = true

Once that’s set up, when VREP starts we can connect in from Python. In our Python script, first we’ll close any open connections that might be around, and then we’ll set up a new connection in:

import vrep 

# close any open connections
# Connect to the V-REP continuous server
clientID = vrep.simxStart('', 19997, True, True, 500, 5) 

if clientID != -1: # if we connected successfully
    print ('Connected to remote API server')

So once the connection is made, we check the clientID value to make sure that it didn’t fail, and then we carry on with our script.


By default, VREP will run its simulation in its own thread, and both the simulation and the controller using the remote API will be running simultaneously. This can lead to some weird behaviour as things fall out of synch etc etc, so what we want instead is for the VREP sim to only run one time step for each time step the controller runs. To do that, we need to set VREP to synchronous mode. So the next few lines of our Python script look like:

    # --------------------- Setup the simulation 


and then later, once we’ve calculated our control signal, sent it over, and want the simulation to run one time step forward, we do that by calling

    # move simulation ahead one time step

Get the handles to objects of interest

OK the next chunk of code in our script uses the names of our objects (as specified in the Scene Hierarchy in VREP) to get an ID for each which to identify which object we want to send a command to or receive feedback from:

    joint_names = ['shoulder', 'elbow']
    # joint target velocities discussed below
    joint_target_velocities = np.ones(len(joint_names)) * 10000.0

    # get the handles for each joint and set up streaming
    joint_handles = [vrep.simxGetObjectHandle(clientID,
        name, vrep.simx_opmode_blocking)[1] for name in joint_names]

    # get handle for target and set up streaming
    _, target_handle = vrep.simxGetObjectHandle(clientID,
                    'target', vrep.simx_opmode_blocking)

Set dt and start the simulation

And the final thing that we’re going to do in our simulation set up is specify the timestep that we want to use, and then start the simulation. I found this in a forum post, and I must say whatever VREP lacks in intuitive API their forum moderators are on the ball. NOTE: To use a custom time step you have to also set the dt option in the VREP simulator to ‘custom’. The drop down is to the left of the ‘play’ arrow, and if you don’t have it set to ‘custom’ you won’t be able to set the time step through Python.

    dt = .01
            dt, # specify a simulation time step

    # --------------------- Start the simulation

    # start our simulation in lockstep with our code

Main loop

For this next chunk I’m going to cut out everything that’s not VREP, since I have a bunch of posts explaining the control signal derivation and forward transformation matrices.

So, once we’ve started the simulation, I’ve set things up for the arm to be controlled for 1 second and then for the simulation to stop and everything shut down and disconnect.

    while count < 1: # run for 1 simulated second

        # get the (x,y,z) position of the target
        _, target_xyz = vrep.simxGetObjectPosition(clientID,
                -1, # retrieve absolute, not relative, position
        if _ !=0 : raise Exception()
        track_target.append(np.copy(target_xyz)) # store for plotting
        target_xyz = np.asarray(target_xyz)

        q = np.zeros(len(joint_handles))
        dq = np.zeros(len(joint_handles))
        for ii,joint_handle in enumerate(joint_handles):
            # get the joint angles
            _, q[ii] = vrep.simxGetJointPosition(clientID,
            if _ !=0 : raise Exception()
            # get the joint velocity
            _, dq[ii] = vrep.simxGetObjectFloatParameter(clientID,
                    2012, # ID for angular velocity of the joint
            if _ !=0 : raise Exception()

In the above chunk of code, I think the big thing to point out is that I’m using vrep.simx_opmode_blocking in each call, instead of vrep.simx_opmode_buffer. The difference is that you for sure get the current values of the simulation when you use blocking, and you can be behind a time step using buffer.

Aside from that the other notable things are I raise an exception if the first parameter (which is the return code) is ever not 0, and that I use simxGetObjectFloatParameter to get the joint velocity instead of simxGetObjectVelocity, which has a rotational velocity component. Zero is the return code for ‘everything worked’, and if you don’t check it and have some silly things going on you can be veeerrrrryyy mystified when things don’t work. And what simxGetObjectVelocity returns is the rotational velocity of the joint relative to the world frame, and not the angular velocity of the joint in its own coordinates. That was also a briefly confusing.

So the next thing I do is calculate u, which we’ll skip, and then we need to set the forces for the joint. This part of the API is real screwy. You can’t set the force applied to the joint directly. Instead, you have to set the target velocity of the joint to some really high value (hence that array we made before), and then modulate the maximum force that can be applied to that joint. Also important: When you want to apply a force in the other direction, you change the sign on the target velocity, but keep the force sign positive.

        # ... calculate u ... 

        for ii,joint_handle in enumerate(joint_handles):
            # get the current joint torque
            _, torque = \
            if _ !=0 : raise Exception()

            # if force has changed signs,
            # we need to change the target velocity sign
            if np.sign(torque) * np.sign(u[ii]) < 0:
                joint_target_velocities[ii] = \
                        joint_target_velocities[ii] * -1
                        joint_target_velocities[ii], # target velocity
            if _ !=0 : raise Exception()

            # and now modulate the force
                    abs(u[ii]), # force to apply
            if _ !=0 : raise Exception()

        # move simulation ahead one time step
        count += dt

So as you can see we check the current torque, see if we need to change the sign on the target velocity, modulate the maximum allowed force, and then finally step the VREP simulation forward.


And there you go! Here’s an animation of it in action (note this is a super low quality gif and it looks way better / smoother when actually running it yourself):


All in all, VREP has been enjoyable to work with so far. It didn’t take long to get things moving and off the ground, the visualization is great, and I haven’t even scratched the surface of what you can do with it. Best of all (so far) you can fully automate everything from Python. Hopefully this is enough to help some people get their own models going and save a few hours and headaches! Again, the full code and the model are up on my GitHub.


  • When you’re applying your control signal, make sure you test each joint in isolation, to make sure your torques push things in the direction you think they do. I had checked the rotation direction in VREP, but the control signal for both joints ended up needing to be multiplied by -1.
  • Another nit when you’re building your model, if you use the rotate button from the VREP toolbar on your model, wherever that joint rotates to is now 0 degrees. If you want to set the joint to start at 45 degrees, instead double click and change Pos[deg] option inside ‘Joint’ in Scene Object Properties.
Tagged , , , ,

Deep learning for control using augmented Hessian-free optimization

Traditionally, deep learning is applied to feed-forward tasks, like classification, where the output of the network doesn’t affect the input to the network. It is a decidedly harder problem when the output is recurrently connected such that network output affects the input. Because of this application of deep learning methods to control was largely unexplored until a few years ago. Recently, however, there’s been a lot of progress and research in this area. In this post I’m going to talk about an implementation of deep learning for control presented by Dr. Ilya Sutskever in his thesis Training Recurrent Neural Networks.

In his thesis, Dr. Sutskever uses augmented Hessian-free (AHF) optimization for learning. There are a bunch of papers and posts that go into details about AHF, here’s a good one by Andrew Gibiansky up on his blog, that I recommend you check out. I’m not going to really talk much here about what AHF is specifically, or how it differs from other methods, if you’re unfamiliar there are lots of places you can read up on it. Quickly, though, AHF is kind of a bag of tricks you can use with a fast method for estimating the curvature of the loss function with respect to the weights of a neural network, as well as the gradient, which allows it to make larger updates in directions where the loss function doesn’t change quickly. So rather than estimating the gradient and then doing a small update along each dimension, you can make the size of your update large in directions that change slowly and small along dimensions where things change quickly. And now that’s enough about that.

In this post I’m going to walk through using a Hessian-free optimization library (version 0.3.8) written by my compadre Dr. Daniel Rasmussen to train up a neural network to train up a 2-link arm, and talk about the various hellish gauntlets you need run to get something that works. Whooo! The first thing to do is install this Hessian-free library, linked above.

I’ll be working through code edited a bit for readability, to find the code in full you can check out the files up on my GitHub.

Build the network

Dr. Sutskever specified the structure of the network in his thesis to be 4 layers: 1) a linear input layer, 2) 100 Tanh nodes, 3) 100 Tanh nodes, 4) linear output layer. The network is connected up with the standard feedforward connections from 1 to 2 to 3 to 4, plus recurrent connections on 2 and 3 to themselves, plus a ‘skip’ connection from layer 1 to layer 3. Finally, the input to the network is the target state for the plant and the current state of the plant. So, lots of recursion! Here’s a picture:

network structure

The output layer connects in to the plant, and, for those unfamiliar with control theory terminology, ‘plant’ just means the system that you’re controlling. In this case an arm simulation.

Before we can go ahead and set up the network that we want to train, we also need to specify the loss function that we’re going to be using during training. The loss function in Ilya’s thesis is a standard one:

L(\theta) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{u}_t) + \ell_f(\textbf{x}_N),
\ell(\textbf{u}_t) = \alpha \frac{||\textbf{u}_t||^2}{2},
\ell_f(\textbf{x}_N) = \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2}

where L(\theta) is the total cost of the trajectory generated with \theta, the set of network parameters, \ell(\textbf{u}) is the immediate state cost, \ell_f(\textbf{x}) is the final state cost, \textbf{x} is the state of the arm, \textbf{x}^* is the target state of the arm, \textbf{u} is the control signal (torques) that drives the arm, and \alpha is a gain value.

To code this up using the hessianfree library we do:

from hessianfree import RNNet
from hessianfree.nonlinearities import (Tanh, Linear, Plant)
from hessianfree.loss_funcs import SquaredError, SparseL2

l2gain = 10e-3 * dt # gain on control signal loss
rnn = RNNet(
    # specify the number of nodes in each layer
    shape=[num_states * 2, 96, 96, num_states, num_states],
    # specify the function of the nodes in each layer
    layers=[Linear(), Tanh(), Tanh(), Linear(), plant],
    # specify the layers that have recurrent connections
    # specify the connections between layers
    conns={0:[1, 2], 1:[2], 2:[3], 3:[4]},
    # specify the loss function
        # squared error between plant output and targets
        # penalize magnitude of control signal (output of layer 3)
        SparseL2(l2gain, layers=[3])],

Note that if you want to run it on your GPU you’ll need PyCuda and sklearn installed. And a GPU.

An important thing to note as well is that in Dr. Sustkever’s thesis when we’re calculating the squared error of the distance from the arm state to the target, this is measured in joint angles. So it’s kind of a weird set up to be looking at the movement of the hand but have your cost function in joint-space instead of end-effector space, but it definitely simplifies training by making the cost more directly relatable to the control signal. So we need to calculate the joint angles of the arm that will have the hand at different targets around a circle. To do this we’ll take advantage of our inverse kinematics solver from way back when, and use the following code:

def gen_targets(arm, n_targets=8, sig_len=100):
    #Generate target angles corresponding to target
    #(x,y) coordinates around a circle
    import scipy.optimize

    x_bias = 0
    if arm.DOF == 2:
        y_bias = .35
        dist = .075
    elif arm.DOF == 3:
        y_bias = .5
        dist = .2

    # set up the reaching trajectories around circle
    targets_x = [dist * np.cos(theta) + x_bias \
                    for theta in np.linspace(0, np.pi*2, 65)][:-1]
    targets_y = [dist * np.sin(theta) + y_bias \
                    for theta in np.linspace(0, np.pi*2, 65)][:-1]

    joint_targets = []
    for ii in range(len(targets_x)):
    targs = np.asarray(joint_targets)

    # repeat the targets over time
    for ii in range(targs.shape[1]-1):
        targets = np.concatenate(
            (np.outer(targs[:, ii], np.ones(sig_len))[:, :, None],
             np.outer(targs[:, ii+1], np.ones(sig_len))[:, :, None]), axis=-1)
    targets = np.concatenate((targets, np.zeros(targets.shape)), axis=-1)
    # only want to penalize the system for not being at the
    # target at the final state, set everything before to np.nan
    targets[:, :-1] = np.nan 

    return targets

And you can see in the last couple lines that to implement the distance to target as a final state cost penalty only we just set all of the targets before the final time step equal to np.nan. If we wanted to penalize distance to target throughout the whole trajectory we would just comment that line out.

Create the plant

You’ll notice in the code that defines our RNN I set the last layer of the network to be plant, but that that’s not defined anywhere. Let’s talk. There are a couple of things that we’re going to need to incorporate our plant into this network and be able to use any deep learning method to train it. We need to be able to:

  1. Simulate the plant forward; i.e. pass in input and get back the resulting plant state at the next timestep.
  2. Calculate the derivative of the plant state with respect to the input; i.e. how do small changes in the input affect the state.
  3. Calculate the derivative of the plant state with respect to the previous state; i.e. how do small changes in the plant state affect the state at the next timestep.
  4. Calculate the derivative of the plant output with respect to its state; i.e. how do small changes in the current position of the state affect the output of the plant.

So 1 is easy, we have the arm simulations that we want already, they’re up on my GitHub. Number 4 is actually trivial too, because the output of our plant is going to be the state itself, so the derivative of the output with respect to the state is just the identity matrix.

For 2 and 3, we’re going to need to calculate some derivatives. If you’ve read the last few posts you’ll note that I’m on a finite differences kick. So let’s get that going! Because no one wants to calculate derivatives!

Important note, the notation in these next couple pieces of code is going to be a bit different from my normal notation because they’re matching with the hessianfree library notation, which is coming from a reinforcement learning literature background instead of a control theory background. So, s is the state of the plant, and x is the input to the plant. I know, I know. All the same, make sure to keep that in mind.

# calculate ds0/dx0 with finite differences
d_input_FD = np.zeros((x.shape[0], # number of trials
                       x.shape[1], # number of inputs
                       self.state.shape[1])) # number of states
for ii in range(x.shape[1]):
    # calculate state adding eps to x[ii]
    inc_x = x.copy()
    inc_x[:, ii] += self.eps
    state_inc = self.state.copy()
    # calculate state subtracting eps from x[ii]
    dec_x = x.copy()
    dec_x[:, ii] -= self.eps
    state_dec = self.state.copy()

    d_input_FD[:, :, ii] = \
        (state_inc - state_dec) / (2 * self.eps)
d_input_FD = d_input_FD[..., None]

Alrighty. First we create a tensor to store the results. Why is it a tensor? Because we’re going to be doing a bunch of runs at once. So our state dimensions are actually trials x size_input. When we then take the partial derivative, we end up with trials many size_input x size_state matrices. Then we increase each of the parameters of the input slightly one at a time and store the results, decrease them all one at a time and store the results, and compute our approximation of the gradient.

Next we’ll do the same for calculating the derivative of the state with respect to the previous state.

# calculate ds1/ds0
d_state_FD = np.zeros((x.shape[0], # number of trials
                       self.state.shape[1], # number of states
                       self.state.shape[1])) # number of states
for ii in range(self.state.shape[1]):
    # calculate state adding eps to self.state[ii]
    state = np.copy(self.prev_state)
    state[:, ii] += self.eps
    state_inc = self.state.copy()
    # calculate state subtracting eps from self.state[ii]
    state = np.copy(self.prev_state)
    state[:, ii] -= self.eps
    state_dec = self.state.copy()

    d_state_FD[:, :, ii] = \
        (state_inc - state_dec) / (2 * self.eps)
d_state_FD = d_state_FD[..., None]

Great! We’re getting closer to having everything we need. Another thing we need is a wrapper for running our arm simulation. It’s going to look like this:

def activation(self, x):
    state = []
    # iterate through and simulate the plant forward
    # for each trial
    for ii in range(x.shape[0]):
        self.arm.reset(q=self.state[ii, :self.arm.DOF],
                       dq=self.state[ii, self.arm.DOF:])
        state.append(np.hstack([self.arm.q, self.arm.dq]))
    state = np.asarray(state)

    self.state = self.squashing(state)

This is definitely not the fastest code to run. Much more ideally we would put the state and input into vectors and do a single set of computations for each call to activation rather than having that for loop in there. Unfortunately, though, we’re not assuming that we have access to the dynamics equations / will be able to pass in vector states and inputs.

Looking at the above code that seems pretty clear what’s going on, except you might notice that last line calling self.squashing. What’s going on there?

The squashing function looks like this:

def squashing(self, x):
    index_below = np.where(x < -2*np.pi)
    x[index_below] = np.tanh(x[index_below]+2*np.pi) - 2*np.pi
    index_above = np.where(x > 2*np.pi)
    x[index_above] = np.tanh(x[index_above]-2*np.pi) + 2*np.pi
    return x

All that’s happening here is that we’re taking our input, and doing nothing to it as long as it doesn’t start to get too positive or too negative. If it does then we just taper it off and prevent it from going off to infinity. So running a 1D vector through this function we get:

This ends up being a pretty important piece of the code here. Basically it prevents wild changes to the weights during learning to result in the system breaking down. So the state of the plant can’t go off to infinity and cause an error to be thrown, stopping our simulation. But because the target state is well within the bounds of where the squashing function does nothing, post-training we’ll still be able to use the resulting network to control a system that doesn’t have this fail safe built in. Think of this function as training wheels that catch you only if you start falling.

With that, we no have pretty much all of the parts necessary to begin training our network!

Training the network

We’re going to be training this network on the centre-out reaching task, where you start at a centre point and reach to a bunch of target locations around a circle. I’m just going to be re-implementing the task as it was done in Dr. Sutskever’s thesis, so we’ll have 64 targets around the circle, and train using a 2-link arm. Here’s the code that we’ll use to actually run the training:

for ii in range(last_trial+1, num_batches):
    # train a bunch of batches using the same input every time
    # to allow the network a chance to minimize things with
    # stable input (speeds up training)
    err = rnn.run_batches(plant, targets=None,
              optimizer=HessianFree(CG_iter=96, init_damping=100))

    # save the weights to file, track trial and error
    # err = rnn.error(inputs)
    err = rnn.best_error
    name = 'weights/rnn_weights-trial%04i-err%.3f'%(ii, err)
    np.savez_compressed(name, rnn.W)

Training your own network

A quick aside: if you want to run this code yourself, get a real good computer, have an arm simulation ready, the hessianfree Python library installed, and download and run this train_hf.py file. (Note: I used version 0.3.8 of the hessianfree library, which you can install using pip install hessianfree==0.3.8) This will start training and save the weights into a weights/ folder, so make sure that that exists in the same folder as train_hf.py. If you want to view the results of the training at any point run the plot_error.py file, which will load in the most recent version of the weights and plot the error so far. If you want to generate an animated plot like I have below run gen_animation_plots.py and then the last command from my post on generating animated gifs.

Another means of seeing the results of your trained up network is to use the controller I’ve implemented in my controls benchmarking suite, which looks for a set of saved weights in the controllers/weights folder, and will load it in and use it to generate command signals for the arm by running it with

python run.py arm2_python ahf reach --dt=1e-2

where you replace arm2_python with whatever arm model you trained your model on. Note the --dt=1e-2 flag, that is important because the model was trained with a .01 timestep and things get a bit weird if you suddenly change the dynamics on the controller.

OK let’s look at some results!


OK full discretion, these results are not optimizing the cost function we discussed above. They’re implementing a simpler cost function that only looks at the final state, i.e. it doesn’t penalize the magnitude of the control signal. I did this because Dr. Sutskever says in his thesis he was able to optimize with just the final state cost using much smaller networks. I originally looked at neurons with 96 neurons in each layer, and it just took forgoddamnedever to run. So after running for 4 weeks (not joking) and needing to make some more changes I dropped the number of neurons and simplified the task.

The results below are from running a network with 32 neurons in each layer controlling this 2-link arm, and took another 4-5 weeks to train up.

Hey that looks good! Not bad, augmented Hessian-free learning, not bad. It had pretty consistent (if slow) decline in the error rate, with a few crazy bumps from which it quickly recovered. Also take note that each training iteration is actually 32 runs, so it’s not 12,50-ish runs it’s closer to 400,000 training runs that it took to get here.

One biggish thing that was a pain was that it turns out that I only trained the neural network for reaching in the one direction, and when you only train it to reach one way it doesn’t generalize to reaching back to the starting point (which, fair enough). But, I didn’t realize this until I was took the trained network and ran it in the benchmarking code, at which point I was not keen to redo all of the training it took to get the neural network to the level of accuracy it was at under a more complicated training set. The downside of this is that even though I’ve implemented a controller that takes in the trained network and uses it to control the arm, to do the reaching task I have to just do a hard reset after the arm reaches the target, because it can’t reach back to the center, like all the other controllers. All the same, here’s an animation of the trained up AHF controller reaching to 8 targets (it was trained on all 64 above though):


Things don’t always go so smoothly, though. Here’s results from another training run that took around 2-3 weeks, and uses a different 2-link arm model (translated from Matlab code written by Dr. Emo Todorov):


What I found frustrating about this was that if you look at the error over time then this arm is doing as well or better than the previous arm at a lot of points. But the corresponding trajectories look terrible, like something you would see in a horror movie based around getting good machine learning results. This of course comes down to how I specified the cost function, and when I looked at the trajectories plotted over time the velocity of the arm is right at zero at the final time step, which it is not quiiiitte the case for the first controller. So this second network has found a workaround to minimize the cost function I specified in a way I did not intend. To prevent this, doing something like weighting the distance to target heavier than non-zero velocity would probably work. Or possibly just rerunning the training with a different random starting point you could get out a better controller, I don’t have a great feel for how important the random initialization is, but I’m hoping that it’s not all too important and its effects go to zero with enough training. Also, it should be noted I’ve run the first network for 12,500 iterations and the second for less than 6,000, so I’ll keep letting them run and maybe it will come around. The first one looked pretty messy too until about 4,000 iterations in.

Training regimes

Frustratingly, the way that you train deep networks is very important. So, very much like the naive deep learning network trainer that I am, I tried the first thing that pretty much anyone would try:

  • run the network,
  • update the weights,
  • repeat.

This is what I’ve done in the results above. And it worked well enough in that case.

If you remember back to the iLQR I made a little while ago, I was able to change the cost function to be

L(\theta) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{u}_t) + \ell_f(\textbf{x}_N),
\ell(\textbf{u}_t, \textbf{x}_t) = \alpha \frac{||\textbf{u}_t||^2}{2} + \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2},
\ell_f(\textbf{x}_N) = \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2}

(i.e. to include a penalty for distance to target throughout the trajectory and not just at the final time step) which resulted in straighter trajectories when controlling the 2-link arm. So I thought I would try this here as well. Sadly (incredibly sadly), this was fairly fruitless. The network didn’t really learn or improve much at all.

After much consideration and quandary on my part, I talked with Dr. Dan and he suggested that I try another method:

  • run the network,
  • record the input,
  • hold the input constant for a few batches of weight updating,
  • repeat.

This method gave much better results. BUT WHY? I hear you ask! Good question. Let me give giving explanation a go.

Essentially, it’s because the cost function is more complex now. In the first training method, the output from the plant is fed back into the network as input at every time step. When the cost function was simpler this was OK, but now we’re getting very different input to train on at every iteration. So the system is being pulled in different directions back and forth at every iteration. In the second training regime, the same input is given several times in a row, which let’s the system follow the same gradient for a few training iterations before things change again. In my head I picture this as giving the algorithm a couple seconds to catch its breath dunking it back underwater.

This is a method that’s been used in a bunch of places recently. One of the more high-profile instances is in the results published from DeepMind on deep RL for control and for playing Go. And indeed, it also works well here.

To implement this training regime, we set up the following code:

for ii in range(last_trial+1, num_batches):

    # run the plant forward once
    rnn.forward(input=plant, params=rnn.W)

    # get the input and targets from above rollout
    inputs = plant.get_vecs()[0].astype(np.float32)
    targets = np.asarray(plant.get_vecs()[1], dtype=np.float32)

    # train a bunch of batches using the same input every time
    # to allow the network a chance to minimize things with
    # stable input (speeds up training)
    err = rnn.run_batches(inputs, targets, max_epochs=batch_size,
              optimizer=HessianFree(CG_iter=96, init_damping=100))

    # save the weights to file, track trial and error
    # err = rnn.error(inputs)
    err = rnn.best_error
    name = 'weights/rnn_weights-trial%04i-err%.3f'%(ii, err)
    np.savez_compressed(name, rnn.W)

So you can see that we do one rollout with the weights, then go in and get the inputs and targets that were used in that rollout, and start training the network while holding those constant for batch_size epochs (training sessions). From a little bit of messing around I’ve found batch_size=32 to be a pretty good number. So then it runs 32 training iterations where it’s updating the weights, and then saves those weights (because we want a loooottttt of check-points) and then restarts the loop.

Embarrassingly, I’ve lost my simulation results from this trial, somehow…so I don’t have any nice plots to back up the above, unfortunately. But since this is just a blog post I figured I would at least talk about it a little bit, since people might still find it useful if they’re just getting into the field like me. and just update this post whenever I re-run them. If I rerun them.

What I do have, however, are results where this method doesn’t work! I tried this with the simpler cost function, that only looks at the final state distance from the target, and it did not go so well. Let’s look at that one!


My guess here is basically that the system has gotten to a point where it’s narrowed things down in the parameter space and now when you run 32 batches it’s overshooting. It needs feedback about its updates after every update at this point. That’s my guess, at least. So it could be the case that for more complex cost functions you’d want to train it while holding the input constant for a while, and then when the error starts to plateau switch to updating the input after every parameter update.


All in all, AHF for training neural networks in control is pretty awesome. There are of course still some major hold-backs, mostly related to how long it takes to train up a network, and having to guess at effective training regimes and network structures etc. But! It was able to train up a relatively small neural network to move an arm model from a center point to 64 targets around a circle, with no knowledge of the system under control at all. In Dr. Sutskever’s thesis he goes on to use the same set up under more complicated circumstances, such as when there’s a feedback delay, or a delay on the outgoing control signal, and unexpected noise etc, so it is able to learn under a number of different, fairly complex situations. Which is pretty slick.

Related to the insane training time required, I very easily could be missing some basic thing that would help speed things up. If you, reader, get ambitious and run the code on your own machine and find out useful methods for speeding up the training please let me know! Personally, my plan is to next investigate guided policy search, which seems like it’s found a way around this crazy training time.

Tagged , , , , , ,

The iterative Linear Quadratic Regulator algorithm

A few months ago I posted on Linear Quadratic Regulators (LQRs) for control of non-linear systems using finite-differences. The gist of it was at every time step linearize the dynamics, quadratize (it could be a word) the cost function around the current point in state space and compute your feedback gain off of that, as though the dynamics were both linear and consistent (i.e. didn’t change in different states). And that was pretty cool because you didn’t need all the equations of motion and inertia matrices etc to generate a control signal. You could just use the simulation you had, sample it a bunch to estimate the dynamics and value function, and go off of that.

The LQR, however, operates with maverick disregard for the future. Careless of the consequences, it optimizes for the current time step only. It would be really great to have an algorithm that was able to plan out a sequence, mindful of the overall cost of the trajectory and to optimize for that.

This is exactly the iterative Linear Quadratic Regulator method (iLQR) was designed for. iLQR is an extension of LQR control, and the idea here is basically to optimize a whole control sequence rather than just the control signal for the current point in time. The basic flow of the algorithm is:

  1. Initialize with initial state x_0 and initial control sequence \textbf{U} = [u_{t_0}, u_{t_1}, ..., u_{t_N}].
  2. Do a forward pass, i.e. simulate the system using (x_0, \textbf{U}) to get the trajectory through state space, \textbf{X}, that results from applying the control sequence \textbf{U} starting in x_0.
  3. Do a backward pass, estimate the value function and dynamics for each (\textbf{x}, \textbf{u}) in the state-space and control signal trajectories.
  4. Calculate an updated control signal \hat{\textbf{U}} and evaluate cost of trajectory resulting from (x_0, \hat{\textbf{U}}).
    1. If |(\textrm{cost}(x_0, \hat{\textbf{U}}) - \textrm{cost}(x_0, \textbf{U})| < \textrm{threshold} then we've converged and exit.
    2. If \textrm{cost}(x_0, \hat{\textbf{U}}) < \textrm{cost}(x_0, \textbf{U}), then set \textbf{U} = \hat{\textbf{U}}, and change the update size to be more aggressive. Go back to step 2.
    3. If \textrm{cost}(x_0, \hat{\textbf{U}}) \geq \textrm{cost}(x_0, \textbf{U}) change the update size to be more modest. Go back to step 3.

There are a bunch of descriptions of iLQR, and it also goes by names like ‘the sequential linear quadratic algorithm’. The paper that I’m going to be working off of is by Yuval Tassa out of Emo Todorov’s lab, called Control-limited differential dynamic programming. And the Python implementation of this can be found up on my github in my Control repo. Also, a big thank you to Dr. Emo Todorov who provided Matlab code for the iLQG algorithm, which was super helpful.

Defining things

So let’s dive in. Formally defining things, we have our system \textbf{x}, and dynamics described with the function \textbf{f}, such that

\textbf{x}_{t+1} = \textbf{f}(\textbf{x}_t, \textbf{u}_t),

where \textbf{u} is the input control signal. The trajectory \{\textbf{X}, \textbf{U}\} is the sequence of states \textbf{X} = \{\textbf{x}_0, \textbf{x}_1, ..., \textbf{x}_N\} that result from applying the control sequence \textbf{U} = \{\textbf{u}_0, \textbf{u}_1, ..., \textbf{u}_N\} starting in the initial state \textbf{x}_0.

Now we need to define all of our cost related equations, so we know exactly what we’re dealing with.

Define the total cost function J, which is the sum of the immediate cost, \ell, from each state in the trajectory plus the final cost, \ell_f:

J(\textbf{x}_0, \textbf{U}) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{x}_t, \textbf{u}_t) + \ell_f(\textbf{x}_N).

Letting \textbf{U}_t = \{\textbf{u}_t, \textbf{u}_{t+1}, ..., \textbf{U}_N\}, we define the cost-to-go as the sum of costs from time t to N:

J_t(\textbf{x}, \textbf{U}_t) = \sum\limits^{N-1}\limits_{i=t} \ell(\textbf{x}_i, \textbf{u}_i) + \ell_f(\textbf{x}_N).

The value function V at time t is the optimal cost-to-go from a given state:

V_t(\textbf{x}) = \min\limits_{\textbf{U}_t} J_t(\textbf{x}, \textbf{U}_t),

where the above equation just says that the optimal cost-to-go is found by using the control sequence \textbf{U}_t that minimizes J_t.

At the final time step, N, the value function is simply

V(\textbf{x}_N) = \ell_f(\textbf{x}_N).

For all preceding time steps, we can write the value function as a function of the immediate cost \ell(\textbf{x}, \textbf{u}) and the value function at the next time step:

V(\textbf{x}) = \min\limits_{\textbf{u}} \left[ \ell(\textbf{x}, \textbf{u}) + V(\textbf{f}(\textbf{x}, \textbf{u})) \right].

NOTE: In the paper, they use the notation V'(\textbf{f}(\textbf{x}, \textbf{u})) to denote the value function at the next time step, which is redundant since \textbf{x}_{t+1} = \textbf{f}(\textbf{x}_t, \textbf{u}_t), but it comes in handy later when they drop the dependencies to simplify notation. So, heads up: V' = V(\textbf{f}(\textbf{x}, \textbf{u}).

Forward rollout

The forward rollout consists of two parts. The first part is to simulating things to generate the (\textbf{X}, \textbf{U}), from which we can calculate the overall cost of the trajectory, and find out the path that the arm will take. To improve things though we’ll need a lot of information about the partial derivatives of the system, calculating these is the second part of the forward rollout phase.

To calculate all these partial derivatives we’ll use (\textbf{X}, \textbf{U}). For each (\textbf{x}_t, \textbf{u}_t) we’ll calculate the derivatives of \textbf{f}(\textbf{x}_t, \textbf{u}_t) with respect to \textbf{x}_t and \textbf{u}_t, which will give us what we need for our linear approximation of the system dynamics.

To get the information we need about the value function, we’ll need the first and second derivatives of \ell(\textbf{x}_t, \textbf{u}_t) and \ell_f(\textbf{x}_t, \textbf{x}_t) with respect to \textbf{x}_t and \textbf{u}_t.

So all in all, we need to calculate \textbf{f}_\textbf{x}, \textbf{f}_\textbf{u}, \ell_\textbf{x}, \ell_\textbf{u}, \ell_\textbf{xx}, \ell_\textbf{ux}, \ell_\textbf{uu}, where the subscripts denote a partial derivative, so \ell_\textbf{x} is the partial derivative of \ell with respect to \textbf{x}, \ell_\textbf{xx} is the second derivative of \ell with respect to \textbf{x}, etc. And to calculate all of these partial derivatives, we’re going to use finite differences! Just like in the LQR with finite differences post. Long story short, load up the simulation for every time step, slightly vary one of the parameters, and measure the resulting change.

Once we have all of these, we’re ready to move on to the backward pass.

Backward pass

Now, we started out with an initial trajectory, but that was just a guess. We want our algorithm to take it and then converge to a local minimum. To do this, we’re going to add some perturbing values and use them to minimize the value function. Specifically, we’re going to compute a local solution to our value function using a quadratic Taylor expansion. So let’s define Q(\delta \textbf{x}, \delta \textbf{u}) to be the change in our value function at (\textbf{x}, \textbf{u}) as a result of small perturbations (\delta \textbf{x}, \delta \textbf{u}):

Q(\delta \textbf{x}, \delta \textbf{u}) = \ell (\textbf{x} + \delta \textbf{x}, \textbf{u} + \delta \textbf{u}) + V(\textbf{f}(\textbf{x} + \delta\textbf{x}, \textbf{u} + \delta \textbf{u})).

The second-order expansion of Q is given by:

Q_\textbf{x} = \ell_\textbf{x} + \textbf{f}_\textbf{x}^T V'_\textbf{x},

Q_\textbf{u} = \ell_\textbf{u} + \textbf{f}_\textbf{u}^T V'_\textbf{x},

Q_\textbf{xx} = \ell_\textbf{xx} + \textbf{f}_\textbf{x}^T V'_\textbf{xx} \textbf{f}_\textbf{x} + V'_\textbf{x} \cdot \textbf{f}_\textbf{xx},

Q_\textbf{ux} = \ell_\textbf{ux} + \textbf{f}_\textbf{u}^T V'_\textbf{xx} \textbf{f}_\textbf{x}+ V'_\textbf{x} \cdot \textbf{f}_\textbf{ux},

Q_\textbf{uu} = \ell_\textbf{uu} + \textbf{f}_\textbf{u}^T V'_\textbf{xx} \textbf{f}_\textbf{u}+ V'_\textbf{x} \cdot \textbf{f}_\textbf{uu}.

Remember that V' = V(\textbf{f}(\textbf{x}, \textbf{u})), which is the value function at the next time step. NOTE: All of the second derivatives of \textbf{f} are zero in the systems we’re controlling here, so when we calculate the second derivatives we don’t need to worry about doing any tensor math, yay!

Given the second-order expansion of Q, we can to compute the optimal modification to the control signal, \delta \textbf{u}^*. This control signal update has two parts, a feedforward term, \textbf{k}, and a feedback term \textbf{K} \delta\textbf{x}. The optimal update is the \delta\textbf{u} that minimizes the cost of Q:

\delta\textbf{u}^*(\delta \textbf{x}) = \min\limits_{\delta\textbf{u}}Q(\delta\textbf{x}, \delta\textbf{u}) = \textbf{k} + \textbf{K}\delta\textbf{x},

where \textbf{k} = -Q^{-1}_\textbf{uu} Q_\textbf{u} and \textbf{K} = -Q^{-1}_\textbf{uu} Q_\textbf{ux}.

Derivation can be found in this earlier paper by Li and Todorov. By then substituting this policy into the expansion of Q we get a quadratic model of V. They do some mathamagics and come out with:

V_\textbf{x} = Q_\textbf{x} - \textbf{K}^T Q_\textbf{uu} \textbf{k},

V_\textbf{xx} = Q_\textbf{xx} - \textbf{K}^T Q_\textbf{uu} \textbf{K}.

So now we have all of the terms that we need, and they’re defined in terms of the values at the next time step. We know the value of the value function at the final time step V_N = \ell_f(\textbf{x}_N), and so we’ll simply plug this value in and work backwards in time recursively computing the partial derivatives of Q and V.

Calculate control signal update

Once those are all calculated, we can calculate the gain matrices, \textbf{k} and \textbf{K}, for our control signal update. Huzzah! Now all that’s left to do is evaluate this new trajectory. So we set up our system

\hat{\textbf{x}}_0 = \textbf{x}_0,

\hat{\textbf{u}}_t = \textbf{u}_t + \textbf{k}_t + \textbf{K}_t (\hat{\textbf{x}}_t - \textbf{x}_t),

\hat{\textbf{x}}_{t+1} = \textbf{f}(\hat{\textbf{x}}_t, \hat{\textbf{u}}_t),

and record the cost. Now if the cost of the new trajectory (\hat{\textbf{X}}, \hat{\textbf{U}}) is less than the cost of (\textbf{X}, \textbf{U}) then we set \textbf{U} = \hat{\textbf{U}} and go do it all again! And when the cost from an update becomes less than a threshold value, call it done. In code this looks like:

if costnew < cost:
  sim_new_trajectory = True

  if (abs(costnew - cost)/cost) < self.converge_thresh:

Of course, another option we need to account for is when costnew > cost. What do we do in this case? Our control update hasn’t worked, do we just exit?

The Levenberg-Marquardt heuristic
No! Phew.

The control signal update in iLQR is calculated in such a way that it can behave like Gauss-Newton optimization (which uses second-order derivative information) or like gradient descent (which only uses first-order derivative information). The is that if the updates are going well, then lets include curvature information in our update to help optimize things faster. If the updates aren’t going well let’s dial back towards gradient descent, stick to first-order derivative information and use smaller steps. This wizardry is known as the Levenberg-Marquardt heuristic. So how does it work?

Something we skimmed over in the iLQR description was that we need to calculate Q^{-1}_\textbf{uu} to get the \textbf{k} and \textbf{K} matrices. Instead of using np.linalg.pinv or somesuch, we’re going to calculate the inverse ourselves using singular value decomposition, so that we can regularize it. This will let us do a couple of things. First, we’ll be able to make sure that our estimate of curvature (Q_\textbf{uu}^{-1}) stays positive definite, which is important to make sure that we always have a descent direction. Second, we’re going to add a regularization term to the singular values to prevent them from exploding when we take their inverse. Here’s our regularization implemented in Python:

 U, S, V = np.linalg.svd(Quu)
S[S < 0] = 0.0 # no negative values
S += lamb # add regularization term
Quu_inv = np.dot(U, np.dot(np.diag(1.0/S), V.T))

Now, what happens when we change lamb? The singular values represent the magnitude of each of the left and right singular vectors, and by taking their reciprocal we flip the contributions of the vectors. So the ones that were contributing the least now have the largest singular values, and the ones that contributed the most now have the smallest singular values. By adding a regularization term we ensure that the inverted singular values can never be larger than 1/lamb. So essentially we throw out information.

In the case where we’ve got a really good approximation of the system dynamics and value function, we don’t want to do this. We want to use all of the information available because it’s accurate, so make lamb small and get a more accurate inverse. In the case where we have a bad approximation of the dynamics we want to be more conservative, which means not having those large singular values. Smaller singular values give a smaller Q_\textbf{uu}^{-1} estimate, which then gives smaller gain matrices and control signal update, which is what we want to do when our control signal updates are going poorly.

How do you know if they’re going poorly or not, you now surely ask! Clever as always, we’re going to use the result of the previous iteration to update lamb. So adding to the code from just above, the end of our control update loop is going to look like:

lamb = 1.0 # initial value of lambda
if costnew < cost:
  lamb /= self.lamb_factor
  sim_new_trajectory = True

  if (abs(costnew - cost)/cost) < self.converge_thresh:
  lamb *= self.lamb_factor
  if lamb > self.max_lamb:

And that is pretty much everything! OK let’s see how this runs!

Simulation results

If you want to run this and see for yourself, you can go copy my Control repo, navigate to the main directory, and run

python run.py arm2 reach

or substitute in arm3. If you’re having trouble getting the arm2 simulation to run, try arm2_python, which is a straight Python implementation of the arm dynamics, and should work no sweat for Windows and Mac.

Below you can see results from the iLQR controller controlling the 2 and 3 link arms (click on the figures to see full sized versions, they got distorted a bit in the shrinking to fit on the page), using immediate and final state cost functions defined as:

l = np.sum(u**2)


pos_err = np.array([self.arm.x[0] - self.target[0], 
                    self.arm.x[1] - self.target[1]])
l = (wp * np.sum(pos_err**2) + # pos error
    wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error

where wp and wv are just gain values, x is the state of the system, and self.arm.x is the (x,y) position of the hand. These read as “during movement, penalize large control signals, and at the final state, have a big penalty on not being at the target.”

So let’s give it up for iLQR, this is awesome! How much of a crazy improvement is that over LQR? And with all knowledge of the system through finite differences, and with the full movements in exactly 1 second! (Note: The simulation speeds look different because of my editing to keep the gif sizes small, they both take the same amount of time for each movement.)

Changing cost functions
Something that you may notice is that the control of the 3 link is actually straighter than the 2 link. I thought that this might be just an issue with the gain values, since the scale of movement is smaller for the 2 link arm than the 3 link there might have been less of a penalty for not moving in a straight line, BUT this was wrong. You can crank the gains and still get the same movement. The actual reason is that this is what the cost function specifies, if you look in the code, only \ell_f(\textbf{x}_N) penalizes the distance from the target, and the cost function during movement is strictly to minimize the control signal, i.e. \ell(\textbf{x}_t, \textbf{u}_t) = \textbf{u}_t^2.

Well that’s a lot of talk, you say, like the incorrigible antagonist we both know you to be, prove it. Alright, fine! Here’s iLQR running with an updated cost function that includes the end-effector’s distance from the target in the immediate cost:

All that I had to do to get this was change the immediate cost from

l = np.sum(u**2)


l = np.sum(u**2)
pos_err = np.array([self.arm.x[0] - self.target[0], 
                    self.arm.x[1] - self.target[1]])
l += (wp * np.sum(pos_err**2) + # pos error
    wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error

where all I had to do was include the position penalty term from the final state cost into the immediate state cost.

Changing sequence length
In these simulations the system is simulating at .01 time step, and I gave it 100 time steps to reach the target. What if I give it only 50 time steps?

It looks pretty much the same! It’s just now twice as fast, which is of course achieved by using larger control signals, which we don’t see, but dang awesome.

What if we try to make it there in 10 time steps??

OK well that does not look good. So what’s going on in this case? Basically we’ve given the algorithm an impossible task. It can’t make it to the target location in 10 time steps. In the implementation I wrote here, if it hits the end of it’s control sequence and it hasn’t reached the target yet, the control sequence starts over back at t=0. Remember that part of the target state is also velocity, so basically it moves for 10 time steps to try to minimize (x,y) distance, and then slows down to minimize final state cost in the velocity term.

In conclusion

This algorithm has been used in a ton of things, for controlling robots and simulations, and is an important part of guided policy search, which has been used to very successfully train deep networks in control problems. It’s getting really impressive results for controlling the arm models that I’ve built here, and using finite differences should easily generalize to other systems.

iLQR is very computationally expensive, though, so that’s definitely a downside. It’s definitely less expensive if you have the equations of your system, or at least a decent approximation of them, and you don’t need to use finite differences. But you pay for the efficiency with a loss in generality.

There are also a bunch of parameters to play around with that I haven’t explored at all here, like the weights in the cost function penalizing the magnitude of the cost function and the final state position error. I showed a basic example of changing the cost function, which hopefully gets across just how easy changing these things out can be when you’re using finite differences, and there’s a lot to play around with there too.

Implementation note
In the Yuval and Todorov paper, they talked about using backtracking line search when generating the control signal. So the algorithm they had when generating the new control signal was actually:

\hat{\textbf{u}}_t = \hat{\textbf{u}}_t + \alpha\textbf{k}_t + \textbf{K}_t(\hat{\textbf{x}}_t - \textbf{x}_t)

where \alpha was the backtracking search parameter, which gets set to one initially and then reduced. It’s very possible I didn’t implement it as intended, but I found consistently that \alpha = 1 always generated the best results, so it was just adding computation time. So I left it out of my implementation. If anyone has insights on an implementation that improves results, please let me know!

And then finally, another thank you to Dr. Emo Todorov for providing Matlab code for the iLQG algorithm, which was very helpful, especially for getting the Levenberg-Marquardt heuristic implemented properly.

Tagged , , , , , , , ,

Linear-Quadratic Regulation for non-linear systems using finite differences

One of the standard controllers in basic control theory is the linear-quadratic regulator (LQR). There is a finite-horizon case (where you have a limited amount of time), and an infinite-horizon case (where you don’t); in this post, for simplicity, we’re only going to be dealing with the infinite-horizon case.

The LQR is designed to handle a very specific kind of problem. First, it assumes you are controlling a system with linear dynamics, which means you can express them as

\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u},

where \textbf{x} and \dot{\textbf{x}} are the state and its time derivative, \textbf{u} is the input, and \textbf{A} and \textbf{B} capture the effects of the state and input on the derivative. And second, it assumes that the cost function, denoted J, is a quadratic of the form

J = \int_0^{\infty} \left( (\textbf{x} - \textbf{x}^*)^T \textbf{Q} (\textbf{x} - \textbf{x}^*) + \textbf{u}^T \textbf{R} \textbf{u} \right) dt

where \textbf{x}^* is the target state, and \textbf{Q} = \textbf{Q}^T \geq 0 and \textbf{R} = \textbf{R}^T \geq 0 are weights on the cost of not being at the target state and applying a control signal. The higher \textbf{Q} is, the more important it is to get to the target state asap, the higher \textbf{R} is, the more important it is to keep the control signal small as you go to the target state.

The goal of the LQR is to calculate a feedback gain matrix \textbf{K} such that

\textbf{u} = -\textbf{K} \textbf{x},

drives the system to the target. When the system is a linear system with a quadratic cost function, this can be done optimally. There is lots of discussion elsewhere about LQRs and their derivation, so I’m not going to go into that with this post. Instead, I’m going to talk about applying LQRs to non-linear systems, and using finite differences to do it, which works when you have a readily accessible simulation of the system on hand. The fun part is that by using finite differences you can get this to work without working out the dynamics equations yourself.

Using LQRs on non-linear systems

As you may have noticed, non-linear systems violate the first assumption of a linear quadratic regulator; that the system is linear. That doesn’t mean that we can’t apply it, it just means that it’s not going to be optimal. How poorly the LQR will perform depends on a few things, two important factors being how non-linear the system dynamics actually are, and how often you’re able to update the feedback gain matrix \textbf{K}. To apply LQR to non-linear systems we’re just going to close our eyes and pretend that the system dynamics are linear, i.e. they fit the form

\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u}.

We’ll do this by approximating the actual dynamics of the system linearly. We’ll then solve for our gain value \textbf{K}, generate our control signal for this timestep, and then re-approximate the dynamics again at the next time step and solve for \textbf{K} from the new state. The more non-linear the system dynamics are, the less appropriate \textbf{K} will be for generating our control signal \textbf{u} as we move away from the state \textbf{K} was calculated in; this is why update time of the LQR can become an important factor.

Using finite-differences to approximate system dynamics

An important question, then, is how do we find this system approximation? How can we calculate the \textbf{A} and \textbf{B} matrices that we then use to solve for \textbf{K}? If we know the dynamics of the system to be

\dot{\textbf{x}} = f(\textbf{x}, \textbf{u}),

then we can calculate

\textbf{A} = \frac{\partial f(\textbf{x}, \textbf{u})}{\partial \textbf{x}}, \;\;\;\; \textbf{B} = \frac{\partial f(\textbf{x}, \textbf{u})}{\partial \textbf{u}}.

If you’re going to try this for the 3-link arm, though, get out Mathematica. Do not try this by hand. If you disregard my warning and foolhardily attempt such a derivation you will regret, repent, and then appeal to Wolfram Alpha for salvation. These equations quickly become terrible and long even for seemingly not-so-complicated systems.

There are a few ways to skirt this. Here we’re going to assume that the system under control is a simulation, or that we at least have access to an accurate model, and use the finite differences method to compute these values. The idea behind finite differences is to approximate the rate of change of the function f at the point x by sampling f near x and using the difference to calculate \dot{f}(x). Here’s a picture for a 1D system:

So here, our current state x is the blue dot, and the red dots represent the sample points x + \Delta x and x - \Delta x. We can then calculate

\dot{f}(x) \approx \frac{f(x+\Delta x) - f(x-\Delta x)}{2\Delta x},

and you can see the actual rate of change of f at x plotted in the blue dashed line, and the approximated rate of change calculated using finite differences plotted in the red dashed line. We can also see that the approximated derivative is only accurate near x (the blue dot).

Back in our multi-dimensional system, to use finite differences to calculate the derivative with respect to the state and the input we’re going to vary each of the dimensions of the state and input by some small amount one at a time, calculating the effects of each one by one. Here’s a chunk of pseudo-code to hopefully clarify this idea:

eps = 1e-5
A = np.zeros((len(current_state), len(current_state))
for ii in range(len(current_state)):
    x = current_state.copy()
    x[ii] += eps
    x_inc = simulate_system(state=x, input=control_signal)
    x = current_state.copy()
    x[ii] -= eps
    x_dec = simulate_system(state=x, input=control_signal)
    A[:,ii] = (x_inc - x_dec) / (2 * eps)

B = np.zeros((len(current_state), len(control_signal))
for ii in range(len(control_signal)):
    u = control_signal.copy()
    u[ii] += eps
    x_inc = simulate_system(state=current_state, input=u)
    u = control_signal.copy()
    u[ii] -= eps
    x_dec = simulate_system(state=current_state, input=u)
    B[:,ii] = (x_inc - x_dec) / (2 * eps)

Now we’re able to generate our \textbf{A} and \textbf{B} matrices we have everything we need to solve for our feedback gain matrix \textbf{K}! Which is great.

Note on using finite differences in continuous vs discrete setup

Something that’s important to straighten out too is what exactly is returned by the simulate_system function in the code above. In the continuous case, your system is captured as

\dot{\textbf{x}} = \textbf{A}\textbf{x} + \textbf{B}\textbf{u},,

where in the discrete case your system is defined

\textbf{x}(t+1) = \textbf{A}\textbf{x}(t) + \textbf{B}\textbf{u}(t).

If you are calculating your feedback gain matrix \textbf{K} using the continuous solution to the algebraic Riccati equation, then you need to be returning \dot{\textbf{x}}(t). If you’re solving for \textbf{K} using the discrete solution to the algebraic Riccati equation you need to return \textbf{x}(t+1). This was just something that I came across as I was coding and so I wanted to mention it here in case anyone else stumbled across it!

Applying LQR to 2 and 3 link arm control

Alright! Let’s have a look at how the LQR does controlling non-linear systems. Below we have the control of a 2-link arm compared to a 3-link arm, and you can see the control of the 2-link arm is better. This is a direct result of the dynamics of a 3-link arm being significantly more complex.

2linkarm 3linkarm

Note on controlling at different timesteps

When I was first testing the LQR controller I expected the effects of different control update times to be a lot more significant than it was. As it turns out, for controlling a 3-link arm, there’s not really a visible difference in a controller that is updating every .01 seconds vs every .001 seconds vs every .0001 seconds. Let’s have a look:

3link.01 3link.001 3link.0001

Can’t even tell, eh? Fun fact, the simulation took 1 minute 30 seconds at .01 seconds time step and 45 minutes at .0001 seconds time step. The left-most animation is the .01 seconds and the right-most the .0001 seconds. But why is there seemingly so little difference? Well, this boils down to the dynamics of the 3-link arm changing actually pretty slowly. Below I’ve plotted just a few of the elements from the \textbf{A}, \textbf{B}, and \textbf{K} matrices over .5 seconds of simulation time:

A matrixB matrixK matrix

So, there are some obvious points where sampling the dynamics at a .01 time step is noticeably less accurate, but all in all there’s not a huuuggge difference between sampling at .01 and .0001 seconds. If you’re just watching the end-effector path it’s really not very noticeable. You can see how the elements of \textbf{A} and \textbf{B} are changing fairly slowly; this means that \textbf{K} is going to be an effective feedback gain for a fair chunk of time. And the computational savings you get by sampling the dynamics and regenerating \textbf{K} every .01 seconds instead of every .0001 seconds are pretty big. This was just another thing that I came across when playing around with the LQR, the take away being don’t just assume you need to update your system crazy often. You might get very comparable performance for much less computational cost.


All in all, the LQR controller is pretty neat! It’s really simple to set up, and generic. We don’t need any specific information about the system dynamics, like we do for effective operational space control (OSC). When we estimate the dynamics with finite differences, all need is a decent system model that we can sample. Again, the more non-linear the system, of course, the less effective a LQR will be. If you’re interested in playing around with one, or generating the figures that I show above, the code is all up and running on my Github for you to explore.

Tagged , , , , , , , ,

Setting up an arm simulation interface in Nengo 2

I got an email the other day asking about how to set up an arm controller in Nengo, where they had been working from the Spaun code to strip away things until they had just the motor portion left they could play with. I ended up putting together a quick script to get them started and thought I would share it here in case anyone else was interested. It’s kind of fun because it shows off some of the new GUI and node interfacing. Note that you’ll need nengo_gui version .15+ for this to work. In general I recommend getting the dev version installed, as it’s stable and updates are made all the time improving functionality.

Nengo 1.4 core was all written in Java, with Jython and Python scripting thrown in on top, and since then a lot of work has gone into the re-write of the entire code base for Nengo 2. Nengo 2 is now written in Python, all the scripting is in Python, and we have a kickass GUI and support for running neural simulations on CPUs, GPUs, and specialized neuromorphic hardware like SpiNNaKer. I super recommend checking it out if you’re at all interested in neural modelling, we’ve got a bunch of tutorials up and a very active support board to help with any questions or problems. You can find the simulator code for installation here: https://github.com/nengo/nengo and the GUI code here: https://github.com/nengo/nengo_gui, where you can also find installation instructions.

And once you have that up and running, to run an arm simulation you can download and run the following code I have up on my GitHub. When you pop it open at the top is a run_in_GUI boolean, which you can use to open the sim up in the GUI, if you set it to False then it will run in the Nengo simulator and once finished will pop up with some basic graphs. Shout out to Terry Stewart for putting together the arm-visualization. It’s a pretty slick little demo of the extensibility of the Nengo GUI, you can see the code for it all in the <code>arm_func</code> in the <code>nengo_arm.py</code> file.

As it’s set up right now, it uses a 2-link arm, but you can simply swap out the Arm.py file with whatever plant you want to control. And as for the neural model, there isn’t one implemented in here, it’s just a simple input node that runs through a neural population to apply torque to the two joints of the arm. But! It should be a good start for anyone looking to experiment with arm control in Nengo. Here’s what it looks like when you pull it up in the GUI (also note that the arm visualization only appears once you hit the play button!):


Tagged , , , , ,

Operational space control of 6DOF robot arm with spiking cameras part 2: Deriving the Jacobian

In the previous exciting post in this series I outlined the project, which is in the title, and we worked through getting access to the arm through Python. The next step was deriving the Jacobian, and that’s what we’re going to be talking about in this post!

This was a time I was very glad to have a previous post talking about generating transformation matrices, because deriving the Jacobian for a 6DOF arm in 3D space comes off as a little daunting when you’re used to 3DOF in 2D space, and I needed a reminder of the derivation process. The first step here was finding out which motors were what, so I went through and found out how each motor moved with something like the following code:

for ii in range(7):
    target_angles = np.zeros(7, dtype='float32')
    target_angles[ii] = np.pi / 4.0

and I found that the robot is setup in the figures below

this is me trying my hand at making things clearer using Inkscape, hopefully it’s worked. Displayed are the first 6 joints and their angles of rotation, q_0 through q_5. The 7th joint, q_6, opens and closes the gripper, so we’re safe to ignore it in deriving our Jacobian. The arm segment lengths l_1, l_3, and l_5 are named based on the nearest joint angles (makes easier reading in the Jacobian derivation).

Find the transformation matrix from end-effector to origin

So first thing’s first, let’s find the transformation matrices. Our first joint, q_0, rotates around the z axis, so the rotational part of our transformation matrix ^\textrm{orgin}\textbf{T}_0 is

^\textrm{orgin}\textbf{R}_0 = \left[ \begin{array}{ccc} \textrm{cos}(q_0) & -\textrm{sin}(q_0) & 0 \\ \textrm{sin}(q_0) & \textrm{cos}(q_0) & 0 \\ 0 & 0 & 1 \end{array} \right],

and q_0 and our origin frame of reference are on top of each other so we don’t need to account for translation, so our translation component of ^\textrm{origin}\textbf{T}_0 is

^\textrm{orgin}\textbf{D}_0 = \left[ \begin{array}{c} 0 \\ 0 \\ 0 \end{array} \right].

Stacking these together to form our first transformation matrix we have

^\textrm{orgin}\textbf{T}_0 = \left[ \begin{array}{cc} ^\textrm{origin}\textbf{R}_0 & ^\textrm{origin}\textbf{D}_0 \\ 0 & 1 \end{array} \right] = \left[ \begin{array}{cccc} \textrm{cos}(q_0) & -\textrm{sin}(q_0) & 0 & 0\\ \textrm{sin}(q_0) & \textrm{cos}(q_0) & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{array} \right] .

So now we are able to convert a position in 3D space from to the reference frame of joint q_0 back to our origin frame of reference. Let’s keep going.

Joint q_1 rotates around the x axis, and there is a translation along the arm segment l_1. Our transformation matrix looks like

^0\textbf{T}_1 = \left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & \textrm{cos}(q_1) & -\textrm{sin}(q_1) & l_1 \textrm{cos}(q_1) \\ 0 & \textrm{sin}(q_1) & \textrm{cos}(q_1) & l_1 \textrm{sin}(q_1) \\ 0 & 0 & 0 & 1 \end{array} \right] .

Joint q_2 also rotates around the x axis, but there is no translation from q_2 to q_3. So our transformation matrix looks like

^1\textbf{T}_2 = \left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & \textrm{cos}(q_2) & -\textrm{sin}(q_2) & 0 \\ 0 & \textrm{sin}(q_2) & \textrm{cos}(q_2) & 0 \\ 0 & 0 & 0 & 1 \end{array} \right] .

The next transformation matrix is a little tricky, because you might be tempted to say that it’s rotating around the z axis, but actually it’s rotating around the y axis. This is determined by where q_3 is mounted relative to q_2. If it was mounted at 90 degrees from q_2 then it would be rotating around the z axis, but it’s not. For translation, there’s a translation along the y axis up to the next joint, so all in all the transformation matrix looks like:

^2\textbf{T}_3 = \left[ \begin{array}{cccc} \textrm{cos}(q_3) & 0 & -\textrm{sin}(q_3) & 0\\ 0 & 1 & 0 & l_3 \\ \textrm{sin}(q_3) & 0 & \textrm{cos}(q_3) & 0\\ 0 & 0 & 0 & 1 \end{array} \right] .

And then the transformation matrices for coming from q_4 to q_3 and q_5 to q_4 are the same as the previous set, so we have

^3\textbf{T}_4 = \left[ \begin{array}{cccc} 1 & 0 & 0 & 0 \\ 0 & \textrm{cos}(q_4) & -\textrm{sin}(q_4) & 0 \\ 0 & \textrm{sin}(q_4) & \textrm{cos}(q_4) & 0 \\ 0 & 0 & 0 & 1 \end{array} \right] .


^4\textbf{T}_5 = \left[ \begin{array}{cccc} \textrm{cos}(q_5) & 0 & -\textrm{sin}(q_5) & 0 \\ 0 & 1 & 0 & l_5 \\ \textrm{sin}(q_5) & 0 & \textrm{cos}(q_5) & 0\\ 0 & 0 & 0 & 1 \end{array} \right] .

Alright! Now that we have all of the transformation matrices, we can put them together to get the transformation from end-effector coordinates to our reference frame coordinates!

^\textrm{origin}\textbf{T}_\textrm{ee} = ^\textrm{origin}\textbf{T}_0 \; ^0\textbf{T}_1 \; ^1\textbf{T}_2 \; ^2\textbf{T}_3 \; ^3\textbf{T}_4 \; ^4\textbf{T}_5.

At this point I went and tested this with some sample points to make sure that everything seemed to be being transformed properly, but we won’t go through that here.

Calculate the derivative of the transform with respect to each joint

The next step in calculating the Jacobian is getting the derivative of ^\textrm{origin}T_\textrm{ee}. This could be a big ol’ headache to do it by hand, OR we could use SymPy, the symbolic computation package for Python. Which is exactly what we’ll do. So after a quick

sudo pip install sympy

I wrote up the following script to perform the derivation for us

import sympy as sp

def calc_transform():
    # set up our joint angle symbols (6th angle doesn't affect any kinematics)
    q = [sp.Symbol('q0'), sp.Symbol('q1'), sp.Symbol('q2'), sp.Symbol('q3'),
            sp.Symbol('q4'), sp.Symbol('q5')]
    # set up our arm segment length symbols
    l1 = sp.Symbol('l1')
    l3 = sp.Symbol('l3')
    l5 = sp.Symbol('l5')

    Torg0 = sp.Matrix([[sp.cos(q[0]), -sp.sin(q[0]), 0, 0,],
                       [sp.sin(q[0]), sp.cos(q[0]), 0, 0],
                       [0, 0, 1, 0],
                       [0, 0, 0, 1]])

    T01 = sp.Matrix([[1, 0, 0, 0],
                     [0, sp.cos(q[1]), -sp.sin(q[1]), l1*sp.cos(q[1])],
                     [0, sp.sin(q[1]), sp.cos(q[1]), l1*sp.sin(q[1])],
                     [0, 0, 0, 1]])

    T12 = sp.Matrix([[1, 0, 0, 0],
                     [0, sp.cos(q[2]), -sp.sin(q[2]), 0],
                     [0, sp.sin(q[2]), sp.cos(q[2]), 0],
                     [0, 0, 0, 1]])

    T23 = sp.Matrix([[sp.cos(q[3]), 0, sp.sin(q[3]), 0],
                     [0, 1, 0, l3],
                     [-sp.sin(q[3]), 0, sp.cos(q[3]), 0],
                     [0, 0, 0, 1]])

    T34 = sp.Matrix([[1, 0, 0, 0],
                     [0, sp.cos(q[4]), -sp.sin(q[4]), 0],
                     [0, sp.sin(q[4]), sp.cos(q[4]), 0],
                     [0, 0, 0, 1]])

    T45 = sp.Matrix([[sp.cos(q[5]), 0, sp.sin(q[5]), 0],
                     [0, 1, 0, l5],
                     [-sp.sin(q[5]), 0, sp.cos(q[5]), 0],
                     [0, 0, 0, 1]])

    T = Torg0 * T01 * T12 * T23 * T34 * T45

    # position of the end-effector relative to joint axes 6 (right at the origin)
    x = sp.Matrix([0,0,0,1])

    Tx = T * x

    for ii in range(6):
        print q[ii]
        print sp.simplify(Tx[0].diff(q[ii]))
        print sp.simplify(Tx[1].diff(q[ii]))
        print sp.simplify(Tx[2].diff(q[ii]))

And then consolidated the output using some variable shorthand to write a function that accepts in joint angles and generates the Jacobian:

def calc_jacobian(q):
    J = np.zeros((3, 7))

    c0 = np.cos(q[0])
    s0 = np.sin(q[0])
    c1 = np.cos(q[1])
    s1 = np.sin(q[1])
    c3 = np.cos(q[3])
    s3 = np.sin(q[3])
    c4 = np.cos(q[4])
    s4 = np.sin(q[4])

    c12 = np.cos(q[1] + q[2])
    s12 = np.sin(q[1] + q[2])

    l1 = self.l1
    l3 = self.l3
    l5 = self.l5

    J[0,0] = -l1*c0*c1 - l3*c0*c12 - l5*((s0*s3 - s12*c0*c3)*s4 + c0*c4*c12)
    J[1,0] = -l1*s0*c1 - l3*s0*c12 + l5*((s0*s12*c3 + s3*c0)*s4 - s0*c4*c12)
    J[2,0] = 0

    J[0,1] = (l1*s1 + l3*s12 + l5*(s4*c3*c12 + s12*c4))*s0
    J[1,1] = -(l1*s1 + l3*s12 + l5*s4*c3*c12 + l5*s12*c4)*c0
    J[2,1] = l1*c1 + l3*c12 - l5*(s4*s12*c3 - c4*c12)

    J[0,2] = (l3*s12 + l5*(s4*c3*c12 + s12*c4))*s0
    J[1,2] = -(l3*s12 + l5*s4*c3*c12 + l5*s12*c4)*c0
    J[2,2] = l3*c12 - l5*(s4*s12*c3 - c4*c12)

    J[0,3] = -l5*(s0*s3*s12 - c0*c3)*s4
    J[1,3] = l5*(s0*c3 + s3*s12*c0)*s4
    J[2,3] = -l5*s3*s4*c12

    J[0,4] = l5*((s0*s12*c3 + s3*c0)*c4 + s0*s4*c12)
    J[1,4] = l5*((s0*s3 - s12*c0*c3)*c4 - s4*c0*c12)
    J[2,4] = -l5*(s4*s12 - c3*c4*c12)

    return J

Alright! Now we have our Jacobian! Really the only time consuming part here was calculating our end-effector to origin transformation matrix, generating the Jacobian was super easy using SymPy once we had that.

Hack position control using the Jacobian

Great! So now that we have our Jacobian we’ll be able to translate forces that we want to apply to the end-effector into joint torques that we want to apply to the arm motors. Since we can’t control applied force to the motors though, and have to pass in desired angle positions, we’re going to do a hack approximation. Let’s first transform our forces from end-effector space into a set of joint angle torques:

\textbf{u} = \textbf{J}^T \; \textbf{u}_\textbf{x}.

To approximate the control then we’re simply going to take the current set of joint angles (which we know because it’s whatever angles we last told the system to move to) and add a scaled down version of \textbf{u} to approximate applying torque that affects acceleration and then velocity.

\textbf{q}_\textrm{des} = \textbf{q} + \alpha \; \textbf{u},

where \alpha is the gain term, I used .001 here because it was nice and slow, so no crazy commands that could break the servos would be sent out before I could react and hit the cancel button.

What we want to do then to implement operational space control here then is find the current (x,y,z) position of the end-effector, calculate the difference between it and the target end-effector position, use that to generate the end-effector control signal u_x, get the Jacobian for the current state of the arm using the function above, find the set of joint torques to apply, approximate this control by generating a set of target joint angles to move to, and then repeat this whole loop until we’re within some threshold of the target position. Whew.

So, a lot of steps, but pretty straight forward to implement. The method I wrote to do it looks something like:

def move_to_xyz(self, xyz_d):
    np.array xyz_d: 3D target (x_d, y_d, z_d)
    count = 0
    while (1):
        count += 1
        # get control signal in 3D space
        xyz = self.calc_xyz()
        delta_xyz = xyz_d - xyz
        ux = self.kp * delta_xyz

        # transform to joint space
        J = self.calc_jacobian()
        u = np.dot(J.T, ux)

        # target joint angles are current + uq (scaled)
        self.q[...] += u * .001
        self.robot.move(np.asarray(self.q.copy(), 'float32'))

        if np.sqrt(np.sum(delta_xyz**2)) < .1 or count > 1e4:

And that is it! We have successfully hacked together a system that can perform operational space control of a 6DOF robot arm. Here is a very choppy video of it moving around to some target points in a grid on a cube.
So, granted I had to drop a lot of frames from the video to bring it’s size down to something close to reasonable, but still you can see that it moves to target locations super fast!

Alright this is sweet, but we’re not done yet. We don’t want to have to tell the arm where to move ourselves. Instead we’d like the robot to perform target tracking for some target LED we’re moving around, because that’s way more fun and interactive. To do this, we’re going to use spiking cameras! So stay tuned, we’ll talk about what the hell spiking cameras are and how to use them for a super quick-to-setup and foolproof target tracking system in the next exciting post!

Tagged , , , , , , ,

Operational space control of 6DOF robot arm with spiking cameras part 1: Getting access to the arm through Python

From June 9th to the 19th we ran a summer school (brain camp) in our lab with people from all over to come and learn how to use our neural modeling software, Nengo, and then work on a project with others in the school and people from the lab. We had a bunch of fun hardware available, from neuromorphic hardware from the SpiNNaker group over at Cambridge to lego robots on omni-wheels to spiking cameras (i’ll discuss what a spiking camera is exactly in part 3 of this series) and little robot arms. There were a bunch of awesome projects, people built neural models capable of performing a simplified version of the Wisconsin card sorting task that they then got running on the SpiNNaker boards, a neural controller built to move a robot leech, a spiking neurons reinforcement-learning system implemented on SpiNNaker with spiking cameras to control the lego robot that learned to move towards green LEDs and avoid red LEDs, and a bunch of others. If you’re interested in participating in these kinds of projects and learning more about this I highly suggest you apply to the summer school for next year!

I took the summer school as a chance to break from other projects and hack together a project. The idea was to take the robot arm, implement an operational space controller (i.e. find the Jacobian), and then used spiking cameras to track an LED and have the robot arm learn how to move to the target, no matter where the cameras were placed, by learning the relationship between where the target is in camera-centric coordinates and arm-centric coordinates. This broke down into several steps: 1) Get access to the arm through Python, 2) derive the Jacobian to implement operational space control, 3) sample state space to get an approximation of the camera-centric to arm-centric function, 4) implement the control system to track the target LED.

Here’s a picture of the set-up during step 3, training.
So in the center is the 6DOF robot arm with a little LED attached to the head, and highlighted in orange circles are the two spiking cameras, expertly taped to the wall with office-grade scotch tape. You can also see the SpiNNaker board in the top left as a bonus, but I didn’t have enough time to involve it in this project.

I was originally going to write this all up as one post, because the first two parts are re-implementing previous posts, but even skimming through those steps it was getting long and I’m sure no one minds having a few different examples to look through for generating Cython code or deriving a Jacobian. So I’m going to break this into a few parts. Here in this post we’ll work through the first step (Cython) of our journey.

Get access to the arm through Python

The arm we had was the VE026A Denso training robot, on loan from Dr. Bryan Tripp of the neuromorphic robotics lab at UW. Previously an interface had been built up by one of Dr. Tripp’s summer students, written in C. C is great and all but Python is much easier to work with, and the rest of the software I wanted to work you know what I’m done justifying it Python is just great so Python is what I wanted to use. The end.

So to get access to the arm in Python I used the great ol’ C++ wrapper hack described in a previous post. Looking at Murphy-the-summer-student’s C code there were basically three functions I needed access to:

// initialize threads, connect to robot
void start_robot(int *semid, int32_t *ctrlid, int32_t *robotid, float **angles)
// send a set of joint angles to the robot servos
void Robot_Execute_slvMove(int32_t robotid, float j1, float j2, float j3, float j4, float j5, float j6, float j7, float j8)
// kill threads, disconnect from robot
void stop_robot(int semid, int32_t ctrlid, int32_t robotid)

So I changed the extension of the file to ‘.cpp’ (I know, I know, I said this was a hack!), fixed some compiler errors that popped up, and then appended the following to the end of the file:

/* A class to contain all the information that needs to
be passed around between these functions, and can
encapsulate it and hide it from the Python interface.

Written by Travis DeWolf (June, 2015)

class Robot {
/* Very simple class, just stores the variables
* needed for talking to the robot, and a gives access
* to the functions for moving it around */

int semid;
int32_t ctrlid;
int32_t robotid;
float* angles;

void connect();
void move(float* angles_d);
void disconnect();

Robot::Robot() { }

Robot::~Robot() { free(angles); }

/* Connect to the robot, get the ids and current joint angles */
/* char* usb_port: the name of the port the robot is connected to */
void Robot::connect()
start_robot(&amp;semid, &amp;ctrlid, &amp;robotid, &amp;angles);
printf("%i %i %i", robotid, ctrlid, semid);

/* Move the robot to the specified angles */
/* float* angles: the target joint angles */
void Robot::move(float* angles)
// convert from radians to degrees
angles[0] * 180.0 / 3.14,
angles[1] * 180.0 / 3.14,
angles[2] * 180.0 / 3.14,
angles[3] * 180.0 / 3.14,
angles[4] * 180.0 / 3.14,
angles[5] * 180.0 / 3.14,
angles[6] * 180.0 / 3.14,
angles[7] * 180.0 / 3.14);

/* Disconnect from the robot */
void Robot::disconnect()
stop_robot(semid, ctrlid, robotid);

int main()
Robot robot = Robot();
// connect to robot

// move robot to some random target angles
float target_angles[7] = {0, np.pi / 2.0, 0.0, 0, 0, 0, 0};


// disconnect

return 0;

So very simple class. Basically just wanted to create a set of functions to hide some of the unnecessary parameters from Python, do the conversion from radians to degrees (who works in degrees? honestly), and have a short little main function to test the creation of the class, and connection, movement, and disconnection of the robot. Like I said, there were a few compiler errors when switching from C to C++, but really that was the only thing that took any time on this part. A few casts and everything was gravy.

The next part was writing the Cython pyRobot.pyx file (I describe the steps involved in this in more detail in this post):

import numpy as np
cimport numpy as np

cdef extern from "bcap.cpp":
cdef cppclass Robot:
void connect()
void move(float* angles)
void disconnect()

cdef class pyRobot:
cdef Robot* thisptr

def __cinit__(self):
self.thisptr = new Robot()

def __dealloc__(self):
del self.thisptr

def connect(self):
Set up the connection to the robot, pass in a vector,
get back the current joint angles of the arm.
param np.ndarray angles: a vector to store current joint angles

def move(self, np.ndarray[float, mode="c"] angles):
Step the simulation forward one timestep. Pass in target angles,
get back the current arm joint angles.
param np.ndarray angles: 7D target angle vector

def disconnect(self):
Disconnect from the robot.

the setup.py file:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

name = 'Demos',
cmdclass = {'build_ext': build_ext},


and then compiling!

run setup.py build_ext -i

With all of this working, a nice test.so file was created, and it was now possible to connect to the robot in Python with

import test
rob = test.pyRobot()
target_angles = np.array([0, np.pi/2, 0, np.pi/4, 0, 0, 0, 0], dtype='float32')
import time

In the above code we’re instantiating the pyRobot class, connecting to the robot, defining a set of target angles and telling the robot to move there, waiting for 1 second to give the robot time to actually move, and then disconnecting from the robot. Upon connection we have to pass in a set of joint angles for the servos, and so you see the robot arm jerk into position, and then move to the target set of joint angles, it looks something exactly like this:


Neat, phase 1 complete.

At the end of phase 1 we are able to connect to the robot arm through Python and send commands in terms of joint angles. But we don’t want to send commands in terms of joint angles, we want to just specify the end-effector position and have the robot work out the angles! I’ve implemented an inverse kinematics solver using constrained optimization before for a 3-link planar arm, but we’re not going to go that route. Find out what we’ll do by joining us next time! or by remembering what I said we’d do at the beginning of this post.

Tagged , , ,

Reinforcement Learning part 4: Combining egocentric and allocentric

Alright, in my last post on RL we looked at egocentric learning and what that is. We also saw that both ego and allocentric learning have their strong points, and they compliment each other. So, some combination of these two approaches seems like a good idea. In this post we’re going to look at combining the output from these two kinds of learners and see the strengths of different weightings. Per usual, all the code from this post is up on my Github.

Combining allocentric and egocentric approaches

There are a few different ways that come to mind that we could combine these two learning styles. In each of these cases the overall setup is that we’re going to run two learning systems, one that is allocentric and one that is egocentric, combine their action value mappings for the current state in some way, and then decide which action to take based on their combined output. So here are several possible ways to combine the outputs:

  • Only use allocentric
  • Only use egocentric
  • Average the mappings
  • Dynamically change weights

There are, of course, many more ways to weight these systems, but these are some basic ones and they’re what we’re going to look at here.

For testing each of these these different setups we are going to run the mouse on a map for a set amount of time, measure the wins, accidental deaths, and intentional deaths, and then gather statistics (mean value and confidence intervals) across 10 trials.

A side note: The rate of exploration (epsilon value) starts out at .1 and is decreased by half every 100,000 simulation time step, so there are a number of jumps where you see the number of wins go up every 100,000, and that’s an artifact of the rate of exploration dropping (i.e. increasing your exploitation of what you already know about the world).

OK, we’re going to do this across 3 different maps, where the size and complexity increases with each map.

Map 1: the basic cliff map (which should look familiar):

Map 2: a middlesized map with several areas that are identical for the egocentric state:

and Map 3: a pretty large map with even more ambiguous areas:



Now, what we expect from our learning systems is that the egocentric learner with have very few ‘suicide’ deaths, where it’s slow to learn that jumping off a cliff is a terrible idea. But on the flip side of that, since we’re going to set the gamma value of the egocentric learner to 0, (which means that it doesn’t incorporate anything other than the immediate reward into it’s learning, as discussed in the previous post), we expect that it will only rarely find the goal. Also, as the map gets larger, we expect that the goal will be found less and less, as it will have to randomly stumble around avoiding cliffs until it ends up in sight of the cheese.

From the allocentric learner we expect the death rate to be much higher than the egocentric learner, but we also expect the success rate to be significantly higher, as it will be able to identify how to move based on its current (x,y) position, rather than based on its immediate surroundings. As the map gets larger though, we expect the number of deaths to be larger, since there is no transfer of knowledge that jumping off a cliff in this position is as bad as jumping off a cliff in that position.

Alright! Let’s get to it!


Only use allocentric

Not much to explain about this, the mouse is going to be learning Q-values where each state is its (x,y) position on the map. The bigger the map, the more deaths we expect this learner to have. So, here are the results:

Alright, so one of the most salient things in these plots is the stepping up of the number of wins at each of the 100,000 marks. Again, this is due to a transfer from exploration to exploitation of what’s already been learned as the epsilon value is decreased. And, also, as the map gets bigger, the number of wins takes longer to max out, but it’s pretty interesting how consistent the learning is (as seen by the very small confidence intervals on the runs) across trials given the random exploration.
Another point of note is that the number of deaths goes way up with the size of the map, as expected.

Only use egocentric

Again, this one is another straightforward case, the mouse is going to be learning Q-values where each state is its set of immediate neighbors (e.g. cliff to the right, cheese in front, empty cells left and behind). Here we expect that it will have a minimal number of deaths (not much more than a death for each of the situations it can encounter a cliff or cliffs in), and that the number of successes should be pretty low and very inconsistent, since it will be entirely based on randomly stumbling into a spot near the cheese while avoiding cliffs. Additionally, we expect the number of results to drop with increases in the size of the map. The results are below:

The results are as expected, lots of variation in the wins number, low suicide numbers, and the number of wins decreases with map sizes. Something that’s kind of interesting here is that you don’t see any real effects of the epsilon value as it decreases as you do in the egocentric case. I believe that this is because it achieves a balance of exploration that throws you off a cliff and exploration that moves you into the vicinity of the cheese. So it’s dying less when the exploration rate drops, but it also has less chance of reaching the cheese now. That’s my guess, at least. Alright! Next.

Average the allo and ego Q-values

Here we’re going to do the most basic combination of the allo and ego-centric learning possible. Both learners will run simultaneously, and generate a set of Q-values over the set of possible actions at each state, and we’ll average the output of both then choose the max Q-value to choose which action is taken. The hope here is that the ego-centric learner will very quickly learn not to jump off cliffs, and the allocentric learner will more slowly learn how to move in the map to get cheese. So we expect few deaths and more wins. Let’s look at the results:

Damn! So now we have the same (or at least very similar) win rate as the allocentric learner (awesome), and the same death rate as the egocentric learner (also awesome)! We have achieved some sort of super-human…or, super-mouse mouse. A super mouse. This is really cool, it tells us that these two systems really do compliment each other, and that it can be incredibly straightforward to implement a combination of the two.

Dynamic weighting

Here we’ll look at the results from a variation of an algorithm I’ve implemented based on this really interesting paper by Dr. Sakya Dasgupta et al where they got a lot of really neat results implemented on a little robot critter. We can’t quite use the algorithm as presented here for a couple of reasons.

In the paper, the authors design the algorithm for weighting the output from classical conditioning and operant conditioning systems. I’m taking these to be pretty close to analogous to an ego-centric learner (with no lookahead) and an allo-centric learner, respectively. The idea being that classical conditioning is based on immediate associations developed from rewards, and operant conditioning allows the longer-term associations to be formed. Specifically, in operant you learn to associate the conditioned stimulus with the unconditioned stimulus, which leads to a a conditioned response. The classic example being the dog salivating when hearing the bell, because it knows food is coming. Here the allocentric learner is doing a similar thing, using a look-ahead to start associating stimuli (in this case different states) with a reward or punishment. So, this could be a wildly inappropriate application of this algorithm, and the comparison definitely warrants further discussion.

The more immediate reason for varying the algorithm is because the output of their system is a continuous ‘left/right’ decision, and the output from the mouse here is a discrete ‘left/right/up/down’ decision. So the systems in the paper output a number from -1 to 1 that indicates which way to turn, and this is used in the update. So the algorithm needs to be changed up a little.

First, here’s the original algorithm:

weight_allo += eta * reward * [output_allo - filtered_output_allo] * output_ego
weight_ego += eta * reward * [output_ego - filtered_output_ego] * output_allo

where eta is the learning rate, reward is the immediate reward of the state you move into, output_ is the -1 to 1 value from the learners, and filtered_output_ is a low pass filtered version of the output. So the first part is standard, learning rate multiplied by the reward, and then the part inside the brackets is calculating a derivative of the output of the learners. This is again standard, did your learner change? If it did change and a reward was received, then increase the weight of that learner. The last term, which multiplies it all by the output from the other learner is for scaling.

And then normalize the weights relative to each other, and it isn’t explicitly stated in the paper, but also lower bound the weights at 0. The reason you need to do this lower bounding is that if you don’t, then when you use the weight as a gain on the learner output all of the Q-values are inverted! Not easy to recover from that.

The dynamic weighting equations that I’m using are:

weight_allo += eta * reward * dq_allo * output_ego
weight_ego += eta * reward * dq_ego * output_allo

and then lower bound at 0 and normalize.

There is still a lot in these equations that I’m unsure about (for example scaling by the output of the other learner instead of by the other weight, and how the signs of the reward, derivative, and other learner output all interact), but I’m going to leave delving into that for a future post, since this one is getting long already and I don’t have a full understanding of it yet. But! I have enough of an understanding to implement the above and run simulations! Here are the results:

I should note that the learning rate here for the weight equations is 5e-4. When you decrease this weight you get improved performance, but the best I’ve gotten is just reducing it to a point that it’s pretty much constantly using .5 for each. And that’s boring.

Alright, comparing to basic averaging here we notice that we have about the same performance for the small map, and then on the bigger maps significantly worse performance and also a lot more variability between runs. Basically what seems to happen is that the system quickly identifies that the allocentric learner should not have control at the beginning, because it keeps killing the mouse, and then it takes a lot longer to start handing control back over from the egocentric learner to the allocentric learner.


Alright, so that’s all! Bottom line is that allocentric and egocentric learning really work well together, and basic averaging of these gives great results. Whether this is the best means of combining them is unclear, however. This was just a really basic test of different combinations, and there is some really neat work looking at different ways to do it. It’s super fun to mess around with and there’s a ton of room for experimentation. In the code up on my Github, the code for weight learning can be found by searching for weight_learning. If anyone plays around and figures out a kick-ass dynamic weighting algorithm please pass it along!

Tagged , , ,