A few months ago I posted on Linear Quadratic Regulators (LQRs) for control of non-linear systems using finite-differences. The gist of it was at every time step linearize the dynamics, quadratize (it could be a word) the cost function around the current point in state space and compute your feedback gain off of that, as though the dynamics were both linear and consistent (i.e. didn’t change in different states). And that was pretty cool because you didn’t need all the equations of motion and inertia matrices etc to generate a control signal. You could just use the simulation you had, sample it a bunch to estimate the dynamics and value function, and go off of that.

The LQR, however, operates with maverick disregard for changes in the future. Careless of the consequences, it optimizes assuming the linear dynamics approximated at the current time step hold for all time. It would be really great to have an algorithm that was able to plan out and optimize a sequence, mindful of the changing dynamics of the system.

This is exactly the iterative Linear Quadratic Regulator method (iLQR) was designed for. iLQR is an extension of LQR control, and the idea here is basically to optimize a whole control sequence rather than just the control signal for the current point in time. The basic flow of the algorithm is:

- Initialize with initial state and initial control sequence .
- Do a forward pass, i.e. simulate the system using to get the trajectory through state space, , that results from applying the control sequence starting in .
- Do a backward pass, estimate the value function and dynamics for each in the state-space and control signal trajectories.
- Calculate an updated control signal and evaluate cost of trajectory resulting from .
- If then we've converged and exit.
- If , then set , and change the update size to be more aggressive. Go back to step 2.
- If change the update size to be more modest. Go back to step 3.

There are a bunch of descriptions of iLQR, and it also goes by names like ‘the sequential linear quadratic algorithm’. The paper that I’m going to be working off of is by Yuval Tassa out of Emo Todorov’s lab, called Control-limited differential dynamic programming. And the Python implementation of this can be found up on my github in my Control repo. Also, a big thank you to Dr. Emo Todorov who provided Matlab code for the iLQG algorithm, which was super helpful.

**Defining things**

So let’s dive in. Formally defining things, we have our system , and dynamics described with the function , such that

where is the input control signal. The trajectory is the sequence of states that result from applying the control sequence starting in the initial state .

Now we need to define all of our cost related equations, so we know exactly what we’re dealing with.

Define the *total cost* function , which is the sum of the immediate cost, , from each state in the trajectory plus the final cost, :

Letting , we define the *cost-to-go* as the sum of costs from time to :

The *value* function at time is the optimal cost-to-go from a given state:

where the above equation just says that the optimal cost-to-go is found by using the control sequence that minimizes .

At the final time step, , the value function is simply

For all preceding time steps, we can write the value function as a function of the immediate cost and the value function at the next time step:

NOTE: In the paper, they use the notation to denote the value function at the next time step, which is redundant since , but it comes in handy later when they drop the dependencies to simplify notation. So, heads up: .

**Forward rollout**

The forward rollout consists of two parts. The first part is to simulating things to generate the from which we can calculate the overall cost of the trajectory, and find out the path that the arm will take. To improve things though we’ll need a lot of information about the partial derivatives of the system, calculating these is the second part of the forward rollout phase.

To calculate all these partial derivatives we’ll use . For each we’ll calculate the derivatives of with respect to and , which will give us what we need for our linear approximation of the system dynamics.

To get the information we need about the value function, we’ll need the first and second derivatives of and with respect to and .

So all in all, we need to calculate , , , , , , , where the subscripts denote a partial derivative, so is the partial derivative of with respect to , is the second derivative of with respect to , etc. And to calculate all of these partial derivatives, we’re going to use finite differences! Just like in the LQR with finite differences post. Long story short, load up the simulation for every time step, slightly vary one of the parameters, and measure the resulting change.

Once we have all of these, we’re ready to move on to the backward pass.

**Backward pass**

Now, we started out with an initial trajectory, but that was just a guess. We want our algorithm to take it and then converge to a local minimum. To do this, we’re going to add some perturbing values and use them to minimize the value function. Specifically, we’re going to compute a local solution to our value function using a quadratic Taylor expansion. So let’s define to be the change in our value function at as a result of small perturbations :

The second-order expansion of is given by:

Remember that , which is the value function at the next time step. NOTE: All of the second derivatives of are zero in the systems we’re controlling here, so when we calculate the second derivatives we don’t need to worry about doing any tensor math, yay!

Given the second-order expansion of , we can to compute the optimal modification to the control signal, . This control signal update has two parts, a feedforward term, , and a feedback term . The optimal update is the that minimizes the cost of :

where and

Derivation can be found in this earlier paper by Li and Todorov. By then substituting this policy into the expansion of we get a quadratic model of . They do some mathamagics and come out with:

So now we have all of the terms that we need, and they’re defined in terms of the values at the *next* time step. We know the value of the value function at the final time step , and so we’ll simply plug this value in and work backwards in time recursively computing the partial derivatives of and .

**Calculate control signal update**

Once those are all calculated, we can calculate the gain matrices, and , for our control signal update. Huzzah! Now all that’s left to do is evaluate this new trajectory. So we set up our system

and record the cost. Now if the cost of the new trajectory is less than the cost of then we set and go do it all again! And when the cost from an update becomes less than a threshold value, call it done. In code this looks like:

if costnew < cost: sim_new_trajectory = True if (abs(costnew - cost)/cost) < self.converge_thresh: break

Of course, another option we need to account for is when `costnew > cost`

. What do we do in this case? Our control update hasn’t worked, do we just exit?

The Levenberg-Marquardt heuristic

No! Phew.

The control signal update in iLQR is calculated in such a way that it can behave like Gauss-Newton optimization (which uses second-order derivative information) or like gradient descent (which only uses first-order derivative information). The is that if the updates are going well, then lets include curvature information in our update to help optimize things faster. If the updates aren’t going well let’s dial back towards gradient descent, stick to first-order derivative information and use smaller steps. This wizardry is known as the Levenberg-Marquardt heuristic. So how does it work?

Something we skimmed over in the iLQR description was that we need to calculate to get the and matrices. Instead of using `np.linalg.pinv`

or somesuch, we’re going to calculate the inverse ourselves using singular value decomposition, so that we can regularize it. This will let us do a couple of things. First, we’ll be able to make sure that our estimate of curvature () stays positive definite, which is important to make sure that we always have a descent direction. Second, we’re going to add a regularization term to the singular values to prevent them from exploding when we take their inverse. Here’s our regularization implemented in Python:

U, S, V = np.linalg.svd(Quu) S[S < 0] = 0.0 # no negative values S += lamb # add regularization term Quu_inv = np.dot(U, np.dot(np.diag(1.0/S), V.T))

Now, what happens when we change `lamb`

? The singular values represent the magnitude of each of the left and right singular vectors, and by taking their reciprocal we flip the contributions of the vectors. So the ones that were contributing the least now have the largest singular values, and the ones that contributed the most now have the smallest singular values. By adding a regularization term we ensure that the inverted singular values can never be larger than `1/lamb`

. So essentially we throw out information.

In the case where we’ve got a really good approximation of the system dynamics and value function, we don’t want to do this. We want to use all of the information available because it’s accurate, so make `lamb`

small and get a more accurate inverse. In the case where we have a bad approximation of the dynamics we want to be more conservative, which means not having those large singular values. Smaller singular values give a smaller estimate, which then gives smaller gain matrices and control signal update, which is what we want to do when our control signal updates are going poorly.

How do you know if they’re going poorly or not, you now surely ask! Clever as always, we’re going to use the result of the previous iteration to update `lamb`

. So adding to the code from just above, the end of our control update loop is going to look like:

lamb = 1.0 # initial value of lambda ... if costnew < cost: lamb /= self.lamb_factor sim_new_trajectory = True if (abs(costnew - cost)/cost) < self.converge_thresh: break else: lamb *= self.lamb_factor if lamb > self.max_lamb: break

And that is pretty much everything! OK let’s see how this runs!

**Simulation results**

If you want to run this and see for yourself, you can go copy my Control repo, navigate to the main directory, and run

python run.py arm2 reach

or substitute in `arm3`

. If you’re having trouble getting the `arm2`

simulation to run, try `arm2_python`

, which is a straight Python implementation of the arm dynamics, and should work no sweat for Windows and Mac.

Below you can see results from the iLQR controller controlling the 2 and 3 link arms (click on the figures to see full sized versions, they got distorted a bit in the shrinking to fit on the page), using immediate and final state cost functions defined as:

l = np.sum(u**2)

and

pos_err = np.array([self.arm.x[0] - self.target[0], self.arm.x[1] - self.target[1]]) l = (wp * np.sum(pos_err**2) + # pos error wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error

where `wp`

and `wv`

are just gain values, `x`

is the state of the system, and `self.arm.x`

is the position of the hand. These read as “during movement, penalize large control signals, and at the final state, have a big penalty on not being at the target.”

So let’s give it up for iLQR, this is awesome! How much of a crazy improvement is that over LQR? And with all knowledge of the system through finite differences, and with the full movements in exactly 1 second! (Note: The simulation speeds look different because of my editing to keep the gif sizes small, they both take the same amount of time for each movement.)

Changing cost functions

Something that you may notice is that the control of the 3 link is actually straighter than the 2 link. I thought that this might be just an issue with the gain values, since the scale of movement is smaller for the 2 link arm than the 3 link there might have been less of a penalty for not moving in a straight line, BUT this was wrong. You can crank the gains and still get the same movement. The actual reason is that this is what the cost function specifies, if you look in the code, only penalizes the distance from the target, and the cost function during movement is strictly to minimize the control signal, i.e. .

Well that’s a lot of talk, you say, like the incorrigible antagonist we both know you to be, prove it. Alright, fine! Here’s iLQR running with an updated cost function that includes the end-effector’s distance from the target in the immediate cost:

All that I had to do to get this was change the immediate cost from

l = np.sum(u**2)

to

l = np.sum(u**2) pos_err = np.array([self.arm.x[0] - self.target[0], self.arm.x[1] - self.target[1]]) l += (wp * np.sum(pos_err**2) + # pos error wv * np.sum(x[self.arm.DOF:self.arm.DOF*2]**2)) # vel error

where all I had to do was include the position penalty term from the final state cost into the immediate state cost.

Changing sequence length

In these simulations the system is simulating at .01 time step, and I gave it 100 time steps to reach the target. What if I give it only 50 time steps?

It looks pretty much the same! It’s just now twice as fast, which is of course achieved by using larger control signals, which we don’t see, but dang awesome.

What if we try to make it there in 10 time steps??

OK well that does not look good. So what’s going on in this case? Basically we’ve given the algorithm an impossible task. It can’t make it to the target location in 10 time steps. In the implementation I wrote here, if it hits the end of it’s control sequence and it hasn’t reached the target yet, the control sequence starts over back at

`t=0`

. Remember that part of the target state is also velocity, so basically it moves for 10 time steps to try to minimize distance, and then slows down to minimize final state cost in the velocity term.
**In conclusion**

This algorithm has been used in a ton of things, for controlling robots and simulations, and is an important part of guided policy search, which has been used to very successfully train deep networks in control problems. It’s getting really impressive results for controlling the arm models that I’ve built here, and using finite differences should easily generalize to other systems.

iLQR is very computationally expensive, though, so that’s definitely a downside. It’s definitely less expensive if you have the equations of your system, or at least a decent approximation of them, and you don’t need to use finite differences. But you pay for the efficiency with a loss in generality.

There are also a bunch of parameters to play around with that I haven’t explored at all here, like the weights in the cost function penalizing the magnitude of the cost function and the final state position error. I showed a basic example of changing the cost function, which hopefully gets across just how easy changing these things out can be when you’re using finite differences, and there’s a lot to play around with there too.

Implementation note

In the Yuval and Todorov paper, they talked about using backtracking line search when generating the control signal. So the algorithm they had when generating the new control signal was actually:

where was the backtracking search parameter, which gets set to one initially and then reduced. It’s very possible I didn’t implement it as intended, but I found consistently that always generated the best results, so it was just adding computation time. So I left it out of my implementation. If anyone has insights on an implementation that improves results, please let me know!

And then finally, another thank you to Dr. Emo Todorov for providing Matlab code for the iLQG algorithm, which was very helpful, especially for getting the Levenberg-Marquardt heuristic implemented properly.

[…] the more complex 3-link arm, FDSA was faster, so I’m going to stick with it in my LQR and iLQR […]

[…] you remember back to the iLQR I made a little while ago, I was able to change the cost function to […]

Nice tutorial. Thanks for posting it.

I do have a doubt. I tried to implement it and got a bug, so looked up your code, I didn’t understand why you multiplied dt to A and B and to l, lx, lxx, lu etc. Also you have added a identity matrix to A while linearising. Could explain those lines? It would really good if you could comment why you did that in your code.

Hello!

So the dt multiplication and + I is because we’re actually taking the derivative of the state with respect to the previous state. Which is not

dx = Ax + Bu,

but rather

x(t) = x(t-1) + (Ax(t-1) + Bu(t-1))*dt

And that’s where the identity matrix and dt come from! So that part’s in the comments of the code, but the *dt on all the cost function stuff is not commented on at all!

So here the dt lets you normalize behaviour for different time steps (assuming you also scale the number of steps in the sequence). So if you have a time step of .01 or .001 you’re not racking up 10 times as much cost function in the latter case. And if you run the code with 50 steps in the sequence and dt=.01 and 500 steps in the sequence and dt=.001 you’ll see that you get the same results, which is not the case at all when you don’t take dt into account in the cost function!

Hope that helps!

Hi Travis,

Thanks for the reply. I read few other articles and I would like to add something to this article. It would be good if you added this point as well. In the control signal u = -k + K(x*-x), K is actually the Riccatti gains for each step. This is one advantage of using DDP. It not only gives optimal feed forward open loop signal (-k) but also the feedback gain to stabilize it on the required trajectory.

Great article! Helps a lot.

Have you looked into constrained ILQR? I am looking at implementing ILQR for trajectory optimization for a manipulator with joint limits, I could add a cost to force the joints to within joint limits, is there any other alternative?

Hi! Glad that you found it useful, sorry about the crazy response delay, your comment got buried in my inbox. The paper that I reference at the beginning talks about adding in constraints for the control signal, but I haven’t played around with joint constraints. Adding a penalty to the cost function seems like a pretty solid way to handle it, if you didn’t want to do that one thing that you could do is add a force in that would act essentially as another part of the dynamics that the iLQR has to optimize over.

I have a post about avoiding obstacles with the body here: https://studywolf.wordpress.com/2016/11/24/full-body-obstacle-collision-avoidance/

In the paper that I link there they talk about how to extend that to create a control signal that pushes you away from the joint limits. So you could add that signal in over top of the iLQR signal, kind of lumping it in to the environment dynamics for the iLQR to optimize over. That might be another way to go about it!

This tutorial was incredibly helpful. I’ve been looking at iLQR for a while and could not find a good paper that provided good intuitions, until I came across your blog post and looked at the paper on which your post is based.

I also had a fair bit of difficulty working out the details enough to implement it, glad you found the post helpful! 😀

Thanks a lot for sharing the clear explanation and code. Much appreciated!

I was wondering if you ever attempted to implement Belief Space ILQG from “Motion Planning under Uncertainty using Iterative Local Optimization in Belief Space”, Van den Berg et al. IJRR 2012. I was working on a matlab implementation from scratch so your blog has been quite helpful 🙂

I have not! Although it looks interesting, added to the reading list.

If you finish your implementation and able to share that would be awesome!

(You might also consider publishing the implementation at ReScience.)

Glad you found it useful! 🙂

I have just gotten a basic implementation working in Matlab at https://github.com/sauravag/bsp-ilqg

Its still very much a work in progress. Used Yuval Tassa’s iLQG solver Matlab implementation, had to modify the params to suit this application. Working on better collision checking and possibly vectorization to speed things up.

Awesome! I’m looking forward to getting the chance to sit down and work through it! Thanks for the link! 😀

I think it would be likely to need to change U = {u0, u1, …, uN} to U = {u0, u1, …, uN-1}.

Thanks for your useful post.

Ah, thank you for the catch! Quite correct, I’ve updated it 🙂

Hi Travis,

Sorry if this is a stupid question, the algorithm you implemented in this post is actually iLQG(iterative linear quadratic gaussian) algorithm right ? But the paper you referenced in the later part of the post is the derivation of the iLQR(iterative linear quadratic regulator) algorithm right ? Based on the paper I read, iLQG is actually built upon iLQR, they are not exactly the same.

Hi! A good question!

I implemented this pretty much straight from ‘Control-limited differential dynamic programming’ by Yuval Tassa, Nicolas Mansard, and Emo Todorov (2014), where they’re implementing iLQR.

And you’re right, iLQR and iLQG are not the same! Todorov talks about the differences in the paper where they introduced iLQG: http://maeresearch.ucsd.edu/skelton/publications/weiwei_ilqg_CDC43.pdf

Possibly the confusion is that I mentioned I used iLQG code provided by Dr. Todorov to get this implementation going?

Does that help clear things up?

Hi, thanks for your post.

I have the problem about the code in ilqr.py controller in your github.

the detail is that:

The function dif_end(self, x) to compute the l_x and l_xx

and the why we define the l_u,l_uu,l_ux

l_u = 2 * u

l_uu = 2 * np.eye(dof)

l_ux = np.zeros((dof, num_states))

can your explation that.

thank you in advance

Hello!

this function computes the derivative of the cost function, l.

l_u is the derivative of l with respect to u, so d(u^2)/du = 2 * u

l_uu is the second derivative of l with respect to u, so d^2(u^2)/du^2 = d(2*u)/du = 2 * I

l_ux is the derivative of the derivative of l with respect to u with respect to x, so d^2(u^2)/dux = d(2*u)/dx = 0

does that help?

yes, Thank you for your reply，how about I_x and I_xx, because I really dont know the function of dif_end(self, x) . In cost_final(), you will update the l_x and l_xx using dif_end() .

so the subscript here in each case denotes ‘derivative with respect to’, so l_x is the derivative of l with respect to u, l_xx second derivative of l wrt to x. all of these functions are used to help see how changes in x and u will affect the cost function. this information is then used in updating the control signal to maximize reward. cost_final computes the cost of the state that the system ended up in at the end of the control sequence. l_x and l_xx returned from cost_final give you information on how changes in x will affect the final state cost, l.

sorry about the poor commenting in dif_end! That first part, with xe, all that’s being computed is the end-effector position. It’s a really convoluted (sorry) way of calculating (target – ee), where ee is the end effector position, with

`x = l0 * cos(q0) + l1 * cos(q0 + q1) + l2 * cos(q0 + q1 + q2)`

and`y = l0 * sin(q0) + l1 * sin(q0 + q1) + l2 * sin(q0 + q1 + q2)`

.and then the second part is computing the derivative of (target – ee)^2, which is the position part of the cost function, and the entire final state cost function.

Hope that clears it up!

In your second paragraph, you mention that vanilla LQR only optimizes for the current time step. I don’t think that’s true. Infinite-horizon LQR optimizes for all time in the future. The problem is not that LQR is myopic, it’s that it assumes a linear dynamics model.

Ah, you are correct! I meant to say that it’s assuming constant linear dynamics, which have been approximated at the current time step, but that was poorly communicated. I’ve reworked the paragraph hopefully it’s clear / correct now!

Shouldn’t the Quu_inv be Quu_inv = np.dot(V.T, np.dot(np.diag(1.0/S), U.T)) ?

svd returns U,S,V and the original matrix can be obtained by U.dot(np.diag(S)).dot(V)

First, thanks for the tutorial, it has been useful on a number of occasions. When using iLQR and TV-LQR I have often had an issue with the value function diverging. I recently saw an implementation of iLQG that did Vxx = 0.5*(Vxx + Vxx’) at the end of each backward pass update step. After implementing the same thing it seems to fix my convergence issues. Is this an okay thing to do?

Hello! Thanks glad you’ve found it useful! 🙂

Hmm, what situations are you finding it’s diverging? Running it with Vxx = .5*(Vxx+Vxx’) is adding a filter to the value function, so it would smooth out any quick dips or spikes, essentially averaging over an area of state space if I’m thinking about it correctly. Which seems fine to me to do in noisier systems or where the value function is some crazy shape with a manifold that makes it difficult to follow the gradient…