Recently in my posts I’ve been using finite differences to approximate the gradient of loss functions and dynamical systems, with the intention of creating generalizable controllers that can be run on any system without having to calculate out derivatives beforehand. Finite differences is pretty much the most straight-forward way of approximating a gradient that there is: vary each parameter up and down (assuming we’re doing central differencing), one at a time, run it through your function and estimate the parameters effect on the system by calculating the difference between resulting function output. To do this requires 2 samples of the function for each parameter.
But there’s always more than one way to peel an avocado, and another approach that’s been used with much success is the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm, which was developed by Dr. James Spall (link to overview paper). SPSA is a method of gradient approximation, like finite differences, but, critically, the difference is that it varies all of the parameters at once, rather than one at a time. As a result, you can get an approximation of the gradient with far fewer samples from the system, and also when you don’t have explicit control over your samples (i.e. the ability to vary each parameter one at a time).
Given some function dependent on some set of parameters , we’re used to finding the gradient of using FDSA (finite differences stochastic approximation) written in this form:
where is a perturbation to the parameter set, and the subscript on the left-hand side denotes the derivative of with respect to .
And that’s how we’ve calculated it before, estimating the gradient of a single parameter at a time. But, we can rewrite this for a set perturbations :
which works as long as is square. When it’s not square (i.e. we have don’t have the same number of samples as we have parameters), we run into problems, because we can’t calculate directly. To address this, let’s take a step back and then work forward again to get a more general form that works for non-square too.
By rewriting the above, and getting rid of the inverse by moving back to the other side, we have:
Now, the standard trick to move a matrix that’s not square is to just make it square by multiplying it by its transpose to get a square matrix, and then the whole thing by the inverse:
Alright! Now we’re comfortable with this characterization of gradient approximation using a form that works with non-square perturbation matrices.
Again, in FDSA, we only vary one parameter at a time. This means that there will only ever be one non-zero entry per row of . By contrast, in SPSA, we vary multiple parameters, and so rows of will be just chalk full of non-zero entries.
Gradient approximation to estimate and for LQR control
This paper, by Drs. Jan Peters and Stepfan Schaal, is where I first stumbled across the above formulation of gradient approximation and read about SPSA (side note: I really recommend reading instead the Scholarpedia article on policy gradients, because it has fixes to a couple notation mistakes from the paper). Reading about this I thought, oh interesting, an alternative to FDSA for gradient approximation, let’s see how it well it does replacing FDSA in a linear quadratic regulator (LQR).
Implementing this was pretty simple. Just had to modify the
calc_derivs function, which I use to estimate the derivative of the arm with respect to the state and control signal, in my LQR controller code by changing from standard finite differences to simultaneous perturbation:
def calc_derivs(self, x, u): """" calculate gradient of plant dynamics using Simultaneous Perturbation Stochastic Approximation (SPSA). Implemented based on (Peters & Schaal, 2008). x np.array: the state of the system u np.array: the control signal """ # Initialization and coefficient selection num_iters = 20 eps = 1e-4 delta_K = None delta_J = None for ii in range(num_iters): # Generation of simultaneous perturbation vector # choose each component from a Bernoulli +-1 distribution # with probability of .5 for each +-1 outcome. delta_k = np.random.choice([-1,1], size=len(x) + len(u), p=[.5, .5]) # Function evaluations inc_x = np.copy(x) + eps * delta_k[:len(x)] inc_u = np.copy(u) + eps * delta_k[len(x):] state_inc = self.plant_dynamics(inc_x, inc_u) dec_x = np.copy(x) - eps * delta_k[:len(x)] dec_u = np.copy(u) - eps * delta_k[len(x):] state_dec = self.plant_dynamics(dec_x, dec_u) delta_j = ((state_inc - state_dec) / (2.0 * eps)).reshape(-1) # Track delta_k and delta_j delta_K = delta_k if delta_K is None else \ np.vstack([delta_K, delta_k]) delta_J = delta_j if delta_J is None else \ np.vstack([delta_J, delta_j]) f_xu = np.dot(np.linalg.pinv(np.dot(delta_K.T, delta_K)), np.dot(delta_K.T, delta_J)) f_x = f_xu[:len(x)] f_u = f_xu[len(x):] return f_x.T , f_u.T
A couple notes about the above code. First, you’ll notice that the
f_b matrices are both calculated at the same time. That’s pretty slick! And that calculation for
f_xu is just a straight implementation of the matrix form of gradient approximation, where I’ve arranged things so that
f_x is in the top part and
f_u is in the lower part.
The second thing is that the perturbation vector
delta_k is generated from a Bernoulli distribution. The reason behind this is that we want to have a bunch of different samples that pretty reasonably spread the state space and move all the parameters independently. Making each perturbation some distance times -1 or 1 is an easy way to achieve this.
Thirdly, there’s the
num_iters variable. This is a very important variable, as it dictates how many random samples of our system we take before we estimate the gradient. I’ve found that to get this to work for both the 2-link arm and the more complex 3-link arm, it needs to be at least 20. Or else things explode and die horribly. Just…horribly.
OK let’s look at the results:
The first thing to notice is that I’ve finally discovered the Seaborn plotting package. The second is that SPSA does as well as FDSA.
You may ask: Is there any difference? Well, if we time these functions, on my lil’ laptop, for the 2-link arm it takes SPSA approximately 2.0ms, but it takes FDSA only 0.8ms. So for the same performance the SPSA is taking almost 3 times as long to run. Why? This boils down to how many times the system dynamics need to be sampled by each algorithm to get a good approximation of the gradient. For a 2-link arm, FDSA has 6 parameters ( and ) that it needs to sample twice (we’re doing central differencing), for a total of 12 samples. And as I mentioned above, the SPSA algorithm needs 20 samples to be stable.
For the 3-link arm, SPSA took about 3.1ms on average and FDSA (which must now perform 18 samples of the dynamics) still only 2.1ms. So number of samples isn’t the only cause of time difference between these two algorithms. SPSA needs to perform that a few more matrix operations, including a matrix inverse, which is expensive, while FDSA can calculate the gradient of each parameter individually, which is much less expensive.
OK so SPSA not really impressive here. BUT! As I discovered, there are other means of employing SPSA.
Gradient approximation to optimize the control signal directly
In the previous set up we were using SPSA to estimate the gradient of the system under control, and then we used that gradient to calculate a control signal that minimized the loss function (as specified inside the LQR). This is one way to use gradient approximation methods. Another way to use these methods is approximate the gradient of the loss function directly, and use that information to iteratively calculate a control signal that minimizes the loss function. This second application is the primary use of the SPSA algorithm, and is what’s described by Dr. Spall in his overview paper.
In this application, the algorithm works like this:
- start with initial input to system
- perturb input and simulate results
- observe loss function and calculate gradient
- update input to system
- repeat to convergence
Because in this approach we’re iteratively optimizing the input using our gradient estimation, having a noisy estimate is no longer a death sentence, as it was in the LQR. If we update our input to the system with several noisy gradient estimates the noise will essentially just cancel itself out. This means that SPSA now has a powerful advantage over FDSA: Since in SPSA we vary all parameters at once, only 2 samples of the loss function are used to estimate the gradient, regardless of the number of parameters. In contrast, FDSA needs to sample the loss function twice for every input parameter. Here’s a picture from (Spall, 1998) that shows the two running against each other to optimize a 2D problem:
This gets across that even though SPSA bounces around more, they both reach the solution in the same number of steps. And, in general, this is the case, as Dr. Spall talks about in the paper. There’s also a couple more details of the algorithm, so let’s look at it in detail. Here’s the code, which is just a straight translation into Python out of the description in Dr. Spall’s paper:
# Step 1: Initialization and coefficient selection max_iters = 5 converge_thresh = 1e-5 alpha = 0.602 # from (Spall, 1998) gamma = 0.101 a = .101 # found empirically using HyperOpt A = .193 c = .0277 delta_K = None delta_J = None u = np.copy(self.u) if self.u is not None \ else np.zeros(self.arm.DOF) for k in range(max_iters): ak = a / (A + k + 1)**alpha ck = c / (k + 1)**gamma # Step 2: Generation of simultaneous perturbation vector # choose each component from a bernoulli +-1 distribution with # probability of .5 for each +-1 outcome. delta_k = np.random.choice([-1,1], size=arm.DOF, p=[.5, .5]) # Step 3: Function evaluations inc_u = np.copy(u) + ck * delta_k cost_inc = self.cost(np.copy(state), inc_u) dec_u = np.copy(u) - ck * delta_k cost_dec = self.cost(np.copy(state), dec_u) # Step 4: Gradient approximation gk = np.dot((cost_inc - cost_dec) / (2.0*ck), delta_k) # Step 5: Update u estimate old_u = np.copy(u) u -= ak * gk # Step 6: Check for convergence if np.sum(abs(u - old_u)) < converge_thresh: break
The main as-of-yet-unexplained parts of this code are the
c variables. What’s their deal?
Looking inside the loop, we can see that
ck controls the magnitude of our perturbations. Looking a little further down,
ak is just the learning rate. And all of those other parameters are just involved in shaping the trajectories that
ck follow through iterations, which is a path towards zero. So the first steps and perturbations are the biggest, and each successively becomes smaller as the iteration count increases.
There are a few heuristics that Dr. Spall goes over, but there aren’t any hard and fast rules for setting
c. Here, I just used HyperOpt to find some values that worked pretty well for this particular problem.
The FDSA version of this is also very straight-forward:
# Step 1: Initialization and coefficient selection max_iters = 10 converge_thresh = 1e-5 eps = 1e-4 u = np.copy(self.u) if self.u is not None \ else np.zeros(self.arm.DOF) for k in range(max_iters): gk = np.zeros(u.shape) for ii in range(gk.shape): # Step 2: Generate perturbations one parameter at a time inc_u = np.copy(u) inc_u[ii] += eps dec_u = np.copy(u) dec_u -= eps # Step 3: Function evaluation cost_inc = self.cost(np.copy(state), inc_u) cost_dec = self.cost(np.copy(state), dec_u) # Step 4: Gradient approximation gk[ii] = (cost_inc - cost_dec) / (2.0 * eps) old_u = np.copy(u) # Step 5: Update u estimate u -= 1e-5 * gk # Step 6: Check for convergence if np.sum(abs(u - old_u)) < converge_thresh: break
You’ll notice that in both the SPSA and FDSA code we’re no longer sampling
plant_dynamics, we’re instead sampling
cost, a loss function I defined. From just my experience playing around with these algorithms a bit, getting the loss function to be appropriate and give the desired behaviour is definitely a bit of an art. It feels like much more of an art than in other controllers I’ve coded, but that could just be me.
The cost function that I’m using is pretty much the first thing you’d think of. It penalizes distance to target and having non-zero velocity. Getting the weighting between distance to target and velocity set up so that the arm moves to the target but also doesn’t overshoot definitely took a bit of trial and error, er, I mean empirical analysis. Here’s the cost function that I found worked pretty well, note that I had to make special cases for the different arms:
def cost(self, x, u): dt = .1 if self.arm.DOF == 3 else .01 next_x = self.plant_dynamics(x, u, dt=dt) vel_gain = 100 if self.arm.DOF == 3 else 10 return (np.sqrt(np.sum((self.arm.x - self.target)**2)) * 1000 \ + np.sum((next_x[self.arm.DOF:])**2) * vel_gain)
So that’s all the code, let’s look at the results!
For these results, I used a max of 10 iterations for optimizing the control signal. I was definitely surprised by the quality of the results, especially for the 3-link arm, compared to the results generated by a standard LQR controller. Although I need to note, again, that it was a fair bit of me playing around with the exact cost function to get these results. Lots of empirical analysis.
The two controllers generate results that are identical to visual inspection. However, especially in the 3-link arm, the time required to run the FDSA was significantly longer than the SPSA controller. It took approximately 140ms for the SPSA controller to run a single loop, but took FDSA on average 700ms for a single loop of calculating the control signal. Almost 5 times as long! For the same results! In directly optimizing the control signal, SPSA gets a big win over standard FDSA. So, if you’re looking to directly optimize over a loss function, SPSA is probably the way you want to go.
First off, I thought it was really neat to directly apply gradient approximation methods to optimizing the control signal. It’s something I haven’t tried before, but definitely makes sense, and can generate some really nice results when tuned properly. Automating the tuning is definitely I’ll be discussing in future posts, because doing it by hand takes a long time and is annoying.
In the LQR, the gradient approximation was best done by the FDSA. I think the main reasons for this is that in solving for the control signal the LQR algorithm uses matrix inverses, and any errors in the linear approximations to the dynamics are going to be amplified quite a bit. If I did anything less than 10-15 iterations (20 for the 3-link arm) in the SPSA approximation then things exploded. Also, here the SPSA algorithm required a matrix inverse, where the FDSA didn’t. This is because we only varied one parameter at a time in FDSA, and the effects of changing each was isolated. In the SPSA case, we had to consider the changes across all the variables and the resulting effects all at once, essentially noting which variables changed by how much and the changes in each case, and averaging. Here, even with the more complex 3-link arm, FDSA was faster, so I’m going to stick with it in my LQR and iLQR implementations.
In the direct control signal optimization SPSA beat the pants off of FDSA. It was almost 5 times faster for control of the 3-link arm. This was, again, because in this case we could use noisy samples of the gradient of the loss function and relied on noise to cancel itself out as we iterated. So we only needed 2 samples of the loss function in SPSA, where in FDSA we needed 2*
num_parameters. And although this generated pretty good results I would definitely be hesitant against using this for any more complicated systems, because tuning that cost function to get out a good trajectory was a pain. If you’re interested in playing around with this, you can check out the code for the gradient controllers up on my GitHub.