## Reinforcement Learning part 4: Combining egocentric and allocentric

Alright, in my last post on RL we looked at egocentric learning and what that is. We also saw that both ego and allocentric learning have their strong points, and they compliment each other. So, some combination of these two approaches seems like a good idea. In this post we’re going to look at combining the output from these two kinds of learners and see the strengths of different weightings. Per usual, all the code from this post is up on my Github.

Combining allocentric and egocentric approaches

There are a few different ways that come to mind that we could combine these two learning styles. In each of these cases the overall setup is that we’re going to run two learning systems, one that is allocentric and one that is egocentric, combine their action value mappings for the current state in some way, and then decide which action to take based on their combined output. So here are several possible ways to combine the outputs:

• Only use allocentric
• Only use egocentric
• Average the mappings
• Dynamically change weights

There are, of course, many more ways to weight these systems, but these are some basic ones and they’re what we’re going to look at here.

For testing each of these these different setups we are going to run the mouse on a map for a set amount of time, measure the wins, accidental deaths, and intentional deaths, and then gather statistics (mean value and confidence intervals) across 10 trials.

A side note: The rate of exploration (epsilon value) starts out at .1 and is decreased by half every 100,000 simulation time step, so there are a number of jumps where you see the number of wins go up every 100,000, and that’s an artifact of the rate of exploration dropping (i.e. increasing your exploitation of what you already know about the world).

OK, we’re going to do this across 3 different maps, where the size and complexity increases with each map.

Map 1: the basic cliff map (which should look familiar):

Map 2: a middlesized map with several areas that are identical for the egocentric state:

and Map 3: a pretty large map with even more ambiguous areas:

Expectations

Now, what we expect from our learning systems is that the egocentric learner with have very few ‘suicide’ deaths, where it’s slow to learn that jumping off a cliff is a terrible idea. But on the flip side of that, since we’re going to set the gamma value of the egocentric learner to 0, (which means that it doesn’t incorporate anything other than the immediate reward into it’s learning, as discussed in the previous post), we expect that it will only rarely find the goal. Also, as the map gets larger, we expect that the goal will be found less and less, as it will have to randomly stumble around avoiding cliffs until it ends up in sight of the cheese.

From the allocentric learner we expect the death rate to be much higher than the egocentric learner, but we also expect the success rate to be significantly higher, as it will be able to identify how to move based on its current (x,y) position, rather than based on its immediate surroundings. As the map gets larger though, we expect the number of deaths to be larger, since there is no transfer of knowledge that jumping off a cliff in this position is as bad as jumping off a cliff in that position.

Alright! Let’s get to it!

Results

Only use allocentric

Not much to explain about this, the mouse is going to be learning Q-values where each state is its (x,y) position on the map. The bigger the map, the more deaths we expect this learner to have. So, here are the results:

Alright, so one of the most salient things in these plots is the stepping up of the number of wins at each of the 100,000 marks. Again, this is due to a transfer from exploration to exploitation of what’s already been learned as the epsilon value is decreased. And, also, as the map gets bigger, the number of wins takes longer to max out, but it’s pretty interesting how consistent the learning is (as seen by the very small confidence intervals on the runs) across trials given the random exploration.
Another point of note is that the number of deaths goes way up with the size of the map, as expected.

Only use egocentric

Again, this one is another straightforward case, the mouse is going to be learning Q-values where each state is its set of immediate neighbors (e.g. cliff to the right, cheese in front, empty cells left and behind). Here we expect that it will have a minimal number of deaths (not much more than a death for each of the situations it can encounter a cliff or cliffs in), and that the number of successes should be pretty low and very inconsistent, since it will be entirely based on randomly stumbling into a spot near the cheese while avoiding cliffs. Additionally, we expect the number of results to drop with increases in the size of the map. The results are below:

The results are as expected, lots of variation in the wins number, low suicide numbers, and the number of wins decreases with map sizes. Something that’s kind of interesting here is that you don’t see any real effects of the epsilon value as it decreases as you do in the egocentric case. I believe that this is because it achieves a balance of exploration that throws you off a cliff and exploration that moves you into the vicinity of the cheese. So it’s dying less when the exploration rate drops, but it also has less chance of reaching the cheese now. That’s my guess, at least. Alright! Next.

Average the allo and ego Q-values

Here we’re going to do the most basic combination of the allo and ego-centric learning possible. Both learners will run simultaneously, and generate a set of Q-values over the set of possible actions at each state, and we’ll average the output of both then choose the max Q-value to choose which action is taken. The hope here is that the ego-centric learner will very quickly learn not to jump off cliffs, and the allocentric learner will more slowly learn how to move in the map to get cheese. So we expect few deaths and more wins. Let’s look at the results:

Damn! So now we have the same (or at least very similar) win rate as the allocentric learner (awesome), and the same death rate as the egocentric learner (also awesome)! We have achieved some sort of super-human…or, super-mouse mouse. A super mouse. This is really cool, it tells us that these two systems really do compliment each other, and that it can be incredibly straightforward to implement a combination of the two.

Dynamic weighting

Here we’ll look at the results from a variation of an algorithm I’ve implemented based on this really interesting paper by Dr. Sakya Dasgupta et al where they got a lot of really neat results implemented on a little robot critter. We can’t quite use the algorithm as presented here for a couple of reasons.

In the paper, the authors design the algorithm for weighting the output from classical conditioning and operant conditioning systems. I’m taking these to be pretty close to analogous to an ego-centric learner (with no lookahead) and an allo-centric learner, respectively. The idea being that classical conditioning is based on immediate associations developed from rewards, and operant conditioning allows the longer-term associations to be formed. Specifically, in operant you learn to associate the conditioned stimulus with the unconditioned stimulus, which leads to a a conditioned response. The classic example being the dog salivating when hearing the bell, because it knows food is coming. Here the allocentric learner is doing a similar thing, using a look-ahead to start associating stimuli (in this case different states) with a reward or punishment. So, this could be a wildly inappropriate application of this algorithm, and the comparison definitely warrants further discussion.

The more immediate reason for varying the algorithm is because the output of their system is a continuous ‘left/right’ decision, and the output from the mouse here is a discrete ‘left/right/up/down’ decision. So the systems in the paper output a number from -1 to 1 that indicates which way to turn, and this is used in the update. So the algorithm needs to be changed up a little.

First, here’s the original algorithm:

weight_allo += eta * reward * [output_allo - filtered_output_allo] * output_ego
weight_ego += eta * reward * [output_ego - filtered_output_ego] * output_allo

where eta is the learning rate, reward is the immediate reward of the state you move into, output_ is the -1 to 1 value from the learners, and filtered_output_ is a low pass filtered version of the output. So the first part is standard, learning rate multiplied by the reward, and then the part inside the brackets is calculating a derivative of the output of the learners. This is again standard, did your learner change? If it did change and a reward was received, then increase the weight of that learner. The last term, which multiplies it all by the output from the other learner is for scaling.

And then normalize the weights relative to each other, and it isn’t explicitly stated in the paper, but also lower bound the weights at 0. The reason you need to do this lower bounding is that if you don’t, then when you use the weight as a gain on the learner output all of the Q-values are inverted! Not easy to recover from that.

The dynamic weighting equations that I’m using are:

weight_allo += eta * reward * dq_allo * output_ego
weight_ego += eta * reward * dq_ego * output_allo

and then lower bound at 0 and normalize.

There is still a lot in these equations that I’m unsure about (for example scaling by the output of the other learner instead of by the other weight, and how the signs of the reward, derivative, and other learner output all interact), but I’m going to leave delving into that for a future post, since this one is getting long already and I don’t have a full understanding of it yet. But! I have enough of an understanding to implement the above and run simulations! Here are the results:

I should note that the learning rate here for the weight equations is 5e-4. When you decrease this weight you get improved performance, but the best I’ve gotten is just reducing it to a point that it’s pretty much constantly using .5 for each. And that’s boring.

Alright, comparing to basic averaging here we notice that we have about the same performance for the small map, and then on the bigger maps significantly worse performance and also a lot more variability between runs. Basically what seems to happen is that the system quickly identifies that the allocentric learner should not have control at the beginning, because it keeps killing the mouse, and then it takes a lot longer to start handing control back over from the egocentric learner to the allocentric learner.

Conclusions

Alright, so that’s all! Bottom line is that allocentric and egocentric learning really work well together, and basic averaging of these gives great results. Whether this is the best means of combining them is unclear, however. This was just a really basic test of different combinations, and there is some really neat work looking at different ways to do it. It’s super fun to mess around with and there’s a ton of room for experimentation. In the code up on my Github, the code for weight learning can be found by searching for weight_learning. If anyone plays around and figures out a kick-ass dynamic weighting algorithm please pass it along!

## Reinforcement Learning part 3: Egocentric learning

When an agent is learning how to move, it’s learning a set of values to associate with possible actions for each situation it encounters. We say that it’s learning the action values. When the agent learns to associate the values of taking different actions in different states, it is learning the state-action values, also referred to as the Q-values. How the agent identifies ‘a situation’ can be handled in a couple of different ways.

If the agent is performing egocentric learning, then it distinguishes situations based on it’s immediately surrounding environment. If the agent is perform allocentric learning, then it has an internal map of the environment, and each different position on the map is considered a different situation. Whether an agent is using allocentric or egocentric learning ends up influencing its behavior quite a bit, and both have their pros and cons. We’ve some of both ego and allo-centric in the previous RL posts; in this post we’re going to examine egocentric learning specifically a little closer. And the code for this post can be found up on my GitHub.

Difference between ego and allo-centric

Let’s look at when a new set of action values is learned in these two cases. Going back to the ever wonderful mouse and cheese scenario, we’ll create a map for our mouse to learn to navigate.

In egocentric learning, each situation is determined by what is near the mouse, and so new action values for each of the possible movements ({W,NW,N,NE,E,SE,S,SW}) have to be learned whenever the immediate environment is not something the agent has seen before. For example, the mouse keeps a different set of action values for each of the following environments:

On the other hand, in the case of allocentric learning, each situation is determined by the (x,y) coordinates of the mouse’s current position. That means that the mouse keeps a different set of action values for every possible (x,y) position:

so even if the environment in each position is the same:

with different (x,y) coordinates the mouse maintains different action value maps for all of them.

With egocentric learning, the agent tends to develop immediately beneficial behavior more quickly, to avoid running off cliffs and to move towards cheese when it’s in sight. The agent can even learn to move towards cheese that’s outside of the range of its sensors in simple maps, essentially by learning heuristics like ‘always moving right leads to reward.’ But this falls apart if there are several areas in the map where the immediate environment is the same, as in the case above with states 2 and 3.

With allocentric learning, the agent develops action value mappings that are much better suited to the overall environment and long-term goals (e.g. if the cheese is always in the center of the map, if you are in the bottom-right move up and left, and if you are in the top-left move down and right), but you can’t generalize across local situations (for example when there is a cliff one space to the right), the consequences of movement at each (x,y) position must be learned independently.

Clearly a combination of these two learning types is desirable, but that is for a future post! There is still a lot to examine in ego-centric learning, and so we’re going to dig deeper.

In allocentric learning, you don’t often perform an action, successfully move, and then still wind up in the same state. This makes transmitting information about rewards received in the future backwards to states that are still far away fairly stable, because a specific state is some constant distance away from getting reward. In egocentric learning, however, this isn’t the case. So when you start trying to propagate reward information backwards you can get some unstable behavior; let’s look at a very basic scenario:

There’s the mouse in blue, on the left, and the reward, in green, on the right, and red represents cliffs / certain death. So now, whenever the mouse isn’t beside one of the walls he will be in one of three states, cliffs above, nothing around him, or cliffs below:

OK. Now let’s look at the update formula for Q-learning again:

Q(s, a) += alpha * (reward(s') + gamma * max(Q(s')) - Q(s, a))

where s is the previous system state, a is the action just taken, s' is the current state, reward(s') is the reward for being in the current state, and max(Q(s')) is the maximum reward obtainable by performing an action in the current state.

Informally, it says that the change to make to the Q-value for the action just taken, Q(s, a), is equal to your learning rate, alpha, multiplied by the reward for the state that you moved into,reward(s'), plus the lookahead reward,max(Q(s')), which relays the expected reward now that you’re in state s', minus the previous Q-value,Q(s, a).

Let’s break this down a bit more, because this can be complicated to think about still. If we set gamma = 0, then our update is

Q(s, a) += alpha * (reward(s') - Q(s, a)).

Here it’s more apparent what the dynamics are doing, they’re driving the Q-value to the reward value of the state you end up in. So you take an action a in state s, you find the reward for where you ended up, and you learn the reward for performing action a in state s. Now if gamma is not equal to 0, then you can think of the Q-value dynamics as trying to drive the Q-values to

reward(s') + gamma * max(Q(s'))

The larger that gamma is, the farther into the future you look. This can lead to unstable / non-convergent systems when you repeatedly move into the same state again and again. For example, if you’re in State 2 from above, and you move right often enough you’ll end up in a state that leads to getting a reward. This boosts the state before the reward, and now the next time you happen to move to a state where the reward is next to you, the Q-values of State 2 are now updated. And so you get your reward and you reset. Now you move into State 2 again and hey! Moving right leads to a reward. So you move right, but now you’re back in State 2 again! And so the reward for moving right decreases a bit. But it’s still the best decision, so you move right. And you’re in State 2 again, and so the reward for moving right goes down again, etc etc. Eventually moving right brings you to the target though and the Q-value for moving right gets a boost. So it’s going to forever be fluctuating.

In this example, it doesn’t seem like a big deal, because the mouse still learns to move towards the cheese and the Q-values have entered a kind of rhythmic limit cycle. There are a few problems that arise, however, when counting the future to heavily in egocentric learning.

Problem: Death-wish

Basically, the mouse decides that certain death is a better option than wandering around in uncertainty forever and starts deciding to commit suicide. Let’s say that to encourage exploration, the reward for being in an unoccupied square is -1, and to encourage not dying the reward for jumping off a cliff is -10. The problem arises in that jumping off the cliff is a constant punishment, -10. It’s always -10 and it will never be any worse than -10. There is no further future past death to influence the reward. When the mouse is just exploring, however, it will explore, and get -1, and then explore more, and get -1, etc etc. This causes the expected reward of actions that lead you into the same state keep dropping. If the weight of gamma is too large, then without all that much exploring the reward for exploring has suddenly dropped to less than -10, and the mouse decides that continuing to explore is worse than death, and jumps.

By setting gamma = 5 (which is pretty high) we then get this behavior very quickly, here are some simulation results, where the first column is the timestep, the second column W denotes how many times the mouse found the cheese, the third column L denotes how many times the mouse chose to jump off a cliff (as dictated by its Q-values), and the last column A denotes how many times random exploration led the mouse to its accidental death:

10000, W: 0, L: 3138, A: 51
20000, W: 0, L: 3207, A: 40
30000, W: 0, L: 3204, A: 40
40000, W: 0, L: 3205, A: 45
50000, W: 0, L: 3208, A: 43
60000, W: 0, L: 3198, A: 49
70000, W: 0, L: 3200, A: 49
80000, W: 0, L: 3207, A: 46
90000, W: 0, L: 3204, A: 46
100000, W: 0, L: 3214, A: 38
110000, W: 0, L: 3211, A: 37
120000, W: 0, L: 3203, A: 44
130000, W: 0, L: 3199, A: 45
140000, W: 0, L: 3225, A: 28
150000, W: 0, L: 3211, A: 39
160000, W: 0, L: 3196, A: 45
170000, W: 0, L: 3208, A: 46
180000, W: 0, L: 3203, A: 47
190000, W: 0, L: 3209, A: 38
200000, W: 0, L: 3216, A: 38


So pretty immediately the mouse gets into a dark place, and starts choosing to jump, without ever having found the cheese even once.

Problem: Confusion

Another problem is that by heavily weighting the immediate future the mouse can get easily confused again about what was good or how it actually found the cheese and got the big reward in the first place. Once State 2 has rewarded moving right, and the mouse moves left, there is now a new max(Q(s')) value for State 2 (the reward for moving right), and so the reward for moving left is now brought up. At the same time, the reward for moving right is dropping every time it moves right and doesn’t get rewarded (as described above). Same with moving up and down, so very quickly the Q-values can be flooded with high values and the optimal direction of movement is lost.

We can see this case by setting gamma = .8. Now, the mouse will find the cheese and learns to move right, but there are also these blips of confusion where it forgets how to find the cheese,

10000, W: 0, L: 4, A: 78
20000, W: 0, L: 0, A: 78
30000, W: 87, L: 0, A: 67
40000, W: 810, L: 0, A: 16
50000, W: 841, L: 0, A: 1
60000, W: 843, L: 0, A: 2
70000, W: 847, L: 0, A: 3
80000, W: 816, L: 0, A: 10
90000, W: 667, L: 0, A: 17
100000, W: 0, L: 0, A: 68
110000, W: 0, L: 0, A: 68
120000, W: 34, L: 0, A: 68
130000, W: 748, L: 0, A: 43
140000, W: 853, L: 0, A: 0
150000, W: 846, L: 0, A: 2
160000, W: 847, L: 0, A: 3
170000, W: 856, L: 0, A: 2
180000, W: 853, L: 0, A: 2
190000, W: 120, L: 0, A: 62
200000, W: 0, L: 0, A: 70


So it’s getting there, but then it gets confused and forgets how to find it, then it remembers, then it forgets…etc.

Solution: Have a healthy fear of death

If instead now we set gamma = .05, the mouse has a much stronger repulsion to imminent death than to ennui for a much longer time, and will go around exploring things (note that this is the same as changing the reward for dying to be a much higher penalty). As a result the mouse will find the cheese faster. As a bonus, we also don’t have the fits of forgetfulness seem in the previous example:

10000, W: 313, L: 4, A: 50
20000, W: 845, L: 0, A: 6
30000, W: 848, L: 0, A: 1
40000, W: 850, L: 0, A: 1
50000, W: 850, L: 0, A: 1
60000, W: 846, L: 0, A: 2
70000, W: 844, L: 0, A: 6
80000, W: 843, L: 0, A: 2
90000, W: 846, L: 0, A: 2
100000, W: 847, L: 0, A: 5
110000, W: 844, L: 0, A: 4
120000, W: 849, L: 0, A: 3
130000, W: 846, L: 0, A: 5
140000, W: 842, L: 0, A: 4
150000, W: 845, L: 0, A: 4
160000, W: 847, L: 0, A: 5
170000, W: 848, L: 0, A: 10
180000, W: 844, L: 0, A: 7
190000, W: 840, L: 0, A: 8
200000, W: 848, L: 0, A: 1


And so in this case now we’ve developed a pretty good, stable, egocentric RL mouse, that can learn to avoid jumping off cliffs and find the cheese in this very simple map.

We can still, of course, drop the gamma value down to the extreme, 0, such that our egocentric learner would have no knowledge of any future further than the reward it gets for taking the next step. In this case it would know that going off a cliff is bad, but it won’t know that moving to the right will eventually lead to cheese / a reward.

However, that can be OK! Warding off any immediate danger is and capitalizing on any immediate reward is the perfect role for this kind of short-sighted system. Trying to make it learn how to run through a complicated maze without a lot of very specific landmarks is just not something that’s going to happen.

Conclusions

So, recapping, the egocentric system learns Q-values for states that are based on the environment relative to the agent. This has some important consequences, in that it allows it to learn very quickly not to move forward if there’s a cliff in front of you, and that if there’s a reward in the immediate vicinity to go get it. Setting the weight on the expected reward in the Q-values update can lead to some unstable behavior, and it’s important to have a good concept of how the behavior of the agent changes with different gamma values. These undesirable behaviors can be avoided by setting gamma = 0, but the agent will then only learn the consequences of its immediate action, and be unable to look into the future at all to see how to bring about a reward or avoid getting trapped.

The egocentric learner has strengths where the allocentric learner fails, and the allocentric learner has strengths where the egocentric learner fails. The natural thought is to combine these two learners, and that will be the subject of the next RL post!

## Dynamic movement primitives part 4: Avoiding obstacles

Alright. Previously I’d mentioned in one of these posts that DMPs are awesome for generalization and extension, and one of the ways that they can be extended is by incorporating obstacle avoidance dynamics. Recently I wanted to implement these dynamics, and after a bit of finagling I got it working, and so that’s going to be the subject of this post.

There are a few papers that talk about this, but the one we’re going to use is Biologically-inspired dynamical systems for movement generation: automatic real-time goal adaptation and obstacle avoidance by Hoffmann and others from Stefan Schaal’s lab. This is actually the second paper talking about obstacle avoidance and DMPs, and this is a good chance to stress one of the most important rules of implementing algorithms discussed in papers: collect at least 2-3 papers detailing the algorithm (if possible) before attempting to implement it. There are several reasons for this, the first and most important is that there are likely some typos in the equations of one paper, by comparing across a few papers it’s easier to identify trickier parts, after which thinking through what the correct form should be is usually straightforward. Secondly, often equations are updated with simplified notation or dynamics in later papers, and you can save yourself a lot of headaches in trying to understand them just by reading a later iteration. I recklessly disregarded this advice and started implementation using a single, earlier paper which had a few key typos in the equations and spent a lot of time tracking down the problem. This is just a peril inherent in any paper that doesn’t provide tested code, which is almost all, sadface.

OK, now on to the dynamics. Fortunately, I can just reference the previous posts on DMPs here and don’t have to spend any time discussing how we arrive at the DMP dynamics (for a 2D system):

$\ddot{\textbf{y}} = \alpha_y (\beta_y( \textbf{g} - \textbf{y}) - \dot{\textbf{y}}) + \textbf{f},$

where $\alpha_y$ and $\beta_y$ are gain terms, $\textbf{g}$ is the goal state, $\textbf{y}$ is the system state, $\dot{\textbf{y}}$ is the system velocity, and $\textbf{f}$ is the forcing function.
As mentioned, DMPs are awesome because now to add obstacle avoidance all we have to do is add another term

$\ddot{\textbf{y}} = \alpha_y (\beta_y( \textbf{g} - \textbf{y}) - \dot{\textbf{y}}) + \textbf{f} + \textbf{p}(\textbf{y}, \dot{\textbf{y}}),$

where $\textbf{p}(\textbf{y}, \dot{\textbf{y}})$ implements the obstacle avoidance dynamics, and is a function of the DMP state and velocity. Now then, the question is what are these dynamics exactly?

Obstacle avoidance dynamics

It turns out that there is a paper by Fajen and Warren that details an algorithm that mimics human obstacle avoidance. The idea is that you calculate the angle between your current velocity and the direction to the obstacle, and then turn away from the obstacle. The angle between current velocity and direction to the obstacle is referred to as the steering angle, denoted $\psi$, here’s a picture of it:

So, given some $\psi$ value, we want to specify how much to change our steering direction, $\dot{\psi}$, as in the figure below:

If we’re on track to hit the object (i.e. $\psi$ is near 0) then we steer away hard, and then make your change in direction less and less as the angle between your heading (velocity) and the object is larger and larger. Formally, define $\dot{\psi}$ as

$\dot{\psi} = \gamma \;\psi \;\textrm{exp}(-\beta | \psi |),$

where $\gamma$ and $\beta$ are constants, which are specified as $1000$ and $\frac{20}{\pi}$ in the paper, respectively.

This $\dot{\psi}$ term can be thought of as a weighting, telling us how much we need to rotate based on how close we are to running into the object. To calculate how we should rotate we’re going to calculate the angle orthonormal to our current velocity, then weight it by the distance between the object and our current state on each axis. Formally, this is written:

$\textbf{R} \; \dot{\textbf{y}},$

where $\textbf{R}$ is the axis $(\textbf{o} - \textbf{x}) \times \dot{\textbf{y}}$ rotated 90 degrees (the $\times$ denoting outer product here). The way I’ve been thinking about this is basically taking your velocity vector, $\dot{\textbf{y}}$, and rotating it 90 degrees. Then we use this rotated vector as a row vector, and weight the top row by the distance between the object and the system along the $x$ axis, and the bottom row by the difference along the $\textbf{y}$ axis. So in the end we’re calculating the angle that rotates our velocity vector 90 degrees, weighted by distance to the object along each axis.

So that whole thing takes into account absolute distance to object along each axis, but that’s not quite enough. We also need to throw in $\dot{\psi}$, which looks at the current angle. What this does is basically look at ‘hey are we going to hit this object?’, if you are on course then make a big turn and if you’re not then turn less or not at all. Phew.

OK so all in all this whole term is written out

$\textbf{p}(\textbf{y}, \dot{\textbf{y}}) = \textbf{R} \; \textbf{v} \; \dot{\psi},$

and that’s what we add in to the system acceleration. And now our DMP can avoid obstacles! How cool is that?

Super compact, straight-forward to add, here’s the code:

beta = 20.0 / np.pi
gamma = 1e4
R_halfpi = np.array([[np.cos(np.pi / 2.0), -np.sin(np.pi / 2.0)],
[np.sin(np.pi / 2.0), np.cos(np.pi / 2.0)]])

def avoid_obstacles(y, dy):
p = np.zeros(2)

for obstacle in obstacles:
# based on (Hoffmann, 2009)
R = np.dot(R_halfpi, np.cross(obstacle - y, dy))

cos_psi = np.dot((obstacle - y).T, dy) / \
(np.linalg.norm(dy) * np.linalg.norm(obstacle - y))
psi = np.arccos(cos_psi)
dpsi = gamma * psi * np.exp(-beta * np.linalg.norm(psi))

p += np.dot(R, dy) * dpsi

return p


And that’s it! Just add this method in to your DMP system and call avoid_obstacles at every timestep, and add it in to your DMP acceleration.

You hopefully noticed in the code that this is set up for multiple obstacles, and that all that that entailed was simply adding the p value generated by each individual obstacle. It’s super easy! Here’s a very basic graphic showing how the DMP system can avoid obstacles:

So here there’s just a basic attractor system (DMP without a forcing function) trying to move from the center position to 8 targets around the unit circle (which are highlighted in red), and there are 4 obstacles that I’ve thrown onto the field (black x’s). As you can see, the system successfully steers way clear of the obstacles while moving towards the target!

We must all use this power wisely.

## Arm visualization with PyGame

So, in my neverending quest for better arm visualizations so I can make prettier pictures / videos and improve my submission acceptance rates I’ve started looking at PyGame. I started out hoping to find something that was quick and easy in Python for animating using SVG images, and PyGame is as close as I got. All in all, I’m decently happy with the result. It runs slower than the Matplotlib animation, and you can’t zoom in like you can on a Matplotlib graph, but it’s prettier. So, tradeoffs!

There were a few things that I ran into that tripped me up a bit when I was doing this implementation, so I thought that I would share, and that’s what this post is going to be about. Below is an animation showing what the arm visualization looks like, and as always the code for everything can be found on my Github. That links you to the control folder, and you can find all of the code developed / used for this post in this folder up on my GitHub.

Alright, there were a few parts in doing this that were a bit tricky. Let’s go through them one at a time.

Rotations in PyGame

Turns out rotating images is a pain right off the bat, but once you get over the initial hurdles it looks pretty slick.

To start we’re just going to attempt to rotate an image in place, here’s a first pass:

import pygame
import pygame.locals

white = (255, 255, 255)

pygame.init()

display = pygame.display.set_mode((300, 300))
fpsClock = pygame.time.Clock()

while 1:

display.fill(white)

image = pygame.transform.rotate(image, 1)
rect = image.get_rect()

display.blit(image, rect)

# check for quit
for event in pygame.event.get():
if event.type == pygame.locals.QUIT:
pygame.quit()
sys.exit()

pygame.display.update()
fpsClock.tick(30)


First thing you notice upon running this is that the image flies off to the side very quickly, as shown below:

This is because we need to recenter the image after each rotation. To do that we can add in this after line 21:

    rect.center = (150, 150)


and so now our animation looks like:

At which point a very egregious problem becomes clear: the image is destroying itself as it rotates.

Transforming from a base image

Basically what’s happening is that every transformation the image gets distorted a little bit, and continues moving farther and farther away from the original. To prevent this we’re going save the original and also track the total degrees to rotate the image. Then we’ll perform one rotation (with the minimal distortion caused by one transformation) and plot that every time step. To do this we’ll introduce a degrees variable to track the total rotations. The changes to the code start on line 12:

radians = 0

while 1:

display.fill(white)

rect = rotated_image.get_rect()
rect.center = (150, 150)


And the resulting animation looks like:

Which is significantly better! Pretty good, in fact. Looking close, however, you can notice that it gets a little choppy. And this is because the pygame.transform.rotate method doesn’t use anti-aliasing to smooth out the image. pygame.transformation.rotozoom does, however! So we’ll use rotozoom and set the zoom level to 1, changing line 20:

    rotated = pygame.transform.rotozoom(image, degrees, 1)


And now we have a nice smooth animation!

At this point, we’re going to create a class to take care of this rotation business for us, storing the original image and providing a function that rotates smoothly and recenters the image.

import numpy as np
import pygame

class ArmPart:
"""
A class for storing relevant arm segment information.
"""
def __init__(self, pic, scale=1.0):
# some handy constants
self.length = self.base.get_rect()[2]
self.scale = self.length * scale
self.offset = self.scale / 2.0

self.rotation = 0.0 # in radians

def rotate(self, rotation):
"""
Rotates and re-centers the arm segment.
"""
self.rotation += rotation
# rotate our image
image = pygame.transform.rotozoom(self.base, np.degrees(self.rotation), 1)
# reset the center
rect = image.get_rect()
rect.center = (0, 0)

return image, rect


Everything is just what we’ve seen so far, except the offset, which is going to be very useful for the trig we’re about to get into.

Trig

With rotations working and going smoothly one of the major hurdles is now behind us. At this point we can get our arm images and use some trig to make sure that they rotate around the joint ends rather than in the center. Using the above class now we’ll write a script that loads in a picture of an upper arm, and then rotates it around the shoulder.

import numpy as np
import pygame
import pygame.locals

from armpart import ArmPart

black = (0, 0, 0)
white = (255, 255, 255)

pygame.init()

width = 500
height = 500
display = pygame.display.set_mode((width, height))
fpsClock = pygame.time.Clock()

upperarm = ArmPart('upperarm.png', scale=.7)

base = (width / 2, height / 2)

while 1:

display.fill(white)

ua_image, ua_rect = upperarm.rotate(.01)
ua_rect.center += np.asarray(base)
ua_rect.center -= np.array([np.cos(upperarm.rotation) * upperarm.offset,
-np.sin(upperarm.rotation) * upperarm.offset])

display.blit(ua_image, ua_rect)
pygame.draw.circle(display, black, base, 30)

# check for quit
for event in pygame.event.get():
if event.type == pygame.locals.QUIT:
pygame.quit()
sys.exit()

pygame.display.update()
fpsClock.tick(30)


So far the only trig we need is simply calculating the offset from the center point. Here it’s calculated as half of the length of the image multiplied by a scaling term. The scaling term is because we don’t want the rotation point to be the absolute edge of the image but rather we want it to be somewhere inside the arm. We calculate the x and y offset values using the cos and sin functions, respectively, with a negative sign in from of the sin function because the y axis is inverted, using the top of the screen as 0. The black circle is just for ease of viewing while we’re figuring all the trig out, to make it easier to verify it’s rotating around the point we want. The above code results in the following

import numpy as np
import pygame
import pygame.locals

from armpart import ArmPart

black = (0, 0, 0)
white = (255, 255, 255)

pygame.init()

width = 750
height = 750
display = pygame.display.set_mode((width, height))
fpsClock = pygame.time.Clock()

upperarm = ArmPart('upperarm.png', scale=.7)
forearm = ArmPart('forearm.png', scale=.8)
hand = ArmPart('hand.png', scale=1.0)

origin = (width / 2, height / 2)

while 1:

display.fill(white)

ua_image, ua_rect = upperarm.rotate(.03)
fa_image, fa_rect = forearm.rotate(-.02)
h_image, h_rect = hand.rotate(.03)

# generate (x,y) positions of each of the joints
joints_x = np.cumsum([0,
upperarm.scale * np.cos(upperarm.rotation),
forearm.scale * np.cos(forearm.rotation),
hand.length * np.cos(hand.rotation)]) + origin[0]
joints_y = np.cumsum([0,
upperarm.scale * np.sin(upperarm.rotation),
forearm.scale * np.sin(forearm.rotation),
hand.length * np.sin(hand.rotation)]) * -1 + origin[1]
joints = [(int(x), int(y)) for x,y in zip(joints_x, joints_y)]

def transform(rect, origin, arm_part):
rect.center += np.asarray(origin)
rect.center += np.array([np.cos(arm_part.rotation) * arm_part.offset,
-np.sin(arm_part.rotation) * arm_part.offset])

transform(ua_rect, joints[0], upperarm)
transform(fa_rect, joints[1], forearm)
transform(h_rect, joints[2], hand)
# transform the hand a bit more because it's weird
h_rect.center += np.array([np.cos(hand.rotation),
-np.sin(hand.rotation)]) * -10

display.blit(ua_image, ua_rect)
display.blit(fa_image, fa_rect)
display.blit(h_image, h_rect)

# check for quit
for event in pygame.event.get():
if event.type == pygame.locals.QUIT:
pygame.quit()
sys.exit()

pygame.display.update()
fpsClock.tick(30)


So basically the only thing we’ve done here that was a little more complicated was setting up the centers of the forearm and hand images to be at the end of the upper arm and forearm, respectively. We do that in the joints_x and joints_y variables, and the trig for that is straight from basic robotics.
The above code results in the following animation (which is a little choppy because I dropped some frames to keep the file size small):

Plotting transparent lines

The other kind of of tricky thing that was done in the visualization code was the plotting of transparent lines of the ‘skeleton’ of the arm. The reason that this was kind of tricky was because you can’t just use the pygame.draw.lines method, because it doesn’t allow for transparency. So what you end up having to do instead is generate a set of Surfaces of the shape that you want for each of the lines, and then transform and plot them appropriately, which is thankfully pretty straightforward after working things out for the images above.

To generate a transparent surface you use the pygame.SRCALPHA variable, so it looks like

surface = pygame.Surface((width, height), pygame.SRCALPHA, 32)
surface.fill((100, 100, 100, 200))


where the variable passed in to fill is a 4-tuple, with the first 3 parameters the standard RGB levels, and the fourth parameter being the transparency level.

Then there’s one more snag, in that when you blit a surface it goes by the top left position of the rectangle that contains it. So doing the same transformations for the images we then have to transform the surface further because with the images we were specifying the desired center point. This is easy enough; just get a handle to the rect for the surface and subtract out half of the width and height (post-rotation).

The code with the transparent lines (and some circles at the joints added in for prettiness) then is

import numpy as np
import pygame
import pygame.locals

from armpart import ArmPart

black = (0, 0, 0)
white = (255, 255, 255)
arm_color = (50, 50, 50, 200) # fourth value specifies transparency

pygame.init()

width = 750
height = 750
display = pygame.display.set_mode((width, height))
fpsClock = pygame.time.Clock()

upperarm = ArmPart('upperarm.png', scale=.7)
forearm = ArmPart('forearm.png', scale=.8)
hand = ArmPart('hand.png', scale=1.0)

line_width = 15
line_upperarm = pygame.Surface((upperarm.scale, line_width), pygame.SRCALPHA, 32)
line_forearm = pygame.Surface((forearm.scale, line_width), pygame.SRCALPHA, 32)
line_hand = pygame.Surface((hand.scale, line_width), pygame.SRCALPHA, 32)

line_upperarm.fill(arm_color)
line_forearm.fill(arm_color)
line_hand.fill(arm_color)

origin = (width / 2, height / 2)

while 1:

display.fill(white)

# rotate our joints
ua_image, ua_rect = upperarm.rotate(.03)
fa_image, fa_rect = forearm.rotate(-.02)
h_image, h_rect = hand.rotate(.03)

# generate (x,y) positions of each of the joints
joints_x = np.cumsum([0,
upperarm.scale * np.cos(upperarm.rotation),
forearm.scale * np.cos(forearm.rotation),
hand.length * np.cos(hand.rotation)]) + origin[0]
joints_y = np.cumsum([0,
upperarm.scale * np.sin(upperarm.rotation),
forearm.scale * np.sin(forearm.rotation),
hand.length * np.sin(hand.rotation)]) * -1 + origin[1]
joints = [(int(x), int(y)) for x,y in zip(joints_x, joints_y)]

def transform(rect, base, arm_part):
rect.center += np.asarray(base)
rect.center += np.array([np.cos(arm_part.rotation) * arm_part.offset,
-np.sin(arm_part.rotation) * arm_part.offset])

transform(ua_rect, joints[0], upperarm)
transform(fa_rect, joints[1], forearm)
transform(h_rect, joints[2], hand)
# transform the hand a bit more because it's weird
h_rect.center += np.array([np.cos(hand.rotation),
-np.sin(hand.rotation)]) * -10

display.blit(ua_image, ua_rect)
display.blit(fa_image, fa_rect)
display.blit(h_image, h_rect)

# rotate arm lines
line_ua = pygame.transform.rotozoom(line_upperarm,
np.degrees(upperarm.rotation), 1)
line_fa = pygame.transform.rotozoom(line_forearm,
np.degrees(forearm.rotation), 1)
line_h = pygame.transform.rotozoom(line_hand,
np.degrees(hand.rotation), 1)
# translate arm lines
lua_rect = line_ua.get_rect()
transform(lua_rect, joints[0], upperarm)
lua_rect.center += np.array([-lua_rect.width / 2.0, -lua_rect.height / 2.0])

lfa_rect = line_fa.get_rect()
transform(lfa_rect, joints[1], forearm)
lfa_rect.center += np.array([-lfa_rect.width / 2.0, -lfa_rect.height / 2.0])

lh_rect = line_h.get_rect()
transform(lh_rect, joints[2], hand)
lh_rect.center += np.array([-lh_rect.width / 2.0, -lh_rect.height / 2.0])

display.blit(line_ua, lua_rect)
display.blit(line_fa, lfa_rect)
display.blit(line_h, lh_rect)

# draw circles at joints for pretty
pygame.draw.circle(display, black, joints[0], 30)
pygame.draw.circle(display, arm_color, joints[0], 12)
pygame.draw.circle(display, black, joints[1], 20)
pygame.draw.circle(display, arm_color, joints[1], 7)
pygame.draw.circle(display, black, joints[2], 15)
pygame.draw.circle(display, arm_color, joints[2], 5)

# check for quit
for event in pygame.event.get():
if event.type == pygame.locals.QUIT:
pygame.quit()
sys.exit()

pygame.display.update()
fpsClock.tick(30)


And the resulting arm visualization now looks like this!

From here you can then blit on an image in the background, and the little zoom in I have in the top left is just another surface, it’s all pretty straight forward. And that’s pretty much everything! Hopefully this helps someone trying to get some fancier PyGame things going, and if you have any means of generating fancier PyGame animations yourself please drop any tips below in the comments!

Again, the code from this post is all up on my GitHub.

## Creating animated gifs in Ubuntu

Something that I do for a lot of my posts is create animated gifs, and usually I forget the whole process and have to re-look up everything. So here I will consolidate the process. There are four steps: 1) record video; 2) convert to images; 3) trim down image set; 4) convert to gif.

First, I’ve had luck recording the original video using the SimpleScreenRecorder program. Installation instructions are on their homepage, it’s very simple. I had been using RecordMyDesktop and tried Kazam, but prefer SimpleScreenRecorder.
Once you have your video, create a folder to store all your images in.
I called mine ‘images’.

To convert the video to a set of images we’ll need mplayer. In linux to install this it’s just

sudo apt-get install mplayer2


Once you have mplayer installed, change directories into your images folder and run

mplayer -ao _ ../movie_name.mp4 -vo png:z=9


This will fill up the folder with a bunch of screenshots from the video.

At this point you can go ahead and convert these images into an animated gif, but I always trim down the set to reduce the gif size. This just entails me going through and deleting every other image until I get a file that’s less than a megabyte. Once you’ve got the image set that you want to convert, you’re going to need ImageMagick installed. If you don’t have it, just run

sudo apt-get install imagemagick


Once you have that you’re going to use the convert function:

convert -delay 10 -loop 0 -deconstruct -quantize transparent -layers optimize -resize 400x400 *.png animation.gif


with a bunch of extra options attached to control the play speed (that’s the delay parameter), the looping, some optimization parameters, and then what size it comes out (I’ve chosen 400×400 pixels here).

And there you go! You can now create your own fancy gifs.

## Workshop talk – Methods for scaling neural computation

A couple of months ago I gave a talk at the Neuro-Inspired Computational Elements (NICE) workshop, about the use of cortical microcircuits for adaptation in the prediction and control problems. The talk was recorded, and here is a link: http://1.usa.gov/1tg2n2I Ignore the fire alarm that gets pulled in the first minute, the actual talk starts around 1 minute.

The were a lot of talks in general that were very interesting, which are also available online here: http://nice.sandia.gov/videos.html

I’ll be writing up some posts on the subject matter of my talk in the next couple months, explaining the methods in more detail and providing some solid examples with code. Hope you find the video interesting!

## Dynamic movement primitives part 3: Rhythmic movements

So far we’ve looked at using DMPs for discrete movements, but as I briefly mentioned it’s also possible to use them to perform rhythmic movements. In this post we’re going to look at the implementation of rhythmic DMPs and see exactly how it’s done. It’s actually pretty straightforward, which is always nich. We’ll go through everything that has to change and then see a couple of different applications of rhythmic DMPs. Again, all the code for the DMPs and arm control implementations can be found up on my Github.

Obtaining consistent and repeatable basis function activation

We’ll start with the canonical system. In the discrete case our canonical system (which drives the activation of the basis functions) decayed from $1$ to $0$ and then we were all done. In the rhythmic case we want our system to repeat indefinitely, so we need a reliable way of continuously activating the basis functions in the same order. One function that may come to mind that reliably repeats is the cosine function. But how to use the cosine function, exactly?

The idea is that we’ll lay out our basis functions in the range from $0$ to $2\pi$, and we’ll set the canonical system to be ever increasing linearly. Then we’ll use the difference between the canonical system state and each of the center points as the value we pass in to our cosine function. Because the cosine function repeats at $2\pi$ we’ll get a nice repeating spread of basis function activations. We’ll then throw these values through our Gaussian equation (with the gain and bias terms) and we’ll be done! Here’s the equation:

$\psi = \textrm{exp}(h * \textrm{cos}(x - c) - 1),$
where $x$ is the state of the canonical system, $c$ is the basis function center point, and $h$ is a gain term controlling the variance of the Gaussian. And here’s a picture of the activations of three basis functions with centers spread out evenly between $0$ and $2\pi$:

Here’s a picture of the basis function activations on a longer time scale:

As long as our canonical system increases at a steady pace the basis functions will continue to activate in a reliable and repeatable way. As mentioned, to get the canonical system to increase at a steady pace we simply give it linear dynamics:

$\dot{x} = 1$
The placement of the center points of the rhythmic system is a lot easier than in the discrete case because the rhythmic canonical system dynamics are linear.

Other differences between discrete and rhythmic

In the actual implementation of the rhythmic system there are a couple of other differences that you need to be aware of. The first is that there is no diminishing term in the rhythmic system, whereas there is in the discrete case.

The second is how to go about establishing goal states when imitating an path. We’re going to assume that whatever path is passed in to us is going to be a rhythmic pattern, and then goal state is going to be set as the center point of the desired trajectory. So for rhythmic systems the ‘goal’ is actually more like a center point for the trajectory.

And that’s pretty much it! If you look through the code that I have up for the rhythmic DMP subclass you’ll see that only a handful of functions needed to be redefined, the rest is the same! Well that’s great, let’s look at some applications.

Example

I was thinking of different examples that would be interesting to see of the rhythmic DMPs, and after a little consideration the most obvious application of rhythmic DMPs is really to a system that does something like walking. In the spinal cord of animals there are circuits which are known as central pattern generators (CPGs), and when stimulated these have been shown to generate rhythmic, repeated movements (as described in Principles of rhythmic motor pattern generation and other papers). Let’s go ahead and build our own CPG here using DMPs. I don’t have a simulation of a set of legs, however, so instead we’ll look at getting a single 3-link arm to reproduce a pattern of behaviour that you’d expect in a leg that is walking.

To do this, I’m going to need to specify the trajectories for the three joints of the arm, and I’ll be controlling the system in joint space, using the generalized coordinates controller, rather than the operational space controller used in the discrete DMP examples for controlling the end-effector position. So that’s one thing. Another thing is that I’m going to need the kinematics of leg joints during motion. After a quick search of the web I found this presentation from Simon Fraser University which has the following images:

Which is some very useful information! I just need to transfer this into a Numpy vector that I can feed into the system and convert degrees to radians and we should be good to go! To get the information out of the graph, use your favorite data stripper, I used http://arohatgi.info/WebPlotDigitizer/app/.

I also had to do a bit of a shuffle with the trajectories provided to make it look right on the arm. The hip to shoulder joint maps fine, but for the knee to elbow the flexion and extension are reversed, and for the wrist to foot the $0$ angle for the foot is straight out. Once those have been accounted for, though, everything works great. Here is an animation of a single leg walking along, if you use your imagination you can just picture a happy-go-lucky torso attached to this leg whistling while he gets his very human-like gait on:

And there you go! It’s neat how straightforward that was, once you have the data of the system you want to model. It shows the power of DMPs for coordinating multiple degrees of freedom quickly and easily. Also something that’s worth pointing out once again is that this is a force based controller. We were able to go from a kinematic description of the movement we wanted to a reliable system generating torques. It’s another good example of the dexterity of dynamic movement primitives.

You can see the code for everything here up online on my github. To run the code you should run:

python run.py arm3 gc walk


## Dynamic movement primitives part 2: Controlling end-effector trajectories

The dynamic movement primitive (DMP) framework was designed for trajectory control. It so happens that in previous posts we’ve built up to having several arm simulations that are ripe for throwing a trajectory controller on top, and that’s what we’ll do in this post. The system that we will be controlling here is the 3 link arm model with an operational space controller (OSC) that translates end-effector forces into joint torques. The DMPs here will be controlling the $(x,y)$ trajectory of the hand, and the OSC will take care of turning the desired hand forces into torques that can be applied to the arm. All of the code used to generate the animations throughout this post can of course be found up on my github.

Controlling a 3 link arm with DMPs
We have our 3 link arm and its OSC controller; this whole setup we’ll collectively refer to as ‘the plant’ throughout this post. We are going to pass in some $(x,y)$ force signal to the plant, and the plant will carry it out. Additionally we’ll get a feedback signal with the $(x,y)$ position of the hand. At the same time, we also have a DMP system that’s doing its own thing, tracing out a desired trajectory in $(x,y)$ space. We have to tie these two systems together.

To do this is easy, we’ll generate the control signal for the plant from our DMP system simply by measuring the difference between the state of our DMP system and the plant state, use that to drive the plant to the state of the DMP system. Formally,

$u = k_p(y_{\textrm{DMP}} - y)$
where $y_{\textrm{DMP}}$ is the state of the DMP system, $y$ is the state of the plant, and $k_p$ and is the position error gain term.

Once we have this, we just go ahead and step our DMP system forward and make sure the gain values on the control signal are high enough that the plant follows the DMP’s trajectory. And that’s pretty much it, just run the DMP system to the end of the trajectory and then stop your simulation.

To give a demonstration of DMP control I’ve set up the DMP system to follow the same number trajectories that the SPAUN arm followed. As you can see the combination of DMPs and operational space control is much more effective than my previous implementation.

Incorporating system feedback

One of the issues in implementing the control above is that we have to be careful about how quickly the DMP trajectory moves, because while the DMP system isn’t constrained by any physical dynamics, the plant is. Depending on the size of the movement the DMP trajectory may be moving a foot a second or an inch a second. You can see above that the arm doesn’t fully draw out the desired trajectories in places where the DMP system moved too quickly in and out and sharp corners. The only way to remedy this without feedback is to have the DMP system move more slowly throughout the entire trajectory. What would be nice, instead, would be to just say ‘go as fast as you can, as long as the plant state is within some threshold distance of you’, and this is where system feedback comes in.

It’s actually very straightforward to implement this using system feedback: If the plant state drifts away from the state of the DMPs, slow down the execution speed of the DMP to allow the plant time to catch up. The do this we just have to multiply the DMP timestep $dt$ by a new term:

$1 / (1 + \alpha_{\textrm{err}}(y_{\textrm{plant}} - y_{\textrm{DMP}}))$.
All this new term does is slow down the canonical system when there’s an error, you can think of it as a scaling on the time step. Additionally, the sensitivity of this term can be modulated the scaling term $\alpha_{\textrm{err}}$ on the difference between the plant and DMP states.

We can get an idea of how this affects the system by looking at the dynamics of the canonical system when an error term is introduced mid-run:

When the error is introduced the dynamics of the system slow down, great! Lets look at an example comparing execution with and without this feedback term. Here’s the system drawing the number 3 without any feedback incorporation:

and here’s the system drawing the number 3 with the feedback term included:

These two examples are a pretty good case for including the feedback term into your DMP system. You can still see in the second case that the specified trajectory isn’t traced out exactly, but if that’s what you’re shooting for you can just crank up the $\alpha_{\textrm{err}}$ to make the DMP timestep really slow down whenever the DMP gets ahead of the plant at all.

Interpolating trajectories for imitation

When imitating trajectories there can be some issues with having enough sample points and how to fit them to the canonical system’s timeframe, they’re not difficult to get around but I thought I would go over what I did here. The approach I took was to always run the canonical system for 1 second, and whenever a trajectory is passed in that should be imitated to scale the x-axis of the trajectory such that it’s between 0 and 1. Then I shove it into an interpolator and use the resulting function to generate an abundance of nicely spaced sample points for the DMP imitator to match. Here’s the code for that:

# generate function to interpolate the desired trajectory
import scipy.interpolate

path = np.zeros((self.dmps, self.timesteps))
x = np.linspace(0, self.cs.run_time, y_des.shape[1])

for d in range(self.dmps):

# create interpolation function
path_gen = scipy.interpolate.interp1d(x, y_des[d])

# create evenly spaced sample points of desired trajectory
for t in range(self.timesteps):
path[d, t] = path_gen(t * self.dt)

y_des = path


Direct trajectory control vs DMP based control

Now, using the above described interpolation function we can just directly use its output to guide our system. And, in fact, when we do this we get very precise control of the end-effector, more precise than the DMP control, as it happens. The reason for this is because our DMP system is approximating the desired trajectory and with a set of basis functions, and some accuracy is being lost in this approximation.

So if we instead use the interpolation function to drive the plant we can get exactly the points that we specified. The times when this comes up especially are when the trajectories that you’re trying to imitate are especially complicated. There are ways to address this with DMPs by placing your basis functions more appropriately, but if you’re just looking for the exact replication of an input trajectory (as often people are) this is a simpler way to go. You can see the execution of this in the control_trajectory.py code up on my github. Here’s a comparison of a single word drawn using the interpolation function:

and here’s the same word drawn using a DMP system with 1,000 basis function per DOF:

We can see that just using the interpolation function here gives us the exact path that we specified, where using DMPs we have some error, and this error increases with the size of the desired trajectory. But! That’s for exactly following a given trajectory, which is often not the case. The strength of the DMP framework is that the trajectory is a dynamical system. This lets us do simple things to get really neat performance, like scale the trajectory spatially on the fly simply by changing the goal, rather than rescaling the entire trajectory:

Conclusions

Some basic examples of using DMPs to control the end-effector trajectory of an arm with operational space control were gone over here, and you can see that they work really nicely together. I like when things build like this. We also saw that power of DMPs in this situation is in their generalizability, and not in exact reproduction of a given path. If I have a single complex trajectory that I only want the end-effector to follow once then I’m going to be better off just interpolating that trajectory and feeding the coordinates into the arm controller rather than go through the whole process of setting up the DMPs.

Drawing words, though, is just one basic example of using the DMP framework. It’s a very simple application and really doesn’t do justice to the flexibility and power of DMPs. Other example applications include things like playing ping pong. This is done by creating a desired trajectory showing the robot how to swing a ping pong paddle, and then using a vision system to track the current location of the incoming ping pong ball and changing the target of the movement to compensate dynamically. There’s also some really awesome stuff with object avoidance, that is implemented by adding another term with some simple dynamics to the DMP. Discussed here, basically you just have another system that moves you away from the object with a strength relative to your distance from the object. You can also use DMPs to control gain terms on your PD control signal, which is useful for things like object manipulation.

And of course I haven’t touched on rhythmic DMPs or learning with DMPs at all, and those are both also really interesting topics! But this serves as a decent introduction to the whole area, which has been developed in the Schaal lab over the last decade or so. I recommend further reading with some of these papers if you’re interested, there are a ton of neat ways to apply the DMP framework! And, again, the code for everything here is up on my github.

## Dynamic movement primitives part 1: The basics

Dynamic movement primitives (DMPs) are a method of trajectory control / planning from Stefan Schaal’s lab. They were presented way back in 2002 in this paper, and then updated in 2013 by Auke Ijspeert in this paper. This work was motivated by the desire to find a way to represent complex motor actions that can be flexibly adjusted without manual parameter tuning or having to worry about instability.

Complex movements have long been thought to be composed of sets of primitive action ‘building blocks’ executed in sequence and \ or in parallel, and DMPs are a proposed mathematical formalization of these primitives. The difference between DMPs and previously proposed building blocks is that each DMP is a nonlinear dynamical system. The basic idea is that you take a dynamical system with well specified, stable behaviour and add another term that makes it follow some interesting trajectory as it goes about its business. There are two kinds of DMPs: discrete and rhythmic. For discrete movements the base system is a point attractor, and for rhythmic movements a limit cycle is used. In this post we’re only going to worry about discrete DMPs, because by the time we get through all the basics this will already be a long post.

Imagine that you have two systems: An imaginary system where you plan trajectories, and a real system where you carry them out. When you use a DMP what you’re doing is planning a trajectory for your real system to follow. A DMP has its own set of dynamics, and by setting up your DMP properly you can get the control signal for your actual system to follow. If our DMP system is planing a path for the hand to follow, then what gets sent to the real system is the set of forces that need to be applied to the hand. It’s up to the real system to take these hand forces and apply them, by converting them down to joint torques or muscle activations (through something like the operation space control framework) or whatever. That’s pretty much all I’ll say here about the real system, what we’re going to focus on here is the DMP system. But keep in mind that the whole DMP framework is for generating a trajectory \ control signal to guide the real system.

I’ve got the code for the basic discrete DMP setup and examples I work through in this post up on my github, so if you want to jump straight to that, there’s the link! You can run test code for each class just by executing that file.

Discrete DMPs

Let’s start out with point attractor dynamics:

$\ddot{y} = \alpha_y ( \beta_y (g - y) - \dot{y})$,
where $y$ is our system state, $g$ is the goal, and $\alpha$ and $\beta$ are gain terms. This should look very familiar, it’s a PD control signal, all this is going to do is draw our system to the target. Now what we’ll do is add on a forcing term that will let us modify this trajectory:

$\ddot{y} = \alpha_y ( \beta_y (g - y) - \dot{y}) + f$.
How to define a nonlinear function $f$ such that you get the desire behaviour is a non-trivial question. The crux of the DMP framework is an additional nonlinear system used to define the forcing function $f$ over time, giving the problem a well defined structure that can be solved in a straight-forward way and easily generalizes. The introduced system is called the canonical dynamical system, is denoted $x$, and has very simple dynamics:

$\dot{x} = -\alpha_x x$.

The forcing function $f$ is defined as a function of the canonical system:

$f(x,g) = \frac{\Sigma_{i=1}^N \psi_i w_i}{\Sigma_{i=1}^N \psi_i} x(g - y_0)$,
where $y_0$ is the initial position of the system,

$\psi_i = \textrm{exp}\left( -h_i \left( x - c_i\right)^2 \right)$,
and $w_i$ is a weighting for a given basis function $\psi_i$. You may recognize that the $\psi_i$ equation above defines a Gaussian centered at $c_i$, where $h_i$ is the variance. So our forcing function is a set of Gaussians that are ‘activated’ as the canonical system $x$ converges to its target. Their weighted summation is normalized, and then multiplied by the $x (g - y_0)$ term, which is both a ‘diminishing’ and spatial scaling term.

Let’s break this down a bit. The canonical system starts at some arbitrary value, throughout this post $x_0 = 1$, and goes to 0 as time goes to infinity. For right now, let’s pretend that $x$ decays linearly to $0$. The first main point is that there are some basis functions which are activated as a function of $x$, this is displayed in the top figure below. As the value of $x$ decreases from 1 to 0, each of the Gaussians are activated (or centered) around different $x$ values. The second thing is that each of these basis functions are also assigned a weight, $w_i$. These weights are displayed in the lower figure in the bar plot. The output of the forcing function $f$ is then the summation of the activations of these basis functions multiplied by their weight, also displayed in the lower figure below.

The diminishing term
Incorporating the $x$ term into the forcing function guarantees that the contribution of the forcing term goes to zero over time, as the canonical system does. This means that we can sleep easy at night knowing that our system can trace out some crazy path, and regardless will eventually return to its simpler point attractor dynamics and converge to the target.

Spatial scaling
Spatial scaling means that once we’ve set up the system to follow a desired trajectory to a specific goal we would like to be able to move that goal farther away or closer in and get a scaled version of our trajectory. This is what the $(g - y_0)$ term of the forcing function handles, by scaling the activation of each of these basis functions to be relative to the distance to the target, causing the system to cover more or less distance. For example, let’s say that we have a set of discrete DMPs set up to follow a given trajectory:

The goals in this case are 1 and .5, which you can see is where the DMPs end up. Now, we’ve specified everything in this case for these particular goals (1 and .5), but let’s say we’d like to now generalize and get a scaled up version of this trajectory for moving by DMPs to a goal of 2. If we don’t appropriately scale our forcing function, with the $(g - y_0)$ term, then we end up with this:

Basically what’s happened is that for these new goals the same weightings of the basis functions were too weak to get the system to follow or desired trajectory. Once the $(g - y_0)$ term included in the forcing function, however, we get:

which is exactly what we want! Our movements now scale spatially. Awesome.

Alright, now, unfortunately for us, our canonical system does not converge linearly to the target, as we assumed above. Here’s a comparison of a linear decay vs the exponential decay of actual system:

This is an issue because our basis functions activate dependent on $x$. If the system was linear then we would be fine and the basis function activations would be well spread out as the system converged to the target. But, with the actual dynamics, $x$ is not a linear function of time. When we plot the basis function activations as a function of time, we see that the majority are activated immediately as $x$ moves quickly at the beginning, and then the activations stretch out as the $x$ slows down at the end:

In the interest of having the basis functions spaced out more evenly through time (so that our forcing function can still move the system along interesting paths as it nears the target, we need to choose our Gaussian center points more shrewdly. If we look at the values of $x$ over time, we can choose the times that we want the Gaussians to be activated, and then work backwards to find the corresponding $x$ value that will give us activation at that time. So, let’s look at a picture:

The red dots are the times we’d like the Gaussians to be centered around, and the blue line is our canonical system $x$. Following the dotted lines up to the corresponding $x$ values we see what values of $x$ the Gaussians need to be centered around. Additionally, we need to worry a bit about the width of each of the Gaussians, because those activated later will be activated for longer periods of time. To even it out the later basis function widths should be smaller. Through the very nonanalytical method of trial and error I’ve come to calculate the variance as

$h_i = \frac{\#BFs}{c_i}$
Which reads the variance of basis function $i$ is equal to the number of basis functions divided by the center of that basis function. When we do this, we can now generate centers for our basis functions that are well spaced out:

Temporal scaling

Again, generalizability is one of the really important things that we want out of this system. There are two obvious kinds, temporal and spatial. Spatial scaling we discussed above, in the temporal case we’d like to be able to follow this same trajectory at different speeds. Sometimes quick, sometimes slow, but always tracing out the same path. To do that we’re going to add another term to our system dynamics, $\tau$, our temporal scaling term. Given that our system dynamics are:

$\ddot{y} = \alpha_y ( \beta_y (g - y) - \dot{y}) + f$,
$\dot{x} = -\alpha_x x$,
to give us temporal flexibility we can add the $\tau$ term:

$\ddot{y} = \tau^2 (\alpha_y ( \beta_y (g - y) - \dot{y}) + f)$,
$\dot{x} = \tau(-\alpha_x x)$,
where we use $\tau^2$ for $\ddot{y}$ because it’s the second derivative, and that’s all we have to do! Now to slow down the system you set $\tau$ between 0 and 1, and to speed it up you set $\tau$ greater than 1.

Imitating a desired path

Alright, great. We have a forcing term that can make the system take a weird path as it converges to a target point, and temporal and spatial scalability. How do we set up the system to follow a path that we specify? That would be ideal, to show the system the path to follow, and have it be able to work backwards and figure out the forces and then be able to generate that trajectory whenever we want. This ends up being a pretty straight forward process.

We have control over the forcing term, which affects the system acceleration. So we first need to take our desired trajectory, $\textbf{y}_d$ (where bold denotes a vector, in this case the time series of desired points in the trajectory), and differentiate it twice to get the accelerations:

$\ddot{\textbf{y}}_d = \frac{\partial}{\partial t} \dot{\textbf{y}}_d = \frac{\partial}{\partial t} \frac{\partial}{\partial t} \textbf{y}_d$.
Once we have the desired acceleration trajectory, we need to remove the effect of the base point attractor system. We have the equation above for exactly what the acceleration induced by the point attractor system at each point in time is:

$\ddot{y} = \alpha_y ( \beta_y (g - y) - \dot{y})$,
so then to calculate what the forcing term needs to be generate this trajectory we have:

$\textbf{f}_d = \ddot{\textbf{y}}_d - \alpha_y ( \beta_y (g - \textbf{y}) - \dot{\textbf{y}})$.
From here we know that the forcing term is comprised of a weighted summation of basis functions which are activated through time, so we can use an optimization technique like locally weighted regression to choose the weights over our basis functions such that the forcing function matches the desired trajectory $\textbf{f}_d$. In locally weighted regression sets up to minimize:

$\Sigma_t \psi_i(t)(f_d(t) - w_i (x(t) (g - y_0)))^2$
and the solution (which I won’t derive here, but is worked through in Schaal’s 1998 paper) is

$w_i = \frac{\textbf{s}^T \pmb{\psi}_i \textbf{f}_d}{\textbf{s}^T \pmb{\psi}_i \textbf{s}}$,
where

$\textbf{s} = \left( \begin{array}{c}x_{t_0}(g - y_0) \\ \vdots \\ x_{t_N}(g - y_0) \end{array} \right), \;\;\; \pmb{\psi}_i = \left( \begin{array}{ccc} \psi_i(t_0) & \dots & 0 \\ 0 & \ddots & 0 \\ 0 & \dots & \psi_i(t_n) \end{array} \right)$
Great! Now we have everything we need to start making some basic discrete DMPs!

Different numbers of basis functions

One of the things you’ll notice right off the bat when imitating paths, is that as the complexity of the trajectory increases, so does the required number of basis functions. For example, below, the system is trying to follow a sine wave and a highly nonlinear piecewise function:

We can see in the second case that although the DMP is never able to exactly reproduce the desired trajectory, the approximation continues to get better as the number of basis functions increases. This kind of slow improvement in certain nonlinear areas is to be expected from how the basis functions are being placed. An even spreading of the centers of the basis functions through time was used, but for imitation there is another method out of Dr. Schaal’s lab that places the basis functions more strategically. Need is determined by the function complexity is in that region, and basis function centers and widths are defined accordingly. In highly nonlinear areas we would expect there to be many narrow basis functions, and in linear areas we would expect fewer basis functions, but ones that are wider. The method is called locally weighted projection regression, which I plan on writing about and applying in a future post!

Conclusions \ thoughts

There’s really a lot of power in this framework, and there are a ton of expansions on this basic setup, including things like incorporating system feedback, spatio-temporal coupling of DMPs, using DMPs for gain control as well as trajectory control, incorporating a cost function and reinforcement learning, identifying action types, and other really exciting stuff.

I deviated from the terminology used in the papers here in a couple of places. First, I didn’t see a reason to reduce the second order systems to two first order systems. When working through it I found it more confusing than helpful, so I left the dynamics as a second order systems. Second, I also moved the $\tau$ term to the right hand side, and that’s just so that it matches the code, it doesn’t really matter. Neither of these were big changes, but in case you’re reading the papers and wondering.

Something that I kind of skirted above is planning along multiple dimensions. It’s actually very simple; the DMP framework simply assigns one DMP per degree of freedom being controlled. But, it’s definitely worth explicitly stating at some point.

I also mentioned this above, but this is a great trajectory control system to throw on top of the previously discussed operational space control framework. With the DMP framework on top to plan robust, generalizable movements, and the OSCs underneath to carry out those commands we can start to get some really neat applications. For use on real systems the incorporation of feedback and spatio-temporal coupling terms is going to be important, so the next post will likely be working through those and then we can start looking at some exciting implementations!

Speaking of implementations, there’s a DMP and canonical system code up on my github, please feel free to explore it, run it, send me questions about it. Whatever. I should also mention that there’s this and a lot more all up and implemented on Stefan Schaal’s lab website.

## Robot control part 7: OSC of a 3-link arm

So we’ve done control for the 2-link arm, and control of the one link arm is trivial (where we control joint angle, or x or y coordinate of the pendulum), so here I’ll just show an implementation of operation space control for a more interesting arm model, the 3-link arm model. The code can all be found up on my Github.

In theory there’s nothing different or more difficult about controlling a 3-link arm versus a 2-link arm. For the inertia matrix, what I ended up doing here is just jacking up all the values of the matrix to about 100, which causes the controller to way over control the arm, and you can see the torques are much larger than they would need to be if we had an accurate inertia matrix. But the result is the same super straight trajectories that we’ve come to expect from operational space control:

It’s a little choppy because I cut out a bunch of frames to keep the gif size down. But you get the point, it works. And quite well!

Because this is also a 3-link arm now and our task space force signal is 2D, we have redundant space of solutions, meaning that the task space control signal can be carried out in a number of ways. In other words, a null space exists for this controller. This means that we can throw another controller in our system to operate inside the null space of the first controller. We’ve already worked through all the math for this, so it’s straightforward to implement.

What kind of null space controller should we put in? Well, you may have noticed in the above animation the arm goes through itself, here’s another example:

Often it’s desirable to avoid this (because of physics or whatever), so what we can do is add a secondary controller that works to keep the arm’s elbow and wrist near some comfortable default angles. When we do this we get the following:

And there you have it! Operational space control of a three link arm with a secondary controller in the null space to try to keep the angles near their default / resting positions.

I also added mouse based control to the arm so it will try to follow your mouse when you move over the figure, which makes it pretty fun to explore the power of the controller. It’s interesting to see where the singularities become an issue, and how having a null space controller that’s operating in joint space can actually come to help the system move through those problem points more smoothly. Check it out! It’s all up on my Github.