## Online Goal Babbling – motor learning paper review

Recently I was linked to an article about learning how to control a highly complex arm from scratch: How infants tell us how to control the Bionic Handling Assistant. The work seemed very interesting so I went and pulled one of the papers linked in the article, Online Goal Babbling for rapid bootstrapping of inverse models in high dimensions.

Diving into that title, online means that we’re using information from every movement as it’s gathered to improve our control, as opposed to ‘batch’ where learning only occurs every so-many trials. Bootstrapping is the process of bringing a system up to a functionally useful level. High dimensions then refers to the complexity of the system being controlled, where every component that requires a control signal is another dimension. Humans, for example, require extremely high dimensional control signals. Inverse models refer to a type of internal model, which ‘describe relations between motor commands and their consequences’. Forward models predict the results of a movement, and inverse models allow suggest a motor command that can be used to achieve a desired consequence, such as ‘the ball falls on the floor’ or ‘my hand touches the red balloon’.

Alright, so that’s the title, let’s dive into the paper proper.

Online goal babbling

The idea behind this research is to let the system learn how to control itself by exploring the environment. The reason why you would want to do this is so that you don’t have to analytically specify how the system should move. Analytic solutions require accurate models of the system dynamics, and calculating these quickly become horrendously complex. To the point that the equations of motion for a relatively simple 3-link arm moving are pages upon pages upon pages. On top of this, because your solution is only is as good as your model and the model you just took all that time to calculate isn’t adaptive, if your system dynamics change at all through wear and tear or an environment change, you’re in big trouble. You can use feedback control to help compensate for the errors introduced, but you can only respond as fast as you can receive and process sensory signals, which is often too long in practical applications.

So this moves us to adaptive feedforward control, based on learning an inverse model of the system dynamics. Importantly, what this paper describes is a kinematic inverse controller, not a kinetic inverse controller; meaning that given a desired target position for the end effector (hand) of an arm it provides a sequence of joint angle sets that will lead to the end effector reaching the target, as opposed to providing a sequence of joint angle torques to drive the system to the target. At some point, a control system will have to generate force commands, rather than just specify what position it wants the system to be in. But, knowing what joint angles and trajectory the joint angles should follow is an important problem, because systems that are of interest to control tend to exhibit a fair amount of configuration redundancy between ‘task-space’ and the actual system state-space. Task-space being something like the 3-dimensional $(x,y,z)$ position of the end-effector, which we are interested in controlling, and the actual system state-space being something like the joint-angles or muscle-lengths. Configuration redundancy is the problem of more than one possible set of joint-angles putting the end-effector in the desired location. Often the number of potential solutions is incredibly large, think about the number of possible ways you can reach to an object. So how do you learn an appropriate trajectory for your joint angles during a reaching movement? That is the problem being addressed here.

To learn an inverse model from scratch, the system needs to explore. How should it explore? Moving randomly eventually leads to an exhaustive search, but this is a poor choice in complex systems because it takes a large amount of time, increasing exponentially with the degrees of freedom (i.e. number of joint angles) of the system. So let’s look to babies, they’re responsible for learning how to control a very complex system, how the heck do they learn to move?

‘Motor babbling’ is a term that was coined to describe the seemingly random way babies moved about. It was suggested that they just flail about without purpose until they gain some understanding of how their bodies work, at which point they can start moving with purpose. But! Baby movement was shown way back in the 80’s to in fact not be just random, but instead to be highly goal directed. And when they find a solution, they stick with it, they don’t worry about it being the best one as long as it gets the job done. Only later are movements tweaked to be more efficient. As mentioned above, in systems as complicated as the human body the task-space (i.e. position of the hand) is much smaller than the motor space (i.e. length of all muscles), and there are a bunch of different solutions to a given task. With all these different potential solutions to a given problem, an exhaustive search isn’t even close to being necessary. Also, if babies aren’t just randomly exploring space to figure things out, they don’t have to flip a switch somewhere in their brain that says “ok stop messing around and let’s move with purpose”.

This paper provides a technique for stable online inverse model learning that can be used from initial bootstrapping, and shows how online learning can even be highly beneficial for bootstrapping. So let’s dive into the Online Goal Babbling learning method.

The inverse model function

Let’s denote the inverse model function we’re learning as $g()$, joint angles as $q$, and end effector positions as $x$. Then to denote giving a desired end effector position and getting out a set of joint angles we write $g(x^*) = q$, where $x^*$ represents a target in end effector space.

We’re going to initialize the inverse model function by having every end effector position return some default resting state of the arm, or home position, that we’ve defined, $q_{home}$. Additionally, we’re going to throw on some exploratory noise to the answer provided by the inverse model, so that the joint angles to move to at time $t$ are defined as $q_t = g(x^*_t, \theta_t) + E_t(x_t^*)$, where $E_t(x^*_t)$ is the exploratory noise. When the system then moves to $q_t$ the end effector position, $x_t$, is observed and the parameters of the inverse model, $\theta$, are updated immediately.

To generate a smooth path from the starting position to the target, the system divides the distance up into a number of sub-targets (25 in the paper) for the system to hit along the way. This is an important point, as it’s how the inverse model function is used to create a path that represents the system moving; a series of targets are provided and the inverse model is queried “what should the joint configuration be for this position?”

As mentioned before, it is possible to have he end effector in the same position, but the joints in a different configuration. Learning across these examples is dangerous and can lead to instability in the system as neighbouring targets could be learned with very dissimilar joint configurations, preventing smooth movement through joint-space. To prevent this, observed information is weighted by a term $w_t$ as it is taken in, based on deviation from the resting position of the arm ($q_{home}$) and efficiency of end effector movement. What this leads to is a consistent solution to be converged upon throughout movement space, causing the inverse model to generate smooth, comfortable (i.e. not near joint limits, but near the resting state of the arm) movements.

To recenter movement exploration every so often, and prevent the system from exploring heavily in some obscure joint-space, every time a new target to move to is selected there is some probability (.1 in the paper) that the system will return to $q_{home}$. To return home the system traces out a straight trajectory in joint-space, not worrying about how it is moving through end effector space. This return probability reinforces learning how to move well near $q_{home}$, and acts as a ‘developmentally plausible stabilizer that helps to stay on known areas of the sensorimotor space.’

Also, we mentioned adding exploratory noise. How is that noise generated? The noise is calculated through a small, randomly chosen linear function that varies slowly over time: $E_t(x^*_t) = A_t \cdot x^* + c_t$, where the entries to the matrix $A_0$ and vector $b_0$ are chosen independently from a normal distribution with zero mean and variance $\sigma^2$. To move, a set of small values is chosen from a normal distribution with variance significantly smaller than $\sigma^2$, and added to elements of $A_t$ and $c_t$. A normalization term is also used to keep the overall deviation stable around the standard deviation $\sigma$. And that’s how we get our slowly changing linear random noise function.

Online learning

To learn the inverse model function, we’re going to use the technique of creating a bunch of linear models that are accurate in a small area of state space, and weighting their contribution to the output based on how close we are to their ‘region of validity’. The specific method used in the paper is a local-linear map, from (H.Ritter, Learning with the self-organizing map 1991). We define our linear models as $g^{(k)}(x) = M^{(k)} \cdot x + b^{(k)}$, intuited as following the standard $mx + b$ definition of a line, but moving into multiple dimensions. $M^{(k)}$ is our linear transformation of the input for model $k$, and $b^{(k)}$ is the offset.

If an input is received that is outside of the region of validity for all of the local linear models, then another one is added to improve the approximation of the function at that point. New local models are initiated with the Jacobian matrix $J(x) = \frac{\partial g(x)}{\partial x}$ of the inverse estimate. In other words, we look at how the approximation of the function is changing as we move in this $x$ direction in state space, or estimate the derivative of the function, and use that to set the initial value of our new linear model.

To update our inverse model function, we fit it to the current example $(x_t, q_t)$ by reducing the weighted square error: $err = w_t \cdot ||q_t - g(x_t)||^2$. With this error, a standard gradient descent method is used to update the slopes and offsets of the dimensions of the linear models:

$M^{(k)}_{t+1} = M^{(k)}_t - \eta \frac{\partial err}{\partial M^{(k)}_t}, \;\;\;\;\; b^{(k)}_{t+1} = b^{(k)}_{t} - \eta \frac{\partial err}{/partial b^{(k)}_{t}}$,

where $\eta$ is the learning rate.

Results

So here are some figures of the results of applying the algorithm, and they’re pretty impressive. First we see learning the inverse model for a 5-link arm:

And here are some results looking at varying the learning rate (x-axis) and the effects on (a) time until 10% error reached, (b) final performance error, and (c) final average distance from $q_{home}$, the resting state configuration.

I enjoyed reading this paper, although I thought the ‘Learning operational space control’ papers from Jan Peters and Stephen Schaal deserved a bit more of a shout-out, as they’ve been around since 2006/8 and it’s a very similar method. Also I pulled one of the older papers from the article, so I’ll need to review the other ones too soon, as there is clearly more going on for the successful control of the elephant trunk arm shown in the article.

Other than that, I’m a big fan of breaking up complicated, non-linear functions into a bunch of small linear models, and stitching them all back together to learn an accurate approximation. I think it’s a very likely candidate for the kind of way that the brain goes about learning these incredibly complicated models, especially in the cerebellum I hold these types of methods as a favorite for how biological systems accomplish this. Another important note is that the learning never ‘turns off’ here, which is another big plus to this method. It allows the system to converge upon a solution, but to keep exploring the space as it uses this solution, so more efficient solutions can be found. Additionally, this makes this system much more robust to changing dynamics, resultant of wear and tear or, as in the case of infants, growing.

Again, the model here is learning an kinematic control of the system, specifying what joint angles the system should move to, and not the joint torques that will be required to actually move the system there. But, I am partial to the idea that this kinematic inverse model does then make it a much more smooth function for the kinetic controller to learn from this point.

The way I picture it is specifying a target in end-effector space, learning an kinematic inverse model controller that can specify the appropriate joint trajectory to follow to reach this target. Then an kinetic inverse model controller that takes in the joint space path, looks at the current joint angles / velocities / and accelerations, and learns the appropriate torques or muscle activation commands to move the arm as desired. This is a much, much smoother function to learn than directly learning the appropriate joint torques for a desired target in end effector space, because there are no hidden configuration redundancies. It’s still complex, because it still has to compensate for the inertia and multi-link interaction dynamics, but should still hopefully be much more straight-forward to learn.

Rolf, Matthias. (2011). Online Goal Babbling for rapid bootstrapping of inverse models in high dimensions. Development and Learning (ICDL), 2011 IEEE DOI: 10.1109/DEVLRN.2011.6037368

## 2 thoughts on “Online Goal Babbling – motor learning paper review”

1. […] a force (muscle activation) signal that will move your system there. But! as mentioned in the motor babbling paper review post I wrote a while back, having the target specified in the most basic space (i.e. […]

2. […] mentioned operational space (or task space) control of robotic systems before, in the context of learning the inverse kinematic transformation, but down here the approach is to analytically derive the dynamics of the system (as opposed to […]