## Gram-Schmidt orthogonalization

The context here is that we have some desired vector $v^*$ that we want to build out of a set of basis vectors $v_i$ through weighted summation. The case where this is easiest is when all of our vectors $v_i$ are orthogonal with respect to each other. Recalling that a dot product of two vectors gives us a measure of their similarity, two vectors are orthogonal if their dot product is 0. A basic example of this is the set $[1,0],[0,1]$, or the rotation of these vectors 45 degrees, $[.7071, .7071],[-.7071, .7071]$.

If we have an orthogonal basis set of vectors, then to generate the weights for each of the basis vectors we simply take the dot product between each $v_i$ and our desired vector $v^*$. For example, with our basis sets from above, the weights to generate the vector $[.45 -.8]$ can be found as

$w_1 = \langle [.45, -.8] , [1, 0] \rangle = .45 \\ w_2 = [.45, -.8] \cdot [0, 1] = -.8,$

where $\langle \rangle$ denotes dot (or inner) product, and leads to

$w_1 = \langle [.45, -.8] , [.7071, .7071] \rangle = -0.2475 \\ w_2 = \langle [.45, -.8] , [-.7071, .7071] \rangle = -0.8839.$

And now we have weights $w_i$ such that for each basis set $\sum_i w_i v_i = v^*$. Written generally, to find the weights we have $w_i = \frac{v_i \cdot v^*}{||v_i||}$. The denominator here is the norm of $v_i$, introduced for generality. In the example set our basis sets were composed of unit vectors (vectors with magnitude = 1), but in general normalization is required.

Now, what if we don’t have an orthogonal basis set? Trouble, that’s what. With a non-orthogonal basis set, such as $[1, .4], [-.1, 1]$, when we try our dot product business to find our coefficients looks what happens

$w_1 = \frac{\langle [.45, -.8] , [1, .4] \rangle}{||[1,.4]||} = .3682 \\ w_2 = \frac{\langle [.45, -.8] , [-.1, 1] \rangle}{||[-.1, 1]||} = -.8408,$

and

$.1207 \cdot [1,.4] + -.8408 \cdot [-.1, 1] = [0.2048, -0.7925]$,

which is not a good reconstruction of our desired vector, $[.45, -.8]$. And the more the cross contribution to the same dimensions between different basis vectors, the worse this becomes. Of course, we could use a least squares method to find our basis set coefficients, but that involves matrix multiplications and inverses, and generally becomes more complex than we want.

So, let’s say we have a basis set of two different, but non-orthogonal vectors, $v_1$ and $v_2$. We instead want two vectors $u_1$ and $u_2$ which describe the same space, but are orthogonal. By describing the same space, I mean that their span is the same. And by span I mean the set of values that can be generated through weighted summation of the two vectors. So we set $u_1 = v_1$, and the task is now to find the appropriate $u_2$. As a conceptual description, we want $u_2$ to be equal to $v_2$, but only covering the area of space that $u_1$ isn’t covering already. To do this, we can calculate at the overlap between $u_1$ and $v_2$, then subtract out that area from $v_2$. The result should then give us the same area of state space covered by $v_1$ and $v_2$, but in a set of orthogonal vectors $u_1$ and $u_2$.

Mathematically, we calculate the overlap between $u_1$ and $v_2$ with a dot product, $\langle u_1, v_2 \rangle$, normalized by the magnitude of $u_1$, $||u_1||$, and then subtract from $v_2$. All together we have

$u_2 = v_2 - \frac{\langle u_1, v_2 \rangle}{||u_1||}$.

Using our non-orthogonal example above,

$u_1 = [1, .4]$

$u_2 = [-.1, 1] - \frac{\langle [-.1, 1] , [1, .4] \rangle}{||[-.1, 1]||} = [-0.3785, 0.7215].$

Due to roundoff during the calculation of $u_2$, the dot product between $u_1$ and $u_2$ isn’t exactly zero, but it’s close enough for our purposes.

OK, great. But what about if we have 3 vectors in our basis set and want 3 orthogonal vectors (assuming we’ve moved to a 3-dimensional space) that span the same space? In this case, how do we calculate $u_3$? Well, carrying on with our intuitive description, you might assume that the calculation would be the same as for $u_2$, but now you must subtract out from $v_3$ that is covered by $u_1$ and $u_2$. And you would be correct:

$u_3 = v_3 - \frac{\langle u_1, v_3 \rangle}{||u_1||} - \frac{\langle u_2, v_3 \rangle}{||u_2||}$.

In general, we have

$u_i = v_i - \sum_j \frac{\langle u_j, v_i \rangle}{||u_j||}$.

And now we know how to take a given set of basis vectors and generate a set of orthogonal vectors that span the same space, which we can then use to easily calculate the weights over our new basis set of $u_i$ vectors to generate the desired vector, $v^*$.

## Likelihood calculus paper series review part 2 – Neuro-mechanical control using differential stochastic operators

The second paper put out by Dr. Terence Sanger in the likelihood calculus paper series is Neuro-mechanical control using differential stochastic operators. Building on the probabalistic representation of systems through differential stochastic operators presented in the last paper (Controlling variability, which I review here) Dr. Sanger starts exploring how one could effect control over a system whose dynamics are described in terms of these operators. Here, he specifically looks at driving a population of neurons described by differential stochastic operators to generate the desired system dynamics. Neural control of a system requires that several phenomena outside the realm of classical control theory be addressed, including the effects of variability in control due to stochastic firing, large partially unlabeled cooperative controllers, bandlimited control due to finite neural resources, and variation in the number of available neurons.

The function of a neuron in a control system can be completely described in terms of 1) the probability of it spiking due to the current system state, $p(s=1|x)$, and 2) the effect of its response on the change in the state. Due to the inherent uncertainty in these systems, each individual neuron’s effect on the change in state is captured by a distribution, $p(\dot{x}|s)$. And because the effect of each neuron is only a small part of a large dynamical system that includes the dynamics of the system being controlled and the effects of all the other neurons, these distributions tend to be very broad.

Rephrasing the above description, neurons are mapping a given state $x$ to a change in state $\dot{x}$. Instead of using two conditional densities to describe this mapping, $p(s|x)$ and $p(\dot{x}|s)$, we can rewrite this more compactly as

$p(\dot{x}|x) = p(\dot{x}|s=1)\;p(s=1|x) + p(\dot{x}|s=0)\;p(s=0|x)$,

which can be read as the probability of a change in state $\dot{x}$ given the current state $x$ is equal to the probability of that change in state occurring if the neuron spikes, $p(\dot{x}|s=1)$, multiplied by the probability of that neuron spiking given the current state, $p(s=1|x)$, plus the probability of that state occurring if the neuron doesn’t spike, $p(\dot{x}|s=0)$, multiplied by the probability of that neuron not spiking, $p(s=0|x)$.

Differential stochastic operators

We want to capture the mapping $p(\dot{x}|x)$ in such a way that if we have a description of a bunch of controllers (such as neurons) and the change in system state they effect individually, we can combine them in a straightforward way to get the overall change in state resulting from all the controllers operating in parallel. To do this we can use the linear operators developed in the previous paper, which allows us combine the effects of multiple components through simple summation to determine the overall change in system state. We’ll go over it again here, as I find reading through several different versions of an idea very helpful for solidifying understanding.

Let $\mathcal{L}$ denote a class of linear operators that act on time-varying probability densities, $p(x,t)$, such that $\frac{\partial}{\partial t}p(x,t) = \mathcal{L}p(x,t)$. Because these operators need to preserve the properties of valid probability density (specifically that $\int p(x)dx = 1$ and $p(x) \geq 0$), for a given operator $L$ where $\dot{p}(x) = \int L(x,y) p(y) dy$ we require that:

• 1) $\int L(x,y)dx = 0$ for all $y$,
• 2) $L(x,y) \geq 0$ whenever $x \neq y$,

which respectively enforce the aforementioned constraints.

So, first thing’s first. Let’s read out $\dot{p}(x) = \int L(x,y) p(y) dy$. This says that our change in the probability density, $\dot{p}(x)$, is found by taking our function that tells us what the change in density is for system state $x$ given our current state $y$, which is $L(x,y)$, and weighting that by the probability of currently being in state $y$, which is $p(y)$, then summing that all up, which is the integral.

Now the constraints. The first constraint reads out as the integral of the changes of the probability density at each point $x$ for a given state $y$ must be equal to 0. This means that the area of the probability density over the states after updating them is the same. So, assuming we start out with a valid density whose sum equals 1, we always have a density whose sum equals 1.

The second constraint reads out as the change in probability density for state $x$ given a current state $y$ must be greater than zero whenever $x \neq y$. This means the only time that the change in the probability density can be negative is if there is a probability of being in that state; it enforces that all $p(x) \geq 0$, because $\dot{p}$ can’t be negative when $p(x)$ is zero.

Dr. Sanger defines the linear operators that satisfy these two conditions to be “differential stochastic operators”. The discrete time versions are matrices, dubbed “difference stochastic operators”.

Superposition of differential stochastic operators

Differential stochastic operators can be derived in different ways, here we’ll go through the derivation from the ‘master’ equation defining $p(\dot{x}|x)$, and from a stochastic differential equation. They each have their insights, so it’s helpful to work through both.

Derivation from master equation

The equation for $p(\dot{x}|x)$ written out above,

$p(\dot{x}|x) = p(\dot{x}|s=1)\;p(s=1|x) + p(\dot{x}|s=0)\;p(s=0|x)$,

determines the probability flow on the state. By employing the Kramers-Moyal expansion we can capture this probability flow through a partial differential equation describing change in probability density. In other words, instead of capturing the time evolution of the system state with a probability density over possible changes in state, $\dot{x}$, we capture it through the changing in probability of each state, $\dot{p}(x)$. The Kramers-Moyal expansion looks like:

$\frac{\partial}{\partial t}p(x,t) = -\frac{\partial}{\partial x}(D_1(x)p(x,t)) + \frac{1}{2}\frac{\partial^2}{\partial x^2}(D_1(x)p(x,t)) + ...$,

where $D_k(x) = E[(\dot{x})^k] = \int \dot{x}^k p(\dot{x}|x) d\dot{x}$. Truncating this expansion at the first two terms we get the Fokker-Planck equation, where the first term describes the drift of the density, and the second term the diffusion. These two terms are sufficient for describing Gaussian conditional densities, which capture many physical phenomena. In the case where $p(\dot{x}|x)$ does not have a Gaussian distribution, higher-order terms from the Kramers-Moyal expansion will need to be included.

Now, imagine we have a simple system of two neurons, where the change in state is defined as $\dot{x} = \dot{x}_1 + \dot{x}_2$. If these neurons have conditionally independent variability, i.e. $p(\dot{x}_1 \dot{x}_2 | x) = p(\dot{x}_1|x)p(\dot{x}_2|x)$, then we can sum the Kramers-Moyal expansion of each of these terms to describe the evolution of the overall system state:

$\frac{\partial}{\partial t}p(x,t) = - \sum_i \frac{\partial}{\partial x}(D_{1i}(x)p(x,t)) + \frac{1}{2} \sum_i \frac{\partial^2}{\partial x^2}(D_{2i}(x)p(x,t)) + ...$,

as long as the neurons have conditionally independent variability. This means that they can’t be connected (directly or indirectly) such that a spike in one neuron causes a spike in the other. While this might seem a poor assumption for modeling networks of spiking neurons, in large populations with many input, the effects of any single input neuron tends to be small enough that the assumption holds approximately.

We can rewrite the previous equation now, taking advantage of linearity of $p(x,t)$ and the Kramers-Moyal coefficients, to get

$\dot{p}(x,t) = \mathcal{L}p = \sum_i \mathcal{L}_i p$,

which means that by describing neurons with the differential stochastic operators, $\mathcal{L}_i$, we can determine the cumulative effect on the dynamics of the system through simple summation. Which we all remember from the last paper, but hey, good to review.

Now, in the event that we want to model the effect of a group of strongly interconnected neurons, we can instead consider the effect of the group as a function of the $2^n$ possible firing patterns (spike or no spike from each of the neurons). So where before $p(\dot{x}| x)$ was written in terms of the two cases $s = 0$ and $s = 1$, it would now be written:

$p(\dot{x}|x) = \sum_{i=1}^{2^n} p(\dot{x}|x,i)p(i)$,

where each $i$ is a different spike pattern. This group on neurons and their effect on the system dynamics is then considered as a single unit, and the differential stochastic operator describing them can then be combined with the operator from another group of neurons, provided that the two groups exhibit conditionally independent variability.

If there are no independent groups in the network then it’s fully connected and this is not for you go home.

Derivation from a stochastic differential equation

For this derivation, imagine a set of controllers operating in parallel. Each controller has a stochastic differential equation that defines how a control signal input affects the system dynamics,

$dx = f_i(x)dt + g_i(x)dB_i$,

where $f_i$ and $g_i$ are arbitrary equations and $dB_i$ are random functions of time, or noise terms. Let $f_i(x)dt$ for $i > 0$ be the controller equations, and $f_0(x)dt$ be the unforced, or passive, dynamics of the system, which define how the system behaves without controller input. We can write the equation for the whole system as

$dx = f_0(x)dt + \sum_{i>0}u_i(f_i(x)dt + g_i(x)dB_i)$,

where $u_i$ are constant or slowly varying control inputs. The reason we would choose to denote the unforced (passive) dynamics of the system as $f_0$ is because we can now define $u_0 = 1$, and rewrite the above equation as

$dx = \sum_{i}u_i(f_i(x)dt + g_i(x)dB_i)$.

The corresponding Fokker-Planck equation for the evolution of the state probability density is

$\frac{\partial}{\partial t}p(x,t) = - \sum_i u_i \frac{\partial}{\partial x}(f_i(x)p(x,t)) + \frac{1}{2} \sum_i u_i \frac{\partial^2}{\partial x^2}(g_i(x)p(x,t))$.

Look familiar? We can rewrite this as a superposition of linear operators

$\dot{p}(x,t) = \mathcal{L}p = \sum_i u_i \mathcal{L}_i p$,

and there you go.

Population model

So, now we can apply this superposition of differential stochastic equations to describe the effect of a population of neurons on a given system. Dr. Sanger lists several ways that this model can go about being controlled; 1) modifying the tuning curves of the neurons, which specifies how they respond to stimulus; 2) modify the output functions that determines the effect that a neuron has on the dynamics of the system; and 3) modifying the firing threshold of the neurons.

I found the difference between 1 and 3 can be a little confusing, so let’s look at an example tuning curve in 2D space to try to make this clear. Imagine a neuron sensitive to -dimensional input signals, and that it’s tuning curve looks like this:

If we’re changing the tuning curve, then how this neuron responds to its input stimulus will change. For example, in this diagram we show changing the tuning curve of a neuron:

Here, the neuron no longer responds to the same type of input that it responded to previously. We have made a qualitative change to the type of stimulus this neuron responds to.

If we change the firing threshold, however, then what we’re changing is when the neuron starts responding to stimulus that it is sensitive to.

Here we show the neuron becoming more and more sensitive to a its stimulus, respond stronger sooner and sooner. So the type of signal that the neuron responds to isn’t changing, but rather when the neuron starts responding.

Alright, now that we’ve got that sorted out, let’s move on.
Tuning curves (1) and output functions (2) are both modifiable through learning, by changing the incoming and outgoing connection weights, respectively, but for controlling systems on the fly this is going to be too slow, i.e. slower than the speed at which the system moves. So what’s left is (3), modifying the firing threshold of the neurons. So the model then looks like:

where $p(x)$ is projected in to a population of neurons, each with a stochastic differential operator that sum together to generate $\dot{p}(x)$. In this diagram, $\lambda_i$ is the firing threshold of neuron $i$, and $\lambda_i(x)$ denotes the modulation of the firing rate of neuron $i$ as a function of the current system state. When the modulation is dependent on the system state we have a feedback, or closed-loop, control system. Dr. Sanger notes that in the case that $\lambda_i$ is heavily dependent on $x$, modulating the firing threshold is indistinguishable from modifying the tuning curve, meaning that we can get some pretty powerful control out of this.

Conclusions

‘The theory of differential stochastic operators links the dynamics of individual neurons to the dynamics of a full neuro-mechanical system. The control system is a set of reflex elements whose gain is modulated in order to produce a desired dynamics of the overall system.’

This paper presents a very different formulation of control than classical control theory. Here, the goal is to modulate the dynamics of the system to closely match a desired set of dynamics that achieve some task, rather than to minimize the deviation from some prespecified trajectory through state space. Dr. Sanger notes that this description of control matches well to behavioral descriptions of neural control system, where there are numerous subsystems and circuits that have reflexive responses to external stimuli which must be modulated to achieve desired behavior. The goal of control is to fully define the dynamics of the system’s reaction to the environment.

What comes to mind first for me, in terms of using the modulation of reflex elements to effect a desired set of dynamics, is modeling the spinocerebellum. With a ton of projections directly to the spinal cord, and strong implications in locomotor and balance system, it seems like it would be a very good candidate for being modeled with this type of control. The idea being that the cerebellum is projecting modulatory values to the different spinal circuits (reflex networks and central pattern generators, for example) that specify how to respond to changes in the environment to maintain our balance or the rhythm of our walk. How we go about specifying exactly what those modulatory terms need to be is something that Dr. Sanger tackles in the last paper of this series, which I’ll be reviewing in the next couple of months. I’m looking forward to it.

On another note, in my lab we all work with the Neural Engineering Framework, in which populations of neurons are taken to represent vectors, and to perform transformations on these vectors to relay information an perform various functions. To this end, something that interests me about likelihood calculus is its application to populations of neurons representing vectors. Instead of finding $p(\dot{x}|x)$ by summing the effects of all of the neurons in a population, or defining it in terms of the population spiking patterns, we’re looking at it in terms of the different vectors this population can represent, and the effect of that vector on the system. So we can still have spiking neuron based models, but we can do all the likelihood calculus business one level removed, simplifying the calculation and reducing the number of differential stochastic operators needed.

There are a ton of things going on in this paper, and lots to think about. At several points I deviated from the notation used in this paper because I found it unclear, but aside from that I’ve really enjoyed reading through it and writing it up. Very interesting work.

Sanger TD (2010). Neuro-mechanical control using differential stochastic operators. Conference proceedings : … Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, 2010, 4494-7 PMID: 21095779

## Online Goal Babbling – motor learning paper review

Recently I was linked to an article about learning how to control a highly complex arm from scratch: How infants tell us how to control the Bionic Handling Assistant. The work seemed very interesting so I went and pulled one of the papers linked in the article, Online Goal Babbling for rapid bootstrapping of inverse models in high dimensions.

Diving into that title, online means that we’re using information from every movement as it’s gathered to improve our control, as opposed to ‘batch’ where learning only occurs every so-many trials. Bootstrapping is the process of bringing a system up to a functionally useful level. High dimensions then refers to the complexity of the system being controlled, where every component that requires a control signal is another dimension. Humans, for example, require extremely high dimensional control signals. Inverse models refer to a type of internal model, which ‘describe relations between motor commands and their consequences’. Forward models predict the results of a movement, and inverse models allow suggest a motor command that can be used to achieve a desired consequence, such as ‘the ball falls on the floor’ or ‘my hand touches the red balloon’.

Alright, so that’s the title, let’s dive into the paper proper.

Online goal babbling

The idea behind this research is to let the system learn how to control itself by exploring the environment. The reason why you would want to do this is so that you don’t have to analytically specify how the system should move. Analytic solutions require accurate models of the system dynamics, and calculating these quickly become horrendously complex. To the point that the equations of motion for a relatively simple 3-link arm moving are pages upon pages upon pages. On top of this, because your solution is only is as good as your model and the model you just took all that time to calculate isn’t adaptive, if your system dynamics change at all through wear and tear or an environment change, you’re in big trouble. You can use feedback control to help compensate for the errors introduced, but you can only respond as fast as you can receive and process sensory signals, which is often too long in practical applications.

So this moves us to adaptive feedforward control, based on learning an inverse model of the system dynamics. Importantly, what this paper describes is a kinematic inverse controller, not a kinetic inverse controller; meaning that given a desired target position for the end effector (hand) of an arm it provides a sequence of joint angle sets that will lead to the end effector reaching the target, as opposed to providing a sequence of joint angle torques to drive the system to the target. At some point, a control system will have to generate force commands, rather than just specify what position it wants the system to be in. But, knowing what joint angles and trajectory the joint angles should follow is an important problem, because systems that are of interest to control tend to exhibit a fair amount of configuration redundancy between ‘task-space’ and the actual system state-space. Task-space being something like the 3-dimensional $(x,y,z)$ position of the end-effector, which we are interested in controlling, and the actual system state-space being something like the joint-angles or muscle-lengths. Configuration redundancy is the problem of more than one possible set of joint-angles putting the end-effector in the desired location. Often the number of potential solutions is incredibly large, think about the number of possible ways you can reach to an object. So how do you learn an appropriate trajectory for your joint angles during a reaching movement? That is the problem being addressed here.

To learn an inverse model from scratch, the system needs to explore. How should it explore? Moving randomly eventually leads to an exhaustive search, but this is a poor choice in complex systems because it takes a large amount of time, increasing exponentially with the degrees of freedom (i.e. number of joint angles) of the system. So let’s look to babies, they’re responsible for learning how to control a very complex system, how the heck do they learn to move?

‘Motor babbling’ is a term that was coined to describe the seemingly random way babies moved about. It was suggested that they just flail about without purpose until they gain some understanding of how their bodies work, at which point they can start moving with purpose. But! Baby movement was shown way back in the 80’s to in fact not be just random, but instead to be highly goal directed. And when they find a solution, they stick with it, they don’t worry about it being the best one as long as it gets the job done. Only later are movements tweaked to be more efficient. As mentioned above, in systems as complicated as the human body the task-space (i.e. position of the hand) is much smaller than the motor space (i.e. length of all muscles), and there are a bunch of different solutions to a given task. With all these different potential solutions to a given problem, an exhaustive search isn’t even close to being necessary. Also, if babies aren’t just randomly exploring space to figure things out, they don’t have to flip a switch somewhere in their brain that says “ok stop messing around and let’s move with purpose”.

This paper provides a technique for stable online inverse model learning that can be used from initial bootstrapping, and shows how online learning can even be highly beneficial for bootstrapping. So let’s dive into the Online Goal Babbling learning method.

The inverse model function

Let’s denote the inverse model function we’re learning as $g()$, joint angles as $q$, and end effector positions as $x$. Then to denote giving a desired end effector position and getting out a set of joint angles we write $g(x^*) = q$, where $x^*$ represents a target in end effector space.

We’re going to initialize the inverse model function by having every end effector position return some default resting state of the arm, or home position, that we’ve defined, $q_{home}$. Additionally, we’re going to throw on some exploratory noise to the answer provided by the inverse model, so that the joint angles to move to at time $t$ are defined as $q_t = g(x^*_t, \theta_t) + E_t(x_t^*)$, where $E_t(x^*_t)$ is the exploratory noise. When the system then moves to $q_t$ the end effector position, $x_t$, is observed and the parameters of the inverse model, $\theta$, are updated immediately.

To generate a smooth path from the starting position to the target, the system divides the distance up into a number of sub-targets (25 in the paper) for the system to hit along the way. This is an important point, as it’s how the inverse model function is used to create a path that represents the system moving; a series of targets are provided and the inverse model is queried “what should the joint configuration be for this position?”

As mentioned before, it is possible to have he end effector in the same position, but the joints in a different configuration. Learning across these examples is dangerous and can lead to instability in the system as neighbouring targets could be learned with very dissimilar joint configurations, preventing smooth movement through joint-space. To prevent this, observed information is weighted by a term $w_t$ as it is taken in, based on deviation from the resting position of the arm ($q_{home}$) and efficiency of end effector movement. What this leads to is a consistent solution to be converged upon throughout movement space, causing the inverse model to generate smooth, comfortable (i.e. not near joint limits, but near the resting state of the arm) movements.

To recenter movement exploration every so often, and prevent the system from exploring heavily in some obscure joint-space, every time a new target to move to is selected there is some probability (.1 in the paper) that the system will return to $q_{home}$. To return home the system traces out a straight trajectory in joint-space, not worrying about how it is moving through end effector space. This return probability reinforces learning how to move well near $q_{home}$, and acts as a ‘developmentally plausible stabilizer that helps to stay on known areas of the sensorimotor space.’

Also, we mentioned adding exploratory noise. How is that noise generated? The noise is calculated through a small, randomly chosen linear function that varies slowly over time: $E_t(x^*_t) = A_t \cdot x^* + c_t$, where the entries to the matrix $A_0$ and vector $b_0$ are chosen independently from a normal distribution with zero mean and variance $\sigma^2$. To move, a set of small values is chosen from a normal distribution with variance significantly smaller than $\sigma^2$, and added to elements of $A_t$ and $c_t$. A normalization term is also used to keep the overall deviation stable around the standard deviation $\sigma$. And that’s how we get our slowly changing linear random noise function.

Online learning

To learn the inverse model function, we’re going to use the technique of creating a bunch of linear models that are accurate in a small area of state space, and weighting their contribution to the output based on how close we are to their ‘region of validity’. The specific method used in the paper is a local-linear map, from (H.Ritter, Learning with the self-organizing map 1991). We define our linear models as $g^{(k)}(x) = M^{(k)} \cdot x + b^{(k)}$, intuited as following the standard $mx + b$ definition of a line, but moving into multiple dimensions. $M^{(k)}$ is our linear transformation of the input for model $k$, and $b^{(k)}$ is the offset.

If an input is received that is outside of the region of validity for all of the local linear models, then another one is added to improve the approximation of the function at that point. New local models are initiated with the Jacobian matrix $J(x) = \frac{\partial g(x)}{\partial x}$ of the inverse estimate. In other words, we look at how the approximation of the function is changing as we move in this $x$ direction in state space, or estimate the derivative of the function, and use that to set the initial value of our new linear model.

To update our inverse model function, we fit it to the current example $(x_t, q_t)$ by reducing the weighted square error: $err = w_t \cdot ||q_t - g(x_t)||^2$. With this error, a standard gradient descent method is used to update the slopes and offsets of the dimensions of the linear models:

$M^{(k)}_{t+1} = M^{(k)}_t - \eta \frac{\partial err}{\partial M^{(k)}_t}, \;\;\;\;\; b^{(k)}_{t+1} = b^{(k)}_{t} - \eta \frac{\partial err}{/partial b^{(k)}_{t}}$,

where $\eta$ is the learning rate.

Results

So here are some figures of the results of applying the algorithm, and they’re pretty impressive. First we see learning the inverse model for a 5-link arm:

And here are some results looking at varying the learning rate (x-axis) and the effects on (a) time until 10% error reached, (b) final performance error, and (c) final average distance from $q_{home}$, the resting state configuration.