## Natural policy gradient in TensorFlow

In working towards reproducing some results from deep learning control papers, one of the learning algorithms that came up was natural policy gradient. The basic idea of natural policy gradient is to use the curvature information of the of the policy’s distribution over actions in the weight update. There are good resources that go into details about the natural policy gradient method (such as here and here), so I’m not going to go into details in this post. Rather, possibly just because I’m new to TensorFlow, I found implementing this method less than straightforward due to a number of unexpected gotchyas. So I’m going to quickly review the algorithm and then mostly focus on the implementation issues that I ran into.

If you’re not familiar with policy gradient or various approaches to the cart-pole problem I also suggest that you check out this blog post from KV Frans, which provides the basis for the code I use here. All of the code that I use in this post is available up on my GitHub, and it’s worth noting this code is for TensorFlow 1.7.0.

Natural policy gradient review

Let’s informally define our notation

• $s_t$ and $a_t$ are the system state and chosen action at time $t$
• $\theta$ are the network parameters that we’re learning
• $\pi_\theta$ is the control policy network parameterized $\theta$
• $A^\pi(s_t, a_t, t)$ is the advantage function, which returns the estimated total reward for taking action $a_t$ in state $s_t$ relative to default behaviour (i.e. was this action better or worse than what we normally do)

The regular policy gradient is calculated as:

$\textbf{g} = \frac{1}{T}\sum_{t=0}^T \nabla_\theta \; \textrm{log} \; \pi_\theta(a_t | s_t) \; \hat{A}^\pi(s_t, a_t, t),$

where $\nabla_\theta$ denotes calculating the partial derivative with respect to $\theta$, the hat above $A^\pi$ denotes that we’re using an estimation of the advantage function, and $\pi_\theta(a_t | s_t)$ returns a scalar value which is the probability of the chosen action $a_t$ in state $s_t$.

For the natural policy gradient, we’re going to calculate the Fisher information matrix, $F$, which estimates the curvature of the control policy:

$\hat{F}_{\theta_k} = \frac{1}{T} \sum_{t=0}^T \nabla_\theta \; \textrm{log} \; \pi_\theta(a_t | s_t) \; \nabla_\theta \; \textrm{log} \; \pi_\theta (a_t | s_t)^T.$

The goal of including curvature information is to be able move along the steepest ascent direction, which we estimate with $\hat{\textbf{F}}^{-1}_{\theta_k}$. Including this then the weight update becomes

$\theta_{k+1} = \theta_k + \alpha \hat{\textbf{F}}^{-1}_{\theta_k} \textbf{g},$

where $\alpha$ is a learning rate parameter. In his paper Towards Generalization and Simplicity in Continuous Control, Aravind Rajeswaran notes that empirically it’s difficult to choose an appropriate $\alpha$ value or set an appropriate schedule. To address this, they normalize the update under the Fisher metric:

$\theta_{k+1} = \theta_k + \sqrt{\frac{\delta}{\textbf{g}^T\hat{\textbf{F}}^{-1}_{\theta_k}\textbf{f}}} \; \hat{\textbf{F}}^{-1}_{\theta_k} \textbf{g}.$

Using the above for the learning rate can be thought of as choosing a step size $\delta$ that is normalized with respect to the change in the control policy. This is beneficial because it means that we don’t do any parameter updates that will drastically change the output of the policy network.

Here’s what the pseudocode for the algorithm looks like

• Initialize policy parameters $\theta_0$
• for k = 1 to K do:
• collect N trajectories by rolling out the stochastic policy $\pi_{\theta_k}$
• compute $\nabla_\theta \; \textrm{log} \; \pi_\theta(a_t | s_t)$ for each $(s, a)$ pair along the trajectories sampled
• compute advantages $\hat{A}^\pi_k$ based on the sampled trajectories and the estimated value function $V^\pi_{k-1}$
• compute the policy gradient $g$ as specified above
• compute the Fisher matrix and perform the weight update as specified above
• update the value network to get $V^\pi_k$

OK great! On to implementation.

Implementation in TensorFlow 1.7.0

Calculating the gradient at every time step

In TensorFlow, the tf.gradient(ys, xs) function, which calculates the partial derivative of  ys with respect to xs:

$\nabla_{X} Y$

will automatically sum over over all elements in ys. There is no function for getting the gradient of each individual entry (i.e. the Jacobian) in ys, which is less than great. People have complained, but until some update happens we can address this with basic list comprehension to iterate through each entry in the history and calculate the gradient at that point in time. If you have a faster means that you’ve tested please comment below!

The code for this looks like

        g_log_prob = tf.stack(
for i in range(action_log_prob_flat.get_shape()[0])])


Where the result is called g_log_prob_flat because it’s the gradient of the log probability of the chosen action at each time step.

Inverse of a positive definite matrix with rounding errors

One of the problems I ran into was in calculating the normalized learning rate $\sqrt{\frac{\delta}{\textbf{g}^T\hat{\textbf{F}}^{-1}_{\theta_k}\textbf{f}}}$ was that the square root calculation was dying because negative values were being set in. This was very confusing because $\hat{\textbf{F}}$ is a positive definite matrix, which means that $\hat{\textbf{F}}^{-1}$ is also positive definite, which means that $\textbf{x}^T \hat{\textbf{F}}^{-1} \textbf{x} \geq 0$ for any $\textbf{x}$.

So it was clear that it was a calculation error somewhere, but I couldn’t find the source of the error for a while, until I started doing the inverse calculation myself using SVD to explicitly look at the eigenvalues and make sure they were all > 0. Turns out that there were some very, very small eigenvalues with a negative sign, like -1e-7 kind of range. My guess (and hope) is that these are just rounding errors, but they’re enough to mess up the calculations of the normalized learning rate. So, to handle this I explicitly set those values to zero, and that took care of that.

        S, U, V = tf.svd(F)
atol = tf.reduce_max(S) * 1e-6
S_inv = tf.divide(1.0, S)
S_inv = tf.where(S < atol, tf.zeros_like(S), S_inv)
S_inv = tf.diag(S_inv)
F_inv = tf.matmul(S_inv, tf.transpose(U))
F_inv = tf.matmul(V, F_inv)


Manually adding an update to the learning parameters

Once the parameter update is explicitly calculated, we then need to update them. To implement an explicit parameter update in TensorFlow we do

        # update trainable parameters
my_variables[0] = tf.assign_add(my_variables[0], update)


So to update the parameters then we call sess.run(my_variables, feed_dict={...}). It’s worth noting here too that any time you run the session with my_variables the parameter update will be calculated and applied, so you can’t run the session with my_variables to only fetch the current values.

Results on the CartPole problem

Now running the original policy gradient algorithm against the natural policy gradient algorithm (with everything else the same) we can examine the results of using the Fisher information matrix in the update provides some strong benefits. The plot below shows the maximum reward received in a batch of 200 time steps, where the system receives a reward of 1 for every time step that the pole stays upright, and 200 is the maximum reward achievable.

To generate this plot I ran 10 sessions of 300 batches, where each batch runs as many episodes as it takes to get 200 time steps of data. The solid lines are the mean value at each epoch across all sessions, and the shaded areas are the 95% confidence intervals. So we can see that the natural policy gradient starts to hit a reward of 200 inside 100 batches, and that the mean stays higher than normal policy gradient even after 300 batches.

It’s also worth noting that for both the policy gradient and natural policy gradient the average time to run a batch and weight update was about the same on my machine, between 150-180 milliseconds.

The modified code for policy gradient, the natural policy gradient, and plotting code are all up on my GitHub.

Other notes

• Had to reduce the $\delta$ to 0.001 (from 0.05 which was used in the Rajeswaran paper) to get the learning to converge, suspect this is because of the greatly reduced complexity of the space and control problem.
• The Rajeswaran paper algorithm does gradient ascent instead of descent, which is why the signs are how they are.

## Building a spiking neural model of adaptive arm control

About a year ago I published the work from my thesis in a paper called ‘A spiking neural model of adaptive arm control’. In this paper I presented the Recurrent Error-driven Adaptive Control Hierarchy (REACH) model. The goal of the model is to begin working towards reproducing behavioural level phenomena of human movement with biologically plausible spiking neural networks.

To do this, I start by using three methods from control literature (operational space control, dynamic movement primitives, and non-linear adaptive control) to create an algorithms level model of the motor control system that captures behavioural level phenomena of human movement. Then I explore how this functionality could be mapped to the primate brain and implemented in spiking neurons. Finally, I look at the data generated by this model on the behavioural level (e.g. kinematics of movement), the systems level (i.e. analysis of populations of neurons), and the single-cell level (e.g. correlating neural activity with movement parameters) and compare/contrast with experimental data.

By having a full model framework (from observable behaviour to neural spikes) is to have a more constrained computational model of the motor control system; adding lower-level biological constraints to behavioural models and higher-level behavioural constraints to neural models.

In general, once you have a model, the critical next step is to generating testable predictions that can be used to discriminate between other models with different implementations or underlying algorithms. Discriminative predictions allow us to design experiments that can gather evidence in favour or against different hypotheses of brain function, and provide clues to useful directions for further research. Which is the real contribution of computational modeling.

So that’s a quick overview of the paper; there are quite a few pages of supplementary information that describe the details of the model implementation, and I provided the code and data used to generate the data analysis figures. However, code to explicitly run the model on your own has been missing. As one of the major points of this blog is to provide code for furthering research, this is pretty embarrassing. So, to begin to remedy this, in this post I’m going to work through a REACH framework for building models to control a two-link arm through reaching in a line, tracing a circle, performing the centre-out reaching task, and adapting online to unexpected perturbations during reaching imposed by a joint-velocity based forcefield.

This post is directed towards those who have already read the paper (although not necessarily the supplementary material). To run the scripts you’ll need Nengo, Nengo GUI, and NengoLib all installed. There’s a description of the theory behind the Neural Engineering Framework, which I use extensively in my Nengo modeling, in the paper. I’m hoping that between that and code readability / my explanations below that most will be comfortable starting to play around with the code. But if you’re not, and would like more resources, you can check out the Getting Started page on the Nengo website, the tutorials from How To Build a Brain, and the examples in the Nengo GUI.

You can find all of the code up on my GitHub.

NOTE: I’m using the two-link arm (which is fully implemented in Python) instead of the three-link arm (which has compile issues for Macs) both so that everyone should be able to run the model arm code and to reduce the number of neurons that are required for control, so that hopefully you can run this on you laptop in the event that you don’t have a super power Ubuntu work station. Scaling this model up to the three-link arm is straight-forward though, and I will work on getting code up (for the three-link arm for non-Mac users) as a next project.

Implementing PMC – the trajectory generation system

I’ve talked at length about dynamic movement primitives (DMPs) in previous posts, so I won’t describe those again here. Instead I will focus on their implementation in neurons.

def generate(y_des, speed=1, alpha=1000.0):
beta = alpha / 4.0

# generate the forcing function
forces, _, goals = forcing_functions.generate(
y_des=y_des, rhythmic=False, alpha=alpha, beta=beta)

# create alpha synapse, which has point attractor dynamics
tau = np.sqrt(1.0 / (alpha*beta))
alpha_synapse = nengolib.Alpha(tau)

net = nengo.Network('PMC')
with net:
net.output = nengo.Node(size_in=2)

# create a start / stop movement signal
time_func = lambda t: min(max(
(t * speed) % 4.5 - 2.5, -1), 1)

def goal_func(t):
t = time_func(t)
if t <= -1:
return goals[0]
return goals[1]
net.goal = nengo.Node(output=goal_func, label='goal')

# -------------------- Ramp ---------------------------
ramp_node = nengo.Node(output=time_func, label='ramp')
net.ramp = nengo.Ensemble(
n_neurons=500, dimensions=1, label='ramp ens')
nengo.Connection(ramp_node, net.ramp)

# ------------------- Forcing Functions ---------------
def relay_func(t, x):
t = time_func(t)
if t <= -1:
return [0, 0]
return x
# the relay prevents forces from being sent on reset
relay = nengo.Node(output=relay_func, size_in=2)

domain = np.linspace(-1, 1, len(forces[0]))
x_func = interpolate.interp1d(domain, forces[0])
y_func = interpolate.interp1d(domain, forces[1])
transform=1.0/alpha/beta
nengo.Connection(net.ramp, relay[0], transform=transform,
function=x_func, synapse=alpha_synapse)
nengo.Connection(net.ramp, relay[1], transform=transform,
function=y_func, synapse=alpha_synapse)
nengo.Connection(relay, net.output)

nengo.Connection(net.goal[0], net.output[0],
synapse=alpha_synapse)
nengo.Connection(net.goal[1], net.output[1],
synapse=alpha_synapse)
return net


The generate method for the PMC takes in a desired trajectory, y_des, as a parameter. The first thing we do (on lines 5-6) is calculate the forcing function that will push the DMP point attractor along the desired trajectory.

The next thing (on lines 9-10) is creating an Alpha (second-order low-pass filter) synapse. By writing out the dynamics of a point attractor in Laplace space, one of the lab members, Aaron Voelker, noticed that the dynamics could be fully implemented by creating an Alpha synapse with an appropriate choice of tau. I walk through all of the math behind this in this post. Here we’ll use that more compact method and project directly to the output node, which improves performance and reduces the number of neurons.

Inside the PMC network we create a time_func node, which is the pace-setter during simulation. It will output a linear ramp from -1 to 1 every few seconds, with the pace set by the speed parameter, and then pause before restarting.

We also have a goal node, which will provide a target starting and ending point for the trajectory. Both the time_func and goal nodes are created and used as a model simulation convenience, and proposed to be generated elsewhere in the brain (the basal ganglia, why not? #igotreasons #provemewrong).

The ramp ensemble is the most important component of the trajectory generation system. It takes the output from the time_func node as input, and generates the forcing function which will guide our little system through the trajectory that was passed in. The ensemble itself is nothing special, but rather the function that it approximates on its outgoing connection. We set up this function approximation with the following code:

        domain = np.linspace(-1, 1, len(forces[0]))
x_func = interpolate.interp1d(domain, forces[0])
y_func = interpolate.interp1d(domain, forces[1])
transform=1.0/alpha/beta
nengo.Connection(net.ramp, relay[0], transform=transform,
function=x_func, synapse=alpha_synapse)
nengo.Connection(net.ramp, relay[1], transform=transform,
function=y_func, synapse=alpha_synapse)
nengo.Connection(relay, net.output)


We want the forcing function be generated as the signals represented in the ramp ensemble moves from -1 to 1. To achieve this, we create interpolation functions, x_func and y_func, which are set up to generate the forcing function values mapped to input values between -1 and 1. We pass these functions into the outgoing connections from the ramp population (one for x and one for y). So now when the ramp ensemble is representing -1, 0, and 1 the output along the two connections will be the starting, middle, and ending x and y points of the forcing function trajectory. The transform and synapse are set on each connection with the appropriate gain values and Alpha synapse, respectively, to implement point attractor dynamics.

NOTE: The above DMP implementation can generate a trajectory signal with as many dimensions as you would like, and all that’s required is adding another outgoing Connection projecting from the ramp ensemble.

The last thing in the code is hooking up the goal node to the output, which completes the point attractor implementation.

Implementing M1 – the kinematics of operational space control

In REACH, we’ve modelled the primary motor cortex (M1) as responsible for taking in a desired hand movement (i.e. target_position - current_hand_position) and calculating a set of joint torques to carry that out. Explicitly, M1 generates the kinematics component of an OSC signal:

$\textbf{u}_\textrm{M1} = \textbf{J}^T \textbf{M}_\textbf{x} (k_p (\textbf{x}^* - \textbf{x}))$

In the paper I did this using several populations of neurons, one to calculate the Jacobian, and then an EnsembleArray to perform the multiplication for the dot product of each dimension separately. Since then I’ve had the chance to rework things and it’s now done entirely in one ensemble.

Now, the set up for the M1 model that computes the above function is to have a single ensemble of neurons that takes in the joint angles, $\textbf{q}$, and control signal $\textbf{u}_\textbf{x} = k_p (\textbf{x}^* - \textbf{x})$, and outputs the function above. Sounds pretty simple, right? Simple is good.

Let’s look at the code (where I’ve stripped out the comments and some debugging code):

def generate(arm, kp=1, operational_space=True,
inertia_compensation=True, means=None, scales=None):

dim = arm.DOF + 2

means = np.zeros(dim) if means is None else means
scales = np.ones(dim) if scales is None else scales
scale_down, scale_up = generate_scaling_functions(
np.asarray(means), np.asarray(scales))

net = nengo.Network('M1')
with net:
# create / connect up M1 ------------------------------
net.M1 = nengo.Ensemble(
n_neurons=1000, dimensions=dim,
intercepts=AreaIntercepts(
dimensions=dim,
base=nengo.dists.Uniform(-1, .1)))

# expecting input in form [q, x_des]
net.input = nengo.Node(output=scale_down, size_in=dim)
net.output = nengo.Node(size_in=arm.DOF)

def M1_func(x, operational_space, inertia_compensation):
""" calculate the kinematics of the OSC signal """
x = scale_up(x)
q = x[:arm.DOF]
x_des = kp * x[arm.DOF:]

# calculate hand (dx, dy)
if operational_space:
J = arm.J(q=q)

if inertia_compensation:
# account for inertia
Mx = arm.Mx(q=q)
x_des = np.dot(Mx, x_des)
# transform to joint torques
u = np.dot(J.T, x_des)
else:
u = x_des

if inertia_compensation:
# account for mass
M = arm.M(q=q)
u = np.dot(M, x_des)

return u

# send in system feedback and target information
# don't account for synapses twice
nengo.Connection(net.input, net.M1, synapse=None)
nengo.Connection(
net.M1, net.output,
function=lambda x: M1_func(
x, operational_space, inertia_compensation),
synapse=None)

return net


The ensemble of neurons itself is created with a few interesting parameters:

        net.M1 = nengo.Ensemble(
n_neurons=1000, dimensions=dim,
intercepts=AreaIntercepts(
dimensions=dim, base=nengo.dists.Uniform(-1, .1)))


Specifically, the radius and intercepts parameters.

Setting the intercepts

First we’ll discuss setting the intercepts using the AreaIntercepts distribution. The intercepts of a neuron determine how much of state space a neuron is active in, which we’ll refer to as the ‘active proportion’. If you don’t know what kind of functions you want to approximate with your neurons, then you having the active proportions for your ensemble chosen from a uniform distribution is a good starting point. This means that you’ll have roughly the same number of neurons active across all of state space as you do neurons that are active over half of state space as you do neurons that are active over very small portions of state space.

By default, Nengo sets the intercepts such that the distribution of active proportions is uniform for lower dimensional spaces. But when you start moving into higher dimensional spaces (into a hypersphere) the default method breaks down and you get mostly neurons that are either active for all of state space or almost none of state space. The AreaIntercepts class calculates how the intercepts should be set to achieve the desire active proportion inside a hypersphere. There are a lot more details here that you can read up on in this IPython notebook by Dr. Terrence C. Stewart.

What you need to know right now is that we’re setting the intercepts of the neurons such that the percentage of state space for which any given neuron is active is chosen from a uniform distribution between 0% and 55%. In other words, a neuron will maximally be active in 55% of state space, no more. This will let us model more nonlinear functions (such as the kinematics of the OSC signal) with fewer neurons. If this description is clear as mud, I really recommend checking out the IPython notebook linked above to get an intuition for what I’m talking about.

Scaling the input signal

The other parameter we set on the M1 ensemble is the radius. The radius scales the range of input values that the ensemble can represent, which is by default anything inside the unit hypersphere (i.e. hypersphere with radius=1). If the radius is left at this default value, the neural activity will saturate for vectors with magnitude greater than 1, leading to inaccurate vector representation and function approximation for input vectors with magnitude > 1. For lower dimensions this isn’t a terrible problem, but as the dimensions of the state space you’re representing grow it becomes more common for input vectors to have a norm greater than 1. Typically, we’d like to be able to, at a minimum, represent vectors with any number of dimensions where any element can be anywhere between -1 and 1. To do this, all we have to do is calculate the norm of the unit vector size dim, which is np.sqrt(dim) (the magnitude of a vector size dim with all elements set to one).

Now that we’re able to represent vectors where the input values are all between -1 and 1, the last part of this sub-network is scaling the input to be between -1 and 1. We use two scaling functions, scale_down and scale_up. The scale_down function subtracts a mean value and scales the input signal to be between -1 and 1. The scale_up function reverts the vector back to it’s original values so that calculations can be carried out normally on the decoding. In choosing mean and scaling values, there are two ways we can set these functions up:

1. Set them generally, based on the upper and lower bounds of the input signal. For M1, the input is $[\textbf{q}, \textbf{u}_\textbf{x}]$ where $\textbf{u}_\textbf{x}$ is the control signal in end-effector space, we know that the joint angles are always in the range 0 to $\pi$ (because that’s how the arm simulation is programmed), so we can set the means and scales to be $\frac{\pi}{2}$ for $\textbf{q}$. For $\textbf{u}$ a mean of zero is reasonable, and we can choose (arbitrarily, empirically, or analytically) the largest task space control signal we want to represent accurately.
2. Or, if we know the model will be performing a specific task, we can look at the range of input values encountered during that task and set the means and scales terms appropriately. For the task of reaching in a straight line, the arm moves in a very limited state space and we can set the mean and we can tune these parameter to be very specific:
                             means=[0.6, 2.2, 0, 0],
scales=[.25, .25, .25, .25]


The benefit of the second method, although one can argue it’s parameter tuning and makes things less biologically plausible, is that it lets us run simulations with fewer neurons. The first method works for all of state space, given enough neurons, but seeing as we don’t always want to be simulating 100k+ neurons we’re using the second method here. By tuning the scaling functions more specifically we’re able to run our model using 1k neurons (and could probably get away with fewer). It’s important to keep in mind though that if the arm moves outside the expected range the control will become unstable.

Implementing CB – the dynamics of operational space control

The cerebellum (CB) sub-network has two components to it: dynamics compensation and dynamics adaptation. First we’ll discuss the dynamics compensation. By which I mean the $k_v \textbf{M} (\textbf{q}^* - \textbf{q})$ term from the OSC signal.

Much like the calculating the kinematics term of the OSC signal in M1, we can calculate the entire dynamics compensation term using a single ensemble with an appropriate radius, scaled inputs, and well chosen intercepts.

def generate(arm, kv=1, learning_rate=None, learned_weights=None,
means=None, scales=None):
dim = arm.DOF * 2

means = np.zeros(dim) if means is None else means
scales = np.ones(dim) if scales is None else scales
scale_down, scale_up = generate_scaling_functions(
np.asarray(means), np.asarray(scales))

net = nengo.Network('CB')
with net:
# create / connect up CB --------------------------------
net.CB = nengo.Ensemble(
n_neurons=1000, dimensions=dim,
intercepts=AreaIntercepts(
dimensions=dim,
base=nengo.dists.Uniform(-1, .1)))
# expecting input in form [q, dq, u]
net.input = nengo.Node(output=scale_down,
size_in=dim+arm.DOF+2)
cb_input = nengo.Node(size_in=dim, label='CB input')
nengo.Connection(net.input[:dim], cb_input)
# output is [-Mdq, u_adapt]
net.output = nengo.Node(size_in=arm.DOF*2)

def CB_func(x):
""" calculate the dynamic component of OSC signal """
x = scale_up(x)
q = x[:arm.DOF]
dq = x[arm.DOF:arm.DOF*2]

# calculate inertia matrix
M = arm.M(q=q)
return -np.dot(M, kv * dq)
# connect up the input and output
nengo.Connection(net.input[:dim], net.CB)
nengo.Connection(net.CB, net.output[:arm.DOF],
function=CB_func, synapse=None)


I don’t think there’s anything noteworthy going on here, most of the relevant details have already been discussed…so we’ll move on to the adaptation!

Implementing CB – non-linear dynamics adaptation

The final part of the model is the non-linear dynamics adaptation, modelled as a separate ensemble in the cerebellar sub-network (a separate ensemble so that it’s more modular, the learning connection could also come off of the other CB population). I work through the details and proof of the learning rule in the paper, so I’m not going to discuss that here. But I will restate the learning rule here:

$\dot{\textbf{d}} = \textbf{L}_d \textbf{A} \otimes \textbf{u},$

where $\textbf{d}$ are the neural decoders, $\textbf{L}_d$ is the learning rate, $\textbf{A}$ is the neural activity of the ensemble, and $\textbf{u}$ is the joint space control signal sent to the arm. This is a basic delta learning rule, where the decoders of the active neurons are modified to push the decoded function in a direction that reduces the error.

The adaptive ensemble can be initialized either using saved weights (passed in with the learned_weights paramater) or as all zeros. It is important to note that setting decoders to all zeros means that does not mean having zero neural activity, so learning will not be affected by this initialization.

        # dynamics adaptation------------------------------------
if learning_rate is not None:
n_neurons=1000, dimensions=arm.DOF*2,
# enforce spiking neurons
neuron_type=nengo.LIF(),
intercepts=AreaIntercepts(
dimensions=arm.DOF,
base=nengo.dists.Uniform(-.5, .2)))

net.learn_encoders = nengo.Connection(

# if no saved weights were passed in start from zero
weights = (
learned_weights if learned_weights is not None
net.learn_conn = nengo.Connection(
# connect directly to arm so that adaptive signal
# is not included in the training signal
transform=weights,
learning_rule_type=nengo.PES(
learning_rate=learning_rate),
synapse=None)

nengo.Connection(net.input[dim:dim+2],
net.learn_conn.learning_rule,
transform=-1, synapse=.01)
return net


We’re able to implement that learning rule using Nengo’s prescribed error-sensitivity (PES) learning on our connection from the adaptive ensemble to the output. With this set up the system will be able to learn to adapt to perturbations that are functions of the input (set here to be $[\textbf{q}, \dot{\textbf{q}}]$).

The intercepts in this population are set to values I found worked well for adapting to a few different forces, but it’s definitely a parameter to play with in your own scripts if you’re finding that there’s too much or not enough generalization of the decoded function output across the state space.

One other thing to mention is that we need to have a relay node to amalgamate the control signals output from M1 and the dynamics compensation ensemble in the CB. This signal is used to train the adaptive ensemble, and it’s important that the adaptive ensemble’s output is not included in the training signal, or else the system quickly goes off to positive or negative infinity.

Implementing S1 – a placeholder

The last sub-network in the REACH model is a placeholder for a primary sensory cortex (S1) model. This is just a set of ensembles that represents the feedback from the arm and relay it on to the rest of the model.

def generate(arm, direct_mode=False, means=None, scales=None):
dim = arm.DOF*2 + 2  # represents [q, dq, hand_xy]

means = np.zeros(dim) if means is None else means
scales = np.ones(dim) if scales is None else scales
scale_down, scale_up = generate_scaling_functions(
np.asarray(means), np.asarray(scales))

net = nengo.Network('S1')
with net:
# create / connect up S1 --------------------------------
net.S1 = nengo.networks.EnsembleArray(
n_neurons=50, n_ensembles=dim)

# expecting input in form [q, x_des]
net.input = nengo.Node(output=scale_down, size_in=dim)
net.output = nengo.Node(
lambda t, x: scale_up(x), size_in=dim)

# send in system feedback and target information
# don't account for synapses twice
nengo.Connection(net.input, net.S1.input, synapse=None)
nengo.Connection(net.S1.output, net.output, synapse=None)

return net


Since there’s no function that we’re decoding off of the represented variables we can use separate ensembles to represent each dimension with an EnsembleArray. If we were going to decode some function of, for example, q0 and dq0, then we would need an ensemble that represents both variables. But since we’re just decoding out f(x) = x, using an EnsembleArray is a convenient way to decrease the number of neurons needed to accurately represent the input.

Creating a model using the framework

The REACH model has been set up to be as much of a plug and play system as possible. To generate a model you first create the M1, PMC, CB, and S1 networks, and then they’re all hooked up to each other using the framework.py file. Here’s an example script that controls the arm to trace a circle:

def generate():
kp = 200
kv = np.sqrt(kp) * 1.5

center = np.array([0, 1.25])
# set the initial position of the arm
arm_sim.init_q = arm_sim.inv_kinematics(center)
arm_sim.reset()

net = nengo.Network(seed=0)
with net:
net.dim = arm_sim.DOF
net.arm_node = arm_sim.create_nengo_node()
net.error = nengo.Ensemble(1000, 2)
net.xy = nengo.Node(size_in=2)

# create an M1 model -------------------------------------
net.M1 = M1.generate(arm_sim, kp=kp,
operational_space=True,
inertia_compensation=True,
means=[0.6, 2.2, 0, 0],
scales=[.5, .5, .25, .25])

# create an S1 model -------------------------------------
net.S1 = S1.generate(arm_sim,
means=[.6, 2.2, -.5, 0, 0, 1.25],
scales=[.5, .5, 1.7, 1.5, .75, .75])

# subtract current position to get task space direction
nengo.Connection(net.S1.output[net.dim*2:], net.error,
transform=-1)

# create a trajectory for the hand to follow -------------
x = np.linspace(0.0, 2.0*np.pi, 100)
PMC_trajectory = np.vstack([np.cos(x) * .5,
np.sin(x) * .5])
PMC_trajectory += center[:, None]
# create / connect up PMC --------------------------------
net.PMC = PMC.generate(PMC_trajectory, speed=1)
# send target for calculating control signal
nengo.Connection(net.PMC.output, net.error)
# send target (x,y) for plotting
nengo.Connection(net.PMC.output, net.xy)

net.CB = CB.generate(arm_sim, kv=kv,
means=[0.6, 2.2, -.5, 0],
scales=[.5, .5, 1.6, 1.5])

model = framework.generate(net=net, probes_on=True)
return model


In line 50 you can see the call to the framework code, which will hook up the most common connections that don’t vary between the different scripts.

The REACH model has assigned functionality to each area / sub-network, and you can see the expected input / output in the comments at the top of each sub-network file, but the implementations are open. You can create your own M1, PMC, CB, or S1 sub-networks and try them out in the context of a full model that generates high-level movement behaviour.

Running the model

To run the model you’ll need Nengo, Nengo GUI, and NengoLib all installed. You can then pull open Nengo GUI and load any of the a# scripts. In all of these scripts the S1 model is just an ensemble that represents the output from the arm_node. Here’s what each of the scripts does:

1. a01 has a spiking M1 and CB, dynamics adaptation turned off. The model guides the arm in reaching in a straight line to a single target and back.
2. a02 has a spiking M1, PMC, and CB, dynamics adaptation turned off. The PMC generates a path for the hand to follow that traces out a circle.
3. a03 has a spiking M1, PMC, and CB, dynamics adaptation turned off. The PMC generates a path for the joints to follow, which moves the hand in a straight line to a target and back.
4. a04 has a spiking M1 and CB, dynamics adaptation turned off. The model performs the centre-out reaching task, starting at a central point and reaching to 8 points around a circle.
5. a05 has a spiking M1 and CB, and dynamics adaptation turned on. The model performs the centre-out reaching task, starting at a central point and reaching to 8 points around a circle. As the model reaches, a forcefield is applied based on the joint velocities that pushes the arm as it tries to reach the target. After 100-150 seconds of simulation the arm has adapted and learned to reach in a straight line again.

Here’s what it looks like when you pull open a02 in Nengo GUI:

I’m not going to win any awards for arm animation, but! It’s still a useful visualization, and if anyone is proficient in javascript and want’s to improve it, please do! You can see the network architecture in the top left, the spikes generated by M1 and CB to the right of that, the arm in the bottom left, and the path traced out by the hand just to the right of that. On the top right you can see the a02 script code, and below that the Nengo console.

Conclusions

One of the most immediate extensions (aside from any sort of model of S1) that comes to mind here is implementing a more detailed cerebellar model, as there are several anatomically detailed models which have the same supervised learner functionality (for example (Yamazaki and Nagao, 2012)).

Hopefully people find this post and the code useful for their own work, or at least interesting! In the ideal scenario this would be a community project, where researchers add models of different brain areas and we end up with a large library of implementations to build larger models with in a Mr. Potato Head kind of fashion.

You can find all of the code up on my GitHub. And again, this code all should have been publicly available along with the publication. Hopefully the code proves useful! If you have any questions about it please don’t hesitate to make a comment here or contact me through email, and I’ll get back to you as soon as I can.

## Deep learning for control using augmented Hessian-free optimization

Traditionally, deep learning is applied to feed-forward tasks, like classification, where the output of the network doesn’t affect the input to the network. It is a decidedly harder problem when the output is recurrently connected such that network output affects the input. Because of this application of deep learning methods to control was largely unexplored until a few years ago. Recently, however, there’s been a lot of progress and research in this area. In this post I’m going to talk about an implementation of deep learning for control presented by Dr. Ilya Sutskever in his thesis Training Recurrent Neural Networks.

In his thesis, Dr. Sutskever uses augmented Hessian-free (AHF) optimization for learning. There are a bunch of papers and posts that go into details about AHF, here’s a good one by Andrew Gibiansky up on his blog, that I recommend you check out. I’m not going to really talk much here about what AHF is specifically, or how it differs from other methods, if you’re unfamiliar there are lots of places you can read up on it. Quickly, though, AHF is kind of a bag of tricks you can use with a fast method for estimating the curvature of the loss function with respect to the weights of a neural network, as well as the gradient, which allows it to make larger updates in directions where the loss function doesn’t change quickly. So rather than estimating the gradient and then doing a small update along each dimension, you can make the size of your update large in directions that change slowly and small along dimensions where things change quickly. And now that’s enough about that.

In this post I’m going to walk through using a Hessian-free optimization library (version 0.3.8) written by my compadre Dr. Daniel Rasmussen to train up a neural network to train up a 2-link arm, and talk about the various hellish gauntlets you need run to get something that works. Whooo! The first thing to do is install this Hessian-free library, linked above.

I’ll be working through code edited a bit for readability, to find the code in full you can check out the files up on my GitHub.

Build the network

Dr. Sutskever specified the structure of the network in his thesis to be 4 layers: 1) a linear input layer, 2) 100 Tanh nodes, 3) 100 Tanh nodes, 4) linear output layer. The network is connected up with the standard feedforward connections from 1 to 2 to 3 to 4, plus recurrent connections on 2 and 3 to themselves, plus a ‘skip’ connection from layer 1 to layer 3. Finally, the input to the network is the target state for the plant and the current state of the plant. So, lots of recursion! Here’s a picture:

The output layer connects in to the plant, and, for those unfamiliar with control theory terminology, ‘plant’ just means the system that you’re controlling. In this case an arm simulation.

Before we can go ahead and set up the network that we want to train, we also need to specify the loss function that we’re going to be using during training. The loss function in Ilya’s thesis is a standard one:

$L(\theta) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{u}_t) + \ell_f(\textbf{x}_N),$
$\ell(\textbf{u}_t) = \alpha \frac{||\textbf{u}_t||^2}{2},$
$\ell_f(\textbf{x}_N) = \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2}$

where $L(\theta)$ is the total cost of the trajectory generated with $\theta$, the set of network parameters, $\ell(\textbf{u})$ is the immediate state cost, $\ell_f(\textbf{x})$ is the final state cost, $\textbf{x}$ is the state of the arm, $\textbf{x}^*$ is the target state of the arm, $\textbf{u}$ is the control signal (torques) that drives the arm, and $\alpha$ is a gain value.

To code this up using the hessianfree library we do:

from hessianfree import RNNet
from hessianfree.nonlinearities import (Tanh, Linear, Plant)
from hessianfree.loss_funcs import SquaredError, SparseL2

l2gain = 10e-3 * dt # gain on control signal loss
rnn = RNNet(
# specify the number of nodes in each layer
shape=[num_states * 2, 96, 96, num_states, num_states],
# specify the function of the nodes in each layer
layers=[Linear(), Tanh(), Tanh(), Linear(), plant],
# specify the layers that have recurrent connections
rec_layers=[1,2],
# specify the connections between layers
conns={0:[1, 2], 1:[2], 2:[3], 3:[4]},
# specify the loss function
loss_type=[
# squared error between plant output and targets
SquaredError(),
# penalize magnitude of control signal (output of layer 3)
SparseL2(l2gain, layers=[3])],
use_GPU=True)


Note that if you want to run it on your GPU you’ll need PyCuda and sklearn installed. And a GPU.

An important thing to note as well is that in Dr. Sustkever’s thesis when we’re calculating the squared error of the distance from the arm state to the target, this is measured in joint angles. So it’s kind of a weird set up to be looking at the movement of the hand but have your cost function in joint-space instead of end-effector space, but it definitely simplifies training by making the cost more directly relatable to the control signal. So we need to calculate the joint angles of the arm that will have the hand at different targets around a circle. To do this we’ll take advantage of our inverse kinematics solver from way back when, and use the following code:

def gen_targets(arm, n_targets=8, sig_len=100):
#Generate target angles corresponding to target
#(x,y) coordinates around a circle
import scipy.optimize

x_bias = 0
if arm.DOF == 2:
y_bias = .35
dist = .075
elif arm.DOF == 3:
y_bias = .5
dist = .2

# set up the reaching trajectories around circle
targets_x = [dist * np.cos(theta) + x_bias \
for theta in np.linspace(0, np.pi*2, 65)][:-1]
targets_y = [dist * np.sin(theta) + y_bias \
for theta in np.linspace(0, np.pi*2, 65)][:-1]

joint_targets = []
for ii in range(len(targets_x)):
joint_targets.append(arm.inv_kinematics(xy=(targets_x[ii],
targets_y[ii])))
targs = np.asarray(joint_targets)

# repeat the targets over time
for ii in range(targs.shape[1]-1):
targets = np.concatenate(
(np.outer(targs[:, ii], np.ones(sig_len))[:, :, None],
np.outer(targs[:, ii+1], np.ones(sig_len))[:, :, None]), axis=-1)
targets = np.concatenate((targets, np.zeros(targets.shape)), axis=-1)
# only want to penalize the system for not being at the
# target at the final state, set everything before to np.nan
targets[:, :-1] = np.nan

return targets


And you can see in the last couple lines that to implement the distance to target as a final state cost penalty only we just set all of the targets before the final time step equal to np.nan. If we wanted to penalize distance to target throughout the whole trajectory we would just comment that line out.

Create the plant

You’ll notice in the code that defines our RNN I set the last layer of the network to be plant, but that that’s not defined anywhere. Let’s talk. There are a couple of things that we’re going to need to incorporate our plant into this network and be able to use any deep learning method to train it. We need to be able to:

1. Simulate the plant forward; i.e. pass in input and get back the resulting plant state at the next timestep.
2. Calculate the derivative of the plant state with respect to the input; i.e. how do small changes in the input affect the state.
3. Calculate the derivative of the plant state with respect to the previous state; i.e. how do small changes in the plant state affect the state at the next timestep.
4. Calculate the derivative of the plant output with respect to its state; i.e. how do small changes in the current position of the state affect the output of the plant.

So 1 is easy, we have the arm simulations that we want already, they’re up on my GitHub. Number 4 is actually trivial too, because the output of our plant is going to be the state itself, so the derivative of the output with respect to the state is just the identity matrix.

For 2 and 3, we’re going to need to calculate some derivatives. If you’ve read the last few posts you’ll note that I’m on a finite differences kick. So let’s get that going! Because no one wants to calculate derivatives!

Important note, the notation in these next couple pieces of code is going to be a bit different from my normal notation because they’re matching with the hessianfree library notation, which is coming from a reinforcement learning literature background instead of a control theory background. So, s is the state of the plant, and x is the input to the plant. I know, I know. All the same, make sure to keep that in mind.

# calculate ds0/dx0 with finite differences
d_input_FD = np.zeros((x.shape[0], # number of trials
x.shape[1], # number of inputs
self.state.shape[1])) # number of states
for ii in range(x.shape[1]):
# calculate state adding eps to x[ii]
self.reset_plant(self.prev_state)
inc_x = x.copy()
inc_x[:, ii] += self.eps
self.activation(inc_x)
state_inc = self.state.copy()
# calculate state subtracting eps from x[ii]
self.reset_plant(self.prev_state)
dec_x = x.copy()
dec_x[:, ii] -= self.eps
self.activation(dec_x)
state_dec = self.state.copy()

d_input_FD[:, :, ii] = \
(state_inc - state_dec) / (2 * self.eps)
d_input_FD = d_input_FD[..., None]


Alrighty. First we create a tensor to store the results. Why is it a tensor? Because we’re going to be doing a bunch of runs at once. So our state dimensions are actually trials x size_input. When we then take the partial derivative, we end up with trials many size_input x size_state matrices. Then we increase each of the parameters of the input slightly one at a time and store the results, decrease them all one at a time and store the results, and compute our approximation of the gradient.

Next we’ll do the same for calculating the derivative of the state with respect to the previous state.

# calculate ds1/ds0
d_state_FD = np.zeros((x.shape[0], # number of trials
self.state.shape[1], # number of states
self.state.shape[1])) # number of states
for ii in range(self.state.shape[1]):
# calculate state adding eps to self.state[ii]
state = np.copy(self.prev_state)
state[:, ii] += self.eps
self.reset_plant(state)
self.activation(x)
state_inc = self.state.copy()
# calculate state subtracting eps from self.state[ii]
state = np.copy(self.prev_state)
state[:, ii] -= self.eps
self.reset_plant(state)
self.activation(x)
state_dec = self.state.copy()

d_state_FD[:, :, ii] = \
(state_inc - state_dec) / (2 * self.eps)
d_state_FD = d_state_FD[..., None]


Great! We’re getting closer to having everything we need. Another thing we need is a wrapper for running our arm simulation. It’s going to look like this:

def activation(self, x):
state = []
# iterate through and simulate the plant forward
# for each trial
for ii in range(x.shape[0]):
self.arm.reset(q=self.state[ii, :self.arm.DOF],
dq=self.state[ii, self.arm.DOF:])
self.arm.apply_torque(u[ii])
state.append(np.hstack([self.arm.q, self.arm.dq]))
state = np.asarray(state)

self.state = self.squashing(state)


This is definitely not the fastest code to run. Much more ideally we would put the state and input into vectors and do a single set of computations for each call to activation rather than having that for loop in there. Unfortunately, though, we’re not assuming that we have access to the dynamics equations / will be able to pass in vector states and inputs.

Squashing
Looking at the above code that seems pretty clear what’s going on, except you might notice that last line calling self.squashing. What’s going on there?

The squashing function looks like this:

def squashing(self, x):
index_below = np.where(x < -2*np.pi)
x[index_below] = np.tanh(x[index_below]+2*np.pi) - 2*np.pi
index_above = np.where(x > 2*np.pi)
x[index_above] = np.tanh(x[index_above]-2*np.pi) + 2*np.pi
return x


All that’s happening here is that we’re taking our input, and doing nothing to it as long as it doesn’t start to get too positive or too negative. If it does then we just taper it off and prevent it from going off to infinity. So running a 1D vector through this function we get:

This ends up being a pretty important piece of the code here. Basically it prevents wild changes to the weights during learning to result in the system breaking down. So the state of the plant can’t go off to infinity and cause an error to be thrown, stopping our simulation. But because the target state is well within the bounds of where the squashing function does nothing, post-training we’ll still be able to use the resulting network to control a system that doesn’t have this fail safe built in. Think of this function as training wheels that catch you only if you start falling.

With that, we no have pretty much all of the parts necessary to begin training our network!

Training the network

We’re going to be training this network on the centre-out reaching task, where you start at a centre point and reach to a bunch of target locations around a circle. I’m just going to be re-implementing the task as it was done in Dr. Sutskever’s thesis, so we’ll have 64 targets around the circle, and train using a 2-link arm. Here’s the code that we’ll use to actually run the training:

for ii in range(last_trial+1, num_batches):
# train a bunch of batches using the same input every time
# to allow the network a chance to minimize things with
# stable input (speeds up training)
err = rnn.run_batches(plant, targets=None,
max_epochs=batch_size,
optimizer=HessianFree(CG_iter=96, init_damping=100))

# save the weights to file, track trial and error
# err = rnn.error(inputs)
err = rnn.best_error
name = 'weights/rnn_weights-trial%04i-err%.3f'%(ii, err)
np.savez_compressed(name, rnn.W)


Training your own network

A quick aside: if you want to run this code yourself, get a real good computer, have an arm simulation ready, the hessianfree Python library installed, and download and run this train_hf.py file. (Note: I used version 0.3.8 of the hessianfree library, which you can install using pip install hessianfree==0.3.8) This will start training and save the weights into a weights/ folder, so make sure that that exists in the same folder as train_hf.py. If you want to view the results of the training at any point run the plot_error.py file, which will load in the most recent version of the weights and plot the error so far. If you want to generate an animated plot like I have below run gen_animation_plots.py and then the last command from my post on generating animated gifs.

Another means of seeing the results of your trained up network is to use the controller I’ve implemented in my controls benchmarking suite, which looks for a set of saved weights in the controllers/weights folder, and will load it in and use it to generate command signals for the arm by running it with

python run.py arm2_python ahf reach --dt=1e-2


where you replace arm2_python with whatever arm model you trained your model on. Note the --dt=1e-2 flag, that is important because the model was trained with a .01 timestep and things get a bit weird if you suddenly change the dynamics on the controller.

OK let’s look at some results!

Results

OK full discretion, these results are not optimizing the cost function we discussed above. They’re implementing a simpler cost function that only looks at the final state, i.e. it doesn’t penalize the magnitude of the control signal. I did this because Dr. Sutskever says in his thesis he was able to optimize with just the final state cost using much smaller networks. I originally looked at neurons with 96 neurons in each layer, and it just took forgoddamnedever to run. So after running for 4 weeks (not joking) and needing to make some more changes I dropped the number of neurons and simplified the task.

The results below are from running a network with 32 neurons in each layer controlling this 2-link arm, and took another 4-5 weeks to train up.

Hey that looks good! Not bad, augmented Hessian-free learning, not bad. It had pretty consistent (if slow) decline in the error rate, with a few crazy bumps from which it quickly recovered. Also take note that each training iteration is actually 32 runs, so it’s not 12,50-ish runs it’s closer to 400,000 training runs that it took to get here.

One biggish thing that was a pain was that it turns out that I only trained the neural network for reaching in the one direction, and when you only train it to reach one way it doesn’t generalize to reaching back to the starting point (which, fair enough). But, I didn’t realize this until I was took the trained network and ran it in the benchmarking code, at which point I was not keen to redo all of the training it took to get the neural network to the level of accuracy it was at under a more complicated training set. The downside of this is that even though I’ve implemented a controller that takes in the trained network and uses it to control the arm, to do the reaching task I have to just do a hard reset after the arm reaches the target, because it can’t reach back to the center, like all the other controllers. All the same, here’s an animation of the trained up AHF controller reaching to 8 targets (it was trained on all 64 above though):

Things don’t always go so smoothly, though. Here’s results from another training run that took around 2-3 weeks, and uses a different 2-link arm model (translated from Matlab code written by Dr. Emo Todorov):

What I found frustrating about this was that if you look at the error over time then this arm is doing as well or better than the previous arm at a lot of points. But the corresponding trajectories look terrible, like something you would see in a horror movie based around getting good machine learning results. This of course comes down to how I specified the cost function, and when I looked at the trajectories plotted over time the velocity of the arm is right at zero at the final time step, which it is not quiiiitte the case for the first controller. So this second network has found a workaround to minimize the cost function I specified in a way I did not intend. To prevent this, doing something like weighting the distance to target heavier than non-zero velocity would probably work. Or possibly just rerunning the training with a different random starting point you could get out a better controller, I don’t have a great feel for how important the random initialization is, but I’m hoping that it’s not all too important and its effects go to zero with enough training. Also, it should be noted I’ve run the first network for 12,500 iterations and the second for less than 6,000, so I’ll keep letting them run and maybe it will come around. The first one looked pretty messy too until about 4,000 iterations in.

Training regimes

Frustratingly, the way that you train deep networks is very important. So, very much like the naive deep learning network trainer that I am, I tried the first thing that pretty much anyone would try:

• run the network,
• update the weights,
• repeat.

This is what I’ve done in the results above. And it worked well enough in that case.

If you remember back to the iLQR I made a little while ago, I was able to change the cost function to be

$L(\theta) = \sum\limits^{N-1}\limits_{t=0} \ell(\textbf{u}_t) + \ell_f(\textbf{x}_N),$
$\ell(\textbf{u}_t, \textbf{x}_t) = \alpha \frac{||\textbf{u}_t||^2}{2} + \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2},$
$\ell_f(\textbf{x}_N) = \frac{||\textbf{x}^* - \textbf{x}_t||^2}{2}$

(i.e. to include a penalty for distance to target throughout the trajectory and not just at the final time step) which resulted in straighter trajectories when controlling the 2-link arm. So I thought I would try this here as well. Sadly (incredibly sadly), this was fairly fruitless. The network didn’t really learn or improve much at all.

After much consideration and quandary on my part, I talked with Dr. Dan and he suggested that I try another method:

• run the network,
• record the input,
• hold the input constant for a few batches of weight updating,
• repeat.

This method gave much better results. BUT WHY? I hear you ask! Good question. Let me give giving explanation a go.

Essentially, it’s because the cost function is more complex now. In the first training method, the output from the plant is fed back into the network as input at every time step. When the cost function was simpler this was OK, but now we’re getting very different input to train on at every iteration. So the system is being pulled in different directions back and forth at every iteration. In the second training regime, the same input is given several times in a row, which let’s the system follow the same gradient for a few training iterations before things change again. In my head I picture this as giving the algorithm a couple seconds to catch its breath dunking it back underwater.

This is a method that’s been used in a bunch of places recently. One of the more high-profile instances is in the results published from DeepMind on deep RL for control and for playing Go. And indeed, it also works well here.

To implement this training regime, we set up the following code:

for ii in range(last_trial+1, num_batches):

# run the plant forward once
rnn.forward(input=plant, params=rnn.W)

# get the input and targets from above rollout
inputs = plant.get_vecs()[0].astype(np.float32)
targets = np.asarray(plant.get_vecs()[1], dtype=np.float32)

# train a bunch of batches using the same input every time
# to allow the network a chance to minimize things with
# stable input (speeds up training)
err = rnn.run_batches(inputs, targets, max_epochs=batch_size,
optimizer=HessianFree(CG_iter=96, init_damping=100))

# save the weights to file, track trial and error
# err = rnn.error(inputs)
err = rnn.best_error
name = 'weights/rnn_weights-trial%04i-err%.3f'%(ii, err)
np.savez_compressed(name, rnn.W)


So you can see that we do one rollout with the weights, then go in and get the inputs and targets that were used in that rollout, and start training the network while holding those constant for batch_size epochs (training sessions). From a little bit of messing around I’ve found batch_size=32 to be a pretty good number. So then it runs 32 training iterations where it’s updating the weights, and then saves those weights (because we want a loooottttt of check-points) and then restarts the loop.

Embarrassingly, I’ve lost my simulation results from this trial, somehow…so I don’t have any nice plots to back up the above, unfortunately. But since this is just a blog post I figured I would at least talk about it a little bit, since people might still find it useful if they’re just getting into the field like me. and just update this post whenever I re-run them. If I rerun them.

What I do have, however, are results where this method doesn’t work! I tried this with the simpler cost function, that only looks at the final state distance from the target, and it did not go so well. Let’s look at that one!

My guess here is basically that the system has gotten to a point where it’s narrowed things down in the parameter space and now when you run 32 batches it’s overshooting. It needs feedback about its updates after every update at this point. That’s my guess, at least. So it could be the case that for more complex cost functions you’d want to train it while holding the input constant for a while, and then when the error starts to plateau switch to updating the input after every parameter update.

Conclusions

All in all, AHF for training neural networks in control is pretty awesome. There are of course still some major hold-backs, mostly related to how long it takes to train up a network, and having to guess at effective training regimes and network structures etc. But! It was able to train up a relatively small neural network to move an arm model from a center point to 64 targets around a circle, with no knowledge of the system under control at all. In Dr. Sutskever’s thesis he goes on to use the same set up under more complicated circumstances, such as when there’s a feedback delay, or a delay on the outgoing control signal, and unexpected noise etc, so it is able to learn under a number of different, fairly complex situations. Which is pretty slick.

Related to the insane training time required, I very easily could be missing some basic thing that would help speed things up. If you, reader, get ambitious and run the code on your own machine and find out useful methods for speeding up the training please let me know! Personally, my plan is to next investigate guided policy search, which seems like it’s found a way around this crazy training time.

## Reinforcement Learning part 4: Combining egocentric and allocentric

Alright, in my last post on RL we looked at egocentric learning and what that is. We also saw that both ego and allocentric learning have their strong points, and they compliment each other. So, some combination of these two approaches seems like a good idea. In this post we’re going to look at combining the output from these two kinds of learners and see the strengths of different weightings. Per usual, all the code from this post is up on my Github.

Combining allocentric and egocentric approaches

There are a few different ways that come to mind that we could combine these two learning styles. In each of these cases the overall setup is that we’re going to run two learning systems, one that is allocentric and one that is egocentric, combine their action value mappings for the current state in some way, and then decide which action to take based on their combined output. So here are several possible ways to combine the outputs:

• Only use allocentric
• Only use egocentric
• Average the mappings
• Dynamically change weights

There are, of course, many more ways to weight these systems, but these are some basic ones and they’re what we’re going to look at here.

For testing each of these these different setups we are going to run the mouse on a map for a set amount of time, measure the wins, accidental deaths, and intentional deaths, and then gather statistics (mean value and confidence intervals) across 10 trials.

A side note: The rate of exploration (epsilon value) starts out at .1 and is decreased by half every 100,000 simulation time step, so there are a number of jumps where you see the number of wins go up every 100,000, and that’s an artifact of the rate of exploration dropping (i.e. increasing your exploitation of what you already know about the world).

OK, we’re going to do this across 3 different maps, where the size and complexity increases with each map.

Map 1: the basic cliff map (which should look familiar):

Map 2: a middlesized map with several areas that are identical for the egocentric state:

and Map 3: a pretty large map with even more ambiguous areas:

Expectations

Now, what we expect from our learning systems is that the egocentric learner with have very few ‘suicide’ deaths, where it’s slow to learn that jumping off a cliff is a terrible idea. But on the flip side of that, since we’re going to set the gamma value of the egocentric learner to 0, (which means that it doesn’t incorporate anything other than the immediate reward into it’s learning, as discussed in the previous post), we expect that it will only rarely find the goal. Also, as the map gets larger, we expect that the goal will be found less and less, as it will have to randomly stumble around avoiding cliffs until it ends up in sight of the cheese.

From the allocentric learner we expect the death rate to be much higher than the egocentric learner, but we also expect the success rate to be significantly higher, as it will be able to identify how to move based on its current (x,y) position, rather than based on its immediate surroundings. As the map gets larger though, we expect the number of deaths to be larger, since there is no transfer of knowledge that jumping off a cliff in this position is as bad as jumping off a cliff in that position.

Alright! Let’s get to it!

Results

Only use allocentric

Not much to explain about this, the mouse is going to be learning Q-values where each state is its (x,y) position on the map. The bigger the map, the more deaths we expect this learner to have. So, here are the results:

Alright, so one of the most salient things in these plots is the stepping up of the number of wins at each of the 100,000 marks. Again, this is due to a transfer from exploration to exploitation of what’s already been learned as the epsilon value is decreased. And, also, as the map gets bigger, the number of wins takes longer to max out, but it’s pretty interesting how consistent the learning is (as seen by the very small confidence intervals on the runs) across trials given the random exploration.
Another point of note is that the number of deaths goes way up with the size of the map, as expected.

Only use egocentric

Again, this one is another straightforward case, the mouse is going to be learning Q-values where each state is its set of immediate neighbors (e.g. cliff to the right, cheese in front, empty cells left and behind). Here we expect that it will have a minimal number of deaths (not much more than a death for each of the situations it can encounter a cliff or cliffs in), and that the number of successes should be pretty low and very inconsistent, since it will be entirely based on randomly stumbling into a spot near the cheese while avoiding cliffs. Additionally, we expect the number of results to drop with increases in the size of the map. The results are below:

The results are as expected, lots of variation in the wins number, low suicide numbers, and the number of wins decreases with map sizes. Something that’s kind of interesting here is that you don’t see any real effects of the epsilon value as it decreases as you do in the egocentric case. I believe that this is because it achieves a balance of exploration that throws you off a cliff and exploration that moves you into the vicinity of the cheese. So it’s dying less when the exploration rate drops, but it also has less chance of reaching the cheese now. That’s my guess, at least. Alright! Next.

Average the allo and ego Q-values

Here we’re going to do the most basic combination of the allo and ego-centric learning possible. Both learners will run simultaneously, and generate a set of Q-values over the set of possible actions at each state, and we’ll average the output of both then choose the max Q-value to choose which action is taken. The hope here is that the ego-centric learner will very quickly learn not to jump off cliffs, and the allocentric learner will more slowly learn how to move in the map to get cheese. So we expect few deaths and more wins. Let’s look at the results:

Damn! So now we have the same (or at least very similar) win rate as the allocentric learner (awesome), and the same death rate as the egocentric learner (also awesome)! We have achieved some sort of super-human…or, super-mouse mouse. A super mouse. This is really cool, it tells us that these two systems really do compliment each other, and that it can be incredibly straightforward to implement a combination of the two.

Dynamic weighting

Here we’ll look at the results from a variation of an algorithm I’ve implemented based on this really interesting paper by Dr. Sakya Dasgupta et al where they got a lot of really neat results implemented on a little robot critter. We can’t quite use the algorithm as presented here for a couple of reasons.

In the paper, the authors design the algorithm for weighting the output from classical conditioning and operant conditioning systems. I’m taking these to be pretty close to analogous to an ego-centric learner (with no lookahead) and an allo-centric learner, respectively. The idea being that classical conditioning is based on immediate associations developed from rewards, and operant conditioning allows the longer-term associations to be formed. Specifically, in operant you learn to associate the conditioned stimulus with the unconditioned stimulus, which leads to a a conditioned response. The classic example being the dog salivating when hearing the bell, because it knows food is coming. Here the allocentric learner is doing a similar thing, using a look-ahead to start associating stimuli (in this case different states) with a reward or punishment. So, this could be a wildly inappropriate application of this algorithm, and the comparison definitely warrants further discussion.

The more immediate reason for varying the algorithm is because the output of their system is a continuous ‘left/right’ decision, and the output from the mouse here is a discrete ‘left/right/up/down’ decision. So the systems in the paper output a number from -1 to 1 that indicates which way to turn, and this is used in the update. So the algorithm needs to be changed up a little.

First, here’s the original algorithm:

weight_allo += eta * reward * [output_allo - filtered_output_allo] * output_ego
weight_ego += eta * reward * [output_ego - filtered_output_ego] * output_allo

where eta is the learning rate, reward is the immediate reward of the state you move into, output_ is the -1 to 1 value from the learners, and filtered_output_ is a low pass filtered version of the output. So the first part is standard, learning rate multiplied by the reward, and then the part inside the brackets is calculating a derivative of the output of the learners. This is again standard, did your learner change? If it did change and a reward was received, then increase the weight of that learner. The last term, which multiplies it all by the output from the other learner is for scaling.

And then normalize the weights relative to each other, and it isn’t explicitly stated in the paper, but also lower bound the weights at 0. The reason you need to do this lower bounding is that if you don’t, then when you use the weight as a gain on the learner output all of the Q-values are inverted! Not easy to recover from that.

The dynamic weighting equations that I’m using are:

weight_allo += eta * reward * dq_allo * output_ego
weight_ego += eta * reward * dq_ego * output_allo

and then lower bound at 0 and normalize.

There is still a lot in these equations that I’m unsure about (for example scaling by the output of the other learner instead of by the other weight, and how the signs of the reward, derivative, and other learner output all interact), but I’m going to leave delving into that for a future post, since this one is getting long already and I don’t have a full understanding of it yet. But! I have enough of an understanding to implement the above and run simulations! Here are the results:

I should note that the learning rate here for the weight equations is 5e-4. When you decrease this weight you get improved performance, but the best I’ve gotten is just reducing it to a point that it’s pretty much constantly using .5 for each. And that’s boring.

Alright, comparing to basic averaging here we notice that we have about the same performance for the small map, and then on the bigger maps significantly worse performance and also a lot more variability between runs. Basically what seems to happen is that the system quickly identifies that the allocentric learner should not have control at the beginning, because it keeps killing the mouse, and then it takes a lot longer to start handing control back over from the egocentric learner to the allocentric learner.

Conclusions

Alright, so that’s all! Bottom line is that allocentric and egocentric learning really work well together, and basic averaging of these gives great results. Whether this is the best means of combining them is unclear, however. This was just a really basic test of different combinations, and there is some really neat work looking at different ways to do it. It’s super fun to mess around with and there’s a ton of room for experimentation. In the code up on my Github, the code for weight learning can be found by searching for weight_learning. If anyone plays around and figures out a kick-ass dynamic weighting algorithm please pass it along!

## Reinforcement Learning part 3: Egocentric learning

When an agent is learning how to move, it’s learning a set of values to associate with possible actions for each situation it encounters. We say that it’s learning the action values. When the agent learns to associate the values of taking different actions in different states, it is learning the state-action values, also referred to as the Q-values. How the agent identifies ‘a situation’ can be handled in a couple of different ways.

If the agent is performing egocentric learning, then it distinguishes situations based on it’s immediately surrounding environment. If the agent is perform allocentric learning, then it has an internal map of the environment, and each different position on the map is considered a different situation. Whether an agent is using allocentric or egocentric learning ends up influencing its behavior quite a bit, and both have their pros and cons. We’ve some of both ego and allo-centric in the previous RL posts; in this post we’re going to examine egocentric learning specifically a little closer. And the code for this post can be found up on my GitHub.

Difference between ego and allo-centric

Let’s look at when a new set of action values is learned in these two cases. Going back to the ever wonderful mouse and cheese scenario, we’ll create a map for our mouse to learn to navigate.

In egocentric learning, each situation is determined by what is near the mouse, and so new action values for each of the possible movements ({W,NW,N,NE,E,SE,S,SW}) have to be learned whenever the immediate environment is not something the agent has seen before. For example, the mouse keeps a different set of action values for each of the following environments:

On the other hand, in the case of allocentric learning, each situation is determined by the (x,y) coordinates of the mouse’s current position. That means that the mouse keeps a different set of action values for every possible (x,y) position:

so even if the environment in each position is the same:

with different (x,y) coordinates the mouse maintains different action value maps for all of them.

With egocentric learning, the agent tends to develop immediately beneficial behavior more quickly, to avoid running off cliffs and to move towards cheese when it’s in sight. The agent can even learn to move towards cheese that’s outside of the range of its sensors in simple maps, essentially by learning heuristics like ‘always moving right leads to reward.’ But this falls apart if there are several areas in the map where the immediate environment is the same, as in the case above with states 2 and 3.

With allocentric learning, the agent develops action value mappings that are much better suited to the overall environment and long-term goals (e.g. if the cheese is always in the center of the map, if you are in the bottom-right move up and left, and if you are in the top-left move down and right), but you can’t generalize across local situations (for example when there is a cliff one space to the right), the consequences of movement at each (x,y) position must be learned independently.

Clearly a combination of these two learning types is desirable, but that is for a future post! There is still a lot to examine in ego-centric learning, and so we’re going to dig deeper.

Thinking about the future

In allocentric learning, you don’t often perform an action, successfully move, and then still wind up in the same state. This makes transmitting information about rewards received in the future backwards to states that are still far away fairly stable, because a specific state is some constant distance away from getting reward. In egocentric learning, however, this isn’t the case. So when you start trying to propagate reward information backwards you can get some unstable behavior; let’s look at a very basic scenario:

There’s the mouse in blue, on the left, and the reward, in green, on the right, and red represents cliffs / certain death. So now, whenever the mouse isn’t beside one of the walls he will be in one of three states, cliffs above, nothing around him, or cliffs below:

OK. Now let’s look at the update formula for Q-learning again:

Q(s, a) += alpha * (reward(s') + gamma * max(Q(s')) - Q(s, a))

where s is the previous system state, a is the action just taken, s' is the current state, reward(s') is the reward for being in the current state, and max(Q(s')) is the maximum reward obtainable by performing an action in the current state.

Informally, it says that the change to make to the Q-value for the action just taken, Q(s, a), is equal to your learning rate, alpha, multiplied by the reward for the state that you moved into,reward(s'), plus the lookahead reward,max(Q(s')), which relays the expected reward now that you’re in state s', minus the previous Q-value,Q(s, a).

Let’s break this down a bit more, because this can be complicated to think about still. If we set gamma = 0, then our update is

Q(s, a) += alpha * (reward(s') - Q(s, a)).

Here it’s more apparent what the dynamics are doing, they’re driving the Q-value to the reward value of the state you end up in. So you take an action a in state s, you find the reward for where you ended up, and you learn the reward for performing action a in state s. Now if gamma is not equal to 0, then you can think of the Q-value dynamics as trying to drive the Q-values to

reward(s') + gamma * max(Q(s'))

The larger that gamma is, the farther into the future you look. This can lead to unstable / non-convergent systems when you repeatedly move into the same state again and again. For example, if you’re in State 2 from above, and you move right often enough you’ll end up in a state that leads to getting a reward. This boosts the state before the reward, and now the next time you happen to move to a state where the reward is next to you, the Q-values of State 2 are now updated. And so you get your reward and you reset. Now you move into State 2 again and hey! Moving right leads to a reward. So you move right, but now you’re back in State 2 again! And so the reward for moving right decreases a bit. But it’s still the best decision, so you move right. And you’re in State 2 again, and so the reward for moving right goes down again, etc etc. Eventually moving right brings you to the target though and the Q-value for moving right gets a boost. So it’s going to forever be fluctuating.

In this example, it doesn’t seem like a big deal, because the mouse still learns to move towards the cheese and the Q-values have entered a kind of rhythmic limit cycle. There are a few problems that arise, however, when counting the future to heavily in egocentric learning.

Problem: Death-wish

Basically, the mouse decides that certain death is a better option than wandering around in uncertainty forever and starts deciding to commit suicide. Let’s say that to encourage exploration, the reward for being in an unoccupied square is -1, and to encourage not dying the reward for jumping off a cliff is -10. The problem arises in that jumping off the cliff is a constant punishment, -10. It’s always -10 and it will never be any worse than -10. There is no further future past death to influence the reward. When the mouse is just exploring, however, it will explore, and get -1, and then explore more, and get -1, etc etc. This causes the expected reward of actions that lead you into the same state keep dropping. If the weight of gamma is too large, then without all that much exploring the reward for exploring has suddenly dropped to less than -10, and the mouse decides that continuing to explore is worse than death, and jumps.

By setting gamma = 5 (which is pretty high) we then get this behavior very quickly, here are some simulation results, where the first column is the timestep, the second column W denotes how many times the mouse found the cheese, the third column L denotes how many times the mouse chose to jump off a cliff (as dictated by its Q-values), and the last column A denotes how many times random exploration led the mouse to its accidental death:

10000, W: 0, L: 3138, A: 51
20000, W: 0, L: 3207, A: 40
30000, W: 0, L: 3204, A: 40
40000, W: 0, L: 3205, A: 45
50000, W: 0, L: 3208, A: 43
60000, W: 0, L: 3198, A: 49
70000, W: 0, L: 3200, A: 49
80000, W: 0, L: 3207, A: 46
90000, W: 0, L: 3204, A: 46
100000, W: 0, L: 3214, A: 38
110000, W: 0, L: 3211, A: 37
120000, W: 0, L: 3203, A: 44
130000, W: 0, L: 3199, A: 45
140000, W: 0, L: 3225, A: 28
150000, W: 0, L: 3211, A: 39
160000, W: 0, L: 3196, A: 45
170000, W: 0, L: 3208, A: 46
180000, W: 0, L: 3203, A: 47
190000, W: 0, L: 3209, A: 38
200000, W: 0, L: 3216, A: 38


So pretty immediately the mouse gets into a dark place, and starts choosing to jump, without ever having found the cheese even once.

Problem: Confusion

Another problem is that by heavily weighting the immediate future the mouse can get easily confused again about what was good or how it actually found the cheese and got the big reward in the first place. Once State 2 has rewarded moving right, and the mouse moves left, there is now a new max(Q(s')) value for State 2 (the reward for moving right), and so the reward for moving left is now brought up. At the same time, the reward for moving right is dropping every time it moves right and doesn’t get rewarded (as described above). Same with moving up and down, so very quickly the Q-values can be flooded with high values and the optimal direction of movement is lost.

We can see this case by setting gamma = .8. Now, the mouse will find the cheese and learns to move right, but there are also these blips of confusion where it forgets how to find the cheese,

10000, W: 0, L: 4, A: 78
20000, W: 0, L: 0, A: 78
30000, W: 87, L: 0, A: 67
40000, W: 810, L: 0, A: 16
50000, W: 841, L: 0, A: 1
60000, W: 843, L: 0, A: 2
70000, W: 847, L: 0, A: 3
80000, W: 816, L: 0, A: 10
90000, W: 667, L: 0, A: 17
100000, W: 0, L: 0, A: 68
110000, W: 0, L: 0, A: 68
120000, W: 34, L: 0, A: 68
130000, W: 748, L: 0, A: 43
140000, W: 853, L: 0, A: 0
150000, W: 846, L: 0, A: 2
160000, W: 847, L: 0, A: 3
170000, W: 856, L: 0, A: 2
180000, W: 853, L: 0, A: 2
190000, W: 120, L: 0, A: 62
200000, W: 0, L: 0, A: 70


So it’s getting there, but then it gets confused and forgets how to find it, then it remembers, then it forgets…etc.

Solution: Have a healthy fear of death

If instead now we set gamma = .05, the mouse has a much stronger repulsion to imminent death than to ennui for a much longer time, and will go around exploring things (note that this is the same as changing the reward for dying to be a much higher penalty). As a result the mouse will find the cheese faster. As a bonus, we also don’t have the fits of forgetfulness seem in the previous example:

10000, W: 313, L: 4, A: 50
20000, W: 845, L: 0, A: 6
30000, W: 848, L: 0, A: 1
40000, W: 850, L: 0, A: 1
50000, W: 850, L: 0, A: 1
60000, W: 846, L: 0, A: 2
70000, W: 844, L: 0, A: 6
80000, W: 843, L: 0, A: 2
90000, W: 846, L: 0, A: 2
100000, W: 847, L: 0, A: 5
110000, W: 844, L: 0, A: 4
120000, W: 849, L: 0, A: 3
130000, W: 846, L: 0, A: 5
140000, W: 842, L: 0, A: 4
150000, W: 845, L: 0, A: 4
160000, W: 847, L: 0, A: 5
170000, W: 848, L: 0, A: 10
180000, W: 844, L: 0, A: 7
190000, W: 840, L: 0, A: 8
200000, W: 848, L: 0, A: 1


And so in this case now we’ve developed a pretty good, stable, egocentric RL mouse, that can learn to avoid jumping off cliffs and find the cheese in this very simple map.

We can still, of course, drop the gamma value down to the extreme, 0, such that our egocentric learner would have no knowledge of any future further than the reward it gets for taking the next step. In this case it would know that going off a cliff is bad, but it won’t know that moving to the right will eventually lead to cheese / a reward.

However, that can be OK! Warding off any immediate danger is and capitalizing on any immediate reward is the perfect role for this kind of short-sighted system. Trying to make it learn how to run through a complicated maze without a lot of very specific landmarks is just not something that’s going to happen.

Conclusions

So, recapping, the egocentric system learns Q-values for states that are based on the environment relative to the agent. This has some important consequences, in that it allows it to learn very quickly not to move forward if there’s a cliff in front of you, and that if there’s a reward in the immediate vicinity to go get it. Setting the weight on the expected reward in the Q-values update can lead to some unstable behavior, and it’s important to have a good concept of how the behavior of the agent changes with different gamma values. These undesirable behaviors can be avoided by setting gamma = 0, but the agent will then only learn the consequences of its immediate action, and be unable to look into the future at all to see how to bring about a reward or avoid getting trapped.

The egocentric learner has strengths where the allocentric learner fails, and the allocentric learner has strengths where the egocentric learner fails. The natural thought is to combine these two learners, and that will be the subject of the next RL post!

## Workshop talk – Methods for scaling neural computation

A couple of months ago I gave a talk at the Neuro-Inspired Computational Elements (NICE) workshop, about the use of cortical microcircuits for adaptation in the prediction and control problems. The talk was recorded, and here is a link: http://1.usa.gov/1tg2n2I Ignore the fire alarm that gets pulled in the first minute, the actual talk starts around 1 minute.

The were a lot of talks in general that were very interesting, which are also available online here: http://nice.sandia.gov/videos.html

I’ll be writing up some posts on the subject matter of my talk in the next couple months, explaining the methods in more detail and providing some solid examples with code. Hope you find the video interesting!

## Reinforcement Learning part 2: SARSA vs Q-learning

In my previous post about reinforcement learning I talked about Q-learning, and how that works in the context of a cat vs mouse game. I mentioned in this post that there are a number of other methods of reinforcement learning aside from Q-learning, and today I’ll talk about another one of them: SARSA. All the code used is from Terry Stewart’s RL code repository, and can be found both there and in a minimalist version on my own github: SARSA vs Qlearn cliff. To run the code, simply execute the cliff_Q or the cliff_S files.

SARSA stands for State-Action-Reward-State-Action. In SARSA, the agent starts in state 1, performs action 1, and gets a reward (reward 1). Now, it’s in state 2 and performs another action (action 2) and gets the reward from this state (reward 2) before it goes back and updates the value of action 1 performed in state 1. In contrast, in Q-learning the agent starts in state 1, performs action 1 and gets a reward (reward 1), and then looks and sees what the maximum possible reward for an action is in state 2, and uses that to update the action value of performing action 1 in state 1. So the difference is in the way the future reward is found. In Q-learning it’s simply the highest possible action that can be taken from state 2, and in SARSA it’s the value of the actual action that was taken.

This means that SARSA takes into account the control policy by which the agent is moving, and incorporates that into its update of action values, where Q-learning simply assumes that an optimal policy is being followed. This difference can be a little difficult conceptually to tease out at first but with an example will hopefully become clear.

Mouse vs cliff

Let’s look at a simple scenario, a mouse is trying to get to a piece of cheese. Additionally, there is a cliff in the map that must be avoided, or the mouse falls, gets a negative reward, and has to start back at the beginning. The simulation looks something like exactly like this:

where the black is the edge of the map (walls), the red is the cliff area, the blue is the mouse and the green is the cheese. As mentioned and linked to above, the code for all of these examples can be found on my github (as a side note: when using the github code remember that you can press the page-up and page-down buttons to speed up and slow down the rate of simulation!)

Now, as we all remember, in the basic Q-learning control policy the action to take is chosen by having the highest action value. However, there is also a chance that some random action will be chosen; this is the built-in exploration mechanism of the agent. This means that even if we see this scenario:

There is a chance that that mouse is going to say ‘yes I see the best move, but…the hell with it’ and jump over the edge! All in the name of exploration. This becomes a problem, because if the mouse was following an optimal control strategy, it would simply run right along the edge of the cliff all the way over to the cheese and grab it. Q-learning assumes that the mouse is following the optimal control strategy, so the action values will converge such that the best path is along the cliff. Here’s an animation of the result of running the Q-learning code for a long time:

The solution that the mouse ends up with is running along the edge of the cliff and occasionally jumping off and plummeting to its death.

However, if the agent’s actual control strategy is taken into account when learning, something very different happens. Here is the result of the mouse learning to find its way to the cheese using SARSA:

Why, that’s much better! The mouse has learned that from time to time it does really foolish things, so the best path is not to run along the edge of the cliff straight to the cheese but to get far away from the cliff and then work its way over safely. As you can see, even if a random action is chosen there is little chance of it resulting in death.

Learning action values with SARSA

So now we know how SARSA determines it’s updates to the action values. It’s a very minor difference between the SARSA and Q-learning implementations, but it causes a profound effect.

Here is the Q-learning learn method:

def learn(self, state1, action1, reward, state2):
maxqnew = max([self.getQ(state2, a) for a in self.actions])
self.learnQ(state1, action1,
reward, reward + self.gamma*maxqnew)


And here is the SARSA learn method

def learn(self, state1, action1, reward, state2, action2):
qnext = self.getQ(state2, action2)
self.learnQ(state1, action1,
reward, reward + self.gamma * qnext)


As we can see, the SARSA method takes another parameter, action2, which is the action that was taken by the agent from the second state. This allows the agent to explicitly find the future reward value, qnext, that followed, rather than assuming that the optimal action will be taken and that the largest reward, maxqnew, resulted.

Written out, the Q-learning update policy is Q(s, a) = reward(s) + alpha * max(Q(s')), and the SARSA update policy is Q(s, a) = reward(s) + alpha * Q(s', a'). This is how SARSA is able to take into account the control policy of the agent during learning. It means that information needs to be stored longer before the action values can be updated, but also means that our mouse is going to jump off a cliff much less frequently, which we can probably all agree is a good thing.

## Online Goal Babbling – motor learning paper review

Recently I was linked to an article about learning how to control a highly complex arm from scratch: How infants tell us how to control the Bionic Handling Assistant. The work seemed very interesting so I went and pulled one of the papers linked in the article, Online Goal Babbling for rapid bootstrapping of inverse models in high dimensions.

Diving into that title, online means that we’re using information from every movement as it’s gathered to improve our control, as opposed to ‘batch’ where learning only occurs every so-many trials. Bootstrapping is the process of bringing a system up to a functionally useful level. High dimensions then refers to the complexity of the system being controlled, where every component that requires a control signal is another dimension. Humans, for example, require extremely high dimensional control signals. Inverse models refer to a type of internal model, which ‘describe relations between motor commands and their consequences’. Forward models predict the results of a movement, and inverse models allow suggest a motor command that can be used to achieve a desired consequence, such as ‘the ball falls on the floor’ or ‘my hand touches the red balloon’.

Alright, so that’s the title, let’s dive into the paper proper.

Online goal babbling

The idea behind this research is to let the system learn how to control itself by exploring the environment. The reason why you would want to do this is so that you don’t have to analytically specify how the system should move. Analytic solutions require accurate models of the system dynamics, and calculating these quickly become horrendously complex. To the point that the equations of motion for a relatively simple 3-link arm moving are pages upon pages upon pages. On top of this, because your solution is only is as good as your model and the model you just took all that time to calculate isn’t adaptive, if your system dynamics change at all through wear and tear or an environment change, you’re in big trouble. You can use feedback control to help compensate for the errors introduced, but you can only respond as fast as you can receive and process sensory signals, which is often too long in practical applications.

So this moves us to adaptive feedforward control, based on learning an inverse model of the system dynamics. Importantly, what this paper describes is a kinematic inverse controller, not a kinetic inverse controller; meaning that given a desired target position for the end effector (hand) of an arm it provides a sequence of joint angle sets that will lead to the end effector reaching the target, as opposed to providing a sequence of joint angle torques to drive the system to the target. At some point, a control system will have to generate force commands, rather than just specify what position it wants the system to be in. But, knowing what joint angles and trajectory the joint angles should follow is an important problem, because systems that are of interest to control tend to exhibit a fair amount of configuration redundancy between ‘task-space’ and the actual system state-space. Task-space being something like the 3-dimensional $(x,y,z)$ position of the end-effector, which we are interested in controlling, and the actual system state-space being something like the joint-angles or muscle-lengths. Configuration redundancy is the problem of more than one possible set of joint-angles putting the end-effector in the desired location. Often the number of potential solutions is incredibly large, think about the number of possible ways you can reach to an object. So how do you learn an appropriate trajectory for your joint angles during a reaching movement? That is the problem being addressed here.

To learn an inverse model from scratch, the system needs to explore. How should it explore? Moving randomly eventually leads to an exhaustive search, but this is a poor choice in complex systems because it takes a large amount of time, increasing exponentially with the degrees of freedom (i.e. number of joint angles) of the system. So let’s look to babies, they’re responsible for learning how to control a very complex system, how the heck do they learn to move?

‘Motor babbling’ is a term that was coined to describe the seemingly random way babies moved about. It was suggested that they just flail about without purpose until they gain some understanding of how their bodies work, at which point they can start moving with purpose. But! Baby movement was shown way back in the 80’s to in fact not be just random, but instead to be highly goal directed. And when they find a solution, they stick with it, they don’t worry about it being the best one as long as it gets the job done. Only later are movements tweaked to be more efficient. As mentioned above, in systems as complicated as the human body the task-space (i.e. position of the hand) is much smaller than the motor space (i.e. length of all muscles), and there are a bunch of different solutions to a given task. With all these different potential solutions to a given problem, an exhaustive search isn’t even close to being necessary. Also, if babies aren’t just randomly exploring space to figure things out, they don’t have to flip a switch somewhere in their brain that says “ok stop messing around and let’s move with purpose”.

This paper provides a technique for stable online inverse model learning that can be used from initial bootstrapping, and shows how online learning can even be highly beneficial for bootstrapping. So let’s dive into the Online Goal Babbling learning method.

The inverse model function

Let’s denote the inverse model function we’re learning as $g()$, joint angles as $q$, and end effector positions as $x$. Then to denote giving a desired end effector position and getting out a set of joint angles we write $g(x^*) = q$, where $x^*$ represents a target in end effector space.

We’re going to initialize the inverse model function by having every end effector position return some default resting state of the arm, or home position, that we’ve defined, $q_{home}$. Additionally, we’re going to throw on some exploratory noise to the answer provided by the inverse model, so that the joint angles to move to at time $t$ are defined as $q_t = g(x^*_t, \theta_t) + E_t(x_t^*)$, where $E_t(x^*_t)$ is the exploratory noise. When the system then moves to $q_t$ the end effector position, $x_t$, is observed and the parameters of the inverse model, $\theta$, are updated immediately.

To generate a smooth path from the starting position to the target, the system divides the distance up into a number of sub-targets (25 in the paper) for the system to hit along the way. This is an important point, as it’s how the inverse model function is used to create a path that represents the system moving; a series of targets are provided and the inverse model is queried “what should the joint configuration be for this position?”

As mentioned before, it is possible to have he end effector in the same position, but the joints in a different configuration. Learning across these examples is dangerous and can lead to instability in the system as neighbouring targets could be learned with very dissimilar joint configurations, preventing smooth movement through joint-space. To prevent this, observed information is weighted by a term $w_t$ as it is taken in, based on deviation from the resting position of the arm ($q_{home}$) and efficiency of end effector movement. What this leads to is a consistent solution to be converged upon throughout movement space, causing the inverse model to generate smooth, comfortable (i.e. not near joint limits, but near the resting state of the arm) movements.

Additions to movement commands

To recenter movement exploration every so often, and prevent the system from exploring heavily in some obscure joint-space, every time a new target to move to is selected there is some probability (.1 in the paper) that the system will return to $q_{home}$. To return home the system traces out a straight trajectory in joint-space, not worrying about how it is moving through end effector space. This return probability reinforces learning how to move well near $q_{home}$, and acts as a ‘developmentally plausible stabilizer that helps to stay on known areas of the sensorimotor space.’

Also, we mentioned adding exploratory noise. How is that noise generated? The noise is calculated through a small, randomly chosen linear function that varies slowly over time: $E_t(x^*_t) = A_t \cdot x^* + c_t$, where the entries to the matrix $A_0$ and vector $b_0$ are chosen independently from a normal distribution with zero mean and variance $\sigma^2$. To move, a set of small values is chosen from a normal distribution with variance significantly smaller than $\sigma^2$, and added to elements of $A_t$ and $c_t$. A normalization term is also used to keep the overall deviation stable around the standard deviation $\sigma$. And that’s how we get our slowly changing linear random noise function.

Online learning

To learn the inverse model function, we’re going to use the technique of creating a bunch of linear models that are accurate in a small area of state space, and weighting their contribution to the output based on how close we are to their ‘region of validity’. The specific method used in the paper is a local-linear map, from (H.Ritter, Learning with the self-organizing map 1991). We define our linear models as $g^{(k)}(x) = M^{(k)} \cdot x + b^{(k)}$, intuited as following the standard $mx + b$ definition of a line, but moving into multiple dimensions. $M^{(k)}$ is our linear transformation of the input for model $k$, and $b^{(k)}$ is the offset.

If an input is received that is outside of the region of validity for all of the local linear models, then another one is added to improve the approximation of the function at that point. New local models are initiated with the Jacobian matrix $J(x) = \frac{\partial g(x)}{\partial x}$ of the inverse estimate. In other words, we look at how the approximation of the function is changing as we move in this $x$ direction in state space, or estimate the derivative of the function, and use that to set the initial value of our new linear model.

To update our inverse model function, we fit it to the current example $(x_t, q_t)$ by reducing the weighted square error: $err = w_t \cdot ||q_t - g(x_t)||^2$. With this error, a standard gradient descent method is used to update the slopes and offsets of the dimensions of the linear models:

$M^{(k)}_{t+1} = M^{(k)}_t - \eta \frac{\partial err}{\partial M^{(k)}_t}, \;\;\;\;\; b^{(k)}_{t+1} = b^{(k)}_{t} - \eta \frac{\partial err}{/partial b^{(k)}_{t}}$,

where $\eta$ is the learning rate.

Results

So here are some figures of the results of applying the algorithm, and they’re pretty impressive. First we see learning the inverse model for a 5-link arm:

And one of the really neat things about this is the scalability of this algorithm. Here’s applying OGB to a 50-link arm:

And here are some results looking at varying the learning rate (x-axis) and the effects on (a) time until 10% error reached, (b) final performance error, and (c) final average distance from $q_{home}$, the resting state configuration.

I enjoyed reading this paper, although I thought the ‘Learning operational space control’ papers from Jan Peters and Stephen Schaal deserved a bit more of a shout-out, as they’ve been around since 2006/8 and it’s a very similar method. Also I pulled one of the older papers from the article, so I’ll need to review the other ones too soon, as there is clearly more going on for the successful control of the elephant trunk arm shown in the article.

Other than that, I’m a big fan of breaking up complicated, non-linear functions into a bunch of small linear models, and stitching them all back together to learn an accurate approximation. I think it’s a very likely candidate for the kind of way that the brain goes about learning these incredibly complicated models, especially in the cerebellum I hold these types of methods as a favorite for how biological systems accomplish this. Another important note is that the learning never ‘turns off’ here, which is another big plus to this method. It allows the system to converge upon a solution, but to keep exploring the space as it uses this solution, so more efficient solutions can be found. Additionally, this makes this system much more robust to changing dynamics, resultant of wear and tear or, as in the case of infants, growing.

Again, the model here is learning an kinematic control of the system, specifying what joint angles the system should move to, and not the joint torques that will be required to actually move the system there. But, I am partial to the idea that this kinematic inverse model does then make it a much more smooth function for the kinetic controller to learn from this point.

The way I picture it is specifying a target in end-effector space, learning an kinematic inverse model controller that can specify the appropriate joint trajectory to follow to reach this target. Then an kinetic inverse model controller that takes in the joint space path, looks at the current joint angles / velocities / and accelerations, and learns the appropriate torques or muscle activation commands to move the arm as desired. This is a much, much smoother function to learn than directly learning the appropriate joint torques for a desired target in end effector space, because there are no hidden configuration redundancies. It’s still complex, because it still has to compensate for the inertia and multi-link interaction dynamics, but should still hopefully be much more straight-forward to learn.

Rolf, Matthias. (2011). Online Goal Babbling for rapid bootstrapping of inverse models in high dimensions. Development and Learning (ICDL), 2011 IEEE DOI: 10.1109/DEVLRN.2011.6037368

## Reinforcement learning part 1: Q-learning and exploration

We’ve been running a reading group on Reinforcement Learning (RL) in my lab the last couple of months, and recently we’ve been looking at a very entertaining simulation for testing RL strategies, ye’ old cat vs mouse paradigm. There are a number of different RL methods you can use / play with in that tutorial, but for this I’m only going to talk about Q-learning. The code implementation I’ll be using is all in Python, and the original code comes from one of our resident post-docs, Terry Stewart, and can be garnered from his online RL tutorial. The code I modify here is based off of Terry’s code and modified by Eric Hunsberger, another PhD student in my lab. I’ll talk more about how it’s been modified in another post.

Q-learning review
For those unfamiliar, the basic gist of Q-learning is that you have a representation of the environmental states s, and possible actions in those states a, and you learn the value of each of those actions in each of those states. Intuitively, this value, Q, is referred to as the state-action value. So in Q-learning you start by setting all your state-action values to 0 (this isn’t always the case, but in this simple implementation it will be), and you go around and explore the state-action space. After you try an action in a state, you evaluate the state that it has lead to. If it has lead to an undesirable outcome you reduce the Q value (or weight) of that action from that state so that other actions will have a greater value and be chosen instead the next time you’re in that state. Similarly, if you’re rewarded for taking a particular action, the weight of that action for that state is increased, so you’re more likely to choose it again the next time you’re in that state. Importantly, when you update Q, you’re updating it for the previous state-action combination. You can only update Q after you’ve seen what results.

Let’s look at an example in the cat vs mouse case, where you are the mouse. You were in the state where the cat was in front of you, and you chose to go forward into the cat. This caused you to get eaten, so now reduce the weight of that action for that state, so that the next time the cat is in front of you you won’t choose to go forward you might choose to go to the side or away from the cat instead (you are a mouse with respawning powers). Note that this doesn’t reduce the value of moving forward when there is no cat in front of you, the value for ‘move forward’ is only reduced in the situation that there’s a cat in front of you. In the opposite case, if there’s cheese in front of you and you choose ‘move forward’ you get rewarded. So now the next time you’re in the situation (state) that there’s cheese in front of you, the action ‘move forward’ is more likely to be chosen, because last time you chose it you were rewarded.

Now this system, as is, gives you no foresight further than one time step. Not incredibly useful and clearly too limited to be a real strategy employed in biological systems. Something that we can do to make this more useful is include a look-ahead value. The look-ahead works as follows. When we’re updating a given Q value for the state-action combination we just experienced, we do a search over all the Q values for the state the we ended up in. We find the maximum state-action value in this state, and incorporate that into our update of the Q value representing the state-action combination we just experienced.

For example, we’re a mouse again. Our state is that cheese is one step ahead, and we haven’t learned anything yet (blank value in the action box represents 0). So we randomly choose an action and we happen to choose ‘move forward’.
Now, in this state (we are on top of cheese) we are rewarded, and so we go back and update the Q value for the state ‘cheese is one step ahead’ and the action ‘move forward’ and increase the value of ‘move forward’ for that state.
Now let’s say the cheese is moved and we’re moving around again, now we’re in the state ‘cheese is two steps ahead’, and we make a move and end up in the state ‘cheese is one step ahead’.
Now when we are updating the Q value for the previous state-action combination we look at all the Q values for the state ‘cheese is one step ahead’. We see that one of these values is high (this being for the action ‘move forward’) and this value is incorporated in the update of the previous state-action combination.
Specifically we’re updating the previous state-action value using the equation: Q(s, a) += alpha * (reward(s,a) + max(Q(s') - Q(s,a)) where s is the previous state, a is the previous action, s' is the current state, and alpha is the discount factor (set to .5 here).

Intuitively, the change in the Q-value for performing action a in state s is the difference between the actual reward (reward(s,a) + max(Q(s'))) and the expected reward (Q(s,a)) multiplied by a learning rate, alpha. You can think of this as a kind of PD control, driving your system to the target, which is in this case the correct Q-value.

Here, we evaluate the reward of moving ahead when the cheese is two steps ahead as the reward for moving into that state (0), plus the reward of the best action from that state (moving into the cheese +50), minus the expected value of that state (0), multiplied by our learning rate (.5) = +25.

Exploration
In the most straightforward implementation of Q-learning, state-action values are stored in a look-up table. So we have a giant table, which is size N x M, where N is the number of different possible states, and M is the number of different possible actions. So then at decision time we simply go to that table, look up the corresponding action values for that state, and choose the maximum, in equation form:

def choose_action(self, state):
q = [self.getQ(state, a) for a in self.actions]
maxQ = max(q)
action = self.actions[maxQ]
return action


Almost. There are a couple of additional things we need to add. First, we need to cover the case where there are several actions that all have the same value. So to do that, if there are several with the same value, randomly choose one of them.

def choose_action(self, state):
q = [self.getQ(state, a) for a in self.actions]
maxQ = max(q)
count = q.count(maxQ)
if count > 1:
best = [i for i in range(len(self.actions)) if q[i] == maxQ]
i = random.choice(best)
else:
i = q.index(maxQ)

action = self.actions[i]
return action


This helps us out of that situation, but now if we ever happen upon a decent option, we’ll always choose that one in the future, even if there is a way better option available. To overcome this, we’re going to need to introduce exploration. The standard way to get exploration is to introduce an additional term, epsilon. We then randomly generate a value, and if that value is less than epsilon, a random action is chosen, instead of following our normal tactic of choosing the max Q value.

def choose_action(self, state):
if random.random() < self.epsilon: # exploration action = random.choice(self.actions) else: q = [self.getQ(state, a) for a in self.actions] maxQ = max(q) count = q.count(maxQ) if count > 1:
best = [i for i in range(len(self.actions)) if q[i] == maxQ]
i = random.choice(best)
else:
i = q.index(maxQ)

action = self.actions[i]
return action


The problem now is that we even after we’ve explored all the options and we know for sure the best option, we still sometimes choose a random action; the exploration doesn’t turn off. There are a number of ways to overcome this, most involving manually adjusting epsilon as the mouse learns, so that it explores less and less as time passes and it has presumably learned the best actions for the majority of situations. I’m not a big fan of this, so instead I’ve implemented exploration the following way: If the randomly generated value is less than epsilon, then randomly add values to the Q values for this state, scaled by the maximum Q value of this state. In this way, exploration is added, but you’re still using your learned Q values as a basis for choosing your action, rather than just randomly choosing an action completely at random.

    def choose_action(self, state):
q = [self.getQ(state, a) for a in self.actions]
maxQ = max(q)

if random.random()  1:
best = [i for i in range(len(self.actions)) if q[i] == maxQ]
i = random.choice(best)
else:
i = q.index(maxQ)

action = self.actions[i]

return action


Very pleasingly, you get results comparable to the case where you perform lots of learning and then set epsilon = 0 to turn off random movements. Let’s look at them!

Simulation results
So here, the simulation is set up such that there is a mouse, cheese, and a cat all inside a room. The mouse then learns over a number of trials to avoid the cat and get the cheese. Printing out the results after every 1,000 time steps, for the standard learning case where you reduce epsilon as you learn the results are:

10000, e: 0.09, W: 44, L: 778
20000, e: 0.08, W: 39, L: 617
30000, e: 0.07, W: 47, L: 437
40000, e: 0.06, W: 33, L: 376
50000, e: 0.05, W: 35, L: 316
60000, e: 0.04, W: 36, L: 285
70000, e: 0.03, W: 33, L: 255
80000, e: 0.02, W: 31, L: 179
90000, e: 0.01, W: 35, L: 152
100000, e: 0.00, W: 44, L: 130
110000, e: 0.00, W: 28, L: 90
120000, e: 0.00, W: 17, L: 65
130000, e: 0.00, W: 40, L: 117
140000, e: 0.00, W: 56, L: 86
150000, e: 0.00, W: 37, L: 77


For comparison now, here are the results when epsilon is not reduced in the standard exploration case:

10000, e: 0.10, W: 53, L: 836
20000, e: 0.10, W: 69, L: 623
30000, e: 0.10, W: 33, L: 452
40000, e: 0.10, W: 32, L: 408
50000, e: 0.10, W: 57, L: 397
60000, e: 0.10, W: 41, L: 326
70000, e: 0.10, W: 40, L: 317
80000, e: 0.10, W: 47, L: 341
90000, e: 0.10, W: 47, L: 309
100000, e: 0.10, W: 54, L: 251
110000, e: 0.10, W: 35, L: 278
120000, e: 0.10, W: 61, L: 278
130000, e: 0.10, W: 45, L: 295
140000, e: 0.10, W: 67, L: 283
150000, e: 0.10, W: 50, L: 304


As you can see, the performance now converges around 300, instead of 100. Not great.
Now here are the results from the alternative implementation of exploration, where epsilon is held constant:

10000, e: 0.10, W: 65, L: 805
20000, e: 0.10, W: 36, L: 516
30000, e: 0.10, W: 50, L: 417
40000, e: 0.10, W: 38, L: 310
50000, e: 0.10, W: 39, L: 247
60000, e: 0.10, W: 31, L: 225
70000, e: 0.10, W: 23, L: 181
80000, e: 0.10, W: 34, L: 159
90000, e: 0.10, W: 36, L: 137
100000, e: 0.10, W: 35, L: 138
110000, e: 0.10, W: 42, L: 136
120000, e: 0.10, W: 24, L: 99
130000, e: 0.10, W: 52, L: 106
140000, e: 0.10, W: 22, L: 88
150000, e: 0.10, W: 29, L: 101


And again we get that nice convergence to 100, but this time without having to manually modify epsilon. Which is pretty baller.

Code
And that’s it! There is of course lots and lots of other facets of exploration to discuss, but this is just a brief discussion. If you’d like the code for the cat vs mouse and the alternative exploration you can access them at my github: Cat vs mouse – exploration.
To alternate between the different types of exploration, change the import statement at the top of the egoMouseLook.py to be either import qlearn for the standard exploration method, or import qlearn_mod_random as qlearn to test the alternative method. To have epsilon reduce in value as time goes, you can uncomment the lines 142-145.