As I mentioned in my last post, I’m reading a series of papers that presents a model of the basal ganglia, written mainly by Peter Redgrave, Kevin Gurney, and John Reynolds. Of particular interest throughout these articles is a re-examination of the role of the short-term phasic dopamine (DA) signal from the substantia nigra pars compacta (SNc). A well-propagated view is that the phasic DA signal is a reward prediction-error signal, but Redgrave et al present a strong argument against this and suggest instead a role of an agency determination / novel movement identification mechanism. In this post I’m going to be presenting their argument for this, and how reinforcement and reward based learning in the basal ganglia could work at large. Again, throughout there will be comments and questions I put forth, I will make an effort for it to be clear when something is from me and when it’s from the papers.
Phasic DA signal as a reward-prediction error
The idea that the phasic DA signal serves as a reward-prediction error is born out of a series of experiments presented in [Schultz 1998]. The idea of reward-prediction error comes from instrumental (aka operant) conditioning, where rewards ‘reinforce’ behavior by strengthening associations between stimuli and behavioral responses. Formally, a reward-prediction error is defined as the difference between the reward predicted at a given point in time and the actual reward received. This goes way back to [Thorndike 1911] where Thorndike formally states the idea as the Law of Effect:
Any act which in a given situation produces satisfaction becomes associated with that situation so that when the situation recurs the act is more likely than before to recur also.
In Schultz’s experiments, the DA neurons of a monkey are recorded from as the monkey performs a number of various tasks, including “reaction time tasks, direct and delayed GO-NO GO tasks, spatial delayed response and alternation tasks, air puff and saline active avoidance tasks, operant and classically conditioned visual discrimination tasks, self-initiated movements, and unpredicted delivery of a reward in absence of any formal task.” This following image has been lifted from the results of [Schultz 1998]:
The explanation of these results is presented as follows. In the top figure, a reward (R) is unexpectedly delivered and the DA neurons activate. This is because there is a positive error in the predicted reward; no reward was expected, but there was one, BOOM, phasic DA signal. In the middle figure a conditioned stimulus (CS) has been associated with the reward, now the CS occurs unexpectedly, which means that a reward is on the way, thus once again there is a positive error in the predicted reward. Now, however, at the time of reward delivery the reward was predicted and the reward was received. There was no error in reward-prediction, so there is no phasic DA signal. And finally, in the bottom panel we see a CS cause activation of the phasic DA signal, but this time no reward is delivered. Now, exactly when the reward should be delivered and is not there is a negative reward-prediction error, and a corresponding decrease in tonic DA levels is observed.
Another interesting result from this experiment is that the phasic DA signal will push backwards along the chain of predictive events to the earliest predictive sensory stimulus signalling that a reward is coming. Taken all together, a pretty strong case for the reward-prediction error hypothesis is presented.
Problems with phasic DA as a reward-prediction error
There are, however, a number of problems have arisen under close examination of this hypothesis and through further experimental work, laid out by Redgrave et al. These are the main contentions:
- DA neurons respond not only to rewarding stimuli, but also to non-rewarding sensory events salient only by virtue of their novelty or intensity [Schultz 1998], as well as conditioned stimuli not associated with a reward [Bromberg-Martin 2010].
- The phasic DA response is remarkably stereotyped (occurring with ~100ms latency, and a duration of ~100ms), across species, sensory modalities, numerous experimental paradigms, and largely independent of perceptual complexity of eliciting event [Redgrave 2011]. This highly stereotyped DA response time is incongruent with the reward-prediction error hypothesis when considering that there can be a marked difference in the time taken to establish the reward value of different stimuli.
- The latency of gaze-shifts is in the range of 150-250ms [Jay 1987], and the phasic DA response very reliably occurs around ~100ms [Schultz 1998], this means that the reward-prediction error must be calculated before the animal has foveated on the stimulus. Additionally, the source of visual information driving the DA neurons is largely, if not exclusively, the superior colliculus. Neurons in the superior colliculus are highly sensitive to the location of luminance changes, but largely nonresponsive to color and geometric configurations, meaning that the superior colliculus is not in a position to provide object identity (and reward) information to DA neurons.
To contend with these last two points, a number of experiments have shown that DA neurons have shown responses of differing magnitudes and probabilities to unpredicted complex visual stimuli. However, throughout all of the experiments conducted, the different visual stimuli were presented consistently at the same location, which is exactly the visual feature that superior colliculus is capable of detecting [Redgrave 2006]. Rather than discriminating between complex visual stimuli features, the location of the stimuli is instead being used to determine the reward value of the stimulus. Outside of the experimental paradigm, however, temporally unpredictable events are also spatially unpredictable, which makes it unlikely that determining reward value by spatial location in natural environments would be a useful mechanism.
Taken all together, a pretty strong case against the reward-prediction error hypothesis is presented.
An alternative implementation of instrumental conditioning
In the paper series, Redgrave et al propose that instrumental (again, also known as operant) conditioning arises as a function of two mechanisms in the brain: 1) A mechanism to determine whether or not an unpredicted sensory stimuli was caused by the system (agency), establishing a cause-effect relationship if one exists, and 2) a mechanism for reward to modulate the afferent input to the striatum. As mentioned in the last post, the basal ganglia is a proposed central selection device, choosing actions based on the saliency of their input.
Phasic DA for agency determination / novel movement identification
Instead of the reward-prediction error being determined by the phasic DA signal, it is proposed to function for a much more basic purpose: Identifying the cause of unpredicted sensory stimuli. This is a prerequisite to instrumental conditioning / any adaptive behavior. This proposal is based on identifying another function that would generate behavior very similar to a reward-prediction error, while also considering the precise and highly stereotyped natures of the response (~100ms latency, ~100ms duration), and the other information that is likely to be in the striatum at the point when the phasic DA signal arrives. According to [Redgrave 2006] and [Redgrave 2008], there are at least three additional signals in the striatum at the time of phasic DA release:
- Sensory: from branching projections of the superior colliculus, providing information on the stimulus that elicited the phasic DA response
- Contextual information (i.e. general sensory, metabolic, cognitive state, and physical location): from any number of cortical, limbic, and subcortical projections into the striatum, and
- Motor-copy: signals sent from cortical and subcortical sensorimotor structures to the brainstem and spinal cord provide efference copies of the outgoing motor command through branching projections that are relayed both directly and indirectly (through the thalamus) to the dorsal striatum.
Note that these signals would also be in the basal ganglia and likely used in the same way in the proposal that phasic DA is for reward-prediction error as well. However, the list of problems presented above suggests strongly that stimulus and reward-value identification do not operate through the short-latency phasic DA signal. Agency detection is the alternative proposal for a learning based function that requires highly precise timing information, and does not rely on unavailable information such as object identity and actual stimulus reward-value.
One of the main problems with identifying the cause of unexpected stimuli is sorting through the irrelevant information to arrive at the specific trigger. The idea for the phasic DA signal to overcome this computational problem is that it ‘tags’ the signals in the dorsal striatum, including the motor-copy, when an unexpected stimulus occurs, making those actions to be more likely to be chosen again in a similar contextual situation. The authors also note that this process would be aided by the short-latency nature of the phasic DA signal, such that behavior evoked by the stimuli doesn’t get included in the signal tagging, confounding the event-outcome identification.
Through noisy exploration trying to make the event recur, signals that are consistently present become reinforced further, weeding through those that aren’t required to elicit the unexpected stimulus. Eventually the signal which accurately predicts the stimulus is identified. If it is a movement, then it gets added to the ‘library’ of motor actions, increasing the animals repertoire of predictable action / outcomes; in this way the phasic DA signal acts to determine agency and identify new movements. If the signal is not a movement, the association with a reward is noted, stored, and life carries on.
Prediction of sensory stimuli
As mentioned above, DA neurons respond to novel sensory stimuli. Interestingly, the novelty response of DA neurons habituates rapidly when a sensory stimulus fails to associate with a reward. Although much is known about the variables that influence habituation in primitive or reduced preparations, relatively little is known about the mechanisms behind the habituation of un-reinforced sensory stimuli [Redgrave 2011]. It could be a default property of the early sensory networks when a stimulus is repeatedly applied in the absence of any reinforcement, or the result of an outside network modulating afferent projections to sensory systems.
However, when a stimulus is associated with reward, early sensory systems sensitize to its presentation. Additionally, the phasic DA response shifts back to occur at the time of the conditioned stimulus, rather than at the time of the reward. This response continues to push backwards to the first predictable event in a chain of events leading to a reward, seemingly in conflict with the sensitization of stimuli associated with a reward. Additionally, if the conditioned stimulus occurs and the reward does not follow, at the time of the expected reward there is a dip in the tonic DA level.
The mechanism responsible is a precisely timed inhibitory signal that acts to cancel out the phasic DA response evoked by predicted rewards [Schultz 1998]. As stimuli are recognized as predictors of future sensory events, this timed inhibitory signal prevents the activation of the DA neurons. In this way only the first, unpredicted, appearance of a CS in a chain of events evokes a phasic DA response. The goal of this response is to try to learn the cause of this stimulus, in the event that no predictor is learned, the predictor first in the chain will continue to evoke a phasic DA response.
The source of this precisely timed inhibitory signal has not been identified experimentally, but there are several candidates identified in [Redgrave 2011]: direct inhibitory inputs from within the basal ganglia (striatum or globus pallidus); indirect inhibitory inputs from the habenulu-rostro-medial tegmental system (hRMTg); or phasic afferent excitation of local inhibitory neurons with connections to nearby DA neurons.
A side note from me. The cerebellum is widely regarded as a supervised learning center for the brain (so widely I won’t even provide supporting references!). With its highly stereotyped repeated neural structure, and the insane amount of neurons it houses (accounting for 10% of the volume of the brain but holding over 50% of its neurons!), it is thought to provide this supervised learning functionality for a number of different neural systems. The prediction of sensory events given a conditioned stimulus or efference copy of a motor command is a very basic supervised learning problem. The hRMTg system has, in its wide list of afferent projections, connections with deep cerebellar neurons [Jhou 2011]. Although the cerebellum wouldn’t necessarily be required, it also has access to all the contextual, sensory, and motor copy information sent to the basal ganglia, and the connections to the hRMTg system suggest it to me as a favorite among the possibilities listed.
Response to noxious events
Another highly valuable feature for a system is to flag any actions which led to a noxious, such as a painful response, and prevent those actions from being executed again. It would be expected, then, that DA activity is suppressed whenever noxious stimiuli are encountered. Indeed this is the case [Redgrave 2006], where phasic suppression of DA activity lasts for the duration of the noxious event. The mechanism believed responsible for this effect are specialized, high-threshold nociceptors, which are sensory receptors that responds to potentially damaging stimuli with direct projections to the spinal cord and brain. In the same way that phasic DA release potentiates connection strengths, phasic DA suppression depresses the weighting of these connections, making them less likely to be chosen again in the future when a similar situation arises.
To be clear, this response is only expected from stimuli that are directly perceived by the nociceptors to be noxious, such a phasic DA suppression is not expected in the case that a stimuli is noxious but higher level processing is required to determine its reward-value.
Reward maximization;
So we’ve established a likely function for the phasic DA signal, the identification of agency. There’s more to instrumental conditioning, however. There also needs to be a means of reward maximization. The details are light on this part of the model, but are based on the observation of computational models that afferent sensory structures projecting into the basal ganglia could also demonstrate reward-based modulation. This is would give rise to the reward-based action selection bias that is the crux of formal reinforcement learning. This figure is lifted from [Redgrave 2011]:
In this figure the proposed system is shown operating in response to intrinsic (to the basal ganglia) reinforcement on the left, in (A). This case arises when unexpected stimuli bias the action selection process of the basal ganglia to attempt to discover the cause of this stimulus, causing a ‘repetition bias’. On the right of the figure, in (B), the system is shown responding to extrinsic reinforcement, where higher level cortical processing centers have determined that a stimulus was rewarding, and the strength of projections into the basal ganglia are weighted to make the responsible action more likely to be repeated.
As the authors admit, how the reward maximization on the afferent projections to the basal ganglia could occur is still very much unknown. Additionally, as previously mentioned, the mechanisms through which non-reward associated novel stimuli habituate and reward associated novel stimuli sensitize remain to be determined. But the reward maximization proposal is definitely of secondary concern in these papers, the main issue being the reconsideration of the function of the phasic DA signal, for which a case was very strongly presented.
Overview of proposed model;
To put this all together, the system model works as follows. An event occurs that causes activity in an early sensory processing system, which activates the DA neurons. The DA neurons cause a biasing of action selection towards the actions in the dorsal striatum at that moment (which are the actions just taken), which potentially caused the novel, or unpredicted, sensory stimulus. Some other system now says ‘hey that was a rewarding stimuli, don’t habituate to it early sensory systems!’, preventing habituation in the early sensory system. The stimulus then continues to drive the DA neurons, tagging the signals that are in the striatum at that time. As this happens, the signals in the striatum will vary through noise on during action selection, which helps exploration to try to pin down what causes this unexpected (and now defined to be) rewarding sensory stimulus. So far all the biasing of action choice is taking place inside the striatum. When the phasic DA release has pinned down the signals that elicit this sensory stimulus, there’s a transference of this signal to the cortex. In the cortex the reward-maximization system can now bias this action that was figured out in the basal ganglia such that it’s weighted more heavily outside the striatum. Once this is done, the inhibitory predictive system can now learn the association between this signal occurring and a reward following, and a precisely timed inhibitory spike can be generated and sent to the DA neurons to prevent a dopaminergic release.
The last part about the inhibitory predictive system kicking in after transference to the cortex wasn’t explicitly stated in any of the papers, but that’s my understanding of this model.
Questions / Comments
Here are some questions that have come up as I’ve been reading through these papers.
– As mentioned in the previous post, the basal ganglia is proposed to be the central selection device for the brain. This means that the different command systems vying for control are constantly projecting in saliency signals, which makes me wonder how does BG make decisions for the upcoming moment if bombarded by efferent copies of motor commands? I remember reading previously in articles with other models of the basal ganglia a functional actor/critic separation of the dorsal/ventral striatum. Would some separation of saliency and information signals help? Or could it have something to do with dual population coding, which the authors previously mentioned as a means of conveying saliency. Perhaps the information is transmitted and the saliency is chosen from the norm of the vector of firing rates inside the striatum? This second option seems likely to introduce some timing issues.
– There are several promising models which operate based on reward maximization in happening first inside the basal ganglia, then being transferred out to the cortex [Ashby 2007], would the above separation into actor/critic dorsal/ventral striatum help realize this? With novelty detection in the dorsal side, receiving projections from the DA neurons, and reward maximization on the ventral side? Then upon consolidation of a “good” set of movements or action plans transference to the cortex? I am interested to investigate this.
Lots to think about!
————————————————————————————-
[Ashby 2007] – A neurobiological theory of automaticity in perceptual categorization
[Bromberg-Martin 2010] – Dopamine in Motivational Control: Rewarding, Aversive, and Alerting
[Jay 1987] – Sensorimotor integration in the primate superior colliculus. I. Motor convergence
[Jhou 2011] – The mesopontine rostromedial tegmental nucleus: a structure targeted by the lateral habenula that projects to the ventral tegmental area of Tsai and substantia nigra compacta
[Matsumoto 2009] – Two types of dopamine neuron distinctly convey positive and negative motivational signals
[Redgrave 2006] – The short-latency dopamine signal: a role in discovering novel actions?
[Redgrave 2008] – What is reinforced by phasic dopamine signals?
[Redgrave 2011] – Functional properties of the basal ganglia’s re-entrant loop architecture: selection and reinforcement
[Schultz 1998] – Predictive Reward Signal of Dopamine Neurons
[Thorndike 1911] – Animal intelligence; experimental studies
Redgrave P, Vautrelle N, & Reynolds JN (2011). Functional properties of the basal ganglia’s re-entrant loop architecture: selection and reinforcement. Neuroscience, 198, 138-51 PMID: 21821101