A network model of basal ganglia for understanding the roles of dopamine and serotonin in reward-punishment-risk based decision making

Balasubramani, Pragathi P.; Chakravarthy, V. Srinivasa; Ravindran, Balaraman; Moustafa, Ahmed A.

doi:10.3389/fncom.2015.00076

HYPOTHESIS AND THEORY article

Front. Comput. Neurosci., 17 June 2015
Volume 9 - 2015 | https://doi.org/10.3389/fncom.2015.00076

A network model of basal ganglia for understanding the roles of dopamine and serotonin in reward-punishment-risk based decision making

Pragathi P. Balasubramani¹

V. Srinivasa Chakravarthy¹^*

Balaraman Ravindran²

Ahmed A. Moustafa^3,4

¹Department of Biotechnology, Indian Institute of Technology Madras, Chennai, India
²Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India
³School of Social Sciences and Technology, Marcs Institute for Brain and Behavior, University of Western Sydney, Penrith, NSW, Australia
⁴Department of Veterans Affairs, New Jersey Health Care System, East Orange, NJ, USA

There is significant evidence that in addition to reward-punishment based decision making, the Basal Ganglia (BG) contributes to risk-based decision making (Balasubramani et al., 2014). Despite this evidence, little is known about the computational principles and neural correlates of risk computation in this subcortical system. We have previously proposed a reinforcement learning (RL)-based model of the BG that simulates the interactions between dopamine (DA) and serotonin (5HT) in a diverse set of experimental studies including reward, punishment and risk based decision making (Balasubramani et al., 2014). Starting with the classical idea that the activity of mesencephalic DA represents reward prediction error, the model posits that serotoninergic activity in the striatum controls risk-prediction error. Our prior model of the BG was an abstract model that did not incorporate anatomical and cellular-level data. In this work, we expand the earlier model into a detailed network model of the BG and demonstrate the joint contributions of DA-5HT in risk and reward-punishment sensitivity. At the core of the proposed network model is the following insight regarding cellular correlates of value and risk computation. Just as DA D1 receptor (D1R) expressing medium spiny neurons (MSNs) of the striatum were thought to be the neural substrates for value computation, we propose that DA D1R and D2R co-expressing MSNs are capable of computing risk. Though the existence of MSNs that co-express D1R and D2R are reported by various experimental studies, prior existing computational models did not include them. Ours is the first model that accounts for the computational possibilities of these co-expressing D1R-D2R MSNs, and describes how DA and 5HT mediate activity in these classes of neurons (D1R-, D2R-, D1R-D2R- MSNs). Starting from the assumption that 5HT modulates all MSNs, our study predicts significant modulatory effects of 5HT on D2R and co-expressing D1R-D2R MSNs which in turn explains the multifarious functions of 5HT in the BG. The experiments simulated in the present study relates 5HT to risk sensitivity and reward-punishment learning. Furthermore, our model is shown to capture reward-punishment and risk based decision making impairment in Parkinson's Disease (PD). The model predicts that optimizing 5HT levels along with DA medications might be essential for improving the patients' reward-punishment learning deficits.

Introduction

Decision making is related to choosing an action from a set of potential alternatives. The resulting rewarding or punitive outcomes can shape future decisions. In psychological terms, rewards and punishments represent opposite ends on the affective scale. Despite efforts to find dissociable brain systems that code for processing reward and punishment outcomes (Liu et al., 2011), a stringent division of brain systems in reward vs. punishment terms does not seem to be possible, since same neural regions respond to both reward and punishment (Rogers, 2011). The science of learning about the environment through outcomes (rewards and punishments) is called reinforcement learning (RL) (Sutton and Barto, 1998). We focus on a key area of the brain thought to implement reinforcement learning—the basal ganglia (Chakravarthy et al., 2010).

The Basal Ganglia (BG) are a set of nuclei situated in the forebrain known to be involved in a variety of functions, including action selection, action timing, working memory, and motor sequencing (Chakravarthy et al., 2010). A prominent approach that has been gaining consensus over the past decade seeks to model functions of the BG using the theory of RL (Joel et al., 2002). RL theory describes how an artificial agent or an animal learns stimulus-response relationships that maximize rewards obtained from the environment. According to this theory, stimulus-response associations with rewarding outcomes are reinforced, while those that result in punishments are attenuated. Experimental studies show that the activity of dopamine (DA) releasing mesencephalic nucleus-substantia nigra pars compacta (SNc) resembles an RL-related quantity called Temporal Difference (TD) error. TD error represents the difference between the total reward that an animal actually obtains and its expectation of the same, and is a key variable that controls learning in RL framework. This insight has inspired extensive modeling work to apply concepts from RL for describing functions of the BG (Joel et al., 2002). RL theory has been able to account for many crucial functions of DA in BG- mediated learning and behavior (Houk et al., 2007; Schultz, 2010a). Classical models of the BG cast their dynamics in a value function based decision making framework, where value function is the expectation of observed rewards (Joel et al., 2002; Frank et al., 2004; Krishnan et al., 2011). We showed in a recent study (Balasubramani et al., 2014) that BG dynamics can be better modeled using utility based decision making framework mediated by the neuromodulators DA and serotonin (5HT). In that abstract model (Balasubramani et al., 2014), the activity of 5HT controlled the combination of value and risk function for the computation of utility, where risk is the variance observed in the outcomes. The model was shown to reconcile three diverse and representative theories that seek to associate 5HT to (1) punishment sensitivity; (2) time scale of reward prediction; and (3) risk-sensitivity. According to the first theory, central 5HT modulates punishment prediction differentially from reward prediction (Cools et al., 2008). Artificial reduction of 5HT by reducing the levels of tryptophan in the body decreased the tendency to avoid punishment (Cools et al., 2011). A second theory of 5HT function associates its activity to the time scale of reward prediction. This theory is based on experiments which showed that under conditions of low 5HT, subjects exhibited impulsivity—a tendency to choose short-term rewards over the long-term ones (Tanaka et al., 2007). The third theory relates 5HT to risk-sensitivity. Low levels of 5HT promote risk seeking behavior when provided with choices of equal mean and different variances (risk) associated with the outcomes (Long et al., 2009; Murphy et al., 2009).

The current study presents a neural network model of the BG including nuclei such as striatum, subthalamic nucleus (STN) and globus pallidum (externa and interna -GPe/GPi), and is controlled by neuromodulators such as DA and 5HT. The model builds on a novel proposal that the medium spiny neurons (MSNs) of the striatum can compute either value or risk depending on the types of DA receptors they express. While the MSNs that express DA D1-receptor (D1R) compute value as earlier suggested in modeling studies (O'Doherty et al., 2004), those that co-express D1R and D2R are now shown to be capable of computing risk. No earlier computational models of the BG (Frank et al., 2004; Ashby et al., 2010; Humphries and Prescott, 2010; Krishnan et al., 2011) have taken these D1R-D2R co-expressing neurons into consideration, though their existence in the BG was shown by many experiments (Nadjar et al., 2006; Bertran-Gonzalez et al., 2010; Hasbi et al., 2010, 2011; Perreault et al., 2010, 2011; Calabresi et al., 2014). The neuromodulator DA is represented as the TD error mediating either the update of the cortico-striatal weights or the action selection dynamics occurring downstream of the striatum. This is in agreement to various contemporary models of DA in the BG (Frank et al., 2004; Magdoom et al., 2011; Kalva et al., 2012; Chakravarthy and Balasubramani, 2014). The specific modulation site of 5HT in the striatum is elusive (Ward and Dorsa, 1996; Eberle-Wang et al., 1997; Barnes and Sharp, 1999; Nicholson and Brotchie, 2002; Parent et al., 2011). This study makes a prediction on the types of striatal MSNs that significantly receive 5HT modulation. It describes the computational roles of the three pools of striatal MSNs viz., D1R-expressing, D2R-expressing and D1R-D2R co-expressing MSNs. It also expands the earlier BG architectures significantly by ascribing a crucial role to the D1R-D2R MSNs that project to the direct and indirect pathways of the BG. The presented DA-5HT mediated network model is then shown to explain their seminal behavioral effects by simulating experiments analyzing reward, punishment, and risk learning (Daw et al., 2002; Cools et al., 2008; Long et al., 2009). The study also extends toward describing a principal model of the BG dysfunction i.e., Parkinson's Disease (PD) for explaining the associated impairment in action selection (Bodi et al., 2009).

The paper is organized as follows: Section A Model of Utility-based Decision Making outlines the lumped model of value and risk computation in the striatum as described in our earlier study (Balasubramani et al., 2014). Section Cellular Correlates for the Value and the Risk Computation describes the neural correlates for both the value and risk computation in the striatum. Specifically, this section shows that D1R expressing MSNs are involved in value computation, while the MSNs that co-express D1R and D2R support risk computation. The network model is introduced in Section Modeling the BG Network in Healthy Control Subjects that uses the neural correlate model of Section Cellular Correlates for the Value and the Risk Computation for the BG action selection dynamics. The D1R MSNs project to GPi via the Direct Pathway (DP) while the D1R-D2R and the D2R MSNs project to GPi via the Indirect Pathway (IP) consisting of the GPe and STN. The SNc model component receives input from both D1R MSNs and D1R-D2R MSNs, and releases DA. The experimental sections deal with testing the model on risk sensitivity (Section Modeling the Risk Sensitivity), punishment sensitivity and behavioral inhibition (Section Modeling Punishment Mediated Behavioral Inhibition). The model is further extended to simulate PD condition. Section Modeling the Reward-punishment Sensitivity in PD thereby studies the model behavior on a probabilistic reward-punishment learning paradigm in control and PD conditions. The model equations that are adapted to represent the PD condition are given in the Section Simulating Parkinson's Disease (PD). The study results, limitations and testable predictions are finally discussed in Section Discussion.

Model

A Model of Utility-Based Decision Making

This section quickly summarizes our extended reinforcement learning model of the BG (Balasubramani et al., 2014), where the agent (subject) tends to maximize utility. We start with the value function “Q,” associated with a state, “s,” and an action, “a,” pair, at time, “t.” This is the expected discounted sum of rewards obtained starting from time t in state s:

\begin{matrix} Q^{π} (s, a) = E_{π} (r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + … | s_{t} = s, a_{t} = a) & (2.1.1) \end{matrix}

where, γ, is a discount factor controlling the myopicity of the rewards. These value functions are updated using the temporal difference learning rule as follows:

\begin{matrix} Q_{t + 1} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) + η_{Q} δ_{t} & (2.1.2) \end{matrix}

where, “δ_t” is the temporal difference (TD) error, given by Equation (2.1.3) if the experiment runs for multiple time steps, and by Equation (2.1.4) in the case of single-step experiments.

\begin{matrix} δ_{t} = r_{t} + γ Q_{t} (s_{t + 1}, a_{t + 1}) - Q_{t} (s_{t}, a_{t}) & (2.1.3) \end{matrix}

\begin{matrix} δ_{t} = r_{t} - Q_{t} (s_{t}, a_{t}) & (2.1.4) \end{matrix}

We introduced the notion of a risk function, “h,” that tracks the variance (δ²) (Bell, 1995; D'Acremont et al., 2009) in instantaneous rewards or the reward prediction error with zero mean, and is updated as follows:

\begin{matrix} h_{t + 1} (s_{t}, a_{t}) = h_{t} (s_{t}, a_{t}) + η_{h} ξ_{t} & (2.1.5) \end{matrix}

where, ξ_t is the risk prediction error given by:

\begin{matrix} ξ_{t} = δ_{t}^{2} {- h}_{t} (s_{t}, a_{t}) . & (2.1.6) \end{matrix}

Finally, we define the utility “U,” at time, “t,” as a combination of the value function and the risk function as follows:

\begin{matrix} U_{t} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) - α s i g n (Q_{t} (s_{t}, a_{t})) \sqrt{h_{t} (s_{t}, a_{t})} & (2.1.7) \end{matrix}

where, α controls the risk sensitivity and is proposed to represent the functioning of 5HT in the BG. The sign() term in Equation (2.1.7) represents the non-linear risk sensitivity. Studies show that the subjects are risk averse in the case of gains and risk seeking during losses (Kahneman and Tversky, 1979). The subjective gains (losses) are represented by a positive (negative) value of Q; and therefore the risk component with the sign(Q) would negatively (positively) affect the Utility, in order to show risk averse (seeking) behavior. The policy used for utility maximization is soft-max, with the probability, “P,” of choosing an action from a state at time, “t,” given by the following Equation (2.1.8):

\begin{matrix} P_{t} (a | s) = \exp (β U_{t} (s, a)) / \sum_{i = 1}^{n} \exp (β U_{t} (s, i)) & (2.1.8) \end{matrix}

“n” is the total number of actions available at state, “s,” and “β” is the inverse temperature parameter. Values of β tending toward 0 make the actions almost equiprobable whilst values tending toward ∞ make the soft-max action selection identical to greedy action selection.

This utility-based model of the BG described by Balasubramani et al. (2014) is an abstract, lumped model in which it is proposed that the utility function is computed in the striatum. However, in order to expand the lumped model to a network version, we first identify cellular correlates of value and risk computations in the next section.

Cellular Correlates for the Value and the Risk Computation

Most approaches to modeling cellular level mechanisms for value computation in the striatum consist of three conditions:

(1) Occurrence of TD error information in the form of DA signals in at the striatum (Schultz et al., 1997),

(2) Availability of information related to the cortical sensory state in the striatum (Divac et al., 1977; Mcgeorge and Faull, 1989), and

(3) DA-dependent plasticity in cortico-striatal connections (Reynolds and Wickens, 2002).

A typical formulation of DA-dependent learning (Reynolds and Wickens, 2002) may be expressed as the change in cortico-striatal connection strength, w (Δw),

\begin{matrix} Δ w = η δ x & (2.2.1) \end{matrix}

Where “x” in Equation (2.2.1) represents the cortical sensory input and is used in this section as a logical variable for neural encoding of the underlying state “x,” x = 1 (if x = s_t) else x = 0; “δ” is the TD error [Equations (2.1.3, 2.1.4): representing DA activity]; and “η” is the learning rate. Similar formulations have been proposed from purely RL-theory considerations (See Chapter 9 of Abbott, 2001). A slight variation of the above equation would be as follows.

\begin{matrix} Δ w = η λ^{S t r} (δ) x & (2.2.2) \end{matrix}

where “λ^Str” is a function of δ, that represents the effect of DA on the striatal neural firing rate (Reynolds and Wickens, 2002). Thus, the learning rule of Equation (2.2.2) has a Hebb-like form, where the neuro-modulation is modeled in terms of the effect of the neuromodulator on the firing rate of the post-synaptic neuron. The form of the function λ^Str varies depending on the type of DA family receptors (R) expressed in Medium Spiny Neurons (MSNs) as explained below. In neurons with D1R expression, higher DA level increases the probability of MSN excitation by a given cortical input (Moyer et al., 2007; Surmeier et al., 2007). Hence, in models that represent MSNs, λ^Str is described as an increasing sigmoid function of DA for neurons that express D1R. In cells with D2R, the activation is higher under conditions of low DA levels (Hernandez-Echeagaray et al., 2004) and therefore the λ^Str function is modeled as a decreasing function of DA (Frank, 2005; Frank et al., 2007a). These sigmoid λ^Str functions are expressed as:

\begin{matrix} \begin{array}{l} λ_{D 1}^{S t r} (δ) = \frac{2 c_{1}}{1 + \exp (c_{2} (δ + c_{3}))} - c_{1} \\ λ_{D 2}^{S t r} (δ) = \frac{2 c_{1}}{1 + \exp (c_{2} (δ + c_{3}))} - c_{1} \\ λ_{h - D 1}^{S t r} (δ) = \frac{c_{1}}{1 + \exp (c_{2} (δ + c_{3}))} \\ λ_{h - D 2}^{S t r} (δ) = \frac{c_{1}}{1 + \exp (c_{2} (δ + c_{3}))} \end{array} & (2.2.3) \end{matrix}

where c₁, c₂, c₃ are constants subject to the receptor type, and represent the nature of the receptors; The gain functions of D1R MSNs, D2R MSNs are given by λ^Str_D1, λ^Str_D2, and that of the D1R and the D2R component of co-expressing MSNs are given by λ^Str_h−D1, λ^Str_h−D2, respectively.

Examples for such sigmoid λ functions with parameters (Table 1) for the D1R, D2R, and the D1R-D2R MSNs are shown in (Figure 1B).

TABLE 1

Table 1. Parameters used in Equation (2.2.3) for Figure 1.

FIGURE 1

Figure 1. (A) Schematic of the cellular correlate model for the value and the risk computation in the striatum, (B) The D1, D2, and D1D2 gain functions, (C) The output activity of D1R MSN (yD1), D1R-D2R co-expressing MSN (yD1D2), variance tracked through Equation (2.1.5) containing δ2, and normalized variance computed analytically (var) = p*(1-p); Here p is the probability associated with rewards, i.e., with probability p, reward = 1, else reward = 0. The resemblance of var to yD1D2 shows the ability of D1R-D2R co-expressing MSN to perform risk computation.

The activity of MSNs with D1R expression (y_D1) are appropriately suited for value computation (Krishnan et al., 2011; Kalva et al., 2012). They express λ_D1(δ) as an increasing function of δ. The D1R MSN's activity can be thought as a network equivalent of the Equation (2.1.2) in abstract model.

The D1R MSNs receive cortico-striatal connections whose weight is denoted by “w_D1.” The value “Q” computed from such an MSN's activity (y_D1) is given by Equation (2.2.4).

\begin{matrix} y_{D 1} = w_{D 1} x a n d Q = y_{D 1} & (2.2.4) \end{matrix}

And change in weight for such a neuron is given by Equation (2.2.5).

\begin{matrix} Δ w_{D 1} = η_{D 1} λ_{D 1}^{S t r} (δ) x & (2.2.5) \end{matrix}

where η_D1 is the learning rate.

A similar neuron model in which D1R and D2R are co-expressed can simulate risk computations. In case of a neuron that would compute risk, the λ^Str function is represented as “λ^Str_D1D2.”It was reported that the behavior of D1R-D2R co-expressing neurons may be described as the sum of the antagonistic actions of D1 and the D2 expressing neurons (refer to the discussion section for more details). Therefore, activation of D1R-D2R MSNs (y_D1D2) could be modeled simply as an addition of the effects of independent activations of D1R and D2R MSNs, respectively (Surmeier et al., 2007; Allen et al., 2011; Hasbi et al., 2011). When their activation function is computed as a simple summation (superposition) of D1R and D2R MSNs, they capture the variance associated with the rewards and thereby form the risk function (Figure 1). The function “λ^Str_D1D2” of D1R-D2R MSNs is an even function of “δ,” with λ^Str_D1D2 (δ) increasing with increasing magnitude of δ, thereby increases with δ². The λ^Str_D1D2 Equation (2.2.6) can be expressed as the summation of functions corresponding to a D1R component (λ^Str_h−D1) and a D2R component (λ^Str_h−D2) as follows:

\begin{matrix} λ_{D 1 D 2}^{S t r} = λ_{h - D 1}^{S t r} + λ_{h - D 2}^{S t r} & (2.2.6) \end{matrix}

Note that the characteristics of λ^Str_h−D1 and λ^Str_h−D2 as a function of δ depend on the constants c₁, c₂, c₃ of Equation (2.2.3). Response (y_D1D2) of such a neuron is given as,

\begin{matrix} y_{D 1 D 2} = w_{D 1 D 2} x a n d h = y_{D 1 D 2} & (2.2.7) \end{matrix}

and the change in corresponding weight, Δw_h, is given as,

\begin{matrix} Δ w_{D 1 D 2} = η_{D 1 D 2} λ_{D 1 D 2}^{S t r} (δ) x & (2.2.8) \end{matrix}

where η_D1D2 is the learning rate. The (D1R-expressing) striatal MSNs with δ-dependent λ^Str functions that are of increasing sigmoidal shape are capable of computing value. Similarly (D1R-D2R co-expressing) striatal neurons with δ-dependent λ^Str functions that are “U” shaped, can compute risk (Figure 1). The gain expression for risk coding MSNs (λ^Str_h−D1, λ^Str_h−D2) uses a logarithmic-sigmoid function that is unipolar, while the gain expression of other D1R-, D2R- MSNs (λ^Str_D1, λ^Str_D2) uses a tangent-sigmoid function that is bipolar Equation (2.2.3).

Just as D1R expressing MSNs can be regarded as cellular level substrates for value computation in the striatum, D1R-D2R co-expressing MSNs are suitable to be cellular level substrates for risk computation [Figures 1, 2 (inset)]. The D1R-D1R co-expressing MSN's activity can be thought as a network equivalent of the Equation (2.1.5) in abstract model. Particularly, the even property of their activation as a function of δ is essential to capture the variance associated with rewards (Figure 1C).

FIGURE 2

Figure 2. The schematic flow of the signal in the network model. Here x denotes the presence of a state; a denotes the action; with the subscript denoting the index i; Since most of the experiments in the study simulate two possible actions for any state, we depict the same in the above figure for a state s_i; The D1, D2, D1D2 represent the D1R-, D2R-, D1R-D2R MSNs, respectively, and w denotes subscript- corresponding cortico-striatal weights. The schematic also have the representation of DA forms: (1) The δ affecting the cortico-striatal connection weights (Schultz et al., 1997; Houk et al., 2007), (2) The δ_U affecting the action selection at the GPi (Chakravarthy and Balasubramani, 2014), (3) The Q affecting the D1/D2 MSNs (Schultz, 2010b); and 5HT forms represented by α_D1, α_D2, and α_D1D2 modulating the D1R, D2R, and the D1R-D2R co-expressing neurons, respectively. The inset details the notations used in model section for representing cortico-striatal weights (w) and responses (y) of various kinds of MSNs (D1R expressing, D2R expressing, and D1R-D2R co-expressing) in the striatum, with a sample cortical state size of 4, and maximum number of action choices available for performing selection in every state as 2.

We now introduce the above cellular substrates for value and risk computation in a network model of the BG and show that the network is capable of reward-punishment-risk based decision making.

Modeling the BG Network in Healthy Control Subjects

The cellular level substrates for value and risk computation in the BG, described above, are now incorporated into a network model of the BG. This model captures the anatomical details of the BG and represents the following nuclei (described in the Section Cellular Correlates for the Value and the Risk Computation)—the striatum, STN, GPe and GPi. The training of the cortico-striatal connections by nigro-striatal DA correlate (δ) also occurs as described in the earlier Section Cellular Correlates for the Value and the Risk Computation. It models, in an elementary form, the action of DA in switching between DP and IP, via the differential action of DA on the D1, D2, and D1-D2 co-expressing receptors (R) of striatal MSNs. The model also claims different DA signals for the updating of cortico-striatal weights and the switching in GPi (Chakravarthy and Balasubramani, 2014). Some of the key properties of the STN-GPe system such as their bi-directional connectivity facilitating oscillations and “Exploratory” behavior are also captured.

The equations for the individual modules of the proposed network model of the BG (Figure 2) are as follows:

Striatum

The Striatum is proposed to have three types of MSNs: D1R expressing, D2R expressing, and D1R-D2R co-expressing MSNs, all of which follow the model described in Section Cellular Correlates for the Value and the Risk Computation. The cortico-striatal weight update equations for different types of neurons (with subscripts—D1, D2, and D1D2: for the D1R expressing, D2R expressing, and D1R-D2R co-expressing MSNs, respectively) with the gain function (λ^Str_D1, λ^Str_D2, λ^Str_D1D2, respectively) as given by Equation (2.2.3), would then be:

\begin{matrix} \begin{array}{l} Δ w_{D 1} (s_{t}, a_{t}) = η_{D 1} λ_{D 1}^{S t r} (δ (t)) x \\ Δ w_{D 2} (s_{t}, a_{t}) = η_{D 2} λ_{D 2}^{S t r} (δ (t)) x \\ Δ w_{D 1 D 2} (s_{t}, a_{t}) = η_{D 1 D 2} λ_{D 1 D 2}^{S t r} (δ (t)) x \end{array} & (2.3.1) \end{matrix}

Each state-action (s-a) pair is associated with a cortico-striatal weight Equation (2.3.1). The weight corresponding to the encountered s and a, at a time t, is then updated using Equation (2.3.1). The λ^Str gain function for the D1R, D2R, D1R-D2R MSNs are the same as in Equation (2.2.3). The δ in the weight update equations is given by Equation (2.3.2) to capture the immediate reward conditions:

\begin{matrix} δ (t) = r - Q_{t} (s_{t}, a_{t}) & (2.3.2) \end{matrix}

η_D1, η_D2, η_D1D2 are the learning rates for the D1R, D2R and the D1R-D2R MSN cortico-striatal weights, respectively. The “Q” function as calculated in the previous section would be computed by the output of D1R MSNs as in Equation (2.3.3).

\begin{matrix} \begin{array}{l} Q_{t} (s_{t}, a_{t}) = y_{D 1} (s_{t}, a_{t}) \\ w h e r e y_{D 1} (s_{t}, a_{t}) = w_{D 1} (s_{t}, a_{t}) x \end{array} & (2.3.3) \end{matrix}

The risk function (h_t) associated with choosing each action, a_t is then calculated by Equation (2.3.4)

\begin{matrix} \begin{array}{l} h_{t} (s_{t}, a_{t}) = y_{D 1 D 2} (s_{t}, a_{t}) \\ w h e r e y_{D 1 D 2} (s_{t}, a_{t}) = w_{D 1 D 2} (s_{t}, a_{t}) x \end{array} & (2.3.4) \end{matrix}

For a conservative development of a network model from the earlier mentioned abstract level model of Section A Model of Utility-based Decision Making, the utility function for a state-action pair can be written as Equation (2.3.5).

\begin{matrix} U_{t} (s_{t}, a_{t}) = Q_{t} (s_{t}, a_{t}) - α_{D 1 D 2} s i g n (Q_{t} (s_{t}, a_{t})) \sqrt{h_{t} (s_{t}, a_{t})} & (2.3.5) \end{matrix}

The change in utility is calculated using Equation (2.3.6).

\begin{matrix} δ_{U} (t) = U_{t} (s_{t}, a_{t}) - U_{t - 1} (s_{t}, a_{t - 1}) & (2.3.6) \end{matrix}

Here α_D1D2 in Equation (2.3.5) denotes the modulation of 5HT particularly on the D1R-D2R co-expressing MSNs which computes the risk value “h.” More details on modeling 5HT modulation are described later in this section.

STN-GPe System

In the STN-GPe model, STN and GPe layers have equal number of neurons, with each neuron in STN uniquely connected bi-directionally to a neuron in GPe. Both STN and GPe layers are further assumed to have weak lateral connections within the layer. A more detailed description of this model can be obtained from Chakravarthy and Balasubramani (2014). The number of neurons in the STN (or GPe) (Figure 2) is taken to be equal to the number of possible actions for any given state (Amemori et al., 2011; Sarvestani et al., 2011). The dynamics of the STN-GPe network is given below

\begin{matrix} \begin{array}{l} τ_{s} \frac{d x_{i}^{S T N}}{d t} = - x_{i}^{S T N} + \sum_{j = m 1}^{n} W_{i j}^{S T N} y_{i}^{S T N} - x_{i}^{G P e} \\ y_{i}^{S T N} = t a n h (λ^{S T N} x_{i}^{S T N}) \\ τ_{g} \frac{d x_{i}^{G P e}}{d t} = - x_{i}^{G P e} + \sum_{j = 1}^{n} W_{i j}^{G P e} x_{i}^{G P e} + y_{i}^{S T N} - x_{i}^{I P} \end{array} & (2.3.7) \end{matrix}

x^GPe_i - internal state (same as the output) representation of ith neuron in GPe;

x^STN_i - internal state representation of ith neuron in STN, with the output represented by y^STN_i;

W^GPe - lateral connections within GPe, equated to a small negative number ϵ_g for both the self (i = j) and non-self (i ≠ j) connections for every GPe neuron.

W^STN - lateral connections within STN, equated to a small positive number ϵ_s for all non-self (i ≠ j) lateral connections, while the weight of self-connection (i = j) is equal to 1 + ϵ_s, for each STN neuron i.

We assume that both STN and GPe have complete internal connectivity, where every neuron in the layer is connected to every other neuron in the same layer, with the same connection strength. That common lateral connection strength is ϵ_s for STN, and ϵ_g for GPe. Likewise, STN and GPe neurons are connected in a one-to-one fashion—the I'th neuron in STN is connected to the i'th neuron in GPe and vice-versa. For all simulations presented below, the parameters: ϵ_g = −ϵ_s = 0.1; the step-sizes: 1/τ_S = 0.1; 1/τ_g = 0.033; and the slope: λ^STN = 3;

Striatal Output Toward the Direct (DP) and the Indirect Pathway (IP)

Assuming that the striatal D1R MSNs project via the DP to GPi (Albin et al., 1989; Frank, 2005; Chakravarthy et al., 2010), the contribution of the DP to GPi is given by:

\begin{matrix} x_{i}^{D P} = α_{D 1} λ_{D 1}^{G P i} (δ_{U} (t)) y_{D 1} (s_{t}, a_{t}) & (2.3.8) \end{matrix}

The GPe is modeled to receive inputs from both the D2R and D1R-D2R MSNs of the striatum (Hasbi et al., 2011; Perreault et al., 2011; Wallman et al., 2011; Balasubramani et al., 2014) in the indirect pathway. The input to the GPe is therefore given by:

\begin{matrix} \begin{array}{l} x_{i}^{I P} = α_{D 2} λ_{D 2}^{G P i} (δ_{U} (t)) y_{D 2} (s_{t}, a_{t}) + α_{D 1 D 2} s i g n (y_{D 1} (s_{t}, a_{t})) \\ λ_{D 1 D 2}^{G P i} (δ_{U} (t)) \sqrt{y_{D 1 D 2} (s_{t}, a_{t})} \end{array} & (2.3.9) \end{matrix}

where the response functions of various kinds of MSNs are denoted by variable “y”:

\begin{array}{l} y_{D 1} (s_{t}, a_{t}) = w_{D 1} (s_{t}, a_{t}) x \\ y_{D 2} (s_{t}, a_{t}) = w_{D 2} (s_{t}, a_{t}) x \\ y_{D 1 D 2} (s_{t}, a_{t}) = w_{D 1 D 2} (s_{t}, a_{t}) x \end{array}

and

\begin{array}{l} λ_{D 1}^{G P i} (δ_{U}) = \frac{2 c_{1}}{1 + \exp (c_{2} (δ_{U} + c_{3}))} - c_{1} \\ λ_{D 2}^{G P i} (δ_{U}) = \frac{2 c_{1}}{1 + \exp (c_{2} (δ_{U} + c_{3}))} - c_{1} \\ λ_{h - D 1}^{G P i} (δ_{U}) = \frac{c_{1}}{1 + \exp (c_{2} (δ_{U} + c_{3}))} \\ λ_{h - D 2}^{G P i} (δ_{U}) = \frac{c_{1}}{1 + \exp (c_{2} (δ_{U} + c_{3}))} \end{array}

It should also be noted that λ^Strs used as gain factors for the striatal neural outputs of Equations (2.3.8, 2.3.9) are different from that used in Equation (2.3.1). The λ s used in weight dynamics of Equation (2.3.1) are dependent on the TD error of Equation (2.3.2) in immediate reward condition. Whereas, DA used in the λ^GPi of Equations (2.3.8, 2.3.9) is different—it is the temporal gradient of U [δ_U: Equation (2.3.6)] which has a direct role in switching between DP and IP (Kliem et al., 2007). The temporal difference in utility function between time t and t-1 is modeled to control exploitation and exploration dynamics of action selection (Balasubramani et al., 2015) in the BG as follows. In the case of δ_U being high, then according to Equation (2.3.6), the action at time, t, has a higher utility compared to that at time, t-1. This case facilitates DP Equation (2.3.8) that is popularly dubbed as Go pathway which exploits by selecting the same action a_t. In contrary, if δ_U is low, then the NoGo pathway (IP) is selected Equation (2.3.9) for facilitating the action taken at time, t−1. This is because the action at time, t−1, has a higher utility compared to that at time, t Equation (2.3.6). In the third case of δ_U between high and low levels, a random selection of choice from the action repertoire is made, by the Explore pathway (IP) (Chakravarthy and Balasubramani, 2014). Further, DAergic neural activity in monkeys is recently found to be well correlating to the computed utility-difference at a time, t, while performing a decision making task (Stauffer et al., 2014).

In the lumped model of Section A Model of Utility-based Decision Making (Balasubramani et al., 2014), the parameter α represents 5HT activity Equation (2.1.7). The following can be realized on carrying over the concept to a network version. Since α controls risk term only in Equation (2.1.7), and it is shown in Section Cellular Correlates for the Value and the Risk Computation that D1R-D2R co-expression MSNs compute risk, it is natural to formulate the network model such that α modulates only the D1R-D2R MSNs in the striatum. However, experimental evidence to support such specificity in 5HT modulation of striatal neurons is unavailable (Refer to the Discussion section for details). Concerning the unspecific nature of 5HT action in the striatum, we introduce three α's in this section, to differentially module D1R, D2R and D1R-D2R MSNs, respectively. Precisely, 5HT α in Equation (2.1.7) is modeled as the parameters α_D1 Equation (2.3.8), α_D2, and α_D1D2 Equation (2.3.9), for representing its differential modulation on D1R, D2R and the D1R-D2R MSNs, respectively (Figure 2, Table 2). The α's are optimized for each experimental condition separately.

TABLE 2

Table 2. The model correlates for DA and 5HT.

The outputs of D1R and D2R MSNs to GPi flow via the DP and IP, respectively (O'Doherty et al., 2004; Amemori et al., 2011; Chakravarthy and Balasubramani, 2014). We propose that D1R-D2R MSNs also project to GPi via the IP (Perreault et al., 2010, 2011). The first term on the RHS of Equation (2.3.9) denotes projections from D2R expressing MSNs to GPe, whereas the second term represents projections from D1R-D2R co-expressing MSNs to the same target. The second term is analogous to the risk term in the utility function of Equation (2.1.7) (Balasubramani et al., 2014). This term contributes to the non-linear risk sensitivity, i.e., being risk-aversive in the case of gains as outcomes, and being risk-seeking during losses (Kahneman and Tversky, 1979).

The different forms of DA signals used in this study along with references to their biological plausibility are summarized as follows (Figure 2, Table 2):

(1) Representing the TD error used in updating the cortico-striatal weights of the MSNs Equation (2.3.2), as reported by many experimental studies (Schultz et al., 1997; Reynolds and Wickens, 2002; Houk et al., 2007).

(2) Representing the temporal gradient of the utility function [: = δ_U Equation (2.3.6)], used for switching between DP and IP (Chakravarthy and Balasubramani, 2014). For such a DA signal (: = δ_U) from the SNc, those neurons might be using the information of the value component received due to the D1R MSN projections from striatum to SNc (Schultz et al., 1997; Doya, 2002; Houk et al., 2007), and the risk component from the projections of D1R-D2R MSNs to SNc (Surmeier et al., 1996; Perreault et al., 2010, 2011). Further, there are evidences for D1R MSNs and the co-expressing D1R-D2R MSNs forming the striosomal component that could assist in computing the utility prediction error from SNc (Jakab et al., 1996; Surmeier et al., 1996; Nadjar et al., 2006; Amemori et al., 2011; Calabresi et al., 2014). This form of DA signal is reported by a recent study on utility based decision making in monkeys by Schultz and colleagues (Stauffer et al., 2014).

(3) The neurobiological interpretation of the sign(Q) used in the second term of the Equation (2.3.9) could be also linked to the SNc functioning. The “value function” coding DA neurons (represented by the projections marked by “Q” in the Figure 2) as reported in studies by Schultz and colleagues (Schultz, 2010b) might be preferentially targeting the D1R-D2R co-expressing neurons in the striatum. This modulation is roughly captured in our model through the sign(Q) term in Equations (2.3.5, 2.3.9).

Combining DP and IP in GPi

Each action neuron in GPi is modeled to combine the contributions of DP and IP (Kliem et al., 2007) as given in Equation (2.3.10),

\begin{matrix} x_{i}^{G P_{i}} = - x_{i}^{D P} + w_{i}^{S T N - G p i} y_{i}^{S T N} & (2.3.10) \end{matrix}

where x^DP is from Equation (2.3.8), and y^STN that denotes output of STN, is from Equation (2.3.7). The relative weightage of STN projections to GPi, compared to that of the DP projections, is represented by w^STN−GPi. For the simulations in this study, w^STN−GPi is set to 1 for all the GPi neurons.

Action Selection at Thalamus

The direct and indirect pathway is combined downstream either in GPi, or further along in the thalamic nuclei, which receive afferents from GPi (Humphries and Gurney, 2002; Chakravarthy et al., 2010). GPi neurons project to thalamus over inhibitory connections. Hence the thalamic afferents for a neuron i, may be expressed simply as,

\begin{matrix} x_{i}^{T h a l a m u s_{i}} = x_{i}^{D P} - w_{i}^{S T N - G p i} y_{i}^{S T N} & (2.3.11) \end{matrix}

These afferents activate thalamic neurons as follows,

\begin{matrix} \frac{d y_{i}^{T h a l a m u s}}{d t} = - y_{i}^{T h a l a m u s} + x_{i}^{T h a l a m u s} & (2.3.12) \end{matrix}

where y^Thalamus_i is the state of the ith thalamic neuron. Action selected is simply the “i” (i = 1, 2, .., n) whose y^Thalamus_i is maximum on integration. In our simulations, the integration process is carried over for 25 time steps.

Simulating Parkinson's Disease (PD)

A model of PD may incorporate the following features in terms of DA and 5HT levels:

(1) DA levels are lower in PD than in controls: This feature is simulated by clamping “δ,” and upper bounding δ to δ_Lim. Since there is a reduced number of DA cells, Substantia Nigra pars compacta (SNc) is thought to be capable of producing a weak signal reliably, but the highest firing levels in PD are smaller compared to controls (Kish et al., 1988).

(2) PD medication (L-dopa, DA agonists) facilitates DA activity. This is simulated by simply adding a fixed constant to the preexisting clamped δ (Dauer and Przedborski, 2003; Foley et al., 2004).

Hence, to represent the PD condition, the Equation (2.3.2) describing DA activity is first clamped to δ_Lim, as in Equation (2.4.1):

\begin{matrix} i f δ > δ_{L i m}; δ = δ_{L i m} & (2.4.1) \end{matrix}

Equation (2.4.1) represents the never-medicated case (PD-OFF). In the recently-medicated case (PD-ON), in addition to the clamping step (to δ_Lim) just described, there is a transient increase in DA (to model the medication factor δ_Med) to the clamped δ, which is implemented as:

\begin{matrix} δ : = δ_{+} δ_{M e d} & (2.4.2) \end{matrix}

This altered δ, that represents any medication condition, is then used for the corresponding simulations in the Section Modeling the BG Network in Healthy Control Subjects. The ON and the OFF medication status is brought out by Equation (2.4.3).

\begin{matrix} δ (t) = {\begin{array}{l} [a, b] & f o r c o n t r o l s \\ [a, δ_{L i m}] & f o r P D O F F \\ [a, δ_{L i m} + δ_{M e d}] & f o r P D O N \end{array} & (2.4.3) \end{matrix}

where δ_Lim and δ_Lim + δ_Med are lesser than b.

Serotonin levels are also found to be lower in the PD patients (Fahn et al., 1971; Halliday et al., 1990; Bedard et al., 2011). The same is verified by the model parameters α_D1, α_D2, and α_D1D2 in various medication cases of PD (Section Modeling the Reward-punishment Sensitivity in PD).

Experiments and Results

In this section, we apply the model of 5HT and DA in the BG (Section Modeling the BG Network in Healthy Control Subjects) to explain several reward/punishment/risk-based decision making phenomena pertaining to the BG function.

(1) Simulating risk sensitivity (Long et al., 2009).

(2) Simulating reward-punishment sensitivity (Cools et al., 2008).

(3) Simulating reward-punishment sensitivity in Parkinson's Disease (Bodi et al., 2009).

In the simulation studies described in Sections Modeling the Risk Sensitivity to Modeling the Reward-punishment Sensitivity in PD, the BG model parameters [λ^GPi—Equations (2.3.8, 2.3.9)] are set as shown in Table 3. The other parameters: gain functions (λ^Str) of the D1R-, D2R-, D1R-D2R MSNs in the striatum equations (2.3.1, 2.2.3, 2.2.6); the model neuromodulator correlates for 5HT viz., α_D1, α_D2, α_D1D2 that affect D1R, D2R, and the D1R-D2R MSNs, respectively; and DA parameters that condition PD (δ_Lim, δ_Med), are optimized for each experiment. The parameter values are initially selected using grid search and are eventually optimized using genetic algorithm (GA) (Goldberg, 1989) (Details of the GA option set are given in Supplementary Material A).

TABLE 3

Table 3. Parameters used in simulation studies of Sections Modeling the Risk Sensitivity to Modeling the Reward-punishment Sensitivity in PD Equations [2.3.8, 2.3.9].

On studying the significance of 5HT modulation on different pools of MSNs, 5HT is found to significantly affect the D2R and the D1R-D2R co-expressing MSNs for explaining the experiments that deal with risk and punishment-based decision making (Cools et al., 2008; Bodi et al., 2009; Long et al., 2009) (Supplementary Material B). α_D1 did not show much sensitivity to these experimental results. The results presented in the next section therefore equate α_D1 = 1, and optimize α_D1D2 and α_D2 for every experimental condition (Refer to discussion section also).

Modeling the Risk Sensitivity

Overview

In the study of Long et al. (2009), monkeys were presented with two choices of juice rewards, differing in the variances associated with the availability of the rewards (Long et al., 2009). One choice was associated with a risky reward and the other with that of a deterministic/safe one; these choices were of equal expected value (EEV) or unequal expected value (UEV) types. In EEV case both the safe and the risky choices to possess the same mean reward, while in UEV case mean rewards are unequal (Table 4). The monkey's risk sensitivity in the variable tryptophan conditions, viz., baseline (balanced) and Rapid tryptophan depleted (RTD), were recorded by analyzing their safe vs. risky reward selection ratio, under EEV and UEV cases.

TABLE 4

Table 4. The sample reward schedule adapted from Long et al. (2009).

A non-linear risk sensitivity toward juice rewards was displayed by the monkeys: they exhibited risk-seeking behavior for small juice rewards and risk-aversive behavior for larger ones (Long et al., 2009). Furthermore, the experiment showed that when 5HT levels were reduced, the monkeys made more risky choices over the safer alternatives (Long et al., 2009), linking 5HT functioning to risk-based decision making. Therefore, this section analyses the property of risk sensitivity of the network model.

Simulation

The D1R, D2R and the D1R-D2R neuron weights are computed using Equation (2.3.1) and are updated using δ Equation (2.3.2). Learning rates are chosen as: η_D1 = 0.3; η_D2 = 0.1; η_D1D2 = 0.1. The corticostriatal weights of D1R (w_D1), D2R (w_D2) and the D1R-D2R (w_D1D2) MSNs are initialized randomly between 0 and 1; the value, risk and the utility functions are calculated using Equations (2.3.3–2.3.5). The parameters for the λ^Str in Equation (2.3.1) are provided in (Table 5).

TABLE 5

Table 5. Section Modeling the Risk Sensitivity: the parameters for Equations (2.3.1, 2.2.3, 2.2.6).

This is done for all states “s” (tabulated in Table 4), and action sets consisting of “a” reaching the safe target and the risky target. The non-linearity in risk attitudes observed by the agent is accounted for by considering a reward base (r^b) that is subtracted from the juice reward (r^j) obtained. The resultant subjective reward (r) is treated as the actual immediate reward received by the agent Equation (3.1.1). Subtracting r^b from r^j, associates any r^j < r^b with an effect similar to losses, and any r^j > r^b with gains.

\begin{matrix} r = r^{j} - r^{b} & (3.1.1) \end{matrix}

The reward base (r^b) optimized for the experiment is 159.83.

Results

When the RTD condition is simulated by setting [α_D1, α_D2, α_D1D2] = [1, 1, 0.0012], and the baseline by [α_D1, α_D2, α_D1D2] = [1, 1, 1.32], a decrease in the selection of the safe choices is observed in the simulation as demonstrated in the experiment. The model has shown increased risk seeking behavior for low α condition particularly in the D1R-D2R co-expressing MSNs. Hence, modulating the α_D1D2 best captures the baseline (high α_D1D2) and RTD (low α_D1D2) conditions for explaining risk sensitivity. The performance of the network model shown in this section is consistent with that of the lumped model described earlier (Balasubramani et al., 2014) in depicting the role of 5HT in risk-based action selection (Figure 3). More analysis on the effect of α_D1, α_D2, α_D1D2 in showing risk sensitivity are provided in Supplementary Material B.

FIGURE 3

Figure 3. Comparison between the experimental and simulated results for the (A) overall choice (B) Unequal EV (C) Equal EV, under Rapid Tryptophan Depletion (RTD) and Baseline (balanced) condition. Error bars represent the Standard Error (SE) with size “N” = 100 (N = number of simulation instances). The experiment (Expt) and the simulation (Sims) results of any condition are not found to be significantly different. Here the experimental results are adapted from Long et al. (2009).