Edited by: Per E. Roland, University of Copenhagen, Denmark
Reviewed by: Cyril Monier, Centre National de la Recherche Scientifique, France; Wioletta Joanna Waleszczyk, Nencki Institute of Experimental Biology, Poland
*Correspondence: Jeroen Joukes, Center for Molecular and Behavioral Neuroscience, Rutgers University, 197 University Avenue, Newark, NJ 07102, USA e-mail:
This article was submitted to the journal Frontiers in Systems Neuroscience.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution and reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
The detection of visual motion requires temporal delays to compare current with earlier visual input. Models of motion detection assume that these delays reside in separate classes of slow and fast thalamic cells, or slow and fast synaptic transmission. We used a data-driven modeling approach to generate a model that instead uses recurrent network dynamics with a single, fixed temporal integration window to implement the velocity computation. This model successfully reproduced the temporal response dynamics of a population of motion sensitive neurons in macaque middle temporal area (MT) and its constituent parts matched many of the properties found in the motion processing pathway (e.g., Gabor-like receptive fields (RFs), simple and complex cells, spatially asymmetric excitation and inhibition). Reverse correlation analysis revealed that a simplified network based on first and second order space-time correlations of the recurrent model behaved much like a feedforward motion energy (ME) model. The feedforward model, however, failed to capture the full speed tuning and direction selectivity properties based on higher than second order space-time correlations typically found in MT. These findings support the idea that recurrent network connectivity can create temporal delays to compute velocity. Moreover, the model explains why the motion detection system often behaves like a feedforward ME network, even though the anatomical evidence strongly suggests that this network should be dominated by recurrent feedback.
Successful interaction with a dynamic environment requires a neural mechanism for the detection of motion. In the dominant model of motion perception in the primate—the motion energy (ME) model (Adelson and Bergen,
The primate visual system does contain a class of slower neurons (the parvocellular stream), but the evidence that they are a critical component in motion detection (Malpeli et al.,
A biophysically realistic model of motion detection (Maex and Orban,
Even though anatomically cortical networks are clearly dominated by recurrent connections, this connectivity plays at best a subordinate role in most models of motion detection. For instance, the ME model was originally envisaged as entirely feedforward although it has been extended with recurrent connectivity to amplify direction selectivity (Douglas et al.,
Our works starts from the data—a set of recordings from MT neurons—and shows that an artificial recurrent neural network can faithfully reproduce the speed and direction tuned responses to visual motion. New insights into motion mechanisms resulted from a detailed, quantitative investigation of this model network. Notably, no separate classes of fast and slow neurons, or carefully tuned delay lines were needed to generate a wide range of speed preferences. Instead, a range of temporal delays and concomitant speed preferences emerged from the weight patterns of the network. Second, while the recurrent network could be approximated by a ME model, such a feedforward approximation failed to capture the sequential recruitment typically found in MT neurons (Mikami,
We measured the speed tuning properties in area MT of two adult male rhesus monkeys (
The visual stimuli were generated with in-house OpenGL software (Quadro Pro Graphics card, 1024 × 768 pixels, 8 bits/pixel) and displayed on a 21 inch monitor (75 Hz, non-interlaced, 1024 × 768 pixels; model GDM-2000TC; Sony). Monkeys viewed the stimuli from a distance of 57 cm in a dark room (<0.5 cd/m2) while seated in a standard primate chair (Crist Instruments, Germantown, MD, USA) with the head post supported by the chair frame. We sampled eye position at 60 Hz using an infrared system (IScan, Burlington, MA, USA), and monitored and recorded the eye position data with the NIMH Cortex program, which also controlled stimulus presentation.
We mapped velocity tuning with a random dot pattern that consisted of 100 dots within a 10° diameter circular aperture. The dots had infinite lifetime and were randomly repositioned after leaving the aperture. The dots were 0.15° in diameter and had a luminance of 30 cd/m2. Compared with the 5 cd/m2 background, this resulted in a Michelson point contrast of 70%.
The activity of single units in area MT was recorded with tungsten microelectrodes (3–5 MOhm; Frederick Haer Company, Bowdoinham, ME, USA), which we inserted using a hydraulic micropositioner (model 650; David Kopf Instruments, Tujunga, CA, USA). We filtered, sorted, and stored the signals using the Plexon (Dallas, TX, USA) system. Area MT was identified by its high proportion of cells with directional selective responses, small RFs relative to those of neighboring medial superior temporal area, and its location on the posterior bank of the superior temporal sulcus. The typical recording depth was in agreement with the expected anatomical location of MT determined by structural magnetic resonance scans.
We determined the directional selectivity and RFs of the cells using automated methods (for details, see Krekelberg and Albright,
The MT response to the moving stimuli was binned in 13 ms time windows; the frame rate of the monitor used during the experiments (75 Hz). This allowed us to investigate the emergence of the speed tuning and direction selectivity properties at a temporal resolution that matched the (apparent) motion on the monitor.
We modeled the neuronal data with an Elman recurrent neural network (Elman,
The network had an input, hidden, and output layer. The input layer consisted of 750 units that simulated a RF of 10° (0.013° per unit); the diameter of the stimulus used during the experiments. The input layer was fully connected to the hidden layer in a feedforward manner. The hidden layer had 300 units that were fully connected to the output layer in a feedforward manner. In addition, all hidden units were laterally/recurrently connected to all hidden units. The output layer consisted of 26 units, each simulating one MT cell. The output for each unit of all layers (indexed by
We used the model to capture the responses of a representative subset of MT neurons from our sample of 129. To reduce computational complexity, we focused our analyses and modeling on MT cells with robust, DS responses, and band pass speed tuning. The specific criteria for inclusion were the robustness of the response (firing rate > 7 spikes/s averaged over all speeds in the preferred direction), modest to strong direction selectivity (DSI > 0.1, for definition see below), and a preferred speed in the range of 8–32°/s. This selection resulted in a population of 26 MT cells.
The population response revealed an initial response latency of approximately 30 ms followed by the rapid onset of speed and direction tuning that lasted around 100 ms, and finally a sustained phase with relatively constant responses and tuning (Figure
We recorded responses to preferred and anti-preferred directions of motion only and, therefore, did not attempt to model the entire two-dimensional random dot patterns. Instead, we represented the input as white noise patterns and trained the network to respond in a tuned manner to each of these patterns (Figure
A moving input pattern sequence was modeled by shifting the input pattern in the preferred or anti-preferred direction with one of seven speeds. In the physiological experiments, the visual pattern moved between 0.013° and 0.85° per monitor frame (1°/s–64°/s, respectively). In the model this was implemented by shifting the input pattern by 1–64 input units per 13 ms, respectively.
Before training the network, we initialized the weights and bias values of all layers with the Nguyen-Widrow algorithm. We trained the recurrent neural network on the input and output pattern sequences we described above in the following way. First, we randomly chose one of seven speeds and a direction of motion. Second, frame-by-frame, a new input pattern sequence for that speed and direction was presented on the input units. Third, for each frame, we calculated the response of the hidden units based on the current feedforward input and the recurrent feedback, and then calculated the response of the output units. Fourth, the error of the network was defined as the difference between the response of the output units and the response of all 26 MT cells (for that speed and direction, and in the corresponding time bin after stimulus onset). This error was used to modify all connection weights in the network using error back-propagation-through-time. We repeated these steps (epochs) five million times until the network converged to reproduce the response of all 26 MT cells. Network parameters were then frozen and we investigated the trained network.
We probed the neurons of the recurrent motion model (RMM) using reverse correlation analysis. The reverse correlation analysis assumes that the system under study can be described by a set of linear space-time filters followed by a static nonlinearity (linear-nonlinear, or LN model). Even though the LN model is a considerable oversimplification of area MT (and the RMM), we have previously shown that this method can successfully generate quantitative descriptions of receptive field properties in area MT (Hartmann et al.,
The spike counts needed in this reverse correlation analysis were derived from the RMM activity by scaling the peak response of each unit to 30 spikes per time bin, and then rounding the activity in each bin to the nearest integer. The noise inputs for the reverse correlation analysis were identical to the individual frames of the moving spatial patterns described previously. To reduce computational complexity we used stimuli consisting of 0.027° wide bars for the output units and 0.04° wide bars for the hidden units. This reduced the spatial dimension by a factor of two and three, respectively. The reverse correlation history—the number of time bins leading up to the output activity—was set to be 67 ms. This corresponds to the time needed for the MT population to create a stable speed tuned and DS output. Two million noise stimuli were presented to the model network for reverse correlation analysis of the output units and one million for the hidden units.
We followed standard procedures to estimate the parameters of the LN—model. First, we estimated the spike-triggered average (STA) and spike-triggered covariance (STC) as described in detail in Chichilnisky (
We estimated the speed tuning and direction selectivity properties of the LN-model with 1000 new moving input pattern sequences. For each pattern and for each time bin, the inner product between the motion inputs and the filters gave us the projection value. Ideally, one would calculate the firing rate of the full filter model by passing the projections through a high-dimensional nonlinearity. Due to computer memory constraints, however, we had to restrict this to 1-dimensional nonlinearities; i.e., we assumed that the filter dimensions were separable. The firing rate for the combined filter output was determined as the sum of the individual filter outputs minus the mean for
For the RMM output and hidden units, the direction selectivity and speed tuning was based on the mean response over time to 1000 new motion inputs for all speeds and directions. For the MT cells we used the mean response over time in all experimental trials. We calculated the direction selectivity index (DSI) with the maximum (average) response over the seven speeds in the preferred direction and the (average) response at that speed in the anti-preferred direction: (preferred−anti-preferred) / (preferred + anti-preferred). For the speed tuning index (SI) we used the maximum and the minimum response across the seven speeds in the preferred direction: (maximum−minimum) / (maximum + minimum).
We classified the hidden units of the RMM as simple or complex based on their response to sinusoidal gratings with the preferred spatial frequency (0.5 cycles/°) and speed (16°/s) moving in the preferred direction of the unit. After presenting the gratings for 10 s, we removed the response to the first 67 ms (initial transient), and then determined the ratio of the response at the grating temporal frequency (
We defined a hidden unit’s direct input as weights between the unit and the input units:
For each hidden unit, we determined the spatial shift between direct and indirect input as follows. First, we low-passed filtered both the direct and indirect inputs using the same filter used for the motion inputs (see above). Then we calculated the cross-correlation between these two signals and defined the spatial shift (d
We used the velocity tuning curves of 26 MT neurons recorded in two awake macaque monkeys. These recordings were subsets of the data in previously published studies focusing on the influence of contrast Krekelberg Contrast paper 2006, adaptation (Krekelberg et al.,
The population average response to seven speeds in both the preferred and anti-preferred motion direction had an onset delay of 30 ms and an initial transient lasting 70 ms during which speed tuning and direction selectivity started to emerge (Figure
We first investigated whether a network consisting of (artificial) neurons, all with identical intrinsic properties but modifiable synaptic strengths, could reproduce the temporal dynamics and polarity insensitive velocity tuning of the 26 MT cells. Second, we probed this network to determine how its constituent units and connections solved the complex task of motion processing and how its properties related to neurons in the motion processing pathway.
We created a recurrent neural network with 750 input units, 300 recurrently connected hidden units, and 26 output units (Figure
We tested the performance of the RMM with a simulation of 1000 new input patterns moving at seven speeds in the preferred and anti-preferred motion direction. The average response over time compares the speed tuning properties of the MT and the RMM population response (Figure
While the generalization to a new set of random dot patterns demonstrates a degree of robustness and pattern invariance of motion detection in the RMM, a more stringent test is to consider directional selectivity for patterns that were qualitatively different from the (random dot) patterns used in the training procedure. We therefore determined the velocity curves in response to drifting sine wave gratings (SF = 0.5°), and found that these were highly correlated with the velocity curves measured with random dot patterns (
A common problem in neural network modeling is that one can rarely point at individual elements of the network model as being responsible for a specific component of the input-output transformation. The reason for this is that information and computation are inherently distributed across many elements. This is the same problem experimentalists face when they investigate the motion processing pathway in the real brain. One approach that provides a lower-dimensional description of a complex system uses noise stimuli together with reverse correlation analysis (Chichilnisky,
We presented visual noise to the RMM and estimated the STA as well as the most informative iSTAC filters and their nonlinearities (See Section Materials and Methods; Pillow and Simoncelli,
If MT neurons were perfect ME detectors, one would predict two pairs of oriented, phase shifted space-time filters, followed by quadratic nonlinearities that evoke excitation from the preferred direction of motion and inhibition from the anti-preferred direction of motion (i.e., opponency). The properties of the first pair of excitatory and the first pair of suppressive filters (first quadruple) qualitatively matched this prediction. However, reverse correlation of this output unit revealed two additional quadruples of filters sensitive to higher spatial frequencies, but overlapping in space-time (Figure
The asymmetric spectra from the 26 output units grouped by speed preferences of 8, 16, and 32°/s are shown in Figure
Reverse correlation analysis allowed us to reduce the high-dimensional description of the RMM to 13 linear filters and static nonlinearities per output unit. In other words, we determined LN-models that closely matched the input-output relationship of each of the RMM output units. We can now investigate the extent to which these LN-models capture the response to motion. We simulated 1000 motion inputs with seven speeds in both directions and presented them to both the LN-models and the RMM.
Figure
This finding strongly suggests that an LN-model based only on first and second order space-time correlations was not sufficient to explain the response of single MT cells. Making this claim we need to address two issues. First, the LN-models were based only on the 13 most-informative dimensions; the full LN model contains thousands more dimensions that could collectively describe a considerable amount of information. To address this issue, we note that an LN-model based on filters beyond the 13th filter (up to 100 tested) did not improve speed tuning and direction selectivity compared to the 13 filter LN-model (data not shown). This strongly suggests that the mismatch between the LN model and the data is not due to second-order filters that we excluded from the LN model.
Second, for all LN-models, we summed the output of the individual filters to determine the combined filter output (see Section Materials and Methods) and it is possible that the individual filters should instead be combined nonlinearly to accurately describe the velocity response for the LN-models (Chichilnisky,
Taken together these findings strongly suggest that the MT neurons velocity tuning is sensitive to third and higher order space-time interactions. The RMM captures this sensitivity, but the LN and ME model cannot.
The RMM generated output units with a range of preferred speeds matching our sample of MT neurons. The model allowed us to investigate how this range was constructed from a single population of hidden units. Figure
This connectivity analysis shows that the motion tuning of the output units arose from the weighted combination of the motion tuning of the hidden layer. They received excitation from hidden units with matching preferred speeds in the preferred direction as well as from hidden units with non-matching preferred speeds in the anti-preferred direction. And, they were inhibited by hidden units with non-matching preferred speeds in the preferred direction as well as by hidden units with matching preferred speeds in the anti-preferred direction. While this provides an intuitive explanation of the motion tuning of the output units, it obviously raises the question how the hidden units generated their motion tuning. We turn to this question next.
We determined direction and speed preference for all hidden units by presenting moving patterns (see Section Materials and Methods) and measuring the hidden units’ tuning curves. The preferred speeds of the hidden units ranged from 1°/s to 64°/s in both the preferred (143 units, mean 30°/s) and anti-preferred (157 units, mean 30°/s) direction of the RMM population response (Figure
Second, to classify hidden units as simple or complex, we presented sinusoidal gratings, recorded the responses, and determined the ratio of the response modulation at the temporal frequency of the stimulus and the mean response. This is the F1/F0 ratio—a measure often used to categorize simple and complex cells (Movshon et al.,
Third, we investigated how the hidden units were connected to the input and to other hidden units. Many hidden units had Gabor-like input weights, but others had no clear spatial structure. Principal component analysis on the input weights of all the hidden units revealed six components that explained 98% of the variance (data not shown). These eight components could be grouped into three pairs whose input weights were roughly in quadrature, but with increasing spatial frequencies. These weight patterns provide the building blocks that lead to the filters that were extracted with reverse correlation of the output units (Figure
We used the same white noise analysis previously applied to the output units to gain more insight into the functional properties of the hidden units. First, both simple-like and complex-like units had multiple slanted filters with excitatory and/or suppressive symmetric nonlinearities. Excitatory filters typically corresponded to the preferred speed and direction while suppressive filters corresponded to the anti-preferred speed and direction of the unit, albeit with broader tuning. In other words, their properties were qualitatively similar to those of the output units (Figure
Finally, we investigated the feedforward connectivity from the input to the hidden units and the lateral connectivity among hidden units. We defined a hidden unit’s direct weight as the pattern of weights connecting it to all input units. A hidden unit’s indirect weight was defined as the average weight that connected the unit to the input units via the lateral connections of the hidden layer (see Section Materials and Methods). Figure
An alternative hypothesis of the mechanism underlying direction selectivity in the RMM starts from the observation that spatially shifted (d
The wide scatter of the data points clearly shows that the population as a whole did not follow the prediction based on the ME model. While some simple-like units were relatively close to the slope-1 line, most are not, and the linear summation scheme either overestimated or underestimated the real preferred speed. This mismatch was even more pronounced for most of the complex-like units where, surprisingly, the linear summation scheme often predicted the opposite direction of motion (orange data points in the second and fourth quadrant).
One way to phrase this result is that the recurrent network connectivity changed the effective d
In the RMM all hidden units project to the output units. As a consequence, both simple-like and complex-like hidden units contributed equally to the velocity tuning of the output units. As this may appear to conflict with the evidence that MT cells receive mainly V1 complex input (Movshon and Newsome,
We showed that a recurrent network can generate the velocity-tuned response dynamics measured in area MT. This network used only a single intrinsic delay, but nevertheless generated output units with a wide range of speed preferences. When the output units were tested with noise stimuli, they had slanted space-time filters with symmetric nonlinearities for the preferred and anti-preferred direction of motion, much like a feedforward ME network. The RMM, however, captured the full time course of velocity tuning, while the feedforward approximation could not. This strongly suggests that higher than second order spatiotemporal interactions play an important role in motion detection and that they may result from recurrent interactions within the motion network. The hidden units of the RMM showed a continuum of simple- to complex-like properties consistent with those found along the motion pathway of the primate brain. The velocity tuning of these units did not arise from the linear summation of spatially shifted and temporally delayed inputs (as in the ME model), but instead relied on asymmetric spatial connectivity and the nonlinear operations embedded in the recurrent interactions to become sensitive to a wide range of velocities.
After discussing some of the practical limitations of our modeling effort, we discuss the origin of delays in the RMM, the importance of considering the full time course of motion selective responses, and compare the RMM to the ME model.
Our experiments used a stimulus with a diameter of 10° and a monitor refresh of 75 Hz (13 ms). This naturally determines the spatial and temporal bounds on the motion tuning we could find (and then model). For instance, due to aliasing, stimuli moving at 64°/s on a 75 Hz monitor generate limited directional motion signals, hence we did not attempt to model MT neurons with very fast speed preferences. Similarly, very slow movements are affected more by the discretization of space (the limited number of input neurons) and our representation of the random dot patterns removed high spatial frequency and therefore some low speed information. In other words, the particular choices we made to approximate the spatiotemporal properties of the stimuli used in the experiment (e.g., RF size, low-pass filters, simulation time step) limited the range of neurons that the RMM could feasibly model. Our selection of 26 neurons (for instance from the middle of the range of speed preferences) was partially based on that. Hence, we do not claim that the specific RMM used here can model the response of any MT neuron; neurons with very high preferred speed, for instance, would likely require an input layer spanning a larger part of space. Interestingly, this is consistent with the finding that preferred speeds increase with RF size (Orban et al.,
Our fixed 13 ms simulation time step is a crude abstraction of the dynamics of the visual system. This window was mainly chosen for practical reasons. First, the random dot patterns were displaced every 13 ms; by choosing a simulation time step of (at most) 13 ms we could simulate the response to each pattern that was shown to the neuron. Shorter simulation time steps would have come at rapidly increasing computational cost, but would also require us to use spike count estimates from shorter windows, which would have made these estimates less reliable. Finally, we note that the 13 ms time step is within the approximate temporal integration range of 10–30 ms for pyramidal cells in cortex (Spruston and Johnston,
The RMM provides a proof of principle that a network, in which all neurons have the same intrinsic delays, can nevertheless generate motion sensitivity with a wide range of preferred speeds. This speed tuning is the result of spatially asymmetric connections (input weights) and nonlinear recurrent dynamics (lateral and recurrent weights) that generate a range of effective delays (Mineiro and Zipser,
A simple linear combination of the spatially asymmetric (d
In other recurrent network based motion models (Suarez et al.,
Of course our model cannot exclude the possibility that factors other than recurrent network dynamics also contribute to the computation of velocity. For instance, variations in the temporal properties of the thalamic relay neurons that provide input to DS cells in primary visual cortex can contribute to direction selectivity (Saul,
The time course of velocity tuned responses has received little attention in models, but can be quite revealing of the underlying mechanisms. Our analysis in Figures
When analyzed with white noise methods, the RMM revealed filters and nonlinearities that were at least superficially consistent with the ME model (i.e., slanted in space time, a quadratic nonlinearity, and a form of motion opponency). While this provides support for the model (because such filters have been found empirically), it also makes an important conceptual point about the interpretation of the empirical data. Notably, finding such ME-like filters in real neurons does not prove that the underlying architecture is at all similar to the feedforward ME model.
We also found a number of deviations between the RMM and the idealized ME model (Adelson and Bergen,
A recurrent network can compute a representation of velocity in much the same way as the ME model, but without the need for separate classes of fast and slow neurons or synapses. In contrast to the ME model, the recurrent network also matches the temporal dynamics of a population of single MT cells, and makes use of higher-order spatiotemporal correlations in the input. Because it relies on the pervasive recurrent connections of visual cortex, and given that it contains hidden units that are similar to other neurons in the motion processing pathway of the primate brain, we believe it is a biologically plausible model of motion detection.
Even though we focused on motion detection here, the training of artificial recurrent networks on recorded neuronal responses may also be a generally useful approach to investigate other domains of sensory processing and higher cognitive function that require the representation of sequences and time, which is thought to depend critically on recurrent network dynamics (Elman,
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This work was supported by the Prins Bernhard Cultuurfonds, the National Eye Institute (R01EY017605) and the Pew Charitable Trusts.