Modeling Interactions between Speech Production and Perception: Speech Error Detection at Semantic and Phonological Levels and the Inner Speech Loop

Kröger, Bernd J.; Crawford, Eric; Bekolay, Trevor; Eliasmith, Chris

doi:10.3389/fncom.2016.00051

ORIGINAL RESEARCH article

Front. Comput. Neurosci., 31 May 2016
Volume 10 - 2016 | https://doi.org/10.3389/fncom.2016.00051

Modeling Interactions between Speech Production and Perception: Speech Error Detection at Semantic and Phonological Levels and the Inner Speech Loop

Bernd J. Kröger¹^*

Eric Crawford²

Trevor Bekolay³

Chris Eliasmith³

¹Neurophonetics Group, Department of Phoniatrics, Pedaudiology, and Communication Disorders, Medical School, RWTH Aachen University, Aachen, Germany
²Reasoning and Learning Lab, School of Computer Science, McGill University, Montreal, QC, Canada
³Centre for Theoretical Neuroscience, University of Waterloo, Waterloo, ON, Canada

Production and comprehension of speech are closely interwoven. For example, the ability to detect an error in one's own speech, halt speech production, and finally correct the error can be explained by assuming an inner speech loop which continuously compares the word representations induced by production to those induced by perception at various cognitive levels (e.g., conceptual, word, or phonological levels). Because spontaneous speech errors are relatively rare, a picture naming and halt paradigm can be used to evoke them. In this paradigm, picture presentation (target word initiation) is followed by an auditory stop signal (distractor word) for halting speech production. The current study seeks to understand the neural mechanisms governing self-detection of speech errors by developing a biologically inspired neural model of the inner speech loop. The neural model is based on the Neural Engineering Framework (NEF) and consists of a network of about 500,000 spiking neurons. In the first experiment we induce simulated speech errors semantically and phonologically. In the second experiment, we simulate a picture naming and halt task. Target-distractor word pairs were balanced with respect to variation of phonological and semantic similarity. The results of the first experiment show that speech errors are successfully detected by a monitoring component in the inner speech loop. The results of the second experiment show that the model correctly reproduces human behavioral data on the picture naming and halt task. In particular, the halting rate in the production of target words was lower for phonologically similar words than for semantically similar or fully dissimilar distractor words. We thus conclude that the neural architecture proposed here to model the inner speech loop reflects important interactions in production and perception at phonological and semantic levels.

Introduction

Speech production is a hierarchical process starting with the activation of an idea, which is intended to be communicated, proceeds with the activation of words, then with modification and sequencing of words with respect to grammatical and syntactic rules, and ends with the activation of a sequence of motor actions that realize the intended utterance (Dell and Reich, 1981; Dell et al., 1997; Levelt et al., 1999; Levelt and Indefrey, 2004; Riecker et al., 2005). Despite the complexity and depth of the speech production hierarchy, the production process runs nearly error free. Speech errors occur relatively seldom and typically need to be evoked in experiments if we want to study them (Levelt, 1983; Nooteboom and Quené, 2015). This robustness supports the assumption that speech production benefits from a robust neural mechanism for activating and processing already learned and stored cognitive and sensorimotor speech units (e.g., syllables, words, short phrases). Additionally, this robustness supports the assumption that speech production may be monitored at different levels in order to detect and repair occurring errors (Postma et al., 1990; Postma, 2000; Hartsuiker and Kolk, 2001; Schwartz et al., 2016).

Restricting our attention to single word production (such as in a picture naming task), speech production starts with the activation of semantic concepts (e.g., “has wheels,” “can move,” “can transport persons,”), retrieves an associated word (e.g., “car”) and its phonological form (/kar/) from the mental lexicon (see e.g., Dell and Reich, 1981; Levelt et al., 1999), activates the relevant motor plan, which can be thought of as a collection of intended speech movements (such as: form tongue, lower jaw and lips for /k/, then for /ar/; in parallel open glottis for production of the unvoiced speech sound /k/ and then for the voiced sound /ar/) and then executes these speech movements or actions in order to articulate the intended word and to generate the appropriate acoustic signal (see e.g., Kröger and Cao, 2015). This production process mainly consists of two stages, one cognitive and one sensorimotor. The cognitive stage consists of concept activation, word selection and the subsequent activation of the related phonological representation (Dell et al., 1997; Levelt et al., 1999), while the sensorimotor stage consists of motor plan activation (also called motor planning) and execution (Riecker et al., 2005; Kröger and Cao, 2015).

Both the cognitive and sensorimotor stages of speech production mainly involve retrieving and activating units or chunks already stored in repositories of cognitive knowledge and sensorimotor skill, respectively. This knowledge and these skills were learned during speech and language acquisition. The cognitive knowledge repository that plays a central role in word production is called the mental lexicon. Here, a neural word node (also called lemma node) is associated both with a semantic or conceptual representation of the word and a lexical or phonological representation of the word (Levelt et al., 1999). The sensorimotor skill repository is called the mental syllabary. Within this repository, phonological forms of syllables or (short) words are associated with motor plans, as well as with auditory and somatosensory mental images of the already acquired syllable or word (Kröger and Cao, 2015). Monitoring at the sensorimotor level is mainly a matter of comparing learned auditory and somatosensory images with the sounds generated during speech articulation, which are fed back through the auditory system (auditory self-perception). This monitoring process is slow, because it includes both motor execution and auditory perception (Postma, 2000). A faster monitoring loop called the inner speech loop compares word representations activated at the cognitive level of the production hierarchy to those activated by a level of the perception hierarchy, here better labeled as comprehension. This monitoring mainly consists of comparing the intended conceptual and phonological representations with the instantaneously-activated conceptual and phonological representations evoked during speech production, and leads to inner self-perception (Hartsuiker and Kolk, 2001). Inner self-perception assumes the existence of inner speech (also called covert speech), while auditory self-perception or outer self-perception requires the production of audible speech, also called overt speech (Oppenheim and Dell, 2008).

It can be assumed that speech monitoring, i.e., the comparison of intended and produced speech, can be realized by linking production and perception outcomes at different levels or stages (e.g., concept, phonological form, or motor plan levels). While the slower outer speech loop includes all stages (from conceptualization to articulation and back), the inner speech loop only includes conceptualization until retrieval of the phonological form and vice versa. This inner loop theory of speech monitoring has been successful in explaining the fact that speech errors are often repaired so quickly that the involvement of the (slow) auditory feedback loop (outer speech loop) can be ruled out (Postma, 2000; Hartsuiker and Kolk, 2001).

Because it is not trivial to evoke speech errors, in this study a picture naming and halt paradigm is used (Slevc and Ferreira, 2006). In this paradigm, utterance of a target word is elicited by normal picture naming, where picture-to-word associations are pre-learned in an initial familiarization procedure (Slevc and Ferreira, 2006, p. 520). About 400 ms later, an acoustically presented halt signal (distractor word) is presented and subjects are required to stop production of the target word if the distractor word is different from the target word. One significant finding of the picture naming and halt study performed by Slevc and Ferreira (2006) was that the mean stopping rate depends on the phonological similarity between the target and distractor word, but not on the semantic similarity. Semantically similar distractor words were found to have the same stopping accuracy as distractor words dissimilar from the target word (Slevc and Ferreira, 2006, p. 521). It should be noted that this paradigm does not directly relate to speech error detection, but is a method for investigating the mechanisms governing speech monitoring. A second main result of the experiment described by Slevc and Ferreira (2006) is that the speech monitor is capable of detecting differences more easily in the case of distractor words which are completely dissimilar to the target word as well as to distractor words which are semantically similar, while phonologically similar distractor words are not detected as easily. The halting rate for phonologically similar words was found to be the lowest.

The main goal of this study is to develop a neural architecture for speech production and perception (and comprehension), which, on the one hand, enables fast, effortless and error-free realization of word production, and, on the other hand, allows for the simulation of speech errors and realistic and effective speech monitoring. Thus, the neural architecture of our model consists of speech production, perception and monitoring components and, furthermore, should be capable of detecting and correcting speech errors. A further major goal of this study is to underline the tight connection between speech production and perception (e.g., Pickering and Garrod, 2013).

The neural model of speech processing developed here uses the principles of the Neural Engineering Framework (NEF, Eliasmith and Anderson, 2003; Eliasmith, 2013). We used this framework because it allows for the development of neurobiologically plausible large-scale models of both cognitive and sensorimotor components, and because it has already been shown to be capable of producing models that match human performance on a number of non-speech behavioral tasks (Eliasmith et al., 2012). Three basic principles characterize the NEF: representation, transformation, and dynamics (Eliasmith and Anderson, 2003). (i) Representation means that the NEF allows to code external (sensory or motor) signals as neural states and to decode neural states as non-neural (external physical) signals. These neural states are represented within the NEF as neural activation patterns or spike patterns of specific neuron ensembles. Thus, a neuron ensemble is the basic unit within the NEF for representing neural states. Each neural ensemble consists of a specific number of individual neurons; in this study leaky integrate-and-fire (LIF) neurons are used. (ii) Transformation means that neuron ensemble A can be connected to a downstream neuron ensemble B by establishing a neural connection from each neuron within ensemble A to each neuron within ensemble B. These neural connections not only allow communication of a neural state from A to B, but can also be constructed to transform the neural state represented by A into a neural state in B that is a function of the neural state represented by A. (iii) Dynamics means that a neural state of a neuron ensemble changes with respect to input over time. Input can be provided by other neuron ensembles, or by the same neuron ensemble through recurrent connections. Here, the each neuron within the neuron ensemble is connected to the other neurons in the neuron ensemble. This allows for the neural state to maintained in the absence of input, implemented a type of neural memory.

A model of speech production, speech perception, speech monitoring, and speech error detection and repair is complex and must include cortical as well as subcortical components. For building complex models using the NEF it is advantageous to use the Semantic Pointer Architecture (SPA, Eliasmith, 2013; Stewart and Eliasmith, 2014; Gosmann and Eliasmith, 2016). The SPA is based on the NEF and allows for more complex neural representations and transformations than would otherwise be possible. In the SPA, complex neural states are called semantic pointers (SPs). Semantic pointers (e.g., “A,” “B,” and “C”) are capable of representing different cognitive states, for example semantic concepts (e.g., A = ‘Is_Blue’, B = ‘Has_Four_Legs’, C = ‘Apple’, Eliasmith, 2013; Blouw et al., 2015), or orthographic and phonological forms of words, or high level visual or auditory representations of concepts or words. Semantic pointers are defined as N-dimensional vectors (typically N = 512 for the case of coding an entire lexicon of a particular language, Crawford et al., 2015). The neural state representing a semantic pointer is a specific neural activation pattern occurring within a cortical SPA buffer. SPA buffers consist of several neuron ensembles that each represent a subset of the N dimensions in the semantic pointer. Like ensembles, the input to buffers can change over time, allowing a SPA buffer to represent different semantic pointers depending on input. While neuron ensembles can be decoded into a real-valued vector, SPA buffers require an additional operation to decode. In order to determine the semantic pointer currently represented by the neural state of a cortical SPA buffer, the similarity of the neural state is calculated for all semantic pointers defined for the current neural model. This similarity is calculated using the dot-product operation (Stewart and Eliasmith, 2014). The dot product should be near 1 if the state matches the target semantic pointer, and near 0 for other semantic pointers. Thus, a useful way of characterizing the activity pattern within a cortical SPA buffer is by way of its similarity values with all currently defined semantic pointers. Throughout this work we will make extensive use of this kind of characterization for visualization purposes (e.g., see Figures 2–11 in Sections Methods and Results of this paper).

Semantic pointers are also the vehicles for representing actions, e.g., whether the production of a word should be started (e.g., buffer A1 = ‘SPEAK’) or halted (e.g., buffer A2 = ‘HALT’). The neural states for these semantic pointers are activated at the level of a cortical task control buffer, which is closely connected with an action selection module. This buffer, together with the basal ganglia and thalamus complex, forms the cortico-cortical action selection loop (Stewart et al., 2010a,b).

Because it is non-trivial to generate an amount of semantic pointers for representing the concepts, words (i.e., lemmas and orthographic forms), and phonological forms of words for a specific natural language vocabulary, including a representation of the similarities of words at the concept (or semantic) and at the phonological-phonetic level, a semantic pointer network module constitutes a further part of the NEF. Here N words WORD_i, where i = 1, …N, are stored within three subnetworks of semantic pointers, e.g., WORD_1_CONCEPT = ‘Apple_Apfel’ is part of subnet “concepts,” WORD_1_LEMMA = ‘W_Apple’ is part of subnet “words,” and WORD_1_PHONOL = ‘St_E_pel’ is part of the subnet “phonological representations.” The nomenclature for the semantic pointers is given in Section Methods. Thus, this semantic pointer network module allows the specification of semantic pointers for words at different levels of representations (e.g., conceptual, orthographic, and phonological levels), as well as the specification of relations and similarities between semantic pointers at different representation levels (see Section Methods in this paper and see Section 5.2 in Crawford et al., 2015).

In the following sections we introduce our model for speech production and perception including speech monitoring and error-processing, and we present experimental results (i) modeling halting in production when distortions are evoked at semantic and phonological levels within the model and (ii) for simulating a picture naming and halt task (Slevc and Ferreira, 2006).

Methods

Architecture of the Neural Model

The architecture of the neural model comprises an input module, a production and a comprehension pathway forming the core of the inner speech loop, and an action selection module (Figure 1). The speech production module, shown in Figure 1, has been described in previous studies (Kröger et al., 2014, 2016; Senft et al., 2016) and is not included in the current study in order to keep simulation time low. The input module consists of three cortical SPA buffers: the perceptual, conceptual, and phonological input buffers (see Figure 1). The neural activations occurring at the level of the perceptual input buffer indicate the points in time when a visual or audio input signal appears. Semantic pointers are defined for representing these perceptual events, i.e., the points in time for the beginning and end of input in the visual and auditory domains. The neural signals generated in the perceptual input buffer are directly forwarded to the action selection module (Figure 1). In parallel, visual input information evokes a neural state within the conceptual input buffer and auditory input information evokes a neural state within the phonemic input buffer. Visual input is represented here in the form of conceptual semantic pointers, which encodes the meaning of the picture presented as the input signal at that point in time. Thus, specific visual processing is not currently included in our model. This omission of visual processing is justified, because subjects participating in the picture naming and halt experiment (Slevc and Ferreira, 2006; Experiment 1) undergo an initial familiarization procedure wherein they learn to associate 18 concrete English words (18 target words) with 18 concrete line drawings which are later presented in the task. Similarly, auditory processing was simplified by directly activating the phonological representation (i.e., a sequence of speech sounds or phones) associated with an acoustically presented distractor word (halt signal). This simplification too is justified, because the goal of this study is to test the inner speech loop (see Slevc and Ferreira, 2006, p. 518, Figure 2). The three buffers that make up the input module (like all other buffers in our model) are implemented as cortical SPA buffers (Stewart and Eliasmith, 2014), capable of representing 512 dimensional vectors (semantic pointers) using 50 neurons per dimension (25,600 neurons per buffer).

FIGURE 1

Figure 1. Architecture of neural model of speech production and speech perception for modeling self-detection of speech errors and for modeling a picture naming and halt task.

FIGURE 2

Figure 2. Simulation of picture naming task without error stimulation; visual input (Vin): “duck,” no additional input; phonological form: ‘St_dak’. Rows indicate neural activation levels of different cortical SPA buffers over time. Row 1: perceptual input buffer, row 2: conceptual input buffer (visual input ‘Vin_…’ is directly converted in a concept representation), row 3: error input buffer (not indicated in Figure 1), row 4: task control buffer, rows 5–10: cortical buffers for concepts, words, and phonological forms within production and perception/comprehension pathway of inner speech loop. In row 2, row 3, and rows 5–10, the activation levels of all 90 semantic pointers are displayed. Only the semantic pointers with the highest activation levels are labeled by text.

The inner speech loop consists of six cortical SPA buffers, representing the conceptual, word, and phonological state of a currently activated word. Within these six cortical SPA buffers (three for the production pathway and three for the perception pathway, see Figure 1), only neural states that represent semantic pointers of concept, word, or phonological forms of already learned words can be activated. These semantic pointers are stored as vectors within a portion of our neural model called the mental lexicon module. The concepts, words, and phonological forms stored in that module are listed in Appendix A. During picture naming, the neural state activated in the concept input buffer (i.e., the concept corresponding to the target word presented visually) directly co-activates concept-level, word-(lemma)-level, and phonological-level neural states for the target word in the production pathway (Figure 1). Subsequently, the phonological neural state of the production pathway co-activates a phonological, word, and perceptual neural state within the comprehension pathway in order to allow self-perception and self-monitoring. If, in addition, external speech (produced not by the model itself but by an interlocutor) is presented acoustically, then the phonological input representation, i.e., phonological representation of an external acoustically presented word (activating the phone input buffer within the input module of our model, see Figure 1), also co-activates the phonological, word, and conceptual SPA buffers of the perception pathway of the inner speech loop (the arrow from input module to inner speech loop in Figure 1). This externally-elicited activation interferes with the activation in the comprehension pathway that stems from the current state of the phonological component of the production pathway (see left-to-right arrow, also called a “shortcut” between both phonological buffers in Figure 1). The direct co-activation of related conceptual, word, and phonological states within both the production and perception pathways of the model is implemented using four (hetero-)associative memories (Voelker et al., 2014), labeled as AM in Figure 1. The associations stored in these four associative memories are considered to be part of the mental lexicon. For example, for the concept coded by the semantic pointer ‘Apple_Apfel’, the semantic pointer ‘W_apple’ is the associated representation at the word (lemma) level and the semantic pointer ‘ST_E_pel’ is the associated representation at the phonological level (see also Appendix A; concept pointers like ‘Apple_Apfel’ are written in two languages, because a concept is not necessarily language specific; the word representation for apple is labeled as ‘W_Apple’; phonological forms are given as phonetic-phonological transcriptions (e.g., ‘ST_E_pel’ for “apple”). Within these transcriptions, syllables are separated by an underline and the most stressed syllable within a word is marked by the prefix ‘St_’; the transcriptions in part follow SAMPA notation, SAMPA, 2005).

From a functional viewpoint, the neural model presented here is designed for (i) self-detection of speech errors occurring during word production by self-monitoring, and for (ii) realizing a picture naming and halt task, which requires the self-monitoring component in order to compare self-produced target words to externally produced distractor words. Consequently, the action selection module used in our model is primarily designed for doing self-monitoring and, in particular, for evaluating the degree of similarity between the neural states active in the production buffers to those active in the comprehension buffers at concept, word, and phonological levels (see arrow from inner speech loop to action selection module in Figure 1). In addition, the neural states that are currently active in the perceptual input buffer are fed to the action selection module in order to identify the points in time at which the comparison of production and perception neural states needs to be carried out in order to activate a ‘HALT’ action.

The similarity values representing concept, word, and phonological levels can be calculated as dot products (see Section Introduction). Thus, dot products are used here for calculating utility values Ui for actions Ai (i = 1, …, M) at the level of the basal ganglia. The action Ai exhibiting the highest utility value Ui is selected by the thalamus component of the action selection module (Stewart et al., 2010a; Eliasmith, 2013; Stewart and Eliasmith, 2014). The neural implementation of action selection relies on the interacting dynamics of excitatory AMPA connections and inhibitory GABA connections between different parts of the basal ganglia (i.e., striatum, substantia nigra, and globus pallidus externus/internus). Moreover, a detailed realization of the cortico-cortical loop including basal ganglia and thalamus has been implemented (Stewart et al., 2010b). It should be noted that the detailed modeling of post-synaptic time constants at cortical levels as well as at the level of the basal ganglia thalamus action selection module leads to a typical time interval of around 50 ms for action selection, also called the “cognitive cycle time” (Anderson et al., 2004; Stewart et al., 2010b).

The basic actions which can be selected in our model are A1 = ‘NEUTRAL’ (do nothing), A2 = ‘SPEAK’, A3 = ‘CONSIDER_HALT’, and A4 = ‘HALT’. If one of these actions becomes chosen, its semantic pointer similarity value approaches 1 within the time course of the neural activation patterns of the task control cortical SPA buffer within the action selection module (Figure 1; see row 4 in Figures 2–11 below). Action selection works as follows: all dot products are continuously evaluated in order to estimate the utility values U2(t), U3(t), and U4(t) for the if-statements (ii) to (iv). Specifically,

U2(t) = DOT_PROD(perceptual_input_buffer, ‘NEW_VISUAL’)

U3(t) = DOT_PROD(perceptual_input_buffer, ‘NEW_AUDIO’)

U4(t) = DOT_PROD(perceptual_input_buffer, ‘WORD’) −

DOT_PROD(concept_buff_perc, WORD_i_CONCEPT) +

DOT_PROD(word_buff_perc, WORD_i_LEMMA) +

DOT_PROD(phonol_buff_perc, WORD_i_PHONOL)

(i) if all utility values Ui(t) < 0.25 (where Ui(t) ranges between 0 and 1)

then: select action ‘NEUTRAL’ (i.e., do nothing);

(ii) if U2(t) is highest utility value currently

then: select action A(t) = ‘SPEAK’;

(iii) if U3(t) is highest utility value currently

then: select action A(t) = ‘CONSIDER_HALT’;

(iv) if U4(t) is highest utility value currently

then: select action A(t) = ‘HALT’;

Thus, action selection mainly leads to ‘NEUTRAL’ if no dot product is above 0.25 (if-statement i). If a new word is activated by a visual signal at the concept input buffer, the ‘SPEAK’ action will always be chosen (if-statement ii). The semantic pointers ‘NEW_VISUAL’ indicates the beginning of a visual presentation of a new word during the time interval ‘WORD’. If an external audio signal is presented quickly after the activation of ‘SPEAK’ (as happens in the picture naming and halt tasks) then ‘CONSIDER_HALT’ will be activated (if-statement iii). In addition, if a word is clearly activated within all buffers of the inner speech loop (which is always the case after the ‘NEW_VISUAL’ input appears) the ‘HALT’ action could be chosen if the difference between semantic pointers activated in the production (i.e., WORD_i_CONCEPT, WORD_i_LEMMA, and WORD_i_PHONOL) and perception/comprehension pathways (i.e., current neural activation pattern within cortical SPA buffers concept_buff_perc, word_buff_perc, and phonol_buff_perc) is small at each of the concept, word and phonological levels (if-statement iv). If this difference is large, then utility value U4 will be low, leading to no activation of the ‘HALT’ action.

In our model, action selection indirectly leads to an activation of a go-signal for the speech production module (see the arrow from action selection to speech production in Figure 1). The go-signal will be activated if the ‘SPEAK’ action is selected and if no ‘HALT’ action becomes activated during the next 100 ms. The speech production module consists of a premotor buffer for representing the motor plan of the word that is currently activated in the phonological component of the inner speech loop (see the arrow between the inner speech loop and speech production module in Figure 1), a motor buffer for representing the muscle activation patterns of the speech articulators, and the (external) vocal tract model for representing the articulator movements and for generating the acoustic speech signal (assumed to represent the M1 cortical area; for the separation of motor planning and execution see Kröger and Cao, 2015 and Kröger et al., 2016). The production model allows primary motor activation only in the case of an active go-signal and if a motor plan is activated in the premotor buffer of the speech production module, which can only be the case if a clear and strong activation of a phonological form occurs in the phonological buffer within the production pathway of the inner speech loop. In order to keep simulation times low, the production module is not included in the simulations described in this paper. The neural model developed here for the simulations described below was programmed in Nengo (Bekolay et al., 2014).

Network Implementation of the Mental Lexicon

It has been mentioned above that the neural states activated in the concept, word, and phonological buffers within the inner speech loop are neural activation patterns equivalent to or represented by semantic pointers. These semantic pointers are stored as vectors within the mental lexicon module of our model (not shown in Figure 1). The vector representations and the neural states associated with these pointers are assumed to have been learned during speech and language acquisition. Because learning is beyond the scope of this paper, the collection of semantic pointers making up the mental lexicon is predefined in our neural model. This is realized by using semantic pointer networks, which define not only the number of semantic pointers but also the relations between them (e.g., relation “is a” in “apple is a fruit”; Eliasmith, 2013; Blouw et al., 2015; Crawford et al., 2015). Semantic pointer networks should not be confused with neural networks; rather, semantic pointer networks can be thought of as a way of representing a knowledge base, which can be implemented or realized by a spiking neural network using NEF methods. Before running a simulation, the semantic pointer network is generated for a pre-defined natural language vocabulary (see Appendix A).

In the case of the mental lexicon, the semantic pointer network needs to be subdivided into subnetworks for concepts, deep concepts, words, phonological forms, deep phonological forms, visual input, and auditory input (see Appendix A). All semantic pointers defined in each subnetwork and all relations between semantic pointers needed in each subnetwork and between different subnetworks are predefined and then used in the simulation in order to (i) generate a vector representation for each semantic pointer and to (ii) generate an associated neural state (neural activation pattern) for each semantic pointer. The subnetworks including all labels for semantic pointers and their relations are listed in Appendices A1–A5. While the subnetworks for concepts, words and phonological forms consist of 90 items each in the case of our vocabulary, the deep_concept and deep_phonological networks contain the semantic pointers needed to specify the relations between specific concepts and specific phonological forms.

Our neural model also requires subnetworks for visual representations of concepts and for auditory representations of words. In our experimental scenario, visual images are closely related to concepts, and aural signals are closely related to phonological forms. Each of these subnetworks contains 90 items, each of which corresponds directly to one of the words defined in the subnetworks for concepts, words, and phonological forms. The semantic pointers for visuals are labeled with the prefix ‘V_’ (e.g., ‘V_Apple_Apfel’) and the semantic pointers for aural signals are labeled with the prefix ‘A_’ (e.g., ‘A_apple’). In addition, semantic pointers are defined for visual and auditory input representations. The pointers within these subnetworks are labeled with an initial ‘Vin_’ or ‘Ain_’ respectively.

Word Corpus

Eighteen different input or target words are used in both simulation experiments (listed in the first column of Table 1). In Experiment 1, semantically similar distractor word activations are added at the word level and phonologically similar distractor word activations are added at the phonological level. These distractor words are listed in columns 2 and 3 of Table 1. In Experiment 2, the 18 visually presented target words are combined with four different auditory input words (i.e., stop signal or distractor words, cf. Slevc and Ferreira, 2006). These distractor words are (i) semantically similar, (ii) phonologically similar, (iii) semantically and phonologically similar, or (iv) semantically and phonologically dissimilar words in relation to their corresponding target word (see columns 2–5 in Table 1). The resulting 90 words for picture naming and/or for distortion (distractor or stop signal words) are collected into in a semantic pointer network (see Appendix A).

TABLE 1

Table 1. Words used as target words (column 1) or as distractor/stop signal words (columns 2–5) in the simulation experiments.

Experiment 1

Experiment 1a

We conducted 35 trials in which productions of each of the 18 target words were simulated (630 simulations in total). No distractors or stop signals were activated. Neural activation levels for different semantic pointers in different cortical SPA buffers are displayed in Figure 2 for a typical simulation trial. Here, the similarity between the neural activity and the most similar semantic pointers is given for 10 different cortical SPA buffers. It can be seen that all cortical SPA buffers for concept, word and phonological form reflect full activation for the target word semantic pointers during production as well as during perception (Figure 2, rows 5 to 10 for the concept/word/phonological form of “duck”). The perceptual input buffer (Figure 2, row 1) signals that a full activation of visual input ‘NEW_VISUAL’ occurs at approximately t = 100 ms. This input is held for a time interval of 150 ms until t = 250 ms (see activation of semantic pointer ‘WORD’ in Figure 2, row 1). The concept input buffer (Figure 2, row 2) indicates that the concept activation from visual input starts at about 50 ms, while full activation occurs at around 100 ms and holds for 150 ms until t = 250 ms. No error signal is generated (see Figure 2, row 3: activation of error input buffer is ‘NEUTRAL’). The semantic pointer activation within the task control buffer (Figure 2, row 4) indicates that the action ‘SPEAK’ will be activated at around t = 170 ms. This action results from the evoked perceptual input (i.e., we evoke the ‘NEW_VISUAL’ semantic pointer). The action selection module always activates the ‘SPEAK’ action if the neural activity pattern within the cortical task control buffer previously represented the ‘NEW_VISUAL’ semantic pointer (see Section Architecture of the Neural Model).

In the lower part of Figure 2 (rows 5–10) the levels for the semantic pointers activated at the concept, word, and phonological form levels are displayed for the production and perception pathways; these will be interpreted in the Results Section.

Experiment 1b

We simulated ten trials in which the model produces each of the 18 target words with some kind of distortion (column 1 in Table 1; 180 simulations in total). Two different types of distortions were introduced (5 trials each). (i) A “concept distortion” was introduced by activating a distractor word which was semantically similar to the target word (column 2 in Table 1) and by adding this activation in the word buffer of the production pathway (Figure 1). (ii) A “phonological distortion” was introduced by activating a distractor word which was phonologically similar to the target word (column 3 in Table 1) and by adding this activation to the phonological buffer in the inner speech loop (Figure 1).

The induction of a conceptual (or semantic) speech error in a picture naming task was simulated by adding a second concept buffer to the production pathway and by connecting the output of that second buffer (“side branch buffer”) together with the output of the original concept buffer (given in Figure 1) to the word buffer within the production pathway. The temporal activation pattern of this second concept buffer is displayed in the third row in Figure 3 (see error input buffer; in this example the concept activation for the distortion word “raven” is displayed). Thus, this “side branch production concept buffer” (not shown in Figure 1) is activated by a word which is semantically similar but not identical to the target word (“duck”).

FIGURE 3

Figure 3. Simulation of picture naming task with stimulation of a semantic (conceptual) speech error; visual input (Vin): “duck,” error input (conceptual): “raven”; phonological forms: ‘St_dak’, ‘St_rEI_wen’. Rows indicate neural activation levels of different cortical SPA buffers over time (see Figure 2).

A similar process was used to induce phonological speech errors. Specifically, a second word-level buffer was added to the production pathway of the inner speech loop, and was connected to the phonological buffer of the production pathway. This new word-level buffer is activated by a word which is phonologically similar but not identical to the target word. This leads to strong activation of the distortion word in the phonological buffer of the production pathway, which is then propagated to all levels of the perception pathway. The temporal activation pattern of the second word buffer is displayed in the third row in Figure 4 (error input buffer; displays the word activation for the distortion word “dub”).

FIGURE 4

Figure 4. Simulation of picture naming task with stimulation of a phonological speech error; visual input (Vin): “duck,” error (word) input: “dub”; phonological forms: ‘St_dak’, ‘St_dab’. Rows indicate neural activation levels of different cortical SPA buffers over time (see Figure 2).

Experiment 2

Five trials of 72 word combinations (i.e., 18 target words and 4 × 18 = 72 distractor words) were used in the picture naming and halt task. Target words were activated by picture naming (column 1 of Table 1). Distractor words were activated after a short delay. Four different categories of distractor words were used: (i) words semantically similar to the target word (column 2 of Table 1), (ii) words phonologically similar to the target word (column 3 of Table 1), (iii) words semantically and phonologically similar to the target word (column 4 of Table 1), and (iv) words dissimilar to the target word along both dimensions (column 5 of Table 1). The results of a simulation experiment for each of these four cases are displayed in Figures 5–8.

FIGURE 5

Figure 5. Simulation of picture naming and halt task for dissimilar inputs; visual input (Vin): “duck,” auditory input (Ain): “brass”; phonological forms: ‘St_dak’, ‘St_bras’. Rows indicate neural activation levels of different cortical SPA buffers over time (see Figure 2 with exception of row 3). Row 3: phone input buffer (audio input ‘Ain_…’ is directly converted in a phonological representation).

FIGURE 6

Figure 6. Simulation of picture naming and halt task for semantically similar inputs; visual input (Vin): “duck,” auditory input (Ain): “raven”; phonological forms: ‘St_dak’, ‘St_rEI_wen’. Rows indicate neural activation levels of different cortical SPA buffers over time (see Figure 5).

FIGURE 7

Figure 7. Simulation of picture naming and halt task for phonologically similar inputs; visual input (Vin): “duck,” auditory input (Ain): “dub”; phonological forms: ‘St_dak’, ‘St_dab’. Rows indicate neural activation levels of different cortical SPA buffers over time (see Figure 5).

FIGURE 8

Figure 8. Simulation of picture naming and halt task for semantically and phonologically dissimilar inputs; visual input (Vin): “duck,” auditory input (Ain): “dove”; phonological forms: ‘St_dak’, ‘St_daw’. Rows indicate neural activation levels of different cortical SPA buffers over time (see Figure 5).

In all four simulations of the picture naming and halt task, visual input starts at about 100 ms and is fully activated at 150 ms. The audio input starts at about 500 ms and is fully activated at 550 ms. The perceptual input buffer signals the presentation of new visual or audio input (row 1 in Figures 5–8). The visual input signal is represented in the visual input buffer directly by its concept representation while the audio input is represented in the audio input buffer directly by its phonological representation (rows 2 and 3 in Figures 5–8).

Source Code of the Model and the Experiments

Nengo source code for this model can be downloaded at http://www.phonetik.phoniatrie.rwth-aachen.de/bkroeger/documents/ipynb_InnerSpeechLoop.zip. This zip file includes 4 scripts in IPython notebook format, representing Experiment 1a, Experiment 1b with semantic distortions, Experiment 1b with phonological distortions, and Experiment 2 (picture naming with halts). This source code requires Nengo (version 2.0; Bekolay et al., 2014), which can be downloaded at http://www.nengo.ca/download.