Phonological Planning during Sentence Production: Beyond the Verb

Schnur, Tatiana T.

doi:10.3389/fpsyg.2011.00319

ORIGINAL RESEARCH article

Front. Psychol., 04 November 2011

Sec. Psychology of Language

volume 2 - 2011 | https://doi.org/10.3389/fpsyg.2011.00319

This article is part of the Research Topic The Dynamics of Lexical Selection in Speech Production View all 8 articles

Phonological Planning during Sentence Production: Beyond the Verb

Tatiana T. Schnur*

Department of Psychology, Rice University, Houston, TX, USA

The current study addresses the extent of phonological planning during spontaneous sentence production. Previous work shows that at articulation, phonological encoding occurs for entire phrases, but encoding beyond the initial phrase may be due to the syntactic relevance of the verb in planning the utterance. I conducted three experiments to investigate whether phonological planning crosses multiple grammatical phrase boundaries (as defined by the number of lexical heads of phrase) within a single phonological phrase. Using the picture–word interference paradigm, I found in two separate experiments a significant phonological facilitation effect to both the verb and noun of sentences like “He opens the gate.” I also altered the frequency of the direct object and found longer utterance initiation times for sentences ending with a low-frequency vs. high-frequency object offering further support that the direct object was phonologically encoded at the time of utterance initiation. That phonological information for post-verbal elements was activated suggests that the grammatical importance of the verb does not restrict the extent of phonological planning. These results suggest that the phonological phrase is unit of planning, where all elements within a phonological phrase are encoded before articulation. Thus, consistent with other action sequencing behavior, there is significant phonological planning ahead in sentence production.

Phonological Planning during Sentence Production: Beyond the Verb

When we speak, we perform a series of actions. We generate an idea. We select the words to convey that idea and order them according to the grammatical rules of the language. Finally, we retrieve the sounds in an order corresponding to those words and perform the motor movements to begin speaking. Because speech is produced sequentially over time, the idea we want to convey has to be translated into components that can be produced in linear order. Earlier stages of speech production involve larger representations (e.g., the idea) but at later stages the representation becomes smaller (i.e., it corresponds to the word being produced at that moment in time).

In advance of articulation, to what extent must the utterance be phonologically planned? Previous evidence suggests that planning is either fully incremental (one phonological word at a time; Meyer, 1996; Wheeldon and Lahiri, 1997; Levelt et al., 1999), or alternatively, encompasses larger units, such as a phrase (Smith and Wheeldon, 2004; Schnur et al., 2006). However, results are also consistent with the extent of planning being driven by the syntactic importance of the verb (Ferreira, 2000; Ferreira and Swets, 2002). In the following, I present evidence in spontaneous sentence production demonstrating that phonological planning is not incremental, is not restricted by the verb, but encompasses a full phonological phrase. Following evidence for other behaviors that require the sequencing of actions like motor planning (Rosenbaum, 2010) and problem solving (Catrambone, 1998), these results suggest that planning in sentence production is not incremental, but encompasses larger chunks of information. Planning larger phonological units may facilitate comprehension for the listener while helping the speaker avoid hesitations (Fox Tree, 1995).

Theories concerning the extent of phonological planning can be generally divided into one of two types, incremental or non-incremental. In one of the most fully articulated models of speech production, Levelt (1989) proposed that grammatical and phonological encoding is highly incremental in order to facilitate fluent speech: “Each processing component will be triggered into activity by a minimal amount of its characteristic input” (Levelt, 1989, p. 26). For planning at the phonological level, for articulation to begin, the minimal (and thus most incremental) unit required is a grouping of phonological segments known as a phonological word (PW), a unit of prosodic structure (Wheeldon and Lahiri, 1997; Levelt et al., 1999).

The creation of prosodic structure is part of phonological encoding and occurs when multiple words are produced in a sequence. A PW is a content word (which automatically receives stress) and any unstressed function word such as auxiliaries, determiners, conjunctions, and prepositions¹. For example, because [gAte] receives stress (stress indicated by capitalized vowels), it is a PW when produced alone and when combined with a non-stressed word like “the” or “a,” e.g., [the gAte]_PW. PWs are grouped into larger structures, such as phonological phrases. Phonological phrases are created from groups of PWs and derived from syntactic structure. Specifically, all PWs that fall within a major grammatical phrase up to the phrase’s right boundary are grouped together to form a phonological phrase (PP; the X-max algorithm, Selkirk, 1986; also see Levelt, 1989; Ferreira, 1991, 1993). For example, the sentence, “He opens the gate,” is comprised of two PWs which form one phonological phrase [[He Opens]_PW [the gAte]_PW]_PP. This is because the right boundary of the syntactic phrase is the end of the verb phrase (VP). Although “he” has special syntactic status because it is the lexical head of the subject noun phrase (NP), because it is a monosyllabic function word (produced with reduced stress), it is not identified as a head of phrase when phonological phrases are constructed (Selkirk, 1984, 1986, 1996)². As a result, when the PWs are combined, only one phonological phrase is created in [he Opens the gAte]_PP.

Because of the type of utterances investigated, evidence of incremental phonological planning (a PW) also can be interpreted as non-incremental, extending to either a phonological phrase, or alternatively to the verb of the sentence. Meyer (1996) using picture naming in the picture–word interference paradigm, found faster naming times when the distractor word was phonologically related to the first object (e.g., the arrow) of a sentence, which is both a PW and phonological phrase (PP). Phonological planning was not found for the second object and phonological phrase (e.g., is next to the bag) which occurred after the verb (“[The Arrow]_PW/PP [is nExt to the bAg]_PP”). Similarly, Oppermann et al. (2010a) found that participants were faster to produce sentences when phonologically related distractors were related to the first PW and PP before the verb, but no effects were found for the object after the verb (e.g., “Die Maus frisst den Käse”: [The mOUse]_PW/PP [[EAts]_PW [the chEEse]_PW/PP; the verb was not tested). Thus, phonological planning encompassed both a PW and PP, and extended up to the verb of the sentence.

The non-incremental view of phonological planning suggests that more is planned in advance of articulation, where the unit of planning may be a phonological phrase (Miozzo and Caramazza, 1999; Alario and Caramazza, 2002; Alario et al., 2002; Costa and Caramazza, 2002; Smith and Wheeldon, 2004; Schnur et al., 2006; Damian and Dumay, 2007). In Schnur et al. (2006), participants were faster to produce sentences when distractors were phonologically related to the verb (e.g., [The Orange gIrl]_PP [wAlks]_PW/PP) showing that planning extended two phonological phrases. Smith and Wheeldon (2004) found that phonological planning encompassed a phonological phrase using a different paradigm to study sentence production. Participants were faster to begin speaking sentences when words were phonologically related to each other in a single phonological phrase (e.g., [[The flAg and the bAg]_PP [move up]_PW/PP) compared to when objects were named across phrase boundaries [The flAg]_PW/PP [mOves]_PW/PP [abOve the bAg]_PP). In both these cases, phonological planning extended multiple PWs, encompassing entire phonological phrases.

Evidence from speech errors is supportive of planning of a phonological phrase as sound errors largely occur within phrase boundaries. For example, based on an analysis of sound-exchange errors Garrett (1975) found that sound exchanges occur within a phrase, as opposed to across clauses, approximately 87% of the time (where a phrase is defined as a simple NP or VP). For the remaining 13% of errors, Garrett (1975) found that 11% of those occurred between a verb and its direct object NP (14 of 19 errors; e.g., “he was slowing shides”; Garrett, 1975) and the remaining 3% were between a verb and its subject NP (5 of 19 errors; e.g., “you should have your brÂkes ch^aked”; Garrett, 1975).

An alternative non-incremental view is that lexical heads of phrase, in particular verbs, drive the extent of planning for sentence production. In a model of syntactic parsing, Ferreira (2000) proposed that verbs are encoded first in order to establish syntactic structure for the sentence (or, the verb is encoded early in sentence planning; cf. Bock, 1987; Griffin, 2000). Syntactic encoding of the verb is required to assign grammatical roles such as subject and object. Under this proposal, only when subject roles are assigned as a result of the syntactic encoding of the verb can phonological encoding begin (Ferreira and Swets, 2002). If we assume that grammatically encoded representations automatically access their phonological information (e.g., Levelt et al., 1999), then phonological planning may depend not on phonological phrase boundaries, but instead, on the syntactic encoding of the verb. Evidence of phonological planning in sentence production is consistent with this account as planning extended to the verb but not beyond, in all cases (e.g., Meyer, 1996; Smith and Wheeldon, 2004; Schnur et al., 2006; Oppermann et al., 2010a).

Thus, it is unclear what drives the extent of phonological planning in spontaneous sentence production. Sentence production evidence that planning is minimal (a PW) is also consistent with non-incremental planning. Under a non-incremental view, the unit of phonological planning is larger, either a phonological phrase, or driven by the syntactic importance of the verb. If the phonological phrase is a minimal unit of phonological encoding, then following Levelt’s (1989) notion of incrementality, all elements of that unit should be phonologically encoded before articulation. Alternatively, if the verb drives encoding, then phonological encoding may not extend beyond the verb, as previous results suggest.

Current Experiments

The goal of the experiments presented here was to determine whether for sentence production, phonological planning extends only up to the verb, or whether the unit of planning is a phonological phrase, where all components of the phrase are encoded before articulation. In the experiments presented here, participants described pictures in the format “He (She) opens the gate,” prosodically defined as single phonological phrase. This utterance format allows the examination of whether all components of a single phonological phrase, especially components following the verb, are phonologically encoded at sentence onset.

Experiments 1 and 3 used the picture–word interference paradigm which elicits spontaneous speech while controlling the nature of what is produced. In picture–word interference, participants describe pictures, while ignoring visually presented distractor words. Picture naming is accelerated when the word sounds like the picture name in comparison to an unrelated word. This is referred to as the phonological facilitation effect (Lupker, (1982; Rayner and Springer, 1986; Meyer and Schriefers, 1991). This acceleration of speech may reflect the distractor’s influence on both the distractor and the to-be-produced target word-form/morphemes (Starreveld and La Heij, 1995, 1996; Starreveld, 2000). Alternatively, it may reflect the influence on the target segments that are inserted into the metrical frame (Meyer and Schriefers, 1991). Like many others (e.g., Meyer, 1991; Meyer and Schriefers, 1991; Costa and Caramazza, 2002; Roelofs, 2002; Damian and Dumay, 2007), I assume that the bulk of the phonological facilitation effect on a produced word is a result of the word’s phonological encoding before articulation, i.e., the retrieval of its phonological representation. Recently, Oppermann et al. (2010a) suggested that the phonological facilitation effect is a by-product of picture-viewing processes (further discussed following Experiment 1). Given the debate concerning the locus of the phonological facilitation effect, I provide converging evidence of advance planning of phonological representations using a simpler paradigm (picture description without word distractors) where I manipulate a lexical and phonological property of to-be-produced words, lexical frequency (see Experiment 2).

To my knowledge, this is the first set of experiments that addresses whether all components of a phonological phrase (including post-verb elements) are phonological encoded upon sentence initiation. These experiments also offer an improvement over previous sentence production studies in approximating real-world speech by using more verbs (16–28) and objects (32) than in previous work³. Lastly, using two different chronometric methods, the experiments provide converging evidence of phonological encoding of post-verb representations.

Experiment 1

In Experiment 1, phonologically related and unrelated distractors to the verb were presented during production of utterances similar to “He opens the gate”. During sentence production, if phonological planning extends through the first PW (Meyer, 1996; Smith and Wheeldon, 2004; Oppermann et al., 2010a) and/or to the verb (Schnur et al., 2006), then participants should be faster to produce utterances in the presence of a phonologically related distractor to the verb in comparison to an unrelated distractor.

Method

Participants

Sixteen Harvard University undergraduate students were paid or received credit for an introductory psychology course. All were native English speakers and none participated in other experiments.

Materials

Twenty-eight line drawings depicting actions were used as target stimuli (modified from the materials used in Masterson and Druks, 1998; see Table A1 in Appendix). All pictures depicted an actor performing transitive actions. Although named as “he” or “she,” an actor was depicted as either a boy, girl, man, or woman so that 7 of the 28 actions fell into each category. No agent, action, or object shared initial phonemes for any picture. Each picture was presented with four distractor words: (a) phonologically related to the verb (e.g., client for climb); (b) phonologically unrelated to the verb (e.g., peer for climb); (c) a baseline condition (a string of 6 X’s printed inside each picture); and (d) a filler condition. The filler condition was not analyzed. The pictures and the distractors were paired so that each distractor appeared once in the phonologically related and once in the unrelated conditions. This design controlled for unintentional pairing effects between different sets of distractors beyond the phonological relatedness with the verbs. Distractors were chosen so that they did not sound similar to the agent or object of the sentence. Phonologically related distractors shared the first two segments with the verb of the picture. At the beginning of each block, four pictures were included as warm-up trials.

The experimental stimuli were presented in 4 different blocks of 32 trials each for a total of 128 trials. The trials were randomized so that (a) the same picture did not occur twice in the same block; (b) the same distractor condition occurred no more than three times in a row; and (c) no agent occurred more than twice in a row. Care was taken so that no item from one trial was semantically or phonologically similar to an item in the following trial. Four different block orders were designed and presented to participants according to a Latin-square design.

The distractors were shown in 28-point boldface capital letters in Geneva font, superimposed on the pictures. Pictures were centered at fixation. Word position varied randomly in the region around fixation to prevent participants from systematically fixating the portion of the picture not containing the distractor. However, for an individual picture, the position of all its distractors was the same.

Before the experiment proper, participants had two practice series. In the first series participants were presented with all the pictures with a series of X’s printed inside each picture, to train the subject to use the correct name for each picture. In the second practice series, they were presented with all the pictures with practice distractors printed inside every picture. These practice distractors were not used during the experiment.

Apparatus

The pictures were presented on a Macintosh using the PsychLab program (Bub and Gym, University of Victoria, British Columbia, Canada). Response times (RTs) were measured to the nearest millisecond by means of a voice key (KOSS headset/CMU voice box) from appearance of the picture until the voice key was triggered.

Procedure

Participants were asked to produce complete sentences, naming the subjects (using the pronouns “he” or “she”), the action (using the third person singular verb form) and the object (using the object’s name; e.g., “He opens the gate”). Participants were tested individually in a darkened testing room. They were instructed to name pictures as quickly and as accurately as possible. When participants made mistakes during the practice session, they were asked to name the picture correctly. Each trial proceeded as follows: A fixation point (+) was shown for 700 ms, followed by presentation of the stimulus 300 ms later. There was a 2000 ms pause between trials. The experimenter remained in the testing room in order to record incorrect responses and when voice key malfunctions occurred. A session lasted approximately 25 min.

Analyses

Three types of responses were classified as errors: (a) production of the wrong word; (b) verbal disfluencies (stuttering, utterance repairs, etc.); and (c) voice key malfunctions. Responses faster than 300 ms and 3 SDs from a participant’s condition mean were also eliminated. Separate analyses were carried out on the RTs using either the means per subject or means per item as dependent variables yielding F1 and F2 statistics, respectively. One variable was analyzed: Type of distractor (phonologically related, unrelated, baseline). Type of distractor was considered a within-subject and within-item variable. I report three different ANOVA’s: an error analysis, a baseline analysis, and the principal analysis. For the error analysis, all three conditions were included. The baseline analysis compared RTs for the unrelated condition vs. baseline. The baseline condition was not included in further analyses. The principal analysis compared RTs for the phonologically related vs. unrelated conditions.

Results and Discussion

Table 1 reports a summary of the data. Figure 1 shows the magnitude of the phonological facilitation effect and the 95% confidence intervals. The naming latencies from two items were removed because they elicited a high percentage of errors (more than 30%). Error rates consisted of 12.2% of the data before outliers were removed and 12.5% of the data after outliers were eliminated. There was no significant difference in the number of errors produced across the three conditions (Fs < 1).

TABLE 1

Table 1. Experiment 1. Mean RTs (ms), SD, and percentage of errors (Error %), for phonologically related, unrelated, and baseline conditions.

FIGURE 1

Figure 1. Experiments 1–3 sentence onset difference scores and 95% confidence intervals. Experiment 1 distractors phonologically related (rel) and unrelated (ur) to the verb were displayed. Experiment 2 sentences produced with either low (lo) or high (hi) frequency objects. Experiment 3 distractors phonologically related and unrelated to the noun were displayed.

Response times in the phonologically related condition (782 ms) were significantly faster in comparison to the unrelated condition [806 ms; F1 (1, 15) = 5.84, MSE = 93862, p = 0.02; F2 (1, 25) = 4.29, MSE = 109404, p = 0.04]. The XXX condition (792 ms) did not produce statistically faster naming latencies in comparison to the unrelated condition [806 ms; F1(1, 15) = 3.02, MSE = 27752, p = 0.10; F2(1, 25) = 1.74, MSE 37721, p = 0.19) although the effect (14 ms) was in the predicted direction.

The experiment showed that production of a sentence was facilitated with a distractor phonologically related to the verb compared to an unrelated distractor. This suggests the verb, part of the first PW, was phonologically encoded before articulation. Phonological encoding of the first PW before articulation of sentences is consistent with previous evidence that phonological planning extends at least one PW in advance (Levelt, 1989; Wheeldon and Lahiri, 1997) and/or planning extends to the verb in sentence production (Smith and Wheeldon, 2004; Schnur et al., 2006; Oppermann et al., 2010a).

Having replicated previous results showing phonological planning through the first PW and verb in a different set of sentences, the critical question to be addressed is whether in sentence production phonological planning is non-incremental, extending an entire phonological phrase. If phonological planning in sentence production is defined by phonological phrase boundaries then I expect phonological effects to the direct object NP at articulation of the sentence for sentences like, “He opens the gate.” If the PW and/or verb limit the extent of phonological planning, I expect no phonological effects for representations following the verb.

Recently, Oppermann et al. (2010a) suggested that the phonological facilitation effect in the picture–word interference paradigm is not a measure of the retrieval of phonological representations and thus is an invalid method for measuring advance phonological planning in sentence production. Specifically, they suggest that the phonological facilitation effect is a by-product of picture-viewing processes. The evidence for this position is mixed, and is derived from measuring effects of unnamed pictures on the naming latencies of pictures or words. In some cases, unnamed pictures phonologically related to the to-be-named picture or word facilitate naming which suggests that participants may automatically access the phonological representations of unnamed pictures (Morsella and Miozzo, 2002; Navarrete and Costa, 2005; Humphreys et al., 2010). However, this effect may only occur when pictures are semantically related (Oppermann et al., 2010b), thematically related (Oppermann et al., 2008), as a result of attentional effects (Roelofs, 2008; Malpass and Meyer, 2010), or strategic effects (Bloem et al., 2004). In some cases, the effect is not found at all (Jescheniak et al., 2009).

Given the unique situations required to create the picture–picture phonological effect, I am in agreement with others (e.g., Jescheniak et al., 2009; Oppermann et al., 2010b) that simply viewing pictures does not activate their phonological representations. However, to provide further evidence against this possibility I used an additional measure of advance phonological planning during sentence production. In Experiment 2, participants simply described pictures (no word distractors presented) and the frequency of the object name was varied. Lexical frequency effects are a “litmus test” for evidence of phonological encoding (Kittredge et al., 2008), but may reflect retrieval at both lexical and phonological levels (Knobel et al., 2008; Strijkers et al., 2009). Thus, in Experiment 3, I returned to the phonological facilitation effect to provide converging evidence of the phonological encoding of the direct object NP.

Experiment 2

In Experiment 2, participants produced sentences similar to Experiment 1, “He opens the gate,” consisting of a phonological phrase with two PWs. Phonological encoding of the end of the phonological phrase, the direct object NP was investigated using the frequency effect. I tested whether varying the frequency of the direct object in the production of transitive sentences [e.g., He opens the door (HF) vs. He opens the gate (LF)] affected RTs in initiating sentence production.

It is well established that the frequency of a picture’s name affects the speed of naming the picture (Oldfield and Wingfield, 1965; Jescheniak and Levelt, 1994; Griffin and Bock, 1998; Levelt et al., 1998). Naming of low-frequency items is slower than the naming of high-frequency items. There is some controversy as to the production level on which the frequency effect in picture naming occurs. Three possible loci for the frequency effect are proposed: at the recognition (input) level, the lexical level, and at the phonological level.

Low-frequency objects may be named more slowly than high-frequency objects because it takes longer to recognize an object of low-frequency because the object itself does not appear frequently in our environment (Kroll and Potter, 1984). Using an object-decision task where no vocal response was required (participants pressed a button as to whether a displayed object represented a real or a non-existent object) Kroll and Potter (1984) found responses were slower to low-frequency objects compared to high-frequency objects. Assuming the task was sensitive to recognition processes, these results suggest that the frequency effect is based on the speed at which an object is recognized. However, other researchers using similar paradigms where a vocal response was not required do not replicate this effect (Wingfield, 1968; Jescheniak and Levelt, 1994; Griffin and Bock, 1998; Levelt et al., 1998; Meyer et al., 1998). Using a different paradigm (delayed picture naming) Almeida et al. (2007) further confirmed that lexical frequency effects in naming do not arise from input level processes.

In general, the frequency effect is thought to be located either at both lexical and phonological levels (Knobel et al., 2008; Strijkers et al., 2009) or primarily at the phonological level (Jescheniak and Levelt, 1994; Jescheniak et al., 2003; Kittredge et al., 2008; although see Caramazza et al., 2001; Caramazza et al., 2004). Convincing evidence that the bulk of the frequency effect arises during selection of the phonological representation comes from speech error data. In a large-scale study of 50 speakers with acquired language deficits, Kittredge et al. (2008) found that phonological errors during picture naming (e.g., pillow for pineapple) occurred more often when pictures had low-frequent vs. high-frequent names (also seen in a smaller-scale study of speech errors by Stemberger and McWhinney, 1986). Importantly, Kittredge et al. found that this effect of picture name frequency was significantly larger for phonological than semantic errors (e.g., apricot for pineapple) which localizes the bulk of the effect at the level of phonological retrieval.

To test whether the direct object NP of a transitive sentence is phonologically encoded before articulation I manipulated the lexical frequency of the object (high vs. low). If the direct object is phonologically encoded before articulation begins, I expected RT differences between sentences with high- and low-frequency direct objects, where high-frequency completions should be produced more quickly than low-frequency ones.

To rule out a conceptual/input interpretation of the frequency effect I included several control experiments (described in Experiment 2 Materials). First, the degree to which an object is consistently named by one name (name agreement) has an effect on the time it takes to name an object independent of lexical frequency (e.g., Snodgrass and Yuditsky, 1996; Barry et al., 1997; Ellis and Morrison, 1998). To address this concern, name-agreement probabilities were collected for both the action and the direct object with which it was depicted. Second, RTs may be slower for actions with low-frequency objects in comparison to actions with high-frequency objects because of some inherent unnaturalness of the action and object occurring together. To address this concern, participants judged the “naturalness” of action–object pairings. I also measured object-recognition latencies (following Jescheniak and Levelt, 1994) to control for a pre-lexical locus of the frequency effect. Lastly, to ensure that the objects could reliably produce a frequency effect independent of sentence context, participants viewed the same pictures, but named the direct object in isolation (Experiment 2b).