Constraints on the Transfer of Perceptual Learning in Accented Speech

Eisner, Frank; Melinger, Alissa; Weber, Andrea

doi:10.3389/fpsyg.2013.00148

ORIGINAL RESEARCH article

Front. Psychol., 01 April 2013

Sec. Cognition

Volume 4 - 2013 | https://doi.org/10.3389/fpsyg.2013.00148

This article is part of the Research Topic Ecological aspects of speech perception View all 6 articles

Constraints on the Transfer of Perceptual Learning in Accented Speech

Frank Eisner¹*

Alissa Melinger²

Andrea Weber^1,3

¹Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
²School of Psychology, University of Dundee, Dundee, UK
³Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, Nijmegen, Netherlands

The perception of speech sounds can be re-tuned through a mechanism of lexically driven perceptual learning after exposure to instances of atypical speech production. This study asked whether this re-tuning is sensitive to the position of the atypical sound within the word. We investigated perceptual learning using English voiced stop consonants, which are commonly devoiced in word-final position by Dutch learners of English. After exposure to a Dutch learner’s productions of devoiced stops in word-final position (but not in any other positions), British English (BE) listeners showed evidence of perceptual learning in a subsequent cross-modal priming task, where auditory primes with devoiced final stops (e.g., “seed”, pronounced [si:t^h]), facilitated recognition of visual targets with voiced final stops (e.g., SEED). In Experiment 1, this learning effect generalized to test pairs where the critical contrast was in word-initial position, e.g., auditory primes such as “town” facilitated recognition of visual targets like DOWN. Control listeners, who had not heard any stops by the speaker during exposure, showed no learning effects. The generalization to word-initial position did not occur when participants had also heard correctly voiced, word-initial stops during exposure (Experiment 2), and when the speaker was a native BE speaker who mimicked the word-final devoicing (Experiment 3). The readiness of the perceptual system to generalize a previously learned adjustment to other positions within the word thus appears to be modulated by distributional properties of the speech input, as well as by the perceived sociophonetic characteristics of the speaker. The results suggest that the transfer of pre-lexical perceptual adjustments that occur through lexically driven learning can be affected by a combination of acoustic, phonological, and sociophonetic factors.

Introduction

Laboratory experiments in spoken-language research often use highly stylized samples of speech – recorded without background noise or interference from other talkers, spoken in canonical form and free of mispronunciations, accents, changes in speaking rate, or emotional tone. Researchers aim to keep those factors as constant and controlled as possible in order to avoid sources of variance in the data they gather. There is no doubt that fluctuations in the acoustic environment as well as inter- and intra-speaker variation have an effect on perception, which is generally detrimental (e.g., Dupoux and Green, 1997; Mullennix et al., 2002; Peelle and Wingfield, 2005; Adank et al., 2009; Bent et al., 2009). However, listeners can usually learn to cope with such sources of variance, and a fairly recent body of research has sketched out the perceptual mechanisms that underlie this adaptability. This research has shown that, although sources of variability in speech cause listeners processing problems initially, these problems can often be overcome, sometimes quite rapidly, through perceptual learning. Perceptual learning allows listeners to adjust to variation in speech in a variety of difficult listening situations, including spectral and temporal degradation of the signal, accents, talker variability, and talker-idiosyncratic mispronunciations (e.g., Nygaard et al., 1994; Rosen et al., 1999; Norris et al., 2003; Bradlow and Bent, 2008; Adank and Janse, 2009). This learning may be guided by a variety of cues that are present in speech, such as visual information from the face of the talker (Bertelson et al., 2003), lexical and phonotactic knowledge (Norris et al., 2003; Cutler et al., 2008), and contingencies in acoustic-phonetic cues (Idemaru and Holt, 2011).

This study investigated lexically driven perceptual learning – one particular mechanism by which pre-lexical representations of speech sounds can be rapidly adjusted (Norris et al., 2003) and which has been studied quite extensively (see Samuel and Kraljic, 2009 for a recent review). The learning takes place when listeners repeatedly encounter a speaker’s consistently atypical productions of a speech sound, and when those atypical productions are produced in a context that allows the listener to infer the sound’s identity. As the outcome of learning, the perceptual category boundary for that speech sound is adjusted. For example, in the study by Norris et al. (2003), after listeners had heard a fricative that was midway in between /s/ and /f/ in the context of words that biased the interpretation of that sound toward /f/, they shifted their /s/-/f/ category boundary toward /f/. When a different group heard that same ambiguous fricative embedded in words that biased its interpretation toward /s/, they then showed a category boundary shift toward /s/.

Perceptual learning is an essential mechanism that allows the listener to adjust to unusual or unexpected characteristics of the speech input. However, for perceptual learning to produce optimal outcomes, it must find a balance between, on the one hand, being robust and stable in the face of constant variability between tokens, and on the other hand, being flexible enough to adapt to systematic and predictable differences. How it achieves this balance is at the heart of our investigation. To maximize stability, the learning system may be very conservative, never generalizing beyond the specific situations that the listener has encountered. Several studies have found that learning can indeed be specific, for example, for a particular talker (Eisner and McQueen, 2005; Kraljic and Samuel, 2005). However, it seems that maximum stability is not a principal property of perceptual learning under all conditions, as learned adjustments can generalize beyond the characteristics of the exposure items, for example to a different place of articulation or to other words containing that sound (Kraljic and Samuel, 2006; McQueen et al., 2006; Maye et al., 2008). It is as yet unclear under what circumstances a learned adjustment will generalize or along which dimensions it will be generalized. Here, we further investigated this basic property of lexically driven perceptual learning by testing (1) whether the change in pre-lexical representation can encode information about the position of the critical sound within a word, and (2) whether learning is affected by sociophonetic characteristics of the talker.

With respect to the first question, linking phonological information with the change in category boundary might protect the perceptual system from overgeneralizing learning in cases where the category change only occurs in a specific position. Two previous studies have reported conflicting results regarding this question. Investigating cross-positional perceptual learning of ambiguous fricatives, Jesse and McQueen (2011) found full transfer of learning from word-final position to the initial position of non-sense syllables. However, Dahan and Mead (2010), using spectrally degraded speech, reported position-sensitive generalization of learning. In their study, consonants were more readily identified when they occurred in the same word position during learning and test than when they occurred in different positions. In the current study, a [d/t] stop contrast which is commonly devoiced word-finally in Dutch-accented English (Warner et al., 2004), but not in other positions, was chosen to investigate the question of position-specific learning. Learning in such a case would be beneficial for recognizing words in which the affected phoneme category occurs in that position, but might in fact hinder word recognition when applied in positions where it is not warranted. The generalizability of learning about a position-specific accent feature was therefore tested in the context of word recognition.

With respect to the second question, the readiness of listeners to adjust a category boundary, and the generalizability of that potential learning, may be affected by the listener’s expectation regarding the speaker, specifically the likelihood that the speaker produces idiosyncratic pronunciations. The perceived identity of a speaker, or their membership to an accent community, is known to affect speech perception on a pre-lexical level (Hay et al., 2006a,b; Hay and Drager, 2010). Listeners have also been shown to process syntactic errors differently when they occur in a speaker who has a global foreign accent, compared to a speaker with a native accent (Hanulíková et al., 2012). This study tested whether such a sociophonetic effect may exist in perceptual learning at the acoustic-phonetic level, by comparing listener adjustment to word-final devoicing in foreign-accented and native-accented speech.

In three experiments using an exposure-test paradigm (McQueen et al., 2006), British English (BE) listeners first learned to adjust to word-final stop consonant devoicing in the context of performing a lexical-decision task and were then tested in a cross-modal priming task to establish whether there was a benefit of exposure in recognizing a new set of word-finally devoiced words. The experiments also included a condition in which word recognition was tested with word-initially unvoiced sounds in order to test for potential generalization of learning across word positions.

Experiment 1

Materials and Methods

Participants

Twenty-four undergraduate students who were enrolled at the University of Dundee participated in exchange for course credit. All participants were native speakers of English, did not speak Dutch, and reported no hearing-related disorders. Participants gave informed consent before taking part in the study.

Speech materials

Stimuli were made from recordings of a female native Dutch speaker who had studied English at high-school level and who had not spent time living in an English-speaking country. The speaker had a good command of English with a noticeable Dutch accent, characterized not only by word-final stop devoicing but also other typical deviations such as substitution of alveolar stops for dental fricatives, velar fricatives for velar stops, and variation in vowel quality (Flege, 1997). Word-final devoicing was produced naturally without specific instruction, and no other substitutions occurred in the critical experimental items. Word lists were read out in a sound-damped booth, recorded with 48 kHz/16-bit sampling and stored digitally for further editing using Praat (Boersma and Weenink, n.d.). The lists consisted of 252 items in total to be used in the exposure phase (32 training words consisting of three to four syllables ending in /d/ (e.g., “overload”), 32 replacement words, matched to the training items in syllabic length and average frequency in CELEX (Baayen et al., 1995) (e.g., “surgery”), 32 /d/-initial words with three to four syllables (e.g., “delivery”), 64 filler words, as well as 92 pseudowords. Except for the word-final and word-initial /d/ in the two critical conditions, there were no other voiced stops and no other alveolar stops in the words recorded for the exposure phase. However, there were some instances of the voiced and unvoiced affricates / yes / distributed across the exposure conditions. The list for the test phase included 240 monosyllabic items (30 minimal pairs of /d/-final items (e.g., “seed”) and /t/-final items (e.g., “seat”), 30 minimal pairs of /d/-initial items (e.g., “down”) and /t/-initial items (e.g., “town”), and 120 monosyllabic filler items which did not contain stop consonants or voiced fricatives. An analysis of some of the acoustic cues that are affected by the word-final devoicing is presented in the “Acoustical analysis” section below.

Design and procedure

The study employed a between-subjects design in which the experimental group heard devoiced alveolar stops during an initial exposure phase, but the control group heard matched control items without stops. Both groups were then tested immediately afterward with a cross-modal priming task (following McQueen et al., 2006; Sjerps and McQueen, 2010) in which the critical conditions contrasted related vs. unrelated prime in initial vs. final position and voiced vs. voiceless alveolar stop.

During the exposure phase, participants were presented with spoken words and pseudowords and instructed to indicate after each word with a yes/no button press whether the item they had heard was an English word. The experimental and control groups both heard the 92 pseudowords and 64 filler words not containing any stop consonants. The experimental group heard in addition the 32 /d/-final items, which were substituted by the 32 matched replacement items in the control group. Three equivalent pseudo randomized orders were made for each of the two groups, and rotated across subjects. During the test phase, participants heard auditory primes paired with visual target words and pseudowords presented in succession; the task was to indicate with a yes/no button press whether the visual target was an English word. The test phase was identical for both groups.

Of the 60 words in each set of minimal pair items (/d/- and /t/-final, /d/- and /t/-initial conditions) 40 were assigned in equal proportions as prime-target pairs in a related condition (e.g., devoiced “seed” – SEED, “seat” – SEAT) and an unrelated condition (e.g., “smile” – SEED, “smile” – SEAT; visual targets are henceforth represented in capitals). The remaining 20 were assigned to a pseudoword condition (e.g., “seed” – DRAGE, “seat” – DRAGE). In addition, 80 of the recorded filler words were paired with pseudoword targets (e.g., “gin” – DORSE), and the remaining 40 were paired with unrelated word targets (e.g., “ring” – MYTH). Six lists were constructed in which the assignment of all critical words to the related, unrelated, and pseudoword conditions was counterbalanced and which were otherwise identical. All lists thus consisted of 240 trials in which half the targets were pseudowords, and across lists the items from the four conditions of interest were equally likely to occur in a related, unrelated, or pseudoword pair.

Stimuli were presented using Presentation software (NeuroBehavioral Systems Inc.) running on a laptop computer. Audio stimuli were delivered via Sennheiser HD280 headphones; visual prompts were shown for 1.4 s in white Helvetica font on black background in the center of the computer screen. Responses were made on a custom response box with two buttons labeled “yes” and “no.” Half of the participants made “yes” responses with their dominant hand. The inter-onset interval in the lexical-decision task was 2.8 s. In the priming task, targets were presented immediately at the offset of the prime and the inter-trial interval was 1.4 s. Reaction times (RTs) in both tasks were measured from the offset of the auditory stimulus. Trials were either ended by a button press, or timed out 1.5 s after target onset.

Results and Discussion

Analysis of the responses in the lexical-decision task showed that on average, on 78% of the trials in which a finally devoiced item occurred, listeners in the experimental group responded by pressing the “yes” button, indicating that those items were largely judged to be real words. For the priming task, RTs from trials of interest with correct responses (always “yes”; mean error rate: 3%) were analyzed in a mixed ANOVA with the between-subjects factor group (experimental vs. control) and within-subjects factors prime type (related vs. unrelated) and word type (/d/-final, /d/-initial, /t/-final, and /t/-initial). RTs were analyzed separately in a subject analysis (F1) and an item analysis (F2). Priming effects (unrelated − related RTs) are shown in Figure 1 (see Table A1 in Appendix for mean RTs). The three-way interaction between group, prime type, and word type was significant (F1_(3,165) = 2.67, p < 0.05; F2_(3,277) = 2.64, p < 0.05) and was followed up by four ANOVAs with the factors group and prime type for each word type, as potential effects of exposure would be revealed as an interaction of group and prime type. This interaction was significant for the /d/-final (F1_(1,22) = 6.48, p < 0.05; F2_(1,38) = 4.43, p < 0.05) and /d/-initial (F1_(1,22) = 9.62, p < 0.01; F2_(1,38) = 5.86, p < 0.05) words types, reflecting larger priming effects in the experimental group than in the control group on pairs such as [si:t^h] – SEED and [ta℧n] – DOWN, but not significant for the /t/-final and /t/-initial word types, that is pairs such as [si:t^h] – SEAT and [ta℧n] – TOWN (Fs < 1). Post hoc one-tailed t-tests for the significant interactions showed that priming effects were significant in the experimental group for /d/-final (t1_(1,11) = −3.99, p < 0.005; t2_(1,19) = −3.45, p < 0.005) and /d/-initial items (t1_(1,11) = −4.80, p < 0.001; t2_(1,19) = −4.93, p < 0.001), but not in the control group (ps > 0.05). An analogous three-way ANOVA carried out on error rates did not reveal a significant three-way interaction of group, prime type, and word type, and was thus not followed up further (Fs < 1).

FIGURE 1

Figure 1. Priming effects (reaction times in the unrelated minus the related prime type) are shown for each of the four word types, for experimental and control groups, in Experiment 1. The group differences for the /d/-final and /d/-initial word types illustrate learning for word-final position and generalization to word-initial position, respectively. Starred differences denote a significant interaction of group and prime type (related vs. unrelated).

The results of Experiment 1 indicate that native English speakers could adjust to word-final devoicing in Dutch-accented English speech after having been exposed to 32 devoiced /d/-final words. Participants in the experimental group were faster to decide that a visual target such as SEED is a word when it was preceded by a devoiced production of “seed” than they were when it was preceded by an unrelated item. This priming effect suggests that those participants had learned that the devoiced word-final /d/ in “seed” is an acceptable production for that phoneme category for that speaker, and that the devoiced prime was thus sufficient for activating the intended lexical item and facilitating recognition of the target. In contrast, for the control group, which did not have an opportunity to learn about this aspect of the speaker’s accent, the devoiced /d/ was not a good match for the category; hence no such priming effect occurred for those participants. A likely source of information to drive this learning effect is lexical knowledge (Norris et al., 2003): because the items in which the devoiced stop occurred in the exposure phase were three to four syllables in length, and because those items did not form minimal pairs with voiceless word-final stops, there was overwhelming lexical evidence for the experimental group that the devoicing was an unusual pronunciation, which could override the mismatch at the acoustic-phonetic level. The learning in Experiment 1 was not restricted to word-final items, as the experimental group showed a priming effect for word pairs in which a voiceless alveolar stop consonant in initial position (such as in “town”) produced priming in targets with a voiced initial stop consonant (such as DOWN).

However, neither group of listeners heard the speaker produce canonically voiced word-initial stops in the exposure phase (which is no problem for Dutch learners of English because Dutch distinguishes voicing in onset position too). Previous research on perceptual learning has shown that learning to adjust to an unusual sound may be blocked when there is evidence that the speaker can produce the sound in question correctly (Kraljic et al., 2008; Kraljic and Samuel, 2011). The finding in Experiment 1 raises the question whether a similar blocking process affects the transfer of learning, that is, whether generalization of learning to word-initial position would also occur if canonically voiced, word-initial stops were included in the exposure phase. This question was addressed in Experiment 2.