Edited by: Micah M. Murray, University Hospital Center and University of Lausanne, Switzerland
Reviewed by: Pascal Barone, Université Paul Sabatier, France; Pascale Sandmann, Hannover Medical School, Germany
*Correspondence: Matthew B. Winn, Waisman Center & Department of Surgery, University of Wisconsin-Madison, Waisman Center Room 565, 1500 Highland Avenue, Madison, WI 53705, USA e-mail:
This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
There is a wide range of acoustic and visual variability across different talkers and different speaking contexts. Listeners with normal hearing (NH) accommodate that variability in ways that facilitate efficient perception, but it is not known whether listeners with cochlear implants (CIs) can do the same. In this study, listeners with NH and listeners with CIs were tested for accommodation to auditory and visual phonetic contexts created by gender-driven speech differences as well as vowel coarticulation and lip rounding in both consonants and vowels. Accommodation was measured as the shifting of perceptual boundaries between /s/ and /∫/ sounds in various contexts, as modeled by mixed-effects logistic regression. Owing to the spectral contrasts thought to underlie these context effects, CI listeners were predicted to perform poorly, but showed considerable success. Listeners with CIs not only showed sensitivity to auditory cues to gender, they were also able to use visual cues to gender (i.e., faces) as a supplement or proxy for information in the acoustic domain, in a pattern that was not observed for listeners with NH. Spectrally-degraded stimuli heard by listeners with NH generally did not elicit strong context effects, underscoring the limitations of noise vocoders and/or the importance of experience with electric hearing. Visual cues for consonant lip rounding and vowel lip rounding were perceived in a manner consistent with coarticulation and were generally used more heavily by listeners with CIs. Results suggest that listeners with CIs are able to accommodate various sources of acoustic variability either by attending to appropriate acoustic cues or by inferring them via the visual signal.
Variability in the acoustic realization of speech segments is a well-known phenomenon that can arise from several sources, including coarticulation with neighboring segments and inter-talker differences related to gender and vocal tract size. Despite this variability in the physical properties of the signal, normal hearing (NH) listeners are remarkably successful at perceiving and understanding speech. Listeners are thought to accommodate this variability by compensating for the context in which sounds are heard and thus, recognize that two different sounds are really the “same.” In this paper, we explore this phenomenon for listeners with cochlear implants (CIs) and for listeners with NH, in normal (unprocessed) or CI-simulated conditions.
The sounds heard by CI listeners are spectro-temporally degraded and altered by the device because of the limited number of independent spectral processing channels (Fishman et al.,
For NH listeners, accommodation to phonetic context is observed across many different types of target sounds and in various contexts. A heavily explored example is the perception of stop consonants /g/ and /d/ in the context of liquid consonants /l/ or /r/. Given an ambiguous syllable that sounds like either /da/ or /ga/, listeners are biased to hear /da/ if a preceding syllable is /ar/, but biased to hear /ga/ if the preceding syllable is /al/ (Mann,
The importance of sound frequency contrast should be problematic for CI listeners on the basis of their impaired auditory systems and the poor frequency coding of CIs. Various studies have shown, however, that listeners can use non-auditory information such as visual information and other, indirect, information about the speaker and the speech signal when accommodating phonetic variability. For example, listeners can incorporate lexical knowledge into phonetic perception; Elman and McClelland,
While the exact mechanisms of accommodation to phonetic context across modalities have yet to be completely elucidated, it is increasingly clear that listeners exploit any information that is available in the signal, under the right circumstances. This behavior appears suited to accommodate coarticulation and other sources of variability in speech production. The extent to which listeners with CIs overcome auditory limitations to accommodate speech variability is largely unknown. In this study, we examine context effects in the auditory and visual domains for listeners with CIs; we compare this with performance of listeners with NH using unprocessed as well as spectrally-degraded speech tokens.
The consonants /∫/ (as in “she”) and /s/ (as in “see”) are the primary focus of our investigation, not for their importance in word identification, but for their well-known patterns of variability across talkers and across different phonetic environments. These sounds are voiceless fricatives that contrast primarily in the amplitude spectrum; spectral peak frequency is higher for /s/ than that for /∫/, and is considered to be a dominant cue in the perceptual distinction (Evers et al.,
Other factors that affect the production and perception of /∫/ and /s/ sounds include the vowel context in which they are spoken and the formant transitions connecting the fricative to the vowel segment. In the context of a following vowel produced with lip rounding, both of these fricatives show a global frequency-lowering effect, and listeners compensate by accordingly lowering the frequency boundary between them (Kunisaki and Fujisaki,
Few studies have examined the use of context in phonetic perception by CI listeners; results generally suggest limited success in this area. Hedrick and Carney (
Visual cues play a role in accommodating inter-talker differences, and this has implications for listeners with impaired auditory systems. When listening to a gender-atypical talker, listeners with NH shift their perceptual phonetic boundaries based on visual cues to talker gender (Strand and Johnson,
In this study, we investigate whether visual cues can aid the accommodation to phonetic context and gender information with listeners who wear CIs, in view of their specific limitations in the auditory domain. We follow a design that is similar to that used by Strand and Johnson (
In the first experiment, we tested accommodation to phonetic context in the acoustic domain only. Specifically, we sought to clarify the separate effects of formant transitions, vowel context, and auditory gender cues. To examine the effects of spectral degradation separately from the additional factors involved with using a cochlear implant, normal-hearing listeners were also tested with noise vocoded speech (which is regarded by some as a “CI simulation”).
Participants included 10 adult (mean age 21.9 years; 8 female) listeners with NH, defined as having pure-tone thresholds ≤20 dB HL (hearing level) from 250 to 8000 Hz in both ears [American National Standards Institute (ANSI),
C12 | F | Unknown | 66 | 3 | Freedom | 66 | 51 | 87 |
C18 | F | 10 years | 66 | 3 | Freedom | 68 | 84 | 99 |
C20 | M | 22 years | 64 | 7 | N 24 | 68 | 46 | 93 |
C25 | M | 11 years | 50 | 10 | N 24 | 72 | 69 | DNT |
C30 | M | Unknown | 56 | 2 | Med-El | 62 | 29 | DNT |
C36 | F | 59 years | 71 | 5 | Freedom | 83 | 62 | 99 |
C42 | F | 4 years | 73 | 4 | Freedom | 70 | 58 | 79 |
Stimuli consisted of 4 varying parameters: fricative spectrum (9 levels), taker (4 levels; 2 female and 2 male), vowel (2 levels), and formant transitions at vowel onset (2 levels). There were thus a total of 144 (9*4*2*2) stimuli.
SP1 | 2932 | 3226 | 3550 | 3906 | 4298 | 4729 | 5203 | 5726 | 6300 |
SP2 | 6130 | 6357 | 6592 | 6837 | 7090 | 7352 | 7625 | 7907 | 8200 |
SP3 | 8100 | 8283 | 8472 | 8666 | 8863 | 9065 | 9272 | 9484 | 9700 |
SP1 | 1500 | 1556 | 1612 | 1671 | 1732 | 1796 | 1861 | 1929 | 2000 |
SP2 | 3500 | 3500 | 3500 | 3500 | 3500 | 3500 | 3500 | 3500 | 3500 |
SP3 | 2520 | 2670 | 2828 | 2997 | 3175 | 3364 | 3564 | 3775 | 4000 |
SP1 | 1.67 | 0.83 | 0.00 | −0.83 | −1.67 | −2.50 | −3.33 | −4.17 | −5 |
SP3 | −1.7 | −0.8 | 0.0 | 0.8 | 1.7 | 2.5 | 3.3 | 4.2 | 5 |
High-pass (Hz) | 150 | 314 | 570 | 967 | 1586 | 2549 | 4046 | 6376 |
Low-pass (Hz) | 314 | 570 | 967 | 1586 | 2549 | 4046 | 6376 | 10000 |
After hearing each stimulus, listeners used a computer mouse to select the word that they perceived in a four-alternative forced-choice task (the choices were “see,” “sue,” “she,” “shoe”). For NH listeners, stimuli were presented in ten alternating blocks of spectral resolution (unprocessed or 8-channel NBV), and presentation of tokens within each block was randomized. Each of the 144 stimuli was heard a total of 5 times in each condition of spectral resolution. CI listeners only heard the unprocessed stimuli. All testing was conducted in a double-walled sound-treated booth. Stimuli were presented at 65 dBA in the free field through a single loudspeaker.
Listeners' responses were analyzed using a generalized linear (logistic) mixed-effects model (GLMM) in the R software interface (R Development Core Team,
Starting with an intercept-only model, factor selection (e.g., the inclusion of talker gender as a response predictor) was done using a forward-selection hill-climbing stepwise process whereby candidate factors were added one-by-one; the factor which yielded the lowest entropy was entered first. Subsequent factors (or factor interactions) were retained in the model if they significantly improved the model without over-fitting. The ranking metric (and test of significance for the inclusion of factors) was the Akaike information criterion (AIC) (Akaike,
Identification functions for fricatives in the various vocalic contexts are shown in Figure
Fricative | 22.86 |
30.56 |
24.05 |
Gender | 17.63 |
n.s. | 15.48 |
Vowel | 7.00 |
n.s. | 3.33 |
Formant | 8.18 |
2.07 |
n.s. |
Fricative: gender | 2.29 |
N/A | n.s. |
Fricative: vowel | n.s. | 2.24 |
3.29 |
Fricative: formant | 2.631 |
n.s. | N/A |
Gender: vowel | 4.681 |
N/A | n.s. |
Gender: formant | n.s | N/A | N/A |
Vowel: formant | n.s. | n.s | N/A |
For the NH listeners in the unprocessed sound condition, there were significant main effects of fricative spectrum, talker gender, vowel context, and formant transitions (all
For NH listeners in the vocoded condition, there were significant main effects for fricative spectrum (
For the cochlear implant listeners, there were significant main effects of fricative spectrum, talker gender and vowel context (all
Given the heterogeneity of the CI and NH groups in terms of age and hearing history, as well relatively small sample size in the CI group, direct statistical comparisons between groups should be treated with caution. A rough qualitative assessment suggests that CI listeners' vocalic context effects were smaller than those of NH listeners, but greater than those of NH listeners in the degraded condition, for whom there were virtually no context effects (see Figure
Context effects observed in this experiment cannot be inferred from performance on conventional word/phoneme recognition tests. Stimuli at the extreme endpoints of the fricative continuum are comparable to those that would be heard in such tests, and therefore, can be evaluated for correctness. Identification of such endpoint stimuli in this experiment was excellent for all listener groups (at least 95% for both phonemes, see Figure
A second experiment was designed to test the influence of visual context cues on the perception of fricatives. This was a modified replication of Strand and Johnson (
Participants included 10 adult (mean age 22.2 years; 6 female) listeners with NH [American National Standards Institute (ANSI),
The procedure for Experiment 2 was nearly the same as for Experiment 1, with the exception of the software used to deliver the stimuli. Video stimuli were centered on the display screen, with the four word choices equally spaced around the periphery. Responses were visible during the videos, which all began and ended with neutral closed lip posture.
Listeners' responses were fit using the same GLMM procedure as for Experiment 1, with additional fixed-effects (fricative spectrum peaks, audio gender, video gender, vowel, consonant lip rounding, coded by whether the original video recording was from a /s/-onset or /∫/ word).
Identification functions in the various auditory and visual contexts are shown in Figures
Fricative | 31.08 |
17.47 |
20.32 |
Gender–Voice | 13.43**** | 5.85 |
9.70 |
Gender–Face | n.s. | 7.52 |
10.14 |
Vowel | 3.31 |
8.28 |
7.4 |
Lip rounding | 13.52 |
20.85 |
18.84 |
Fricative: gender (voice) | n.s. | 3.56 |
2.38 |
Fricative: gender (face) | n.s. | 3.83 |
2.15 |
Fricative: vowel | 7.05 |
8.77 |
4.28 |
Fricative: lip rounding | n.s. | n.s. | n.s. |
Gender (voice): gender (face) | N/A | n.s. | n.s. |
Gender (voice): vowel | 10.55 |
n.s. | n.s. |
Gender (voice): lip rounding | n.s. | n.s. | n.s. |
Gender (face): vowel | N/A | 3.28 |
4.20 |
Gender (face): lip rounding | N/A | 5.55 |
3.07 |
Vowel: lip rounding | 8.14 |
9.68 |
11.95 |
For the listeners with NH in the unprocessed (non-vocoded) condition, there were significant main effects of fricative spectrum, lip rounding, vowel environment, and auditory gender (voice) cues (all
For NH listeners in the vocoded condition, each main effect reached significance (all
For listeners with CIs, each main effect reached significance (all
As for experiment 1, the statistical comparison between NH and CI listener groups in the second experiment should be interpreted with caution, given the sizes and nature of these groups. There was a significant interaction between hearing status and each of the five main effects. For CI listeners, there were significantly smaller effects of the fricative cue (
In Experiment 2, listeners were presented with a subset of the sounds from Experiment 1, with accompanying visual cues. The auditory gender context effects from Experiment 1 were replicated, with additional effects of visual cues to gender (face) and lip rounding. Results suggest that although CI listeners receive degraded auditory cues to gender (Fu et al.,
Listeners (especially those with CIs) demonstrated nuanced sensitivity to lip posture, reflecting tacit understanding of visual cues for the consonant and vowel as a coherent unit rather than as individual segments. Given an interpretation of the vowel as /i/, lip rounding had to be attributed to the consonant. Thus, when in the context of /i/, consonant lip rounding gave rise to considerable bias toward /∫/ even at high-frequency steps of the fricative continuum; the fricative itself was a weaker cue in this case. Conversely, in the context of the rounded vowel /u/, lip rounding during the consonant is acceptable for both /s/ and /∫/. Accordingly, in the context of /u/, the CI listeners were relatively less affected by lip rounding and relatively more affected by the spectrum of the fricative. This pattern of second-order context effects (use of the lip-rounding context according to the level of the vowel context) is illustrated by the psychometric functions in Figure
The presence of auditory phonetic context effects in this study supported findings of earlier literature and generalized the phenomenon to CI listeners. Specifically, listeners with NH were more likely to identify fricatives as /s/ when they were (1) perceived to be spoken by male voices, (2) in the context of a rounded vowel, or (3) followed by formant transitions appropriate for /s/. CI listeners showed auditory context effects of a similar type but arguably to a lesser degree. The differences between alternate contexts in this experiment (male/female voices, /i/-/u/ vowels, /s/-/∫/ formants) are all cued primarily by spectral properties such as formant spacing, spectral tilt, voice pitch, vowel formants, and dynamic formant transitions; the presence of context effects for CI listeners was somewhat surprising, given the limitations of spectral resolution in CIs. To the extent that these effects are representative of the many inter-talker and cross-context variations present in natural speech, successful accommodation appears to emerge without fine spectral resolution.
The 50% crossover point in /s/-/∫/ identification in each context was calculated from the group aggregate GLMMs in both experiments. In Table
Gender (voice) | 702 | 0 | 636 |
Vowel | 282 | 62 | 168 |
Formant | 242 | 51 | 0 |
Voice | 675 | 183 | 490 |
Face | 0 | 176 | 290 |
Voice and face | 675 | 358 | 780 |
Lips | 274 | 1016 | 993 |
Vowel | 131 | 247 | 98 |
Time-varying spectral contrast encapsulates some of the acoustic variables that underlie the contextual accommodation in this study and other previous studies (Lotto and Kluender,
It is commonplace in CI research to use vocoders to produce “simulated CI” speech that is then played for NH listeners. Results presented in this study expose some limitations of the conventional vocoder method in modeling perception of speech by CI listeners. Although performance for clearly-pronounced words was comparable across all groups of listeners (Figure
In Experiment 2, people with CIs utilized visual talker information to accommodate gender-driven acoustic differences in speech production, perhaps as a supplement to or proxy for degraded auditory information. There is precedent for increased cross-modal influence on speech perception, especially in the presence of acoustic signal degradation (e.g., Sumby and Pollack,
It is noteworthy that the effect in this experiment went in the direction predicted by the acoustics that typically correspond to the gender of the visual stimulus; listeners had no reason to prefer /s/ when seeing male faces (especially when presented with concurrent female voices) other than having been exposed to the natural association between visual cues and the auditory spectral properties of voices that correspond to those faces. The influence of this cue cannot be completely understood on the basis of this experiment alone, but it is likely that it arises from a learned association between phonetic segments and typical gender-driven differences in speech production. Although Holt et al. (
There were two distinct effects of visual speech information in this study. First, lip rounding during the consonant segment increased the proportion of /∫/ responses in a predictable fashion. The effect of lip rounding was further modulated by vowel context, suggesting that listeners are sensitive to the compatibility of auditory and visual phonetic cues at the syllabic level in addition to the segmental level. This was especially apparent for the NH listeners in spectrally-degraded conditions and CI listeners (Figure
The use of context and visual cues has potential impact on the rehabilitation of listeners with CIs. Consonant and vowel recognition performed in predictable syllable contexts (e.g., /apa/, /ata/, /aka/, etc.) is not reflective of the variability in real speech signals (both linguistically and acoustically), which is an important part of basic speech recognition in everyday life. It is not enough for listeners to recognize cues for a consonant in just one environment; they must be able to accommodate variability arising from different phonetic contexts and from different talkers; basic word identification is not sensitive enough to capture this ability (Figure
Finally, visual cues clearly provide information that can be of use to listeners with hearing impairment. To the extent that audiologists aim to equip clients with all the tools necessary to succeed in everyday listening, it may be beneficial to exploit the relationships between visual and auditory cues to facilitate not only consonant recognition (Bernstein et al.,
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This work was supported by NIH grant R01 DC 004786 to Monita Chatterjee and NIH grant 7R01DC005660-07 to David Poeppel and William J. Idsardi. Matthew B. Winn was supported by the University of Maryland Center for Comparative and Evolutionary Biology of Hearing Training Grant (NIH T32 DC000046-17 PI: Arthur N. Popper). We are grateful to Tristan Mahr for his assistance with the preparation of the figures. All figures except Figure
1A Praat script that can be used to re-create this continuum is available upon request from the first author.