# THE COGNITIVE AND NEURAL ORGANISATION OF SPEECH PROCESSING

EDITED BY: Patti Adank, Carolyn McGettigan and Sonja A. E. Kotz PUBLISHED IN: Frontiers in Human Neuroscience

#### *Frontiers Copyright Statement*

*© Copyright 2007-2015 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88919-775-0 DOI 10.3389/978-2-88919-775-0

## About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

## Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

## Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

## What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

## **THE COGNITIVE AND NEURAL ORGANISATION OF SPEECH PROCESSING**

Topic Editors:

**Patti Adank,** University College London, UK **Carolyn McGettigan,** Royal Holloway University of London, UK **Sonja A. E. Kotz,** Max Planck Institute Leipzig, Germany

Sequential sagittal gradient-echo MRI images of the human head and vocal tract (ordered left to right) were collected at a sample rate of 8 frames per second while a native speaker of British English (Ulster) produced the monosyllables /bead/, /bird/ and /booed/. Image by Carolyn McGettigan.

Speech production and perception are two of the most complex actions humans perform. The processing of speech is studied across various fields and using a wide variety of research approaches. These fields include, but are not limited to, (socio)linguistics, phonetics, cognitive psychology, neurophysiology, and cognitive neuroscience. Research approaches range from behavioural studies to neuroimaging techniques such as Magnetoencephalography, electroencephalography (MEG/EEG) and functional Magnetic Resonance Imaging (fMRI), as well as neurophysiological approaches, such as the recording of Motor Evoked Potentials (MEPs), and Transcranial Magnetic Stimulation (TMS). Each of these approaches provides valuable information about specific aspects of speech processing. Behavioural testing can inform about the nature of the cognitive processes involved in speech processing, neuroimaging methods show

where (fMRI and MEG) in the brain these processes take place and/or elucidate on the timecourse of activation of these brain areas (EEG and MEG), while neurophysiological methods (MEPs and TMS) can assess critical involvement of brain regions in the cognitive process. Yet, what is currently unclear is how speech researchers can combine methods such that a convergent approach adds to theory/model formulation, above and beyond the contribution of individual component methods? We expect that such combinations of approaches will significantly forward theoretical development in the field.

The present research topic comprise a collection of manuscripts discussing the cognitive and neural organisation of speech processing, including speech production and perception at the level of individual speech sounds, syllables, words, and sentences. Our goal was to use findings from a variety of disciplines, perspectives, and approaches to gain a more complete picture of the organisation of speech processing. The contributions are grouped around the following five main themes: 1) Spoken language comprehension under difficult listening conditions; 2) Sub-lexical processing; 3) Sensorimotor processing of speech; 4) Speech production. The contributions used a variety of research approaches, including behavioural experiments, fMRI, EEG, MEG, and TMS. Twelve of the 14 contributions were on speech perception processing, and the remaining two examined speech production. This Research Topic thus displays a wide variety of topics and research methods and this comprehensive approach allows an integrative understanding of currently available evidence as well as the identification of concrete venues for future research.

**Citation:** Adank, P., McGettigan, C., Kotz, S. A. E., eds. (2016). The Cognitive and Neural Organisation of Speech Processing. Lausanne: Frontiers Media. doi: 10.3389/978-2-88919-775-0

# Table of Contents


*134 Multi-talker background and semantic priming effect* Marie Dekerle, Véronique Boulenger, Michel Hoen and Fanny Meunier

## Editorial: Current research and emerging directions on the cognitive and neural organization of speech processing

Patti Adank <sup>1</sup> \*, Carolyn McGettigan<sup>2</sup> and Sonja A. E. Kotz 3, 4

<sup>1</sup> Division of Psychology and Language Sciences, Speech, Hearing and Phonetic Sciences, University College London, London, UK, <sup>2</sup> Department of Psychology, Royal Holloway University of London, Egham, UK, <sup>3</sup> Max Planck Institute Leipzig, Leipzig, Germany, <sup>4</sup> School of Psychological Sciences, University of Manchester, Manchester, UK

Keywords: speech perception, speech production, functional magnetic resonance imaging (fMRI), magnetoencephalography (MEG), electroencephalography, transcranial magnetic stimulation (TMS)

This Research Topic consists of 14 manuscripts discussing the cognitive and neural organization of speech processing. The contributions are grouped around four themes: (1) Spoken language comprehension under difficult listening conditions; (2) Sub-lexical processing; (3) Sensorimotor processing of speech; (4) Speech production.

Seven papers addressed speech perception under challenging listening conditions. Van Engen and Peelle (2014) discuss the effects of processing speech in an unfamiliar regional or foreign accent. They argue that, as perceiving accented speech incurs a processing cost, just like other types of distortions such as background noise, it should also be regarded as representing a challenging listening condition. Neger et al. (2014) focused on plasticity of speech processing in statistical and perceptual learning tasks in aging. They conclude that perceptual and statistical learning share mechanisms of implicit regularity detection, but that the ability to detect statistical regularities is impaired in older adults for fast visual sequences. Dekerle et al. (2014) examined whether speech perception in a multi-speaker background relies on semantic interference between the background and target speaker using a semantic priming paradigm in three experiments. Their results indicate that higher-level linguistic processes such as semantic priming may not be as automatic as commonly thought but are subjected to the limits of cognitive resources such as working memory and attention. Yi et al. (2014) evaluate how processing of foreign-accented speech relates to social cognition. It was concluded that foreign-accented speech perception engages greater activation of neural systems underlying speech perception, and that implicit Asian-foreign association is related to with decreased neural efficiency in early spectrotemporal processing. Vitello et al. (2014) used fMRI to address the question of how semantic ambiguities are resolved during speech comprehension.

Strauß et al. (2014) examined through literature review whether neural oscillations in the alpha frequency range (∼10 Hz) act as a neural mechanism to selectively inhibit the processing of noise to improve auditory selective attention to task-relevant speech signals. Ding and Simon (2014) discuss whether cortical entrained activity is related more closely to speech perception or to auditory encoding that is not specific to speech, by reviewing evidence regarding various hypotheses about the functional roles of cortical entrainment to speech.

Three papers focused on perception of speech at sub-lexical levels. Deschamps and Tremblay (2014) studied perception of sub-lexical information by examining the neural bases of processing of simple syllables and more complex syllabic structures using fMRI, while Yu et al. (2014) used MEG to study the neural processing of disgust in anterior insula by presenting listeners with syllables with differed intended emotional meanings. Finally Chen et al. (2014) investigated processing of

#### Edited and reviewed by:

Hauke R. Heekeren, Freie Universität Berlin, Germany

#### \*Correspondence:

Patti Adank, p.adank@ucl.ac.uk

Received: 16 April 2015 Accepted: 12 May 2015 Published: 27 May 2015

#### Citation:

Adank P, McGettigan C and Kotz SAE (2015) Editorial: Current research and emerging directions on the cognitive and neural organization of speech processing. Front. Hum. Neurosci. 9:305. doi: 10.3389/fnhum.2015.00305 acoustic and phonological information in lexical tones in Mandarin Chinese using EEG.

Two papers addressed sensorimotor processing of speech. Komeilipoor et al. (2014) report higher motor excitability as measured using Transcranial Magnetic Stimulation (TMS) in the tongue area during the presentation of meaningful gestures (noun-associated). Sowman et al. (2014) demonstrate that appropriately timed TMS to the hand area, paired with auditorily mediated excitation of the motor cortex, induces an enhancement of motor cortex excitability that lasts beyond the time of stimulation.

Two papers focused on speech production. Etchell et al. (2014) provide a review of the stuttering literature and Hernandez-Pavon et al. (2014) present a neuronavigated TMS study

## References


exploring the neural locus of aspects of picture naming in healthy participants.

This Frontiers Research Topic allows new insights into the neurobiology of speech perception and production, and demonstrates how the field of speech science is now addressing issues at its very core. We believe that the future of the research in the field lies in the effective combination of research methods, e.g., EEG and TMS, or fMRI and EEG, as research will benefit from the strengths of each method. In conclusion, this Research Topic consists of 14 excellent contributions, and we are convinced the Topic will provide readers with novel ideas for future studies that will elucidate the cognitive and neural architecture of speech processing.

an auditory stimulus with TMS. Front. Hum. Neurosci. 8:398. doi: 10.3389/fnhum.2014.00398


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Adank, McGettigan and Kotz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## Cortical entrainment to continuous speech: functional roles and interpretations

## *Nai Ding1\* and Jonathan Z. Simon2,3,4 \**

<sup>1</sup> Department of Psychology, New York University, New York, NY, USA

<sup>2</sup> Department of Electrical and Computer Engineering, University of Maryland College Park, College Park, MD, USA

<sup>3</sup> Department of Biology, University of Maryland College Park, College Park, MD, USA

<sup>4</sup> Institute for Systems Research, University of Maryland College Park, College Park, MD, USA

#### *Edited by:*

Sonja A. E. Kotz, Max Planck Institute for Human Cognitive and Brain Sciences, Germany

#### *Reviewed by:*

István Winkler, University of Szeged, Hungary Jonas Obleser, Max Planck Institute

for Human Cognitive and Brain Sciences, Germany

#### *\*Correspondence:*

Nai Ding, Department of Psychology, New York University, New York, NY 10012, USA e-mail: gahding@gmail.com; Jonathan Z. Simon, Department of Electrical and Computer Engineering, University of Maryland College Park, College Park, MD 20742, USA e-mail: jzsimon@umd.edu

Auditory cortical activity is entrained to the temporal envelope of speech, which corresponds to the syllabic rhythm of speech. Such entrained cortical activity can be measured from subjects naturally listening to sentences or spoken passages, providing a reliable neural marker of online speech processing. A central question still remains to be answered about whether cortical entrained activity is more closely related to speech perception or non-speech-specific auditory encoding. Here, we review a few hypotheses about the functional roles of cortical entrainment to speech, e.g., encoding acoustic features, parsing syllabic boundaries, and selecting sensory information in complex listening environments. It is likely that speech entrainment is not a homogeneous response and these hypotheses apply separately for speech entrainment generated from different neural sources. The relationship between entrained activity and speech intelligibility is also discussed. A tentative conclusion is that theta-band entrainment (4–8 Hz) encodes speech features critical for intelligibility while delta-band entrainment (1–4 Hz) is related to the perceived, non-speech-specific acoustic rhythm. To further understand the functional properties of speech entrainment, a splitter's approach will be needed to investigate (1) not just the temporal envelope but what specific acoustic features are encoded and (2) not just speech intelligibility but what specific psycholinguistic processes are encoded by entrained cortical activity. Similarly, the anatomical and spectro-temporal details of entrained activity need to be taken into account when investigating its functional properties.

**Keywords: auditory cortex, entrainment of rhythms, speech intelligibility, speech perception in noise, speech envelope, cocktail party problem**

## **INTRODUCTION**

Speech recognition is a process that maps an acoustic signal onto the underlying linguistic meaning. The acoustic properties of speech are complex and contain temporal dynamics on several time scales (Rosen, 1992; Chi et al., 2005). The time scale most critical for speech recognition is on the order of hundreds of milliseconds (1–10 Hz), and the temporal fluctuations on this time scale are usually called the *temporal envelope* (**Figure 1A**). Single neuron neurophysiology from animal models has shown that neurons in primary auditory cortex encode the analogous temporal envelope of other non-speech sounds by phase locked neural firing (Wang et al., 2003). In contrast, the finer scale acoustic properties that decide the pitch and timbre of speech at each time moment (acoustic fragments lasting a few 100 ms) are likely to be encoded using a spatial code, by either individual neurons (Bendor and Wang, 2005) or spatial patterns of cortical activity (Walker et al., 2011).

In the last decade or so, cortical entrainment to the temporal envelope of speech has been demonstrated in humans using magnetoencephalography (MEG; Ahissar et al., 2001; Luo and Poeppel, 2007), electroencephalography (EEG; Aiken and Picton, 2008), and electrocorticography (ECoG; Nourski et al., 2009). This envelope following response can be recorded from subjects listening to sentences or spoken passages and therefore provides an online marker of neural processing of continuous speech. Envelope entrainment has mainly been seen in the waveform of low-frequency neural activity (<8 Hz) and in the power envelope of high-gamma activity (Pasley et al., 2012; Zion Golumbic et al., 2013). Although the phenomenon of envelope entrainment has been well established, its underlying neural mechanisms, andfunctional roles remain controversial. It is still under debate whether entrained cortical activity is more closely tied to the physical properties of the acoustic stimulus or to higher level language related processing that is directly related to speech perception. A number of studies have shown that cortical entrainment to speech is strongly modulated by top–down cognitive functions such as attention (Kerlin et al., 2010; Ding and Simon, 2012a; Mesgarani and Chang, 2012; Zion Golumbic et al., 2013) and therefore is not purely a bottom-up response. On the other hand, cortical entrainment to the sound envelope is seen for non-speech sound (Lalor et al., 2009; Hämäläinen et al., 2012; Millman et al., 2012; Wang et al., 2012; Steinschneider et al., 2013) and therefore does not rely on speech-specific neural processing. In this article, we first summarize a number of hypotheses about the functional roles of envelope entrainment, and then review the literature about how envelope entrainment is affected by speech intelligibility.

**FIGURE 1 | A schematic illustration of hypotheses proposed to explain the generation of cortical entrainment to the speech envelope. (A)** The spectro-temporal representation of speech, obtained from a cochlear model (Yang et al., 1992). The broad-band temporal envelope of speech, the sum of the spectro-temporal representation over frequency, is superimposed in white. **(B)** An illustration of the collective feature tracking hypothesis and the onset tracking hypothesis. The colored images show time courses of the dendritic activity of two example groups of neurons, hypothetically in primary and associative auditory areas. One group encodes the slow temporal modulations and coarse spectral modulations of sound intensity, i.e., the spectro-temporal envelope of speech, which contain major phonetic cues. The other group encodes the slow temporal changes of cues computed from the spectro-temporal fine structure, e.g., the pitch contour and the trajectory of the sound source location. According to the collective feature tracking hypothesis, magnetoencephalography (MEG)/electroencephalography (EEG) measurements are the direct sum of dendritic activity across all such neural populations in primary and associative auditory areas. The onset tracking hypothesis is similar, but instead neurons encoding the temporal edges of speech dominate cortical activity and thus drive MEG/EEG measurable responses. **(C)** An illustration of the syllabic parsing hypothesis and the sensory selection hypotheses. These hypotheses assume certain computations that integrate over distributively-represented auditory features. The syllable parsing hypothesis hypothesizes neural operations integrating features belonging to the same syllable. The sensory selection hypotheses propose either a temporal coherence analysis or a temporal predictive analysis.

#### **FUNCTIONAL ROLES OF CORTICAL ENTRAINMENT**

A number of hypotheses have been proposed about what aspects of speech, ranging from its acoustic features to its linguistic meaning, are encoded by entrained cortical activity. A few dominant hypotheses are summarized and compared (**Table 1**). Other unresolved questions about cortical neural entrainment, e.g., what the biophysical mechanisms generating cortical entrainment are, and whether entrained neural activity is related to spontaneous neural oscillations, are not covered here (see discussions in e.g., Schroeder and Lakatos, 2009; Howard and Poeppel, 2012; Ding and Simon, 2013b).

#### **ONSET TRACKING HYPOTHESIS**

Speech is dynamic and is full of acoustic "edges," e.g., onsets and offsets. These edges usually occur at syllable boundaries and are well characterized by the speech envelope. It is well known that a reliable macroscopic brain response can be evoked by an acoustic edge. Therefore, it has been proposed that neural entrainment to the speech envelope is a superposition of discrete, edge/onset related brain responses (Howard and Poeppel, 2010). Consistent with this hypothesis, it has been shown that the sharpness of acoustic edges, i.e., how quickly sound intensity increases, strongly influences cortical tracking of the sound envelope (Prendergast et al., 2010; Doelling et al., 2014). A challenge of this hypothesis, however, is that speech is continuously changing and it remains a problem as to which acoustic transients can be counted as edges.

If this hypothesis is true, a question naturally follows about whether envelope entrainment can provide insights that cannot be learned using the traditional event-related response approach. The answer is yes. Cortical responses, including edge/onset related auditory evoked responses, are stimulus-dependent, and quickly adapt to the spectro-temporal structure of the stimulus (Zacharias et al., 2012; Herrmann et al., 2014). Therefore, even if envelope entrainment is just a superposition of event-related responses, it can still provide insights about the properties of cortical activity when it is adapted to the acoustic properties of speech.

#### **COLLECTIVE FEATURE TRACKING HYPOTHESIS**

When sound enters the ear, it is decomposed into narrowfrequency bands in the auditory periphery and is further decomposed into multi-scale acoustic features in the central auditory system, such as pitch, sound source location information, and coarse spectrotemporal modulations (Shamma, 2001; Ghitza et al., 2012). In speech, most acoustic features coherently fluctuate in time and these coherent fluctuations are captured by the speech envelope. If a neuron or a neural population encodes an acoustic feature, its activity is synchronized to the strength of that acoustic feature. As a result, neurons or neural networks that are tuned to coherently fluctuating speech features are activated coherently (Shamma et al., 2011).

Analogously to the speech envelope being the summation of the power of all speech features at each time moment, the large-scale neural entrainment to speech measured by MEG/EEG can be the summation of neural activity tracking different acoustic features of speech (**Figure 1B**). It is therefore plausible to hypothesize that macroscopic speech entrainment is a passive summation of microscopic neural tracking of acoustic features


**Table 1 | A summary of major hypotheses about the functional roles of cortical entrainment to speech.**

across neurons/networks (Ding and Simon, 2012b). Based on this hypothesis, the MEG/EEG speech entrainment is a marker of a collective cortical representation of speech but does not play any additional roles in regulating neuronal activity.

The onset tracking hypothesis can be viewed as a special case of the collective feature tracking hypothesis, when the acoustic features driving cortical responses are restricted to a set of discrete edges. The collective feature tracking hypothesis, however, is more general since it allows features to be continuously changing and also incorporates features that are not associated with sharp intensity changes, such as changes in the pitch contour (Obleser et al., 2012), and sound source location. Under the onset tracking hypothesis, entrained neural activity is a superposition of onset/edge-related auditory evoked responses. Under the more general collective feature tracking hypothesis, at a first-order approximation, entrained activity is a convolution between speech features, e.g., the temporal envelopes in different narrow frequency bands, and the corresponding response functions, e.g., the response evoked by a very brief tone pip in the corresponding frequency band (Lalor et al., 2009; Ding and Simon, 2012b).

#### **SYLLABIC PARSING HYPOTHESIS**

During speech recognition, the listener must segment a continuous acoustic signal into a sequence of discrete linguistic symbols, into the units of, e.g., phonemes, syllables or words. The boundaries between phonemes, and especially syllables, are relatively well encoded by the speech envelope (Stevens, 2002; Ghitza, 2013, see also Cummins, 2012). Furthermore, the average syllabic rate ranges between 5 and 8 Hz across languages (Pellegrino et al.,2011) and the rate for stressed syllables is below 4 Hz for English (Greenberg et al., 2003). Therefore it has been hypothesized that neural entrainment to the speech envelope plays a role in creating a syllabic level, discrete, representation of speech (Giraud and Poeppel, 2012). In particular, it has been hypothesized that each cycle of the cortical theta oscillation (4–8 Hz) is aligned to the portion of speech signal in between of two vowels, corresponding to two adjacent peaks in the speech envelope. Auditory features within a cycle of theta oscillation are then used to decode the phonetic information of speech (Ghitza, 2011, 2013). Therefore, according to this hypothesis, speech entrainment does not only passively track acoustic features but also reflects the language-based packaging of speech into syllable size chunks. Since syllables play different roles in segmenting syllable-timed language and stress-timed language (Cutler et al., 1986), further cross-language research may

further elucidate which of these neural processes are represented in envelope tracking activity.

#### **SENSORY SELECTION HYPOTHESIS**

In everyday listening environments, speech is often embedded in a complex acoustic background. Therefore, to understand speech, a listener must segregate speech from the listening background and process it selectively. A useful strategy for the brain would be to find and selectively process moments in time (or spectro-temporal instances in a more general framework) that are dominated by speech and ignore the moments dominated by the background (Wang, 2005; Cooke, 2006). In other words, the brain might robustly encode speech by taking glimpses at the temporal (or spectro-temporal) features that contain critical speech information. The rhythmicity of speech (Schroeder and Lakatos, 2009; Giraud and Poeppel, 2012), and the temporal coherence between acoustic features (Shamma et al., 2011), are both reflected by the speech envelope and so become critical cues for the brain to decide where the useful speech information lies. Therefore, envelope entrainment may play a critical role in the neural segregation of speech and the listening background.

In a complex listening environment, cortical entrainment to speech has been found to be largely invariant to the listening background (Ding and Simon, 2012a; Ding and Simon, 2013a). Two possible functional roles have been hypothesized for the observed background-invariant envelope entrainment. One is that the brain uses temporal coherence to bind together acoustic features belonging to the same speech stream and envelope entrainment may reflect computations related to this coherence analysis (Shamma et al., 2011; Ding and Simon, 2012a). The other is that envelope entrainment is used by the brain to predict which moments contain more information about speech than the acoustic background and then guide the brain to selectively process those moments (Schroeder et al., 2008; Schroeder and Lakatos, 2009; Zion Golumbic et al., 2012).

#### **WHICH HYPOTHESIS IS TRUE? AN ANALYSIS-BY-SYNTHESIS ACCOUNT OF SPEECH PROCESSING**

Speech processing is a complicated process that can be roughly divided into an analysis stage and a synthesis stage. In the analysis stage, speech sounds are decomposed into primitive auditory features, a process that starts from the cochlea and applies mostly equally to the auditory encoding of both speech and non-speech sounds. A later synthesis stage, in contrast, combines multiple auditory features to create speech perception, including, e.g., binding spectro-temporal cues to determine phonemic categories, or integrating multiple acoustic cues to segregate a target speech stream from an acoustic background. The onset tracking hypothesis and the collective feature tracking hypothesis both view speech entrainment as a passive auditory encoding mechanism belonging to the analysis stage. Note, however, that the analysis stage does include some integration over separately represented features also. For example, neural processing of pitch and spectral modulations requires integrating information across frequency. Functionally, however, the purpose of integrating features in the analysis stage is to extract higher level auditory features rather than to construct linguistic/perceptual entities.

The syllabic parsing hypothesis and the sensory selection hypothesis propose functional roles of cortical entrainment in the synthesis stage. They hypothesize that cortical entrainment is involved in combining features into linguistic units, e.g., syllables, or perceptual units, e.g., speech streams (**Figure 1C**). These additional functional roles may be implemented in two ways: an active mechanism would be one that entrained cortical activity, as a large-scale voltage fluctuation, directly regulating syllabic parsing or sensory selection (Schroeder et al., 2008; Schroeder and Lakatos, 2009). A passive mechanism would be one where neural computations related to syllabic parsing or sensory selection would generate spatially coherent neural signals that are measurable by macroscopic recording tools.

Although clearly distinctive from each other, the four hypotheses may all be true for different functional areas of the brain and describe different neural generators for speech entrainment. Onset detection, feature tracking, syllabic parsing, and sensory selection are all neural computations necessary for speech recognition and all of them are likely to be synchronized to the speech rhythm carried by the envelope. Therefore, these neural computations may all be reflected by cortical entrainment to speech, and may only differ in their fine-scale neural generators. It remains unclear, however, whether these fine-scale neural generators can be resolved by macroscopic recording tools such as MEG and EEG.

Future studies are needed to explicitly test these hypotheses, or explicitly modify them, to determine which specific acoustic features and which specific psycholinguistic processes are relevant to cortical entrainment. For example, to dissociate the onset tracking hypothesis and the collective feature tracking hypothesis, one approach is to create explicit computational models for them and test which model would fit the data better. To test the syllabic parsing hypothesis, it will be important to calculate the correlation between cortical entrainment and relevant behavioral measures, e.g., misallocation of syllable boundaries (Woodfield and Akeroyd, 2010). To test the sensory selection hypothesis, stimuli that vary in their temporal probability or coherence among spectro-temporal features are likely to be revealing.

## **ENVELOPE ENTRAINMENT AND SPEECH INTELLIGIBILITY ENTRAINMENT AND ACOUSTIC MANIPULATION OF SPEECH**

As indicated by its name, envelope entrainment is correlated with the speech envelope, an acoustic property of speech. Nevertheless, neural encoding of speech must underlie the ultimate goal of

decoding its meaning. Therefore, it is critical to identify if cortical entrainment to speech is related to any behavioral measure during speech recognition, such as speech intelligibility.

A number of studies have compared cortical activity entrained to intelligible speech and unintelligible speech. One approach is to vary the acoustic stimulus and analyze how cortical entrainment changes within individual subjects. Some studies have found that cortical entrainment to normal sentences is similar to cortical entrainment to sentences that are played backward in time (Howard and Poeppel, 2010; Peña and Melloni, 2012; though see Gross et al., 2013).

A second way to reduce intelligibility is to introduce different types of acoustic interference. When speech is presented together with stationary noise, delta-band (1–4 Hz) cortical entrainment to the speech is found to be robust to noise until the listeners can barely hear speech, while theta-band (4–8 Hz) entrainment decreases gradually as the noise level increases (Ding and Simon, 2013a). In this way, theta-band entrainment is correlated with noise level and also speech intelligibility, but delta-band entrainment is not. When speech is presented together with a competing speech stream, cortical entrainment is found to be robust against the level of the competing speech stream even though intelligibility drops (Ding and Simon, 2012a; theta- and delta-band activity was not analyzed separately there).

A third way to reduce speech intelligibility is to degrade the spectral resolution through noise-vocoding, which destroys spectro-temporal fine structure but preserves the temporal envelope (Shannon et al., 1995). When the spectral resolution of speech decreases, it has been shown that theta-band cortical entrainment reduces (Peelle et al., 2013; Ding et al., 2014) but delta-band entrainment enhances (Ding et al., 2014). In contrast, when background noise is added to speech and the speechnoise mixture is noise vocoded, it is found that both deltaand theta-band entrainment is reduced by vocoding (Ding et al., 2014).

A fourth way to vary speech intelligibility is to directly manipulate the temporal envelope (Doelling et al., 2014). When the temporal envelope in the delta-theta frequency range is corrupted, cortical entrainment in the corresponding frequency bands degrades and so does speech intelligibility. When a processed speech envelope is used to modulate a broadband noise carrier, the stimulus is not intelligible but reliable cortical entrainment is nevertheless seen.

In many of these studies investigating the correlation between cortical entrainment and intelligibility, a common issue is that stimuli which differ in intelligibly also differ in acoustic properties. This makes it is difficult to determine if changes in cortical entrainment arise from changes in speech intelligibility or from changes in acoustic properties. For example, speech syllables generally have a sharper onset than offset, so reversing speech in time changes those temporal characteristics. Similarly, when the spectral resolution is reduced, neurons tuned to fine spectral features are likely to be deactivated. Therefore, based on the studies reviewed here, it can only be tentatively concluded that, when critical speech features are manipulated, speech intelligibility, and theta-band entrainment are affected in similar ways while delta-band entrainment is not. It remains unclear about

whether speech intelligibility causally modulates cortical entrainment or that auditory encoding, reflected by cortical entrainment, influences downstream language processing and therefore become indirectly related to intelligibility.

#### **VARIABILITY BETWEEN LISTENERS**

A second approach to address the correlation between neural entrainment and speech intelligibility is to investigate the variability across listeners. Peña and Melloni (2012) compared neural responses in listeners who speak the tested language and listeners who do not speak the tested language. It was found that language understanding does not significantly change the low-frequency neural responses, but it does change high-gamma band neural activity. Within the group of native speakers, the intelligibility score still varied broadly in the challenging listening conditions. Delta-band, but not theta-band, cortical entrainment has been shown to correlate with intelligibility scores for individual listeners in a number of studies (Ding and Simon, 2013a; Ding et al., 2014; Doelling et al., 2014). The advantage of investigating inter-subject variability is that it avoids modifications of the sound stimuli. Nevertheless, it still cannot identify whether the individual differences in speech recognition arise from the individual differences in auditory processing (Ruggles et al., 2011), language related processing, or cognitive control.

The speech intelligibility approach in general, suffers from a drawback that it is the end point of the entire speech recognition chain, and is not targeted at specific linguistic computations, e.g., allocating the boundaries between syllables. Furthermore, when the acoustic properties of speech are degraded, speech recognition requires additional cognitive control and the involved neural processing networks adapt (Du et al., 2011; Wild et al., 2012; Erb et al., 2013; Lee et al., 2014). Therefore, just from a change in speech intelligibility, it is difficult to trace what kinds of neural processing are affected.

#### **DISTINCTIONS BETWEEN DELTA- AND THETA-BAND ENTRAINMENT**

In summary of these different approaches, when the acoustic properties of speech are manipulated, theta-band entrainment often shows changes that correlate with speech intelligibility. For the same stimulus, however, the speech intelligibility measured from individual listeners is often correlated with delta-band entrainment. To explain this dichotomy, here we hypothesize that theta-band entrainment encodes syllabic-level acoustic features critical for speech recognition, while delta-band entrainment is more closely related to the perceived acoustic rhythm rather than the phonemic information of speech. This hypothesis is also consistent with the fact that speech modulations between 4 and 8 Hz are critical for intelligibility (Drullman et al., 1994a,b; Elliott and Theunissen, 2009) while temporal modulations below 4 Hz include prosodic information of speech (Goswami and Leong, 2013) and it is the frequency range important for music rhythm perception (Patel, 2008; Farbood et al., 2013).

#### **ENVELOPE ENTRAINMENT TO NON-SPEECH SOUNDS**

Although speech envelope entrainment may show correlated changes with speech intelligibility when the acoustic properties of speech are manipulated, speech intelligibility is probably not a major driving force for envelope entrainment. A critical evidence is that envelope entrainment can be observed for non-speech sounds in humans and both speech and non-speech sounds in animals. Here, we briefly review human studies on envelope entrainment for non-speech sounds (see e.g., Steinschneider et al., 2013 for a comparison between envelope entrainment in human and animal models).

Traditionally, envelope entrainment has been studied using the auditory steady-state response (aSSR), a periodic neural response tracking the stimulus repetition rate or modulation rate. An aSSR at a given frequency can be elicited by, e.g., a click or tone-pip train repeating at the same frequency (Nourski et al., 2009; Xiang et al., 2010), and by amplitude or frequency modulation at that frequency (Picton et al., 1987; Ross et al., 2000; Wang et al., 2012). Although the cortical aSSR can be elicited in a broad frequency range (up to ∼100 Hz), speech envelope entrainment is likely to be related to the slow aSSR in the corresponding frequency range, i.e., below 10 Hz (see Picton, 2007 for a review of the robust aSSR of 40 Hz and above). More recently, cortical entrainment has also been demonstrated for sounds modulated by an irregular envelope (Lalor et al., 2009). Low-frequency (<10 Hz) cortical entrainment to nonspeech sound shares many properties with cortical entrainment to speech. For example, when envelope entrainment is modeled using a linear system-theoretic model, the neural response is qualitatively similar for speech (Power et al., 2012) and amplitudemodulated tones (Lalor et al., 2009). Furthermore, low-frequency (<10 Hz) cortical entrainment to non-speech sound is also strongly modulated by attention (Elhilali et al., 2009; Power et al., 2010; Xiang et al., 2010), and the phase of entrained activity is predictive of listeners' performance in some soundfeature detection tasks (Henry and Obleser, 2012; Ng et al., 2012).

## **SUMMARY**

Cortical entrainment to the speech envelope provides a powerful tool to investigate online neural processing of continuous speech. It greatly extends the traditional event-related approach that can only be applied to analyze the response to isolated syllables or words. Although envelope entrainment has attracted researchers' attention in the last decade, it is still a less well-characterized cortical response than event-related responses. The basic phenomenon of envelope entrainment has been reliably seen in EEG, MEG, and ECoG, even at the single-trial level (Ding and Simon, 2012a; O'Sullivan et al., 2014). Hypotheses have been proposed about the neural mechanisms generating cortical entrainment and its functional roles, but these hypotheses remain to be explicitly tested. To test these hypotheses, a computational modeling approach is likely to be effective. For example, rather than just calculating the correlation between neural activity and the speech envelope, more explicit computational models can be proposed and used to fit the data (e.g., Ding and Simon, 2013a). Furthermore, to understand what linguistic computations are achieved by entrained cortical activity, more fine-scaled behavioral measures are likely to be required, e.g., measures related to syllable boundary allocation rather than the general measure of intelligibility. Finally, the

anatomical, temporal, and spectral specifics of cortical entrainment should be taken into account when discussing its functional roles (Peña and Melloni, 2012; Zion Golumbic et al., 2013; Ding et al., 2014).

#### **AUTHOR CONTRIBUTIONS**

Nai Ding and Jonathan Z. Simon wrote and approved the paper.

#### **ACKNOWLEDGMENT**

The work is supported by NIH grant R01 DC 008342.

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 10 March 2014; accepted: 27 April 2014; published online: 28 May 2014.*

*Citation: Ding N and Simon JZ (2014) Cortical entrainment to continuous speech: functional roles and interpretations. Front. Hum. Neurosci. 8:311. doi: 10.3389/fnhum.2014.00311*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Ding and Simon. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Cortical alpha oscillations as a tool for auditory selective inhibition

#### *Antje Strauß1 \*†, Malte Wöstmann1,2 † and Jonas Obleser <sup>1</sup>*

*<sup>1</sup> Max Planck Research Group "Auditory Cognition", Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany <sup>2</sup> International Max Planck Research School on Neuroscience of Communication, Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany*

#### *Edited by:*

*Carolyn McGettigan, Royal Holloway University of London, UK*

#### *Reviewed by:*

*Johanna M. Zumer, University of Birmingham, UK Rebecca E. Millman, York NeuroImaging Centre, UK*

#### *\*Correspondence:*

*Antje Strauß, Max Planck Research Group "Auditory Cognition", Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstraße 1A, 04103 Leipzig, Germany e-mail: strauss@cbs.mpg.de*

*†These authors have contributed equally to this work.*

Listening to speech is often demanding because of signal degradations and the presence of distracting sounds (i.e., "noise"). The question how the brain achieves the task of extracting only relevant information from the mixture of sounds reaching the ear (i.e., "cocktail party problem") is still open. In analogy to recent findings in vision, we propose cortical alpha (∼10 Hz) oscillations measurable using M/EEG as a pivotal mechanism to selectively inhibit the processing of noise to improve auditory selective attention to task-relevant signals. We review initial evidence of enhanced alpha activity in selective listening tasks, suggesting a significant role of alpha-modulated noise suppression in speech. We discuss the importance of dissociating between noise interference in the auditory periphery (i.e., energetic masking) and noise interference with more central cognitive aspects of speech processing (i.e., informational masking). Finally, we point out the adverse effects of age-related hearing loss and/or cognitive decline on auditory selective inhibition. With this perspective article, we set the stage for future studies on the inhibitory role of alpha oscillations for speech processing in challenging listening situations.

**Keywords: alpha, neural oscillations, effortful listening, inhibition, masking, speech, aging, hearing loss**

## **1. INTRODUCTION**

In ecological listening situations, auditory signals are rarely perceived in quiet due to the presence of different auditory maskers such as distracting background speech or environmental noise. Thus, sounds from different sources greatly overlap spectro-temporally at the level of the listener's ear. What are the neural correlates that facilitate selective listening to relevant target signals despite irrelevant auditory input (i.e., the "cocktail party problem"; Cherry, 1953)? At the central neural level, two complementary mechanisms of top–down control (i.e., regulation of subsidiary cognitive processes) should be considered: First, top–down selective attention to relevant information (Fritz et al., 2007) could facilitate target processing by enhancing the neural response to the attended stream (i.e., gain control; Lee et al., 2013). Second, top–down selective inhibition of maskers (Melara et al., 2002) could help to direct limited processing capacities away from irrelevant information (Desimone and Duncan, 1995), thereby avoiding full processing of distractors (Foxe and Snyder, 2011).

In this regard, interference of auditory maskers might be the result of both insufficient attention to the target and poor inhibition of noise and distractors. In this perspective article we focus on the latter, that is, neural mechanisms of auditory selective inhibition. We propose that cortical alpha (∼10 Hz) oscillations are an important tool for top–down control as they regulate the inhibition of masker information during speech processing in challenging listening situations.

#### **2. THE FUNCTIONAL SIGNIFICANCE OF ALPHA OSCILLATIONS**

Neural oscillations in the alpha frequency range (∼10 Hz) are the most dominant signal measurable in the human magneto- and electroencephalogram (M/EEG), going back to their first description by Berger (1931). The earliest observations of the alpha rhythm revealed that its amplitude is enhanced in humans who are awake but not actively engaged in any task. This finding led initially to the view that high alpha power might simply reflect the default state of brain inactivity or "cortical idling" (for a review, see Pfurtscheller et al., 1996).

Only within the last two decades, the functional significance of alpha oscillations has been recognized and furthermore its ubiquitous role across sensory modalities (visual: for review see Mathewson et al., 2011; sensorimotor: e.g., Haegens et al., 2012; auditory: e.g., Hartmann et al., 2012) and cognitive tasks (working memory: e.g., Jensen et al., 2002; attention: for a review see Klimesch, 2012; decision making: e.g., Cohen et al., 2009). One unifying mechanism suggested for alpha rhythms across modalities and brain areas is that it provides a neural means to functionally inhibit the processing of currently task-irrelevant or task-detrimental information (Jensen and Mazaheri, 2010; Foxe and Snyder, 2011). Please note that the opposite mechanism also has been proposed where higher inter-areal alpha phase synchronization does not index cortical inhibition but increased information processing such as for internal (working memory related) information processes (Palva and Palva, 2011). The functional inhibition hypothesis, though, has received neurophysiological support. For example, both alpha power (i.e., squared amplitude) and alpha phase modulate neuronal spike rate (Haegens et al., 2011) and thus can directly affect the efficiency of neural information flow. In future, the alpha network needs to be further characterized by its phase– amplitude coupling to gamma oscillations (Jensen et al., 2012) and its role in top–down control as implemented in different cortical layers (Buffalo et al., 2011; Spaak et al., 2012) or in thalamico-cortical communication (Strauss et al., 2010; Roux et al., 2013).

Despite the abundance of studies on the role of alpha activity for visual selective inhibition, there are currently few studies that directly examine the role of alpha activity in the auditory modality. Recently, a series of studies found modulations in alpha power in a variety of auditory tasks prompted by degraded spectral detail (Obleser and Weisz, 2012), missing temporal expectations (Wilsch et al., 2014), working memory load (Leiberg et al., 2006; Obleser et al., 2012), or syntactic complexity (Meyer et al., 2013). Together, these findings provide good evidence that alpha oscillatory power can be a reliable indicator of auditory cognitive load (see also Luo et al., 2005; Kaiser et al., 2007). In the following section, we argue that part of this cognitive load occurs due to auditory selective inhibition as a compensatory mechanism for demanding listening situations and manifests in enhanced alpha power.

#### **3. ALPHA OSCILLATIONS AS A TOOL FOR AUDITORY SELECTIVE INHIBITION**

A common observation from our laboratory is a prominent increase in alpha power when participants listen to auditory materials presented against background noise (e.g., Wilsch et al., 2014). **Figure 1A**, for example, shows the grand average alpha power of 11 participants during a lexical decision task on isolated words presented in quiet (published in Strauß et al., 2014) and in white noise. For words in quiet, alpha power at around 10 Hz did not considerably increase after word onset. However, when words were presented in noise, alpha power was increased during the first 500 ms after word onset corresponding to the first two thirds of the average word duration. This effect was strongest over temporal and occipital sites (topography in **Figure 1A**) suggesting the inhibition of the task irrelevant visual modality but also compensatory mechanisms within speech-related areas. Critically, alpha power difference did not depend on ITPC (inter-trial phase coherence) differences, as indicated by the absence of a stronger ITPC in noise compared to quiet (**Figure 1B**). In fact, no significant ITPC differences were observed between 0.2 and 0.5 s. We therefore presume that induced (i.e., not strictly stimulus-locked; Freunberger et al., 2009) alpha power is crucial for speech processing in challenging listening conditions as it suppresses irrelevant information.

**FIGURE 1 | The proposed role of alpha activity for speech processing in noise. (A)** Average absolute alpha power of 11 participants performing a lexical decision task on words in quiet (top) and in white noise (bottom). SNRs were titrated individually using a two-down-one-up staircase adaptive tracking procedure. Average SNR was −10.22 dB ±1.95 (*SD*) such that participants performed about 71% correct. Speech onset is indicated by the black vertical line at 0 s; average word length = 750 ms; EEG recorded from 61 scalp electrodes; time-frequency analysis using Morlet wavelets. Plots show measures of absolute power averaged over all scalp electrodes. Topography depicts the alpha power difference for speech in noise–quiet. Data were SCD

(source current density)-transformed before power estimation to improve spatial resolution. **(B)** Inter-trial phase coherence (ITPC) as a measure of phase-locking of oscillations over trials. ITPC is bound between 0 and 1; higher ITPC values indicate stronger phase alignment across trials. **(C)** A simple framework of alpha oscillations for speech processing in noise. Acoustic signals overlap energetically as they enter the ear. At the brain level, features of speech and noise are processed as far as possible in distinct processing channels (depicted here with arrows; for details see text). High alpha power inhibits channels processing noise features to allow for an optimal task performance with minimized noise interference.

**Figure 1C** illustrates a tentative framework for how alpha oscillations could support auditory selective inhibition. Sounds arriving at the listener's ear must be further processed in the brain to extract task-relevant information. One way to think about the proposed mechanism is in terms of auditory object selection which requires object formation in the first place (Shinn-Cunningham, 2008). An auditory object might be formed on the basis of common spectro-temporal features, harmonicity, simultaneous onsets, or spatial grouping (Griffiths and Warren, 2004; Bizley and Cohen, 2013). We refer to all these different features used to form auditory objects as "channels" of auditory information represented by the arrows in **Figure 1C**. The concept of channels has a long tradition (Broadbent, 1958) and is inspired by the most clear distinction of target and distractor used in many dichotic listening paradigms where left and right ear channel need to be separated. Nevertheless, channels in our framework should be conceived as functional auditory processing units rather than anatomical pathways. As soon as these channels are defined, attention or inhibition can be selectively applied, given attentionally flexible fields in the auditory cortices (Petkov et al., 2004). Note that even though in the visual modality claims about alpha oscillations in feature-based (Romei et al., 2012) and object-based (Kinsey et al., 2011) attention have been made, we do not make any assumption about this distinction in our framework and use the term "channels" for both features and objects, or early and late selection.

If speech is presented in quiet (**Figure 1C**, top panel), alpha power is low in channels processing features of the speech signal to support processing of task-relevant information. Accordingly, the net resulting alpha power in the M/EEG would continue on baseline level (**Figure 1A**) and decrease during word integration (>400 ms). If, however, speech is presented in the presence of maskers (e.g., environmental noise, distracting talkers; **Figure 1C**, bottom panel), alpha power needs to be up-regulated first in those channels processing noise features before it is going to be suppressed during word integration (**Figure 1A**). Enhanced alpha activity inhibits processing of noise and thereby "protects" (Klimesch, 1999; Roux and Uhlhaas, 2014) the task- or performance-relevant information in the speech signal from noise interference.

Importantly, the up-regulation of alpha power in channels that process noise is not an automatic ("bottom–up") process but critically depends on "top–down" attentional control. For instance, in a multi-talker situation, target and distracting talker switch roles permanently, as the listener decides to change the conversational partner. In such a situation, M/EEG alpha power would be constantly at a high level; however, the deployment of alpha power onto the different processing channels would be changing continuously.

What is the functional role of high alpha activity for word processing in noise? To answer this question, it is essential to distinguish between interpretations in which alpha activity is related to target processing from these related to noise processing. It is possible that the reduced intelligibility of words in noise leads to sub-optimal word processing and thus to less alpha suppression in brain areas relevant for speech processing (Strauß et al., 2014). The inverse mechanism, as we put forward in the current framework, is equally likely by which alpha power is enhanced for temporarily irrelevant information and thereby compensates for perceived cognitive effort (increased when listening to speech in noise: Larsby et al., 2005; Helfer et al., 2010; Zekveld et al., 2011). In this regard, alpha would "protect" the lexical processes from noise interference. The challenge will be to experimentally dissect these (not mutually exclusive) mechanisms. We now review initial evidence for alpha's inhibitory role in audition.

Currently, there are only few studies that show alpha power modulations when participants simultaneously listen to two auditory streams, that is, one signal and one masker. In one study by Kerlin et al. (2010), participants were simultaneously listening to two spatially separated speech streams. On each trial, an initial visual cue indicated whether they were supposed to attend the left or right stream. During speech presentation, EEG alpha power was enhanced over the cerebral hemisphere contralateral to the masker, while alpha power was reduced contralateral to the to-be-attended stream. The authors concluded that this alpha lateralization indexes the direction of auditory attention to speech in space. Importantly, this finding corroborates our view that enhanced alpha power in brain areas engaged in distractor processing decreases further processing of the distractor and hence, facilitates processing of the target signal. However, two questions arise from this study: First, as the direction of auditory attention was cued visually in this study, it might be that the alpha lateralization indicates the allocation of supramodal rather than auditory selective attention (Farah et al., 1989). Second, spatial attention may play a special role not least because of auditory processing models suggesting separate what- and where-pathways (Rauschecker and Scott, 2009).

In three other recent studies, alpha power modulations were consistently found during the anticipation of auditory target signals from the left or right (Banerjee et al., 2011; Müller and Weisz, 2012; Ahveninen et al., 2013). In these studies, participants were cued to attend either the auditory event on the left or right, and to ignore the distractor on the other side. Alpha power was enhanced during the anticipation of auditory stimulation contralateral to the distractor. These results demonstrate alpha lateralization effects already during the preparation for an auditory selective listening task. This is in line with studies reporting high pre-stimulus alpha power when participants are about to miss a (visual) target (van Dijk et al., 2008; Busch et al., 2009; Romei et al., 2010). In terms of our framework (**Figure 1C**), anticipatory high alpha power successfully blocks in-depth processing of sensory information that might lead to missing the target.

However, interpretations of these studies are limited for our model, since alpha power modulations were found only during the anticipation but not during the actual processing of competing auditory streams. More data are clearly needed on the peri-stimulus alpha dynamics. As the spatial resolution of M/EEG is limited, prospective experiments could induce alpha oscillations over specific brain areas using transcranial alternating current stimulation (tACS) to assess the influence of alpha modulations on listening success under adverse acoustic conditions. Moreover, future studies could record the electrocorticogram (ECoG) directly from the cortical surface to track alpha sources and reveal the interplay between frequency bands. Such higher spatial resolution would allow to differentiate between alpha activity in brain regions associated with processing the masker or the signal. As of now, we are left to speculate how spatially specific alpha oscillations might operate, for example along a cochleotopic gradient in primary auditory cortex. The best data to infer from stems from visual cortex, where for example Buffalo and colleagues recorded with two electrode tips in attended vs. non-attended receptive fields less than a millimeter apart and report attention-dependent opposing, and deep-layerspecific alpha changes (expressed as alpha spike-field coherence; Buffalo et al., 2011). Comparable data are, to our knowledge, still missing for auditory areas.

In the next two sections, we will elaborate first, at which levels of auditory processing alpha power might be deployed for the inhibition of different kinds of auditory maskers, and second, how age and hearing loss might affect auditory selective inhibition.

### **4. MASKING RELEASE VIA ALPHA ENHANCEMENT ALONG THE AUDITORY PATHWAY**

So far, we have shown that alpha oscillations are an attractive neural candidate mechanism of selective auditory inhibition. There are different aspects which need to be systematically investigated in order to determine the role of alpha: Which neural circuits "deploy" or trigger high-alpha states? And in terms of the current framework: What kind of channels can be attenuated by enhanced alpha power?

Currently, there are few studies mapping the sources of alpha power during masked auditory processing. Some evidence has accumulated showing noise-invariant representations of the signal in auditory cortices (Chang et al., 2010; Ding and Simon, 2012) with the degree of invariance increasing from peripheral to cortical processing stages (Rabinowitz et al., 2013). If we assume that alpha is an important central mechanism to inhibit various types of maskers, these studies suggest that masking release via alpha enhancement might occur as early as in primary auditory cortex. A first direct hint to this idea might be the case of an illusory sound percept like tinnitus, which can be centrally suppressed by means of increasing alpha power in primary auditory cortex (Leske et al., 2013; Weisz et al., 2014). This is in line with research showing that attention modulates activity in sensory cortices corresponding to the modality of the stimulus (e.g., Heinrich et al., 2011; Wild et al., 2012). Thus, alpha activity in primary auditory cortex might be crucially contributing to inhibiting the formation of auditory objects.

In future studies investigating underlying alpha sources, a distinction between energetic and informational masking might be crucial (Brungart et al., 2001; Mattys et al., 2009; Scott and McGettigan, 2013; for a more comprehensive overview of potential adverse listening conditions see Mattys et al., 2012). Energetic masking describes the competition of auditory target and masker in the auditory periphery due to spectro-temporal overlay of the two signals, causing an overlap of excitation patterns in the cochlea and auditory nerve (Durlach et al., 2003). One type of background signal often assumed to cause primarily energetic masking is white noise (e.g., Arbogast et al., 2005) which is quasi-stationary and has high energy in a broad frequency range (for discussion see Stone et al., 2012). Although informational masking is sometimes defined only negatively as all masking effects not accounted for by energetic masking (cf. Gutschalk et al., 2008), a more refined definition is required, especially when it comes to speech processing. When target speech is masked by a competing talker, it is not just the energetic overlap of the two signals that causes masker interference. Rather, the speech masker initiates phonetic and semantic processing that interferes with the linguistic processing of the target (Schneider et al., 2007). Thus, informational masking describes the interference of target and masker at a more central, cognitive level, whereas energetic masking refers to energetic overlap in the auditory periphery.

According to the framework described above, alpha oscillations might be important for inhibition of both types of maskers, however, in different brain areas. We presume that energetic maskers are inhibited by enhanced alpha activity in auditory cortex (Müller and Weisz, 2012). In contrast, processing of informational maskers like competing speech should rather be inhibited by alpha activity in higher auditory areas such as posterior superior temporal gyrus (pSTG) and beyond, relevant for linguistic processing (Scott et al., 2004, 2009). In addition to the proposed inhibition of auditory input, alpha oscillations are involved in supramodal or crossmodal inhibition of the currently task-irrelevant modality (Banerjee et al., 2011).

#### **5. EFFECTS OF AGE AND HEARING LOSS ON AUDITORY DISTRACTOR INHIBITION**

In acoustically demanding multi-talker situations, older listeners typically experience more difficulties compared with younger adults. It is however unclear, in how far these difficulties are caused by age-related decline in perceptual auditory acuity (hearing loss or loss of temporal and spectral resolution; Fostick and Babkoff, 2013), decline of cognitive functioning with age, or both (Wingfield et al., 2005). Crucial for the present framework, however, both auditory perceptual and cognitive decline could lead to insufficient masker inhibition. First, compared with normalhearing controls, listeners with hearing loss are less successful in utilizing spectral (Lorenzi et al., 2006), temporal (Tremblay et al., 2003), and spatial auditory cues (Neher et al., 2009) important for the perceptual segregation of different sound sources. Thus, attending to relevant and inhibiting irrelevant sound sources is impaired, as auditory features are lacking to distinguish the different sound sources in the first place (Shinn-Cunningham and Best, 2008). Second, age negatively affects many aspects of cognitive functioning (Park et al., 2003), amongst it the ability to suppress irrelevant but salient auditory distractors (Chao and Knight, 1997; Tun et al., 2002; Passow et al., 2014). Thus, even if the perceptual segregation of sound sources is accomplished successfully, the insufficient inhibition of maskers may cause interference.

In line with prior studies that found age effects on brain oscillatory activity in the alpha frequency range (Yordanova et al., 1998; Klimesch, 1999; Böttger et al., 2002), we consider it valuable to investigate alpha oscillations in demanding listening tasks as an indicator of age-dependent auditory cognitive effort of masker inhibition. We presume that auditory selective inhibition, realized by alpha activity in channels relevant for masker processing (**Figure 1C**), serves as a compensatory mechanism as multi-talker listening conditions become more demanding, for instance due to a decreasing signal-to-noise ratio (SNR). The study of alpha oscillations could help to reveal how listeners of different age exert top–down attentional control to facilitate processing of taskrelevant signals and inhibit processing of interfering maskers. In particular, this line of research might foster the understanding of why older listeners find it more exhausting to participate in cocktail party-like listening situations compared with younger listeners (Pichora–Fuller, 2003).

#### **6. CONCLUSIONS**

In this perspective article, we have presented a framework for studying alpha oscillations as a tool for auditory selective inhibition in challenging listening situations. We have presented initial evidence qualifying alpha oscillations as a pivotal mechanism affecting listening in multi-talker situations. Future studies could expand these findings and study the role of alpha oscillations (1) during speech perception in ecologically valid listening situations, (2) in the presence of energetic and informational maskers, and (3) for aging and hearing-impaired listeners.

#### **ACKNOWLEDGMENTS**

Antje Strauß, Malte Wöstmann, and Jonas Obleser are supported by a Max Planck Research Grant to Jonas Obleser. The authors are grateful for in-depth discussions with the members of the Max Planck Research Group "Auditory Cognition" during manuscript preparation.

#### **REFERENCES**


segmentation in human vision. *Int. J. Psychophysiol.* 79, 392–400. doi: 10.1016/j.ijpsycho.2010.12.007


on the pupil response. *Ear Hear.* 32, 498–510. doi: 10.1097/AUD.0b013e3182 0512bb

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 27 February 2014; accepted: 08 May 2014; published online: 28 May 2014. Citation: Strauß A, Wöstmann M and Obleser J (2014) Cortical alpha oscillations as a tool for auditory selective inhibition. Front. Hum. Neurosci. 8:350. doi: 10.3389/ fnhum.2014.00350*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Strauß, Wöstmann and Obleser. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Induction of plasticity in the human motor cortex by pairing an auditory stimulus with TMS

**Paul F. Sowman1,2,3\*, Søren S. Dueholm1,4 , Jesper H. Rasmussen1,4 and Natalie Mrachacz-Kersting<sup>4</sup>**

<sup>1</sup> Department of Cognitive Science, Macquarie University, Sydney, NSW, Australia

<sup>2</sup> Perception and Action Research Centre (PARC), Faculty of Human Sciences, Macquarie University, Sydney, NSW, Australia

<sup>3</sup> Australian Research Council Centre of Excellence in Cognition and its Disorders (CCD), Macquarie University, Sydney, NSW, Australia

<sup>4</sup> Department of Health Science and Technology, Center for Sensory-Motor Interaction (SMI), Aalborg University, Aalborg, Denmark

#### **Edited by:**

Carolyn McGettigan, Royal Holloway University of London, UK

#### **Reviewed by:**

Alessandro D'Ausilio, Italian Institute of Technology, Italy Dan Kennedy-Higgins, University College London–Speech, Hearing and Phonetic Sciences, UK

#### **\*Correspondence:**

Paul F. Sowman, Department of Cognitive Science, Level 3 Australian Hearing Hub, 16 University Drive, Macquarie University, Sydney, NSW 2109, Australia e-mail: paul.sowman@mq.edu.au Acoustic stimuli can cause a transient increase in the excitability of the motor cortex. The current study leverages this phenomenon to develop a method for testing the integrity of auditorimotor integration and the capacity for auditorimotor plasticity. We demonstrate that appropriately timed transcranial magnetic stimulation (TMS) of the hand area, paired with auditorily mediated excitation of the motor cortex, induces an enhancement of motor cortex excitability that lasts beyond the time of stimulation. This result demonstrates for the first time that paired associative stimulation (PAS)-induced plasticity within the motor cortex is applicable with auditory stimuli. We propose that the method developed here might provide a useful tool for future studies that measure auditory-motor connectivity in communication disorders.

**Keywords: paired associative stimulation, transcranial magnetic stimulation, auditory motor integration, speech sounds, plasticity, motor cortex, auditory cortex**

#### **INTRODUCTION**

Paired associative stimulation (PAS) is a technique used to experimentally induce long-lasting changes in cortical excitability (Stefan et al., 2002; Ridding and Flavel, 2006; Mrachacz-Kersting et al., 2007; Murakami et al., 2008; Kumpulainen et al., 2012). Most commonly in PAS studies, electrical stimulation of the median nerve is paired with transcranial magnetic stimulation (TMS) of the contralateral motor cortex. The nerve impulse resulting from the somatosensory stimulus can be timed to arrive at the cortical level milliseconds prior the TMS pulse in order to induce a long-lasting increase in excitability—a process that is thought to be mediated by a Hebbian long-term potentiation (LTP)-like process (Stefan et al., 2002).

In recent years, modified PAS protocols have been designed that apply more ecologically valid stimuli in place of either the TMS or the electrical somatosensory stimulation, e.g., TMS paired with movement (Thabit et al., 2010) or electrical somatosensory stimulation paired with motor imagery (Mrachacz-Kersting et al., 2012). PAS protocols have also moved beyond ubiquitous sensorimotor associations to demonstrate that pairing a TMS-induced cortical activation outside the motor cortex with a homotopic sensory activation can induce enhanced responses to sensory inputs. For example, Schecklmann et al. (2011) showed that pairing a TMS pulse to the auditory cortex with a simple tone could induce a prolonged decrement of the auditory evoked potential. Cortical stimulation has also been paired with visual stimuli to demonstrate the capacity for visuomotor integration to mediate plastic changes in motor cortex (Suppa et al., 2013). To date however, the connections known to exist between the auditory and motor domains have not been tested for their capacity to induce motor cortex plasticity.

A number of well-described functional links between audition and the motor system exist. These range from protective reflexive motor activations in response to signals of potential danger (Forbes and Sherrington, 1914) to the complex feedback and feedforward communication necessary for fluent speech to occur (Tourville et al., 2008; Perkell, 2012). These connections allow us to, for example, modulate the volume of our speech to appropriately match the ambient environmental noise (Lane and Tranel, 1971) or modulate the sensitivity of our sensory system to compensate for speech-induced reafference (Curio et al., 2000).

Motoric activation via auditory inputs has been demonstrated in a number of experiments that have used TMS to probe the link between speech perception and motor representations. Modulation of motor cortical excitability during speech perception has been demonstrated to occur in the cortical representations of the hand (Flöel et al., 2003), lips (Watkins et al., 2003) and tongue (Fadiga et al., 2002; Roy et al., 2008).

Despite evidence suggesting a strong connection between auditory and motor centers, auditory stimuli have not yet been used in a modified PAS study to induce plasticity in the motor area. The aim of the current study was to investigate whether it is possible to induce plasticity in the motor system by pairing auditory stimuli and TMS. The development of such a protocol would in future allow for the direct investigation of auditorimotor linkages in a number of disorders where these are thought to be abnormal. Auditorimotor disconnection or dysfunction has, for example, been proposed to underpin the speech dysfluencies that characterize stuttering (Neef et al., 2011) and the misattribution of self-produced speech that may produce auditory hallucinations in schizophrenia (Ford et al., 2005).

## **METHODS**

Two separate experiments (A and B) were conducted. Given that the timing of stimuli in a PAS protocol is critical for facilitating plastic change (Stefan et al., 2002; Wolters et al., 2005; Mrachacz-Kersting et al., 2007; Murakami et al., 2008; Kumpulainen et al., 2012), the aim of Experiment A was to find, at a group level, the optimal offset timing of the motor cortical excitation from the onset of the auditory stimulus. This offset was determined by applying TMS pulses at different latencies relative to the onset of the auditory stimulus and measuring the conditioned motor evoked potential (MEP) in the right first dorsal interosseus (FDI) muscle.

The PAS protocol implemented in Experiment B was informed by the results of Experiment A. First a baseline session was conducted where MEPs (TMS with no auditory stimuli) were collected and saved as pre-PAS measurements. This was followed by an intervention block which consisted of the auditorimotor PAS-protocol. During the intervention block, subjects received an auditory stimulus paired with TMS using the optimal time latency between stimulations that was found in Experiment A. After the intervention session, post-PAS MEPs were recorded immediately after and then 15 min after the session ended (post and post15, respectively). By comparing pre- with post-MEPs and post15- MEPs it was possible to evaluate whether motor cortex excitability changes had occurred, how fast they evolved and whether they were long-lasting.

## **EXPERIMENT A—TIMING OF STIMULI**

#### **SUBJECTS**

Experiment A was performed on 12 healthy right-handed volunteers (9 males), aged 18–36 years (mean 24.2 ± 5.0 years). Prior to commencement of the experiment subjects completed a standard TMS screening questionnaire and provided written informed consent. None of the subjects reported any history of hearing impairment, neurological disease or mental illness, was taking regular medication or had a history brain injury. This study was reviewed and approved by the Human Research Ethics Committee of Macquarie University.

#### **EXPERIMENTAL PROCEDURE**

Subjects were seated in a chair with their right arm and hand resting in a comfortable position on an armrest. An armrest was used in order to eliminate hand movements during recordings. During the experiment the subject was told to relax, avoid any movement of the right arm and hand and to have their eyes open. Surface EMG (sEMG) was recorded (1000 × gain, bandpass filtered from 20–500 Hz) from a bipolar electrode (Medi-Trace 100, Kendall/Tyco Healthcare, USA) montage. One electrode was placed over the muscle belly of the right FDI muscle and the other electrode was placed over the proximal metacarpal of the index finger.

A monophasic transcranial magnetic stimulator (Magstim model 200, Magstim, Whitland, UK), with a focal figure-of-eight stimulating coil (90-mm outer diameter), was used to elicit MEPs from the right FDI muscle. The stimulating coil was held tangentially to the skull with the coil oriented 45◦ to the parasagittal plane and the handle pointing laterally and posteriorly. The center of the coil junction was placed over the primary motor cortex (M1) hand area of the left hemisphere and the "motor hot spot" was determined as the site where TMS consistently elicited the largest MEPs.

Resting motor threshold (MT) was determined by finding the lowest stimulation intensity of the motor hotspot for the right FDI needed in order to obtain an MEP with a peak-to-peak amplitude of 50 µV in 5 out of 10 consecutive stimulations. The TMS test intensity was then set at 120% of resting MT. Eight different TMS conditions were tested. These consisted of seven auditory-stimulation/TMS pairs and one TMS condition without associated auditory stimulation (baseline). The auditorystimulation/TMS pairs consisted of a test TMS pulse applied at one of seven different intervals (25, 50, 100, 150, 200, 250 and 300 ms) after the onset of the auditory stimulus. The auditory stimulus consisted of a male voice pronouncing the word "Hey!" played back at 80 dB SPL via Etymotic ER-1 insert tube-phones. We chose to use a speech sounds because previous research suggests that speech sounds strongly activate the motor cortex e.g., Flöel et al. (2003). However, other evidence suggests that the motor cortex might be also activated by non speech sounds (Watkins et al., 2003; Alibiglou and Mackinnon, 2012) so we also included a condition in which the auditory stimulus matched the amplitude envelope of the speech stimulus but consisted entirely of white noise (Pulvermüller et al., 2006). This signal-correlated noise (SCN) stimulus was created using Praat (Boersma and Weenink, 2013). Time and frequency domain comparisons of the two signals are displayed in **Figure 1**.

The order of all seven auditory-stimulation/TMS pairs and stimulus types (speech or SCN) was randomly intermingled and presented with an intertrial interval (ITI) that randomly varied between 4000 and 5000 ms in two blocks such that the total number of stimuli per condition was 16. The total number of trials was hence 128 (16 baseline trials + 7 × 16 conditioned trials). The duration of the experiment was approximately 25 min.

#### **DATA PROCESSING**

Offline MEP analysis was conducted using a custom MATLAB (The Mathworks, USA) script. The average MEP amplitude calculated for each sound type and auditory-stimulation/TMS pair was expressed as a function of the average pre MEP (baseline).

#### **STATISTICAL ANALYSIS**

A repeated measures ANOVA with the factors delay (auditorystimulus/TMS interval) and condition (speech or SCN) was performed on the averaged MEPs. A two-tailed, one-sample *t*test was then used to determine the time points at which the conditioned MEPs differed significantly from baseline using an α-value of 0.05.

(white noise) version of **(A)**.

## **EXPERIMENT B—AUDITORIMOTOR PAS SUBJECTS**

Experiment B was performed on 10 healthy right-handed volunteers (8 males), aged 18–31 years (mean 24.5 ± 3.3 years) without any prior neurological medical history. Written informed consent was obtained from each subject before participation in the study.

#### **EXPERIMENTAL PROCEDURE**

The procedure used in Experiment B was similar to the one used in Experiment A. The main difference was that a single auditory stimulus/TMS interval (100 ms) was used during the PAS induction period in Experiment B. As no difference in MEP facilitation between the speech and SCN stimulus conditions was found in Experiment A we arbitrarily chose to use only the speech stimulus in Experiment B. PAS induction following baseline MEP recording consisted of a total of 200 auditory stimulus/TMS pairs applied with a 4000– 5000 ms random interval between each pair. A 2 min pause in stimulation after 100 pairs were applied was included. The total duration of the experiment was approximately 27 min (introduction: 10 min, part one: 7.5 min, pause: 2 min, part two: 7.5 min).

## **STATISTICAL ANALYSIS**

A two-tailed, one-sample *t*-test was used to determine significant differences between pre-MEPs (baseline), post-MEPs and post15- MEPs using an α-value of 0.05.

## **RESULTS**

Mean (± SEM) MEP threshold in Experiment A was 45.5 ± 2.1% of stimulator output and 46.6 ± 2.4% in Experiment B.

Results from Experiment A are shown in **Figure 2**. A repeated measures ANOVA showed that there was a significant effect of delay on the size of the MEP *F*(6,66) = 2.3, *p* = 0.045. There was no significant effect of condition nor significant interaction between delay and condition. Within condition comparison of mean normalized MEPs to baseline by means of a two-tailed one-sample *t*-test revealed that in the noise condition, MEPs were significantly increased above baseline for one ISI: 100 ms (115.5 ± 5.2% of baseline, *t*(11) = 3.0, *p* = 0.012). For the speech sound condition two ISIs had MEPs that were significantly increased above baseline: ISI = 100 ms (117.0 ± 6.5% of baseline, *t*(11) = 2.6, *p* = 0.023) and ISI = 150 ms (111.4 ± 4.7% of baseline, *t*(11) = 2.4, *p* = 0.035).

Results from Experiment B show that across all subjects the averaged MEP peak-to-peak amplitude increased to 148% (post)

and 165% (post15) of baseline as shown in **Figure 3**. Two-tailed one-sample *t*-tests showed a significant increase in normalized MEP peak-to-peak amplitude for post (*t*(9) = 3.8, *p* = 0.004) and post15 (*t*(9) = 2.9, *p* = 0.018). Comparison between post and post15 by means of a paired *t*-test revealed no significant difference (*t*(9) = 1.06, *p* = 0.32).

#### **DISCUSSION**

The current study demonstrates for the first time that long-lasting motor cortical plasticity can be induced by an auditorimotor PAS paradigm. This result is significant because it not only provides a new method for investigating auditorimotor integration, but importantly, also a method to directly probe the brain's capacity for auditorimotor plasticity.

We utilized a two-stage approach in developing this PAS paradigm. First, we identified the optimal ISI for eliciting an enhanced MEP response compared to baseline. This paradigm follows the empirical approach developed by Mrachacz-Kersting et al. (2007) to investigate PAS induced plasticity in the cortical representation of tibialis anterior. The optimal interval we found fits well with the temporal structure of the auditory N1 to speech sounds which peaks 100 ms after stimulus onset e.g., Liotti et al. (2010), and agrees with the TMS findings of Fadiga et al. (2002) and those of Roy et al. (2008), who found "phonological motor resonance" was present at 100 ms after their target speech sound stimulus onsets. In both studies the authors applied TMS to the tongue motor representation following the presentation of pseudo-words containing double consonants. The MEP response that they recorded in the tongue peaked in amplitude when the auditory stimulus to TMS interval was 100 ms.

While we used a speech stimulus in these experiments, the lack of difference between the response to the speech stimulus and SCN found in Experiment A suggests that under the experimental conditions we have imposed, i.e., a repetitive presentation of a speech sound without the requirement for engagement on the part of the subject, the stimulus may not be processed as speech *per se* and should rather be considered a non-specific acoustic stimulus. This fact may explain why our results differ in part to those of Watkins et al. (2003) and Murakami et al. (2011) whose findings suggest that auditory-induced motor modulations related to speech listening are confined to the cortical representations of those muscles involved in articulation. Indeed, there is now a significant body of evidence to support the somatotopic arrangement of speech gesture perception (Fadiga et al., 2002; Roy et al., 2008; D'Ausilio et al., 2009, 2011; Möttönen and Watkins, 2009; Sato et al., 2010) but such findings do not necessarily rule out the non-specific motor activations in response to both speech and non-speech acoustic stimuli that have been documented using both TMS and other methods (Flöel et al., 2003; Alibiglou and Mackinnon, 2012; Fujioka et al., 2012).

The current study shows that repeated pairing of an acoustic stimulus with a TMS pulse to the motor cortex representation of the hand leads to a rapidly-evolving, long-lasting increase in cortical excitability. This effect was induced with an ISI of 100 ms, a time interval that corresponded to the point of peak enhancement in the acoustic stimulus-conditioned MEP. Given that this ISI was converged upon using a method that used discrete intervals with a minimum step of 50 ms, it is expected that this PAS technique could be refined further by reexamining the optimal sound-to-TMS interval using smaller time steps (i.e., less than 25 ms) centered around 100 ms. Moreover, using auditory evoked potentials to discover individualized N1 latencies, and then using these as the basis for the PAS ISI would likely refine the technique further. Since we were able to find a significant PAS effect in this proof of concept study, we posit that auditorimotor PAS is a robust effect that will provide a powerful tool for studying auditorimotor plasticity in the future.

Auditorimotor plasticity i.e., the capacity for strengthening of auditorimotor connections within the brain is essential for the acquisition of speech and the learning of musical competence. For this reason, techniques that can probe the brain's capacity for auditorimotor plasticity provide the opportunity to investigate some of the hypothesized mechanisms of conditions such as stuttering and specific language impairment (SLI) in which disordered motor learning has been documented (Namasivayam and van Lieshout, 2008; Mayor-Dubois et al., 2014). Both of those conditions have been associated with disordered sensorimotor integration (Hill, 2001; Neef et al., 2011; Cai et al., 2012, 2014) and, in the case of SLI, with disordered auditorimotor plasticity (Kurt et al., 2012). Additionally, disorders such as schizophrenia and tinnitus have been associated with disrupted auditorimotor connections (Cacace, 2003; Ford et al., 2005; Langguth et al., 2005) and synaptic plasticity (Møller, 2003; Stephan et al., 2009); the technique described herein is therefore a novel means to assess these associations. Beyond mechanistic investigation of disorders, associative stimulation using TMS has also been proposed as a therapeutic modality (Uy et al., 2003; Jayaram and Stinear, 2008; Michou et al., 2013). If it is established that disorders such as those described above involve a form of auditorimotor disconnection, then auditorimotor PAS could be used as a novel adjuvant therapy to assist in the re/establishment of appropriate sensorimotor mappings.

#### **ACKNOWLEDGMENTS**

Paul F. Sowman is supported by the National Health and Medical Research Council, Australia (#543438, #1003760) and the Australian Research Council (DE130100868). The authors would like to thank AC Etchell for his comments on the manuscript.

#### **REFERENCES**


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 21 March 2014; accepted: 18 May 2014; published online: 03 June 2014*. *Citation: Sowman PF, Dueholm SS, Rasmussen JH and Mrachacz-Kersting N (2014) Induction of plasticity in the human motor cortex by pairing an auditory stimulus with TMS. Front. Hum. Neurosci. 8:398. doi: 10.3389/fnhum.2014.00398*

*This article was submitted to the journal Frontiers in Human Neuroscience*. *Copyright © 2014 Sowman, Dueholm, Rasmussen and Mrachacz-Kersting. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Behavioral and multimodal neuroimaging evidence for a deficit in brain timing networks in stuttering: a hypothesis and theory

## *Andrew C. Etchell 1,2\*, Blake W. Johnson1 and Paul F. Sowman1,2*

*<sup>1</sup> Department of Cognitive Science, ARC Centre of Excellence in Cognition and its Disorders, Macquarie University, Sydney, NSW, Australia <sup>2</sup> Department of Cognitive Science, Perception in Action Research Centre, Macquarie University, Sydney, NSW, Australia*

#### *Edited by:*

*Sonja A. E. Kotz, Max Planck Institute Leipzig, Germany*

*Reviewed by: Pierpaolo Busan, University of Trieste, Italy Christian A. Kell, Goethe University, Germany*

*Michael Schwartze, Max Planck Society, Germany*

#### *\*Correspondence:*

*Andrew C. Etchell, Department of Cognitive Science, ARC Centre of Excellence in Cognition and its Disorders, Macquarie University, 16 University Avenue, Sydney, NSW 2109, Australia e-mail: andrew.etchell@mq.edu.au*

The fluent production of speech requires accurately timed movements. In this article, we propose that a deficit in brain timing networks is one of the core neurophysiological deficits in stuttering. We first discuss the experimental evidence supporting the involvement of the basal ganglia and supplementary motor area (SMA) in stuttering and the involvement of the cerebellum as a possible mechanism for compensating for the neural deficits that underlie stuttering. Next, we outline the involvement of the right inferior frontal gyrus (IFG) as another putative compensatory locus in stuttering and suggest a role for this structure in an expanded core timing-network. Subsequently, we review behavioral studies of timing in people who stutter and examine their behavioral performance as compared to people who do not stutter. Finally, we highlight challenges to existing research and provide avenues for future research with specific hypotheses.

**Keywords: stuttering, rhythm, tapping, speech, basal ganglia, cerebellum, timing**

#### **THEORIES OF STUTTERING**

According to the World Health Organisation (2010, para. F98.5), stuttering is "speech that is characterized by the frequent repetitions or prolongation of sounds or syllables or words, or by frequent hesitations or pauses that disrupt the rhythmic flow of speech." Repetitions typically consist of a repetition of part of a word, a whole word or a phrase (e.g., re*...* re*...* re*...* repetitions). Prolongations consist of a lengthening of the sounds within a word (e.g., prrrrrrrolongations). Complete interruption to the flow of speech, known as "blocking" is also a common symptom of stuttering. Blocks are where there is a length of time where no form of speech is produced either within words [e.g., block-(pause)-ing] or between words. In most cases, stuttering emerges between 2 and 5 years of age, around the time children start preschool. Stuttering has a prevalence of around 5% in early childhood but due to the fact that many children recover spontaneously, the prevalence across the general population is closer to 1% (Yairi and Ambrose, 2013). This percentage of stutterers who do not recover generally experience poorer social, emotional and mental health (Craig et al., 2009; Iverach et al.,

**Abbreviations:** BG, Basal ganglia; CB, Cerebellum; CTC, Cerebellar-thalamocortical; CWDS, Children who do not stutter; CWS, Children who stutter; ETN, External timing network; fMRI, Functional magnetic resonance imaging; IFG, Inferior frontal gyrus; ITN, Internal timing network; MEG, Magnetoencephalography; PD, Parkinson's disease; PET, Positron emission tomography; PMC, Premotor cortex; PWDS, People who do not stutter; PWS, People who stutter; SMA, Supplementary motor area; STC, Striato-thalamo-cortical; STG, Superior temporal gyrus; TMS, Transcranial magnetic stimulation; VBM, Voxel based morphometry.

2009) and elicit negative reactions from others (Langevin et al., 2010). Stuttering is also associated with secondary or associated signs that include facial grimaces, forced effort and eye-blinks (Conture and Kelly, 1991; Riva-Posse et al., 2008). These secondary signs further impair the ability to communicate effectively and exacerbate the problems that result from the primary symptoms. Importantly, such secondary signs imply that stuttering is not solely confined to the domain of speech but rather a disorder of motor control that manifests primarily in the domain of speech because of the extreme timing and sequencing demands required for that function. Moreover, while difficult, it is not impossible to detect differences related to stuttering in the manual domain (e.g., Max et al., 2003; Ambrose, 2004).

Packman (2012) argues that the necessary condition for stuttering, i.e., the one thing each person who stutters must possess, is a neural anomaly that weakens the integrity of the speech motor system. In this weakened state, the speech motor system is rendered more susceptible to breakdown when various features of the spoken language place increasing demand on the system (Packman, 2012). The point at which stuttering is triggered is modulated according to individual and environmental factors such as levels of physiological arousal. Here we take the view that the necessary condition for stuttering (which unless otherwise specified is used to refer specifically to developmental stuttering) is the presence of a neural anomaly in timing.

The following account proposes the hypothesis that the core disorder of stuttering is a deficit in brain timing-networks. This article is not an exhaustive review of the literature on stuttering Etchell et al. Brain timing deficits in stuttering

or the arguments surrounding the cause of the disorder, but rather a hypothesis as to one of the possible causes of stuttering. The proposal that timing is important for speech (see Lashley, 1951; Martin, 1972; Strait et al., 2011) and even speech disorders like specific language impairment (Tallal et al., 1993) dyslexia (Goswami, 2011) or indeed stuttering (Alm, 2004, 2010) is not new. In the later case, the idea that stuttering relates to a deficit of timing follows from the observation that regular external stimulation temporarily alleviates stuttering (see for a revision, Alm, 2004; Snyder et al., 2009). The novel aspect of this article is that it expands on previous research suggesting that dysfunction within a brain network that supports internal timing [comprised of the basal ganglia (BG) and the supplementary motor area (SMA)] is causing stuttering and that a secondary system which utilizes external timing cues to sequence movements [comprised of the cerebellum (CB), the premotor cortex (PMC) and the right inferior frontal gyrus (IFG)] is compensating for stuttering. Specifically, we propose that an internal timing network (ITN), largely equivalent to the "medial system" proposed by Goldberg (1985) is involved in internally timed movement (movement performed in the absence of external timing cues) and is causally related to stuttering. We further propose that an external timing network (ETN), largely equivalent to the "lateral system" proposed by Goldberg (1985), with the addition of the right IFG, is involved in externally timed movement (movement performed in the presence of external timing cues) and provides a substrate for timing compensation in stuttering. Importantly, we are not suggesting that neural deficits in structures underlying timing is the sole cause of stuttering, but rather one of many possible deficits that could lead to stuttering. In this section, we first present multimodal neuroimaging evidence for the possible causal involvement of ITN in stuttering before moving on to discuss putative compensatory roles of the ETN.

There is ongoing debate as to whether some brain regions are specifically dedicated to processing time or whether the capacity to process time is intrinsic to each region of the brain directly through the activation of sensory processes (for review see Ivry and Schlerf, 2008). There already exist reviews outlining the cognitive and neural architecture proposed for how we represent a sense of time (e.g., Buhusi and Meck, 2005), how different sensory networks interact with core timing networks across different tasks (e.g., Merchant et al., 2013) as well as evidence for common timing mechanisms across manual and oral movements (e.g., Franz et al., 1992). While the questions of how and where time is processed in the brain are of considerable practical and theoretical interest, such a discussion is outside the scope of this article. Here we argue that the ETN is primarily active when an individual is timing their movement to an external rhythm and that it is particularly active during early exposure to rhythm or when the rhythm is difficult and is not easily internalized. In contrast to this, the ITN is primarily active when an individual is making rhythmic motor movements that are not specifically timed to an external stimulus. Importantly, the two systems can be active simultaneously such as when an individual is pacing their movements to an external stimulus and is internalizing that rhythm. Practically, this means that results of functional magnetic resonance imaging (fMRI) studies may show no difference in brain activation between conditions that supposedly bias internally or externally-timed movements; however, disruption of these systems via inhibitory transcranial magnetic stimulation (TMS) should yield selective interference in behavioral performance. What follows is a brief overview of studies supporting a dissociation between the ITN and the ETN in timing tasks.

There is strong support for the involvement of the ITN during timing tasks from a number of fMRI, magnetoencephalography (MEG), lesion and TMS studies. For example, a recent fMRI study has found that the BG and the SMA tend to be active when movements are internally as opposed to being externally timed (Coull et al., 2013). Similarly, it has been shown using finger tapping tasks, that the BG and the SMA are active during the continuation phase (no external pacing stimulus, hence an internally-timed process) but not the synchronization phase (with external pacing, hence externally-timed) of the task (Rao et al., 1997). In particular, the BG are more active during the performance or tracking of simple rhythms, i.e., those that are easier to internalize, compared to complex rhythms (Grahn and Rowe, 2009, 2013; Geiser et al., 2012). The fact that fMRI studies show an overlap of neural activity during synchronization and continuation tapping (e.g., Jäncke et al., 2000; Jantzen et al., 2004) provides little support for a functional distinction between brain networks supporting internal and external timing; however, evidence from lesion and TMS does support such a dissociation between the INT and the ETN and their respective functions. Studies show that individuals with bilateral lesions to the BG perform poorly on the continuation phase of the finger-tapping task (Coslett et al., 2010) and are also poor at adjusting to accelerations and decelerations in tempo (Schwartze et al., 2011). Disruption of the SMA by inhibitory TMS impairs accuracy of continuation tapping whilst leaving the accuracy of synchronization tapping intact (Halsband et al., 1993).

There is also evidence for the involvement of CB and the PMC in the ETN. Inhibitory TMS of the CB has been shown to disrupt synchronization to auditory (Del Olmo et al., 2007) and visual pacing (Theoret et al., 2001; Koch et al., 2007). This disruption appears to be selective because lesions to the CB do not affect performance during the continuation phase of the finger-tapping task (Spencer et al., 2003). Likewise, a number of studies show that inhibitory TMS of the left PMC disrupts the synchronization tapping (Pollok et al., 2008; Bijsterbosch et al., 2011) and that this effect is specific to external pacing, as no effect of TMS is observed on continuation tapping (Del Olmo et al., 2007) or when tapping in the presence of, but not in time with, a scrambled beat (Kornysheva and Schubotz, 2011). Taken together, there indeed appears to be a functional dissociation of the ITN and the ETN in healthy adults. We now turn to neuroimaging studies to demonstrate how these systems are impaired in people who stutter.

#### **NEUROIMAGING STUDIES OF THE INTERNAL TIMING NETWORK IN PWS**

A number of neuroimaging studies implicate the BG or components thereof in the etiology of stuttering. For example, when comparing the fluent and dysfluent speech of people who stutter (PWS) to people who do not stutter (PWDS), Wu et al. (1995) found that PWS exhibited less activity in the caudate during both dysfluent speech and fluent speech. This lowered activity was suggested to be a trait marker for stuttering. The BG has also been related to the most typical symptoms of stuttering at an individual level (Jiang et al., 2012). These authors elicited stuttering during a sentence completion task and classified repetitions, pauses and prolongations as being either least typical or most typical of stuttering based on patterns of haemodynamic responses. Jiang et al. (2012) found that one of the activation patterns contributing to this separation of most and least typical symptoms was a reduction in BG activation. Although the aforementioned studies provide a correlative link between the putative ITN and stuttering, they do not unambiguously support the notion that the ITN *causes* stuttering. Because those studies were conducted mainly in adults, and stuttering is a disorder that appears in childhood, it can therefore be hard to determine whether anomalous BG activations observed in PWS are related to the cause of stuttering or are compensations for it.

In contrast, structural and functional abnormalities in children who stutter (CWS) are likely to be more indicative of the causative agents in stuttering because children have not had as much time to adapt to stuttering as adults. Chang and Zhu (2013), examined functional connectivity in CWS and children who do not stutter (CWDS) aged 3–9 and found reduced levels of connectivity between the putamen and the SMA, superior temporal gyrus (STG) and CB and similarly between the SMA and the putamen, STG and CB. Chang and Zhu (2013) concluded that CWS exhibited reduced activity in areas responsible for self-paced movement as compared to CWDS. Similarly, a recent voxel based morphometry (VBM) study conducted in CWS, found less gray matter volume in the bilateral inferior frontal gyri and the left putamen but more gray matter volume in the right rolandic operculum and the right STG relative to CWDS (Beal et al., 2013). In another study, Foundas et al. (2013) measured the volume of the caudate in right-handed boys who stutter and compared them to right-handed boys who did not stutter. They found that male CWS exhibited significantly less volume in the right caudate as compared to male CWDS. These studies suggest that even at a very young age, CWS exhibit abnormalities in structure and connectivity in the ITN. A recent MEG study examined lateralization of brain functions in preschool CWS and CWDS during a picture-naming task (Sowman et al., 2014). These authors found that speech was strongly left lateralized in both groups. Although not explicitly focusing on the ITN, this study demonstrates that much of the abnormal activation observed in the cortical right hemisphere in adults is the result of years of compensation for stuttering rather than being causally related to it. Moreover, that there were no differences between CWS and CWDS in cortical activations further hints at the possibility that stuttering is caused by deficiencies in subcortical regions. Overall, these studies provide strong support for viewing stuttering as a disorder of the BG. Since the BG seems responsible for internal timing of movement, they provide indirect support that stuttering is a disorder of internally timed movement.

To implicate the ITN in stuttering, structural or functional abnormalities should be evident in these structures in both children and adults who stutter and the neural deficit necessary to cause stuttering should be present irrespective of whether or not a subject is performing a task. Ingham et al. (2012) examined speech during oral reading and monologs as well as during a rest condition and found that PWS were different to PWDS in both the medial (ITN) and lateral (ETN) systems proposed by Alm (2004). PWS had significantly more activity in the BG (including the left putamen) during an eyes closed rest condition but significantly less activity during speaking conditions. This was thought to result in difficulties in performing fine-grained movement that may extend to speech and explain the fact that other studies observed increased activation of these regions in speech conditions like oral reading and monolog. More specifically though, if it is the case that the BG are overactive during rest and not just underactive during speech, it would indicate abnormalities in stuttering are not solely confined to speech. That is to say, the problem spans a number of domains because there are functional differences in neural activation occurring in the absence of speech.

If abnormalities of the ITN are causally related to stuttering, then it could be expected that effective speech therapy should produce measurable changes in the neural activity of these structures rather than in the areas compensating for stuttering. To this end, Giraud et al. (2008) examined neural activity using fMRI before and after speech therapy in a group of PWS. Therapy consisted of 3 weeks of undergoing an inpatient program focusing on biofeedback for syllable prolongation, soft voice onset and smooth sound transition. The researchers found that activity in the caudate positively correlated with stuttering severity before speech therapy but not after. Since the caudate was positively correlated with severity rather than negatively correlated with it, the speech therapy appeared to target causal rather than compensatory regions.

Similarly, if the ITN is related to stuttering this will not only be reflected in measures of neural activity but also in terms of the connections within the ITN. Lu et al. (2010) used structural equation modeling to compare causal relationships and function in the ITN in PWS and PWDS during a picture-naming task. Although there were no significant differences between stuttering and nonstuttering speakers in the output of the SMA to the BG, there were significant differences between the groups in the output of the BG to the SMA. More specifically, whereas PWDS showed a strong negative projection from the BG to the pre-SMA, PWS showed a positive projection from the BG to the pre-SMA Lu et al. (2010) interpreted their finding of abnormal output of the BG to the SMA as reflecting the difficulties PWS have in updating the timing and sequencing of movement. Interestingly, like Lu et al. (2010), a number of other studies have also shown altered patterns of activity in the SMA in relation to the perception and planning of speech in stuttering (Chang et al., 2009, 2011). Taken together, these findings, are consistent with the notion that stuttering is the result of dysfunctional processes that engage core structures within the proposed ITN: the BG and the SMA.

#### **LESION STUDIES OF THE ITN IN PWS**

If dysfunction in the ITN is thought to cause stuttering, then it follows that damage to these regions may result in stuttering. When stuttering develops following a lesion to the brain it is known as acquired or neurogenic stuttering (for review see Lundgren et al., 2010). There is evidence that damage to the ITN results in stuttering. For example a recent study by Tani and Sakai (2011) examining five patients with BG lesions (two with bilateral putamen lesions, two patients with bilateral BG lesions and one patient with a left putamenal lesion) but without aphasia, found that they exhibited dysfluencies such as syllable repetitions, part word repetitions and frequent blocks. Importantly, these patients' symptoms mimicked the characteristics of developmental stuttering in that almost all stuttering occurred on the initial syllable of a word. In a number of case studies, Ciabarra et al. (2000) describe a right-handed woman with a left BG lesion, and a woman with a left corona radiata, putamenal and subinsular infarct who both stuttered. Similarly, a number of different case studies have reported the onset of stuttering following damage to the SMA (Alexander et al., 1987; Ackermann et al., 1996; Chung et al., 2004). Furthermore, direct electrical stimulation of the SMA has also been shown to induce stuttering (Penfield and Welch, 1951). These findings are consistent with the notion that damage to the SMA can cause speech disorders and that the SMA is linked with the rhythmic control of speech (Jonas, 1981). This and other works have prompted investigation into the role of the SMA in rhythmic movements of the mouth (MacNeilage and Davis, 2001) as well as dissociations between the pre-SMA and the SMA-proper in rhythmic timing (Schwartze et al., 2012).

#### **NEUROIMAGING STUDIES OF THE ETN SYSTEM IN PWS**

There are studies hinting that deficits to the ITN are causing stuttering, but what proof is there that the ETN is recruited to compensate for this? To answer this question, we turn to fMRI studies of PWS. Braun et al. (1997) found the CB to be overactive in PWS during stuttered and fluent speech and it has been suggested that this is a compensatory mechanism for stuttering (see also Alm, 2004). In a meta-analysis of PWS, Brown et al. (2005) identified three neural signatures of stuttering. These neural signatures were the absence of auditory activation bilaterally, the over-activation of the right IFG and the over-activation of the CB. These findings have since been partially replicated by Lu et al. (2010) who found over-activation of the right IFG and the CB (but not the absence of bilateral auditory activation) and interpreted them as compensating for stuttering. Ingham et al. (2012) examined speech during oral reading and monologs as well as rest, finding that PWS exhibited increased cerebellar activity which was negatively associated with stuttering, indicating that the ETN may indeed be compensating for the ITN. A similar study, examined resting state functional connectivity of PWS before and after speech therapy in stuttering and non-stuttering adults (Lu et al., 2012). These authors found increased resting-state-functionalconnectivity between the midline CB and a network of regions (comprised of the medial frontal gyrus, the SMA and the left IFG) at rest for PWS relative to PWDS. For the PWS who received intervention as compared to the PWS who did not receive intervention (and PWDS), the resting-state-functional-connectivity in the midline CB returned to normal levels and was correlated with an increase in fluency. As such, Lu et al. (2012) suggested the CB was likely compensating in stuttering. In addition to these, other studies have associated the CB with compensatory activation in PWS (e.g., De Nil et al., 2008; Watkins et al., 2008).

While there is overlap in the neural structures responsible for external timing and compensation for stuttering, it does not automatically follow that the ETN is compensating for deficits in internal timing in PWS. However, there is fMRI evidence showing that the CB and the right IFG specifically compensate for deficits in the BG with respect to timing tasks in those who have Parkinson's Disease (PD). For example, Jahanshahi et al. (2010), investigated the differences in neural activation between PD patients and controls in and the synchronization continuation task. They also examined the effect of administering apomorphine (a non-selective dopamine agonist) on neural activation in the PD patients. Results showed that for healthy controls synchronization and continuation tapping (relative to a control reaction time task) was associated with significantly greater activation in the nucleus accumbens and caudate, a pattern not found in PD patients. In contrast, individuals with PD showed greater activation in the bilateral cerebellar hemispheres, right thalamus and left midbrain during both phases of finger tapping. Administration of apomorphine to the PD patients appeared to normalize activity, both increasing the connectivity between the caudate and putamen and frontal regions as well as decreasing activity in the CB. Thus, the authors suggested that increased cerebellar activation was likely compensating for the impaired functioning of the BG. Sen et al. (2010) found increased cerebellarthalamo-coritical (CTC) activation as PD progressed, perhaps indicating an increasing need to compensate for loss of function in the striato-thalamo-cortical networks (STC). This increase was only observed during continuation tapping and was not evident during synchronization tapping suggesting that the CTC (i.e., the ETN) was compensating for the STC (i.e., the ITN). The dissociation between the ITN and the ETN may seem problematic given both the CB (part of the ETN) and the SMA (part of the ITN) are thought to compensate for deficits in the BG during self initiated hand movements in the early stages of PD (Eckert et al., 2006). Nevertheless, this could suggest that part of the ITN (the SMA) may still be able to compensate for deficits in other parts of the ITN (the BG) when degeneration is not particularly severe.

#### **COMPENSATION BY THE RIGHT IFG IN STUTTERING**

An increasing number of studies have reported anomalous activation of the right IFG in a variety of speech tasks (e.g., Fox et al., 1996; Brown et al., 2005; Sowman et al., 2012) in PWS. Several studies found that increases in right IFG activation during overt reading (Preibisch et al., 2003; Lu et al., 2010) that were positively correlated with speech fluency in PWS and thought to be a nonspecific compensatory mechanism because the activation was not specifically related to speech production. Examining the effect of external auditory pacing on the speech of PWS Toyomura et al. (2011) found that, relative to a PWDS, the PWS showed more activation in the right IFG (along with bilateral auditory cortices) during both choral speaking and when speaking in time with an isochronous metronome. There are also reports of increased right frontal connections in adults who began stuttering as children (i.e., developmental stuttering) relative to adults who began stuttering later in life following a psychological trigger and without evidence of brain injury (Chang et al., 2010). This evidence suggests that the longer a PWS has been compensating for their stuttering, the greater the activity in the right IFG.

It is worth noting that Goldberg's formulation of the lateral system (upon which the ETN partially maps) does not contain the right IFG. Why then should right IFG be considered a part of an ETN that compensates for a dysfunctional ITN in stuttering? This question is particularly relevant when considering that the simplest explanation for right IFG involvement in stuttering is that it compensates for deficits in the left IFG (see Kell et al., 2009). Kell et al. (2009) associate the left IFG with processing of rhythm and sensorimotor feedback and it is possible that the right IFG may perform a similar function. Recently, the right IFG has been recognized as part of a "core timing network" (Wiener et al., 2010) that is recognized to be strongly connected both functionally and structurally to the ITN (Kung et al., 2013; Brittain and Brown, 2014). In particular, the right IFG may only become active when a task is more demanding. That is to say, the difficulty of compensating for deficits in internal timing by external timing regions might account for why there was over-activation of only the CB during speech, but not the right IFG during rest in PWS (Lu et al., 2012). A second, though not mutually exclusive explanation is that while the CB is able to compensate for timing deficits, its ability to do so is limited. This is evident in the case of individuals with PD where behavioral performance worsened despite increases in compensatory activation in the CB (Sen et al., 2010). A similarly limited ability of the cerebellar systems to compensate for deficits in timing may be occurring in PWS as evidenced by the reduced integrity of cerebellar tracts in both the left and the right hemispheres (Connally et al., 2013). Since the ETN has a limited capacity to compensate for deficits in the ITN, the assistance of the right IFG may be required to maintain normal timing functions. A third possible explanation is that the model proposed by Goldberg (1985) (where the ETN is comprised of the CB and the PMC) is incomplete and requires the addition of the right IFG as a secondary part of the system. Importantly, the right IFG is not likely to be the only region that is be compensating for stuttering. There are many other regions like the orbitofrontal cortex that could found to be compensating depending on the task and motor regions involved (see Kell et al., 2009; Sowman et al., 2012). Our contention is that the right IFG forms part of a network that compensates for deficient internal timing.

### **BEHAVIORAL STUDIES OF TIMING IN PWS**

If stuttering is the result of dysfunction in the ITN, and the ITN is important for timing, then it follows that PWS should exhibit deficits in behavioral performance on timing tasks. To this end several groups have found significant differences in asynchrony and variability of tapping between PWS and PWDS. For example, measuring the timing variability of reading sentences or nursery rhymes or tapping, Cooper and Allen (1977) found that PWS were consistently more variable in the length of time it took them to read sentences, paragraphs or nursery rhymes, and in their inter-tap intervals compared to PWDS. Brown et al. (1990) found that PWS were slower and less variable than PWDS at repeating the phrase "ah" and tapping their fingers as at their own pace compared to PWDS, findings they interpreted to represent less flexible timing systems which were more susceptible to breakdown. Similarly, when examining the timing intensity and variability of externally timed speech, Boutsen et al. (2000) showed that although both PWS and PWDS exhibited similar intensities when producing syllables, PWS were significantly more variable in their inter-onset vocalization times (analogous to the inter tap interval in tapping tasks). Additionally, Zelaznik et al. (1997) found that PWS were more variable on bimanual finger tapping (something more demanding than unimanual finger tapping) relative to PWDS. Similarly, Hulstijn et al. (1992) found that on a task which required the coordination of finger tapping and vocal responses (tapping in time with vocalizing the word "pip"), PWS exhibited greater variability than PWDS. More recently, Olander et al. (2010) compared hand-clapping variability in CWS and CWDS. While there was no difference in mean clapping rate, there were significant differences between groups in the variability of the clapping rate. This variability was bimodally distributed, with 60% of CWS showing variability that was greater than the worst performing CWDS. The remaining CWS showed variability in clapping that overlapped with that of the CWDS. Interestingly, this number approximately corresponded to the number of children that spontaneously recover and whose stuttering persists. As a result, the authors suggested that the motor timing deficit may be predictive of recovery from stuttering. Later, Foundas et al. (2013) found that when male CWS were required to tap as fast as possible in a given time period, most were better when tapping with their left rather than right hands as compared to most male CWDS who showed an advantage for their right hand. A recent behavioral study has found robust differences in tapping performance between CWS who stutter compared to CWDS (Falk et al., 2014). In contrast to the CWDS, the CWS not only tapped earlier and were less consistent in tapping, but also failed to improve with age.

However, a number of studies have compared the asynchrony and variability of PWS and PWDS on externally or internally timed vocal or oral motor movements and found similar levels of variance between the groups (e.g., Hulstijn et al., 1992; Melvine et al., 1995). Similar results have been obtained by Zelaznik et al. (1994) who compared PWS and PWDS on externally and internally timed manual responses for isochronous intervals and found that the groups did not differ in behavioral performance. Likewise, Max and Yudman (2003) found PWS and PWDS displayed highly similar levels of asynchrony and variability for finger tapping and producing vocalizations for multiple isochronous intervals. Overall, the behavioral studies investigating the timing abilities of PWS have produced mixed results. While some studies have found differences between PWS and PWDS, many have failed to find differences between groups. From this research, it might seem appropriate to conclude that stuttering is not a disorder of timing and that the links between stuttering and deficits in production of timed limb movements is tenuous at best. One possible explanation is that motor control of limbs and speech is different both centrally and peripherally (Kent, 2000). However if this were indeed the case, then it would be hard to explain why some studies did find significant differences between PWS and PWDS in non-speech motor tasks. Moreover, there is evidence of common timing systems across modalities (Franz et al., 1992) and it has been stressed that the behavioral differences between PWS and PWDS are not confined to the speech production system and instead appear to be generalized deficits (Max et al., 2003). There are other possible explanations for the failure to find behavioral differences between groups which can, in part, be attributed to compensatory neural activity and task difficulty.

## **TENTATIVE SUGGESTIONS FOR TIMING DEFICITS IN PWS**

The substantial number of studies finding no difference in timing behavior in PWS and PWDS is inconsistent with the notion that stuttering could be considered a disorder of timing. How then can we resolve these seemingly paradoxical findings with the consistent observation that neural regions involved in internal timing display anomalous function and structure in stuttering? The absence of a difference at a behavioral level does not imply the absence of differences at a *neural* level. Even a task as simple as tapping a finger or vocalizing to a metronome recruits a complex network of brain regions each with a variety of different functions (Repp and Su, 2013). Moreover, there may be differences at the neural level in the absence of differences at the behavioral level precisely because PWS are compensating for deficits in internal timing. Such a possibility is highlighted by the findings of Neef et al. (2011), who, utilizing inhibitory TMS, showed PWS did not exhibit behavioral differences in timing prior to stimulation but did exhibit behavioral differences subsequent to stimulation. If the suggestion that PWS demonstrate similar behavioral performance as a result of re-organization is plausible, then PWS should exhibit compensatory neural activity in regions associated with external timing of movement that are specifically compensating for deficits in the internal timing of movements. This indeed appears to be the case as both the CB and the right IFG seem to be compensatory regions in stuttering; both appear to be associated with timing, and both may specifically be compensating for deficits in the BG's control of timing tasks. Although speculative, this strongly suggests that the compensatory response to stuttering that occurs during speech is occurring as a result of deficits in the ITN. It perhaps explains why, in some studies at least, PWS have not shown differences in asynchrony (the difference in time between taps and the pacing signal) or variability (in the time between taps) on tapping tasks compared to PWDS. However, any failure to find a difference between these groups may also be attributed to task related effects such as the motoric or temporal complexity.

Many of the behavioral studies investigating timing abilities in PWS employed simple motoric and temporal tasks. Tapping at isochronous intervals is, as a task, relatively easy and this ease may explain a lack of differences in behavioral performance between PWS and PWDS, a problem that may extend to differences in regional brain activation in neuroimaging studies. Imaging data from early research on finger movements shows that the amount of cerebral blood flow to a particular region depends upon the complexity of the task (Shibasaki et al., 1993). Simple tasks are, *ipso facto*, not sufficiently motorically demanding to engage parts of the brain normally employed in more complex tapping tasks and which are impaired in PWS. This principle has been demonstrated experimentally in a number of studies. For example, Zelaznik et al. (1994) failed to find behavioral differences when comparing unimanual tapping performance, but successfully found differences in the same group of stuttering participants when examining bimanual tapping at an isochronous interval (Zelaznik et al., 1997). Similarly, increasing the syntactic complexity of words surrounding a to-be-repeated phrase, decreased speech motor stability for PWS as compared to PWDS (Kleinow and Smith, 2000).

In the same way that increasing the difficulty of the motor movement associated with the task could better reveal differences (should they exist) in behavioral performance and neural activation, so too could placing more strain on the systems governing temporal control of movements. Whereas Webster (1985) failed to find a difference in behavioral performance for PWS during bimanual tapping in a 1:1 ratio (that is one tap of the right hand for every tap of the left hand), Webster (1990) found that PWS took a substantially longer time to tap the required number of times when tapping in a ratio of 2:1 (that is two taps of the left hand for each tap of the right hand) than PWDS. Tapping at an uneven ratio (2:1) places significantly more demand on the neural systems governing timing than does tapping in an even ratio (1:1). This suggests that PWS are much less efficient in coordinating motor output to complex temporal patterns. Similarly, Lewis et al. (2004) demonstrated that parametrically increasing the number of different intervals in a series of tones resulted in a corresponding increase in neural activation in regions associated with timing. These studies show that, increasing the demands on temporal processing is more likely to yield differences in behavior and by extension, in neural activation. This is particularly relevant in the case of speech since speech is rarely perfectly isochronous but rather quasi-periodic (Martin, 1972). Speech contains multiple levels of temporal complexity (Kotz and Schwartze, 2010; Goswami and Leong, 2013) and is therefore substantially more demanding than tapping at an isochronous interval or in a 1:1 ratio. That is to say, differences in the complexity of rhythms required for speech and finger tapping may explain why most timed movements are relatively normal in PWS. Additionally, the timing required for speech control is robust to interference so difficulties in timing movements or speech may only become evident under increased cognitive loads (e.g., Saltuklaroglu et al., 2009). If PWS were compared to PWDS on a tapping task that contained a similar degree of temporal complexity usually required by speech, then clinically meaningful differences in behavior are likely to emerge. While there is a theoretical distinction between motor and temporal complexity, in practice, this distinction may not be so clear. Using near infrared spectroscopy (a means to measure the level of deoxygenated blood from the scalp somewhat analogous to how fMRI measures neural activity) Koenraadt et al. (2013) found that that the two may not be mutually exclusive. Tapping at multiple frequencies activated larger portions of the motor cortex than tapping at single frequencies. The extent to which manipulating motoric and temporal complexity are able to elicit behavioral differences in timing between PWS and PWDS remains to be tested by future research. Yet, even if these tasks are unable to elicit such differences in PWS, future research investigating the overlap between stuttering and timing should consider the use of neuroimaging techniques.

## **DIRECTIONS FOR FUTURE RESEARCH**

There appears to be a vast gap in the stuttering literature particularly with respect to neuroimaging and brain stimulation of timing tasks. In particular, we know of no fMRI or positron emission tomography (PET) studies that specifically examined internally or externally timed movements in PWS using either simple or complex temporal intervals despite the long theoretical history of an association between deficient timing and stuttering. The timing deficits we propose to exist in PWS are only tentative suggestions and remain to be verified by future research. Our proposal can nevertheless be used to generate a number of testable hypotheses. For example, it could be hypothesized that PWS show impaired behavioral performance and corresponding neural activation in tasks that require the internal timing of movements (the continuation phase of a finger tapping task) as opposed to the external timing of movements (the synchronization phase of a finger tapping task).

Likewise to the best of our knowledge, there are no studies investigating neural oscillations in PWS in response to isochronous or non-isochronous tones either by passive listening, finger tapping or vocalizations. Given the role of neural oscillations in timing (Arnal, 2012), it would be interesting to investigate how they might differ between PWS and PWDS in the context of a timing task. With respect to studies of brain stimulation, no studies have yet examined the effect of distuptive TMS on the right IFG, the SMA or the CB in PWS in a timing task. Although speculative, it might be expected that tapping in time to a metronome (external timing) will be relatively unimpaired because PWS can rely on the CB and premotor cortices much in the same way as non-stuttering adults do. However for self-paced tapping it might be expected that following inhibitory TMS to the right IFG, PWS will be significantly impaired because they cannot rely on either the right IFG or the BG. In contrast, PWDS will be able to rely on the BG, but not the right IFG. The compensatory function of the right IFG in stuttering is biologically plausible in that it forms part of a core timing-network (Wiener et al., 2010), is functionally interconnected with the BG (Kung et al., 2013) and is utilized for the processing of speech rhythm (Geiser et al., 2012).

While this article focused on the neural correlates of the ITN and the ETN during the perception and production of rhythmic movements and stimuli, there are many other tasks that probe these networks. The finger-tapping task is a continuous task that is often conducted in the presence of a regular external stimulus. It is possible that the regular external stimulus reduces behavioral variability and (possibly the associated) neural activity much in the same way that it is able to temporarily induce fluency in PWS. It would therefore be prudent to examine the timing abilities of PWS on tasks that do not contain such regular stimuli or where there is a disruption to the external stimuli. In line with the hypothesis of impaired internal timing and the hypothesized compensatory increases in regions associated with the processing of external timing of movements, it might be expected that PWS are more reliant on external cues. As such it would be interesting to test abilities of PWS to judge whether a "test interval" is longer or shorter than a "reference interval" and how these judgments are influenced by the presence of a "distractor interval" that they must ignore (see Rao et al., 2001). To this end, we know of no studies that have examined temporal judgment deficits in PWS either behaviorally or neurologically. More generally, if it is demonstrated that PWS exhibit deficits in timing, it would be particularly interesting to see if there is any dissociation between these different types of timing tasks or modalities; There may for example, be a dissociation between motor timing or judgment duration or between auditory and visual timing.

## **CONCLUDING REMARKS**

In conclusion, we provide a theoretical framework with which to view stuttering as a disorder of timing. This paper reviews converging evidence from neuroimaging and brain stimulation experiments showing a great degree of overlap between the structures engaged in the internal timing of movements and the regions thought to be causally involved in stuttering. We also provide evidence of overlap between the neural structures engaged in the external timing of movement and link them with compensatory activity in PWS. We further highlight significant gaps in the literature and suggest avenues for further research motivated by this overarching theory. More generally, this article highlights anomalies in the functional activations and the structural anatomy of the areas involved in the processing of time in stuttering, that are linked to the dysfluent production of speech and should motivate further research in the field.

## **ACKNOWLEDGMENTS**

We thank Paul Tawadros for his valuable comments on the manuscript. This work was supported by the Australian Research Council (DE130100868).

## **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 14 March 2014; accepted: 08 June 2014; published online: 25 June 2014. Citation: Etchell AC, Johnson BW and Sowman PF (2014) Behavioral and multimodal neuroimaging evidence for a deficit in brain timing networks in stuttering: a hypothesis and theory. Front. Hum. Neurosci. 8:467. doi: 10.3389/fnhum.2014.00467 This article was submitted to the journal Frontiers in Human Neuroscience. Copyright © 2014 Etchell, Johnson and Sowman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Sequencing at the syllabic and supra-syllabic levels during speech perception: an fMRI study

## *Isabelle Deschamps 1,2\* and Pascale Tremblay1,2*

*<sup>1</sup> Département de Réadaptation, Université Laval, Québec City, QC, Canada*

*<sup>2</sup> Centre de recherche de l'Institut universitaire en santé mentale de Québec, Québec City, QC, Canada*

#### *Edited by:*

*Patti Adank, University College London, UK*

#### *Reviewed by:*

*Jonathan E. Peelle, Washington University in St. Louis, USA Carolyn McGettigan, Royal Holloway University of London, UK Samuel Evans, University College London, UK*

#### *\*Correspondence:*

*Isabelle Deschamps, Centre de recherche de l'Institut universitaire en santé mentale de Québec, 2601 rue de la Canardière, Office F-2424A, Québec City, QC G1J 2G3, Canada e-mail: isabelle.deschamps.1@ ulaval.ca*

The processing of fluent speech involves complex computational steps that begin with the segmentation of the continuous flow of speech sounds into syllables and words. One question that naturally arises pertains to the type of syllabic information that speech processes act upon. Here, we used functional magnetic resonance imaging to profile regions, using a combination of whole-brain and exploratory anatomical region-of-interest (ROI) approaches, that were sensitive to syllabic information during speech perception by parametrically manipulating syllabic complexity along two dimensions: (1) individual syllable complexity, and (2) sequence complexity (supra-syllabic). We manipulated the complexity of the syllable by using the simplest syllable template—a consonant and vowel (CV)-and inserting an additional consonant to create a complex onset (CCV). The supra-syllabic complexity was manipulated by creating sequences composed of the same syllable repeated six times (e.g., /pa-pa-pa-pa-pa-pa/) and sequences of three different syllables each repeated twice (e.g., /pa-ta-ka-pa-ta-ka/). This parametrical design allowed us to identify brain regions sensitive to (1) syllabic complexity independent of supra-syllabic complexity, (2) supra-syllabic complexity independent of syllabic complexity and, (3) both syllabic and supra-syllabic complexity. High-resolution scans were acquired for 15 healthy adults. An exploratory anatomical ROI analysis of the supratemporal plane (STP) identified bilateral regions within the anterior two-third of the planum temporale, the primary auditory cortices as well as the anterior two-third of the superior temporal gyrus that showed different patterns of sensitivity to syllabic and supra-syllabic information. These findings demonstrate that during passive listening of syllable sequences, sublexical information is processed automatically, and sensitivity to syllabic and supra-syllabic information is localized almost exclusively within the STP.

**Keywords: syllabic information, supra-syllabic information, supratemporal plane, speech processing, language**

## **INTRODUCTION**

The speech signal is undoubtedly one of the most complex auditory signals that humans are exposed to, requiring multiple computational steps to parse and convert acoustic waves into discrete linguistic units from which meaning can be extracted. Unsurprisingly, given such complexity, the manner in which the human brain accomplishes the complex computational steps leading to the comprehension of speech remains far from understood.

Functional neuroimaging studies of speech perception offer converging evidence suggesting that the supratemporal plane (STP), and superior temporal sulcus (STS) play a critical role in the processing of speech sounds. This finding is quite robust having been observed under different types of speech perception tasks (i.e., passive listening, monitoring and discrimination tasks as well as neural adaptation paradigms) and with different types of speech stimuli (words, pseudo-words, syllables, phonemes). For instance, neuroimaging studies contrasting the neural activity evoked by speech stimuli to the neural activity associated with the processing of acoustically complex non-speech sounds or silence have reliably reported clusters of activation within the STP and/or STS (Zatorre et al., 1992; Binder et al., 1996, 1997; Dhankhar et al., 1997; Celsis et al., 1999; Burton et al., 2000; Scott et al., 2000; Benson et al., 2001; Vouloumanos et al., 2001; Joanisse and Gati, 2003; Wilson et al., 2004; Liebenthal et al., 2005; Rimol et al., 2005; Wilson and Iacoboni, 2006; Obleser et al., 2007; Okada et al., 2010; Zhang et al., 2011; Tremblay et al., 2012). In addition, neuropsychological evidence demonstrate that bilateral lesions to the superior temporal lobes can result in pure word deafness, a deficit associated with impaired word comprehension but relatively intact ability to process non-speech sounds (Buchman et al., 1986; Tanaka et al., 1987; Poeppel, 1996). While both functional and neuropsychological studies provide strong evidence regarding the importance of the STP and STS for the perception of speech sounds, the specific contribution of each of the subregions that form this large cortical area to speech perception is still uncertain; whether it is related to the processing of acoustical, sublexical, or lexical information.

Several neuroimaging studies have contrasted the neural activity evoked by different sublexical units (e.g., consonant clusters, phonemes, syllables) to the processing of non-speech or unintelligible speech sounds (e.g., sinewave analogs, tones, environmental sounds, noise, spectrally rotated syllables, silence) to isolate speech specific processes. These studies reported reliable activation within supratemporal regions [e.g., the superior temporal gyrus (STG), the transverse temporal gyrus (TTG), and planum temporale (PT)], the STS, the middle temporal gyrus (MTG) and, in some instances, in the inferior parietal lobule (IPL), and the inferior frontal gyrus (IFG) (Demonet et al., 1992; Zatorre et al., 1992; Binder et al., 1994; Dhankhar et al., 1997; Giraud and Price, 2001; Vouloumanos et al., 2001; Liebenthal et al., 2005; Rimol et al., 2005; Pulvermuller et al., 2006; Obleser et al., 2007; Tremblay et al., 2012). The consistency of the STP and STS results in studies using words or sublexical units suggest that these regions might be involved in the conversion of acoustical information into phonological representations. However, because these studies have contrasted different types of sublexical units to non-speech or unintelligible speech sounds, the level of processing (e.g., acoustical/phonetic, phonemic, syllabic, suprasyllabic) at which mechanisms implemented within the STP and STS operate remains unclear.

Neuroimaging studies in which phonological mechanisms are engaged by the use of an explicit task (discrimination, rhyming) can more readily target specific mechanisms operating at different sublexical levels (phonemic, syllabic, supra-syllabic) and offer valuable insights into the functional contribution of STP regions to the perception of speech sounds. For instance, STP and STS activation have been reported in studies using a variety of auditory tasks: phonetic discrimination (Burton et al., 2000), rhyming (Booth et al., 2002), syllable identification (Liebenthal et al., 2013), monitoring (Rimol et al., 2005), and phonemic judgments (Jacquemot et al., 2003). Other studies using a neural adaptation paradigm to target phonetic processing have also identified regions within the STP that responded more strongly to stimuli that were part of different phonemic categories than those that felt within the same phonemic category (Dehaene-Lambertz et al., 2005; Joanisse et al., 2007). Taken together, these studies support the notion of a key involvement of the STP and STS in processing sounds at different levels (phonemic, syllabic). However, despite their importance, studies using explicit speech perception tasks requiring judgments on speech sounds probably recruit to greater extent phonological processes than do more naturalistic speech tasks. It is therefore unclear whether similar regions would be recruited in the absence of a task. It is also unclear whether phonological mechanisms operating at different levels (phonemic, syllabic, supra-syllabic) engage the same or different neural networks. Despite the scarcity of studies addressing this issue, in a recent functional magnetic resonance imaging (fMRI) study, McGettigan et al. (2011) manipulated both the complexity of syllabic and supra-syllabic information in pseudowords during a passive listening task. Syllabic complexity was manipulated by varying the number of consonant clusters (0 vs. 2) and supra-syllabic complexity was manipulated by varying the number of syllables (2 vs. 4). An effect of supra-syllabic complexity was observed in the bilateral PT. However, no positive1 effect of syllabic complexity was reported. In contrast, Tremblay and Small (2011), also using fMRI, varied syllabic complexity as indexed by the presence or absence of consonant clusters during the passive listening of words and found that the right PT was sensitive to the syllabic complexity manipulation, supporting the idea that the supratemporal cortex plays a role in processing syllabic information (Grabski et al., 2013).

One question that arises from this literature is whether specific sublexical processes can be localized to specific regions within the STP and STS. In the current experiment, we were interested in investigating the distinct and shared effects of syllabic and suprasyllabic complexity on brain activity during the processing of auditory sequences. To this aim, we parametrically manipulated phonological complexity along two dimensions (1) individual syllable complexity (presence or absence of a consonant cluster in the syllable onset) and (2) sequence-level complexity (the ordering of syllables within a sequence). Given the importance of the STP and STS in the processing of auditory information, we conducted an exploratory anatomical ROI analysis focusing on a fine-grain parcellation of the supratemporal cortex and STS based on our previous work (Tremblay et al., 2012, 2013) to determine whether sub-regions within the STP and STS process similar or different kind of sublexical information during passive speech perception (i.e., syllabic or supra-syllabic). In these prior studies, we demonstrated that sub-regions within the STP exhibited different patterns of sensitivity to speech sounds during speech perception and production, suggesting that the STP contains a mosaic of functionally distinct areas. It is therefore possible that sub-regions within the STP are processing the speech signal in different manners and at different levels, with some focusing on spectral information, while others on syllable- or sequence-level information. Based on the results from our previous studies, we hypothesized that some sub-regions within the STP (in particular the PT) and STS would show similar patterns of activation for both manipulations while others would show a preference for one manipulation. For example, we expected the primary auditory cortex to be sensitive to both manipulations, as both syllabic and supra-syllabic complexity increase acoustic complexity. We also expected the PT to be sensitive to the syllabic manipulation based on previous results (Tremblay and Small, 2011).

## **MATERIALS AND METHODS**

#### **PARTICIPANTS**

The participants were 15 healthy right-handed (Oldfield, 1971) native French speakers (9 females; 26.8 ± 4.8 years; range 21– 34, education 17.3 ± 1.9 years), with normal hearing and no history of language or neurological/neuropsychological disorders. Hearing was assessed using pure tone audiometry (clinical audiometer, AC40, Interacoustic) for each ear separately for the following frequencies: 0.25, 0.5, 1, 2, 3, 4, 5, 8, 12, and 16 kHZ. Then for each participant, a standard pure tone average (PTA: average of threshold at 0.5, 1, and 2 kHz) was computed for the left (17.13 ± 3.78 dB) and right ear (18.68 ± 3.17 dB), since most of the speech sounds fall within this range (Stach, 2010). All participants were screened for depression (Yesavage et al., 1982) and their cognitive functioning was evaluated using the Montreal Cognitive Assessment scale (MOCA) (Nasreddine et al., 2005).

<sup>1</sup>The authors reported several brain areas in which blood-oxygen-level dependent (BOLD) signal magnitude was higher for pseudowords without consonant clusters than for pseudowords containing consonant clusters.

All participants were within normal range on the MOCA (i.e., 26/30 or better) and none of the participants were depressive. The study was approved by the committee on research ethics of the Institut Universitaire en santé mentale de Québec (#280-2012).

#### **STIMULI AND TASK**

The experimental task consisted in listening passively (i.e., without performing a task) to sequences of syllables. To investigate sublexical phonological processing, we used sequences of syllables instead of pseudowords to avoid lexical effects. Prior research has demonstrated that pseudowords, given their close resemblance to words, activate regions involved in lexical access and in some cases they do so to an even greater extent than words (Newman and Twieg, 2001; Burton et al., 2005). Thus, the use of pseudowords renders the dissociation between lexical and sublexical phonological processing extremely difficult. For this reason, we decided to used syllable strings rather than words to alleviate potential lexical effects. The degree of complexity of each sequence was manipulated along two phonological dimensions: syllabic and supra-syllabic complexity. Each factor had two levels (simple or complex), resulting in a 2 × 2 experimental design matrix (See **Table 1**).

Syllabic complexity refers to the presence or absence of a consonant cluster (e.g., /gr/): simple syllables were composed of a single consonant and vowel (CV) and complex syllables were composed of a consonant cluster and a vowel (CCV). Supra-syllabic complexity refers to the number of different syllables in a sequence: simple sequences were composed of the same syllable repeated six times (e.g., /ba-ba-ba-ba-ba-ba/) and complex sequences were composed of three different syllables each repeated twice (e.g., /ba-da-ga-ba-da-ga/). While these two manipulations increase phonological complexity, they target different levels of processing; syllabic (individual unit) and suprasyllabic (sequence of units).

All syllables were created by selecting among five frequent French vowels, which included two front vowels (/i/, /ε/), two back vowels (/o/, /u/), and one central vowel (-), and 12 frequent French consonants, which included four labial consonants (/b/, /p/, /v/, /f/), four coronal consonants (/d/, /n/, /t/, /l/) and four dorsal consonants (/g/, /ñ/, /k/, /<sup>R</sup> /). These vowels and consonants were combined to form 60 simple syllables (CV) and 60 complex syllables (CCV). Each syllable was repeated a total of three times (i.e., in three different sequences). Six-syllable sequences were created by producing sequences of three different syllables twice (/pa-ta-ka-pa-ta-ka), or by repeating one syllable six times (/pa-pa-pa-pa-pa-pa/). A native young adult male French speaker from Quebéc City pronounced all syllable sequences naturally in a sound attenuated booth. Each sequence was recorded five times



and the best exemplar was selected to use in the experiment. The syllable sequences were recorded at 44.1 KH using a unidirectional microphone connected to a sound card (Fast Track C-400, M-audio), saved directly to disk using Sound Studio 4.5.4 (Felt Tip Software, NY, USA), and edited offline using Wave Pad Sound Editor 4.53 (NHC Software, Canberra, Australia). Each syllable sequence was edited to have an average duration of 2400 ms. The duration of the syllable sequences was the same across all experimental conditions (i.e., 2400 ms). The root mean square (RMS) intensity was then normalized across all sound files. Individual sequences were not repeated during the course of the experiment.

#### **PROCEDURE**

This experimental paradigm resulted in four conditions of 30 trials each, for a total of 120 trials. Each trial lasted 6.5 s. A resting baseline condition was interleaved with the experimental conditions (60 trials). The randomization of the experimental and baseline conditions was optimized using Optseq2 (http://surfer*.*nmr*.*mgh*.*harvard*.*edu/optseq/). The four conditions were equally divided into two runs. A passive listening experimental paradigm was used; participants were not required to produce any overt response. All stimuli were presented during the delay in acquisition (see Image acquisition section below) using Presentation Software (Neurobehavioral System, CA, USA) through high-quality MRI-compatible stereo electrostatic earplugs (Nordic Neurolab, Norway), which provide 30 dB of sound attenuation.

#### **IMAGE ACQUISITION**

A 3 T Philips Achieva TX MRI scanner was used to acquire anatomical and functional data for each participant. Structural MR images were acquired with a T1-weighted MPRAGE sequence (*TR*/*TE* = 8*.*2*/*3*.*7 ms, flip angle = 8◦, isotropic voxel size = 1 mm3, 256 <sup>×</sup> 256 matrix, 180 slices/volume, no gap). Singleshot EPI BOLD functional images were acquired using parallel imaging, with a SENSE reduction factor of 2 to reduce the number of phase encoding steps and speed up acquisition. In order to ensure that syllables were intelligible, a sparse image acquisition technique (Eden et al., 1999; Edmister et al., 1999; Hall et al., 1999; Gracco et al., 2005) was used. A silent period of 4360 ms was interleaved between each volume acquisition. The syllable sequences were presented 360 ms after the onset of the silent period. One hundred and eighty functional volumes were acquired across 2 runs (*TR*/*TE* = 6500*/*30 ms, volume acquisition = 2140 ms; delay in *TR* 4360 ms, 40 axial slices parallel to AC/PC, voxel size = 3 × 3 × 3, no gap; matrix = 80 × 80; FoV = 240 × 240 mm). This study was part of a larger project, which also included a speech production task and a speech perception in noise task2. Those two tasks will not be discussed as part of this manuscript. The speech perception task that is the focus of the present manuscript was always presented first to participants, followed by the speech production task and the speech perception

<sup>2</sup>Not all participants took part in all three tasks. Here we report the data from 15 young adults, whereas Bilodeau-Mercure et al. (2014) reported the data for a subset (11) of these participants, who performed the speech perception in noise task.

in noise task. Participants were not told until the production task that they would be required to produce speech. This was done in order to avoid priming subvocal rehearsal during the speech perception task. The speech perception in noise task has been reported elsewhere (Bilodeau-Mercure et al., 2014).

#### **DATA ANALYSIS**

#### *fMRI time-series analyses*

All functional time-series were motion-corrected, time-shifted, de-spiked and mean-normalized using AFNI (version 10.7, intel 64; Cox, 1996). All time points that occurred during excessive motion (i.e., *>*1 mm) (Johnstone et al., 2006) were censored. The anatomical scan of each participant was aligned to their registered EPI time series using local Pearson correlations (Saad et al., 2009). The alignment was verified and manually adjusted when necessary. For each participant and for each run a finite impulse response ordinary least squares model was used to fit each time point of the hemodynamic response function for each of the four experimental conditions using AFNI's tent basis function (SS, SC, CS, CC). Additional regressors for the mean, the linear and quadratic trend components as well as the six motion parameters were also included. This model-free deconvolution method allows the shape of the hemodynamic response to vary for each condition rather than assuming a single response profile for all conditions (Meltzer et al., 2008). The interval modeled covered the entire volume acquisition (2.14 s), starting with stimulus onset and continuing at intervals of 6.5 s (i.e., silent period and volume acquisition) for 13 s (i.e., 2 *TR*). All analyses (wholebrain and ROIs) focused on the first interval (i.e., the first TR). The resulting time-series were projected onto the 2-dimensional surfaces where all subsequent processing took place.

For each participant, FreeSurfer was used to create a surface representation of the participant's MRI (Dale et al., 1999; Fischl et al., 1999) by inflating each hemisphere of the anatomical volumes to a surface representation and aligning it to a template of average curvature. SUMA was used to import the surface representations into the AFNI 3D space and to project the pre-processed time-series from the 3-dimensional volumes onto the 2-dimensional surfaces. Both the surface representations and the pre-processed time-series were standardized to a common mesh reference system (Saad et al., 2004). The time-series were smoothed on the surface to achieve a target smoothing value of 6 mm using a Gaussian full width half maximum (FWHM) filter. Smoothing on the surface as opposed to the volume ensures that white matter values are not included, and that functional data located in anatomically distant locations on the cortical surface are not averaged across sulci (Argall et al., 2006).

#### *Group-level node-wise analyses*

Whole-brain group analyses were performed using SUMA on the participants' beta values resulting from the first level analysis (Saad et al., 2004). The group level analyses focused on (1) the effect of passive auditory sequence perception on the Blood oxygenation level dependent (BOLD) signal (2) the effect of syllabic and supra-syllabic complexity on the BOLD signal during auditory sequence perception, (3) the contrast between the effect of syllabic and supra-syllabic complexity, and (4) the conjunction of the syllabic and supra-syllabic complexity effects. To identify regions recruited during the perception of auditory sequences, a node-wise linear regression was conducted (perception *>*0, one sample *t*-test option in the AFNI 3dttest++ program). To investigate the effect of syllabic and supra-syllabic complexity, a two-way repeated measure ANOVA (rANOVA) was conducted (AFNI's 3dANOVA program) with syllabic complexity (simple, complex) and supra-syllabic complexity (simple, complex) as within-subjects factors. To identify regions that exhibited a stronger response to one of the manipulations (i.e., syllabic or supra-syllabic), we computed, at the individual subject level, the effect of syllabic complexity (complex syllables - simple syllables) and the effect of supra-syllabic complexity (complex sequences - simple sequences). At the group level, the resulting t-maps were submitted to a paired sample *t*-test, to determine whether the two contrasts (i.e., syllable and sequence contrast) differed (AFNI 3dttest++ program). For the conjunction, we computed a map of the joint activation, for each subject, for syllabic and supra-syllabic complexity (syllabic ∩ supra-syllabic). Only voxels that were significant at *p* = 0*.*05 (uncorrected) in both individual maps were included in the conjunction map. A group-level average of the conjunction maps was then generated. All resulting group maps were corrected for multiple comparisons using the Monte Carlo procedure implemented in FreeSurfer. This correction implements a cluster-size threshold procedure to protect against Type I error. For the first three analysis, based on the simulation results, it was determined that a family-wise error (FWE) rate of *p <* 0*.*001 is achieved with a minimum cluster size of 157 contiguous surface nodes, each significant at *p <* 0*.*01. For the conjunction analysis, we adopted a more lenient correction (a FWE rate of *p* 0.05 was achieved with a minimum cluster size of 202 contiguous surface nodes, each significant at *p <* 0*.*05).

#### *Exploratory anatomical ROI analysis*

To examine the role of supratemporal regions in the processing of syllabic and supra-syllabic information, we conducted an exploratory anatomical ROI analysis focusing on a set of 16 a priori selected anatomical regions. In a previous study, using a similar fine-grain parcellation, we demonstrated that several STP regions exhibited differential sensitivity pattern to auditory categories (i.e., syllables or bird songs) and sequence regularity (Tremblay et al., 2012). Here we used a similar parcellation scheme with the addition of the STS to investigate the sensitivity of these regions to syllabic and supra-syllabic information. These bilateral ROIs included the planum polare (PP), the STG, the STS, the TTG, the transverse temporal sulcus (TTS), the PT, the caudal segment of the Sylvia fissure (SF). These ROIs were anatomically defined on the participant's individual cortical surface representation using an automated parcellation scheme (Fischl et al., 2004; Desikan et al., 2006). This parcellation scheme relies on a probabilistic labeling algorithm based on the well-established anatomical convention of Duvernoy (1991). The anatomical accuracy of this method is high, approaching that of manual parcellation (Fischl et al., 2002, 2004; Desikan et al., 2006). The advantage of using anatomical (as opposed to functional) ROIs based on individual micro-anatomical landmarks is that it can capture inter-subject anatomical variability, something that is loss when using normalized templates (i.e., functional ROIs based on group level data or cytoarchitectonic maps). It is also more anatomically precise. Thus, given that we were specifically interested in exploring the functional anatomy of the STP/STS, the choice of an anatomical ROI approach was logical.

To augment the spatial resolution of the FreeSurfer anatomical parcellation, we manually subdivided the initial parcellation of each participant's inflated surface in the following manner: the STS, the STG, the PT were subdivided into equal thirds whereas the SF, the TTG, and the TTS were subdivided into equal halves, resulting in 16 ROIs (refer to **Figure 1** and **Table 2** for details). The use of this modified FreeSurfer parcellation scheme is advantageous for several reasons: (1) it is based on a well-recognized anatomical parcellation scheme, (2) it is systematic, (3) it is easily replicable across participants and studies, and (4) it has been shown to reveal functional subdivisions within the STP (Tremblay et al., 2012).

For each participant, we extracted the mean percentage of BOLD signal change in each of the 16 resulting bilateral ROIs. First, we determined which ROIs were significantly active during the auditory perception of the sequences by testing the following hypothesis using FDR-corrected *t*-tests (Benjamini and Hochberg, 1995; Genovese et al., 2002) (*q* = 0*.*05): (i) perception *>*0, (*n* = 32, one-sample *t*-tests).

For each ROI that was significantly active, we conducted a three-way ANOVA with repeated measurements on the magnitude of the BOLD signal as a function of hemisphere, syllabic complexity, and supra-syllabic complexity. Within each ROI, all main effects as well as two-way and three-way interactions were examined using Bonferroni corrected paired-sample *t*-tests (α = 0*.*05). Adjusted *p*-values are reported.

### **RESULTS**

#### **WHOLE BRAIN RESULTS**

The first whole-brain analysis focused on identifying brain regions that were significantly recruited during the perception of auditory sequences regardless of syllabic and supra-syllabic complexity. The node-wise linear regression identified regions within the bilateral precentral gyrus, IFG, medial superior frontal gyrus and supratemporal cortex, as well as the left cingulate gyrus and right superior frontal gyrus that were more active than during the

perception of auditory sequences than the baseline (i.e., rest) (for details, refer to **Figure 2** and **Table 3**).

The second analysis sought to identify brain regions that were sensitive to syllabic complexity, supra-syllabic complexity. The node-wise rANOVA showed significant main effects of syllabic complexity and supra-syllabic complexity within the STP (for details, refer to **Table 4** and **Figures 3A,B**). As illustrated in **Figure 3A**, for the syllabic complexity manipulation, significant clusters of activation were observed within the left TTGl extending posteriorly into the SFp, and medially into the inferior sulcus of the insula as well as the right TTGl extending posteriorly into the SFa, laterally into the STGm and medially into the inferior circular sulcus of the insula (for details, refer to **Table 4A**). These two regions were significantly more active for the complex syllables than the simple syllables. As illustrated in **Figure 3B**, an effect of supra-syllabic complexity was found within the left STGm extending medially into the STSm, and TTSl as well as the right STGa/m, the right central sulcus and the right superior frontal gyrus. Only the clusters within the STP were significantly more active for the complex sequences (see



**Table 4B**). No significant two-way interaction between syllabic complexity and supra-syllabic complexity was found.

The third analysis sought to determine whether brain regions responded more to one complexity manipulation than the other. The node-wise *t*-test showed that the effect of supra-syllabic complexity was stronger than the effect of syllabic complexity within STP regions in the left STSp, STGp, and STGa, whereas the effect of syllabic complexity was stronger than the effect of suprasyllabic complexity in the left TTGl (for details, refer to **Table 5** and **Figure 3C**).

The last analysis focused on identifying regions that were sensitive to both experimental manipulations. As illustrated in **Figure 3D**, the conjunction between the syllabic complexity contrast and the supra-syllabic contrast revealed overlapping activation for both experimental manipulation within left STP regions (TTSm, TTSl, PTa, STGm), the cuneus as well as right STP regions (TTSm, TTSl, SFp), the right supramarginal gyrus, and the right subparietal sulcus. For each area that responded to both manipulations, we quantified the number of participants for which the two effects overlapped. As can be seen in **Figure 3D**, less than five participants shared common overlapping regions.

#### **Table 3 | FWE-corrected whole-brain for the speech perception network.**


*All coordinates are in MNI space and represent the peak surface node for each of the cluster (FWE: p* = *0.001, minimum cluster size: 157 contiguous surface nodes, each significant at p < 0.01). When more than one activation foci is listed, this means than the cluster had multiple peaks or was not continuous.*

#### **Table 4 | FWE-corrected whole-brain BOLD results.**


*All coordinates are in MNI space and represent the peak surface node for each of the cluster (FWE: p* = *0.001, minimum cluster size: 157 contiguous surface nodes, each significant at p < 0.01). T-values are reported instead of F-values. T-values were obtained by contrasting the two levels of complexity for each experimental factor while collapsing across the other one. When more than one activation foci is listed, this means than the cluster had multiple peaks or was not continuous.*

**FIGURE 3 | Whole-brain analysis of BOLD response.** Activation is shown on the group average smoothed flattened surfaces. The first three analyses **(A–C)** are controlled for multiple comparisons using a cluster extent of 157 vertices, and a single node threshold of *p <* 0*.*01, to achieve a family-wise error rate of *p <* 0*.*001. The last analysis **(D)** is controlled for multiple comparisons using a cluster extent of 202 vertices, and a single node threshold of *p <* 0*.*05, to achieve a family-wise error rate of *p <* 0*.*05. Panel **(A)** illustrates regions significantly active for the contrast between levels of syllabic complexity

(complex *>* simple sequences). Panel **(B)** illustrates regions significantly active for the contrast between levels of supra-syllabic complexity (complex *>* simple). Panel **(C)** illustrates regions that were differently active for the two complexity contrasts ([complex sequence - simple sequence] - [complex syllable - simple syllable]). Panel **(D)** illustrates regions significantly active for the conjunction of syllabic and supra-syllabic complexity (syllabic complexity ∩ supra-syllabic complexity). The color scheme represents the number of participants in which an overlap between the two manipulations was found (less than 5).

#### **Table 5 | FWE-corrected whole-brain BOLD results.**


*All coordinates are in MNI space and represent the peak surface node for each of the cluster (FWE: p* = *0.001, minimum cluster size: 157 contiguous surface nodes, each significant at p < 0.01). Two clusters or multiple clusters indicate that the activation cluster is not continuous.*

#### **EXPLORATORY SUPRA-TEMPORAL ROI ANALYSES**

Only the ROIs that were significantly activated for speech perception were included in the subsequent analyses. Of the 32 ROIs, only the bilateral STSp was not significantly activated. For each remaining ROI (*n* = 15), we investigated the main effects of hemisphere, syllabic complexity, supra-syllabic complexity as well as the two-way interactions between hemisphere ∗ syllabic complexity, hemisphere ∗ supra-syllabic complexity, syllabic complexity ∗ supra-syllabic complexity and three-way interaction between hemisphere ∗ syllabic complexity ∗ supra-syllabic complexity. Bonferroni adjusted *p*-values are reported.

As shown in **Figure 4**, a main effect of syllabic complexity was observed in the TTGl [*F*(1*,* 14) = 26*.*44, *p* = 0*.*0002], the TTGm [*F*(1*,* 14) = 31*.*11, *p* = 0*.*00007], the TTSl [*F*(1*,* 14) = 29*.*4, *p* = 0*.*00009], the TTSm [*F*(1*,* 14) = 17*.*13, *p* = 0*.*001], the STGm [*F*(1*,* 14) = 8*.*71, *p* = 0*.*011], the SFp [*F*(1*,* 14) = 5*.*90, *p* = 0*.*029], the SFa [*F*(1*,* 14) = 9*.*84, *p* = 0*.*007], the PTa [*F*(1*,* 14) = 13*.*61, *p* = 0*.*002] and the PTm [*F*(1*,* 14) = 4*.*84, *p* = 0*.*045]. We then determined the type of stimuli driving the effect. For all nine regions, a stronger effect was observed for complex than simple syllables (paired sample *t*-tests, Bonferroni corrected). For the SFa, a significant hemisphere ∗ syllabic complexity interaction was also observed [*F*(1*,* 14) = 8*.*39, *p* = 0*.*012]. Paired sample *t*-tests revealed that the source of the interaction was due to the presence of an effect of syllabic complexity for the left SFa (*t* = 4*.*39, *p* = 0*.*003) but not the right SFa (*t* = 1*.*358, *p* = 0*.*59) (for details, refer to **Figure 3**). For the PTm, a significant syllabic complexity ∗ supra-syllabic complexity interaction was noted. Paired sample *t*-tests revealed that this interaction was due to the presence of an effect of syllabic complexity for the complex (*t* = 2*.*95, *p* = 0*.*044) but not the simple sequences (*t* = 0*.*01, *p* = 1) (for details, refer to **Figure 4**). For the TTSm, a significant syllabic complexity ∗ hemisphere interaction was observed. Paired sample *t*-tests revealed that this interaction was due to a marginally significant difference when we computed a differential complexity score per hemisphere [complex - simple syllable] and compared these scores across hemispheres (*t* = −2*.*51, *p* = 0*.*06). A significant three-way interaction was observed in the STSa. To investigate the source of the three-way interaction, two-way interactions were computed. A two-way interaction between syllabic complexity and hemisphere was found for complex sequences [*F*(1*,* 14) = 7*.*32, *p* = 0*.*018] but not for simple sequences [*F*(1*,* 14) = 0*.*413, *p* = 0*.*531]. Paired sample *t*-tests were computed. A marginally significant difference (*t* = −2*.*67, *p* = 0*.*054) was found when we computed a differential complexity score per hemisphere [complex - simple syllable] and compared these scores across hemispheres.

The overall pattern that emerges with regard to the syllabic manipulation is a significant increase in sensitivity for complex syllables (i.e., CCV) relative to simple syllables (i.e., CV) in the TTGl, TTGm, TTSm, TTSl, STGm, SFp, SFa, PTa, and PTm. Furthermore, the SFa demonstrated a lateralization effect during the processing of syllabic information (the left SFa was sensitive to the syllabic manipulation but not the right SFa). Lastly, the PTm

**FIGURE 4 | Patterns of syllabic complexity effects observed in exploratory STP and STS ROI analysis.** The results are shown on a flattened schematic representation of STP and STS showing the parcellation used in this study (different areas shown not to scale). Areas in dark purple exhibited a main effect of complexity and areas in lighter purple exhibited an interaction was observed (hemisphere ∗ syllabic complexity for the SFa and syllabic complexity ∗ supra-syllabic complexity for the PTm). Legend: PP,

planum polare; TTG, transverse temporal gyrus (*m*, medial; l, lateral); TTS, transverse temporal sulcus (m, medial; l, lateral); PT, planum temporale (a, anterior; m, middle; p, posterior); SF, caudal Sylvian fissure (a, anterior; p, posterior); STG, superior temporal gyrus (a, anterior; m, middle; p, posterior); STS, superior temporal sulcus (a, anterior; m, middle; p, posterior); <sup>∗</sup>significant contrast at pFWE = 0*.*05, Bonferonni corrected; n.s. non-significant contrast. Error bars represent standard error of the mean.

was the only region where an interaction between the syllabic and supra-syllabic manipulations was observed. In this region, the effect of syllabic complexity was restricted to complex sequences.

As shown in **Figure 5**, a main effect of supra-syllabic complexity was observed in the STSm [*F*(1*,* 14) = 5*.*89, *p* = 0*.*03], the STGa [*F*(1*,* 14) = 5*.*39, *p* = 0*.*036], the STGm [*F*(1*,* 14) = 27*.*38, *p* = 0*.*0001], the PTa [*F*(1*,* 14) = 8*.*64, *p* = 0*.*01], the TTSl [*F*(1*,* 14) = 10*.*95, *p* = 0*.*005], the TTSm [*F*(1*,* 14) = 11*.*67, *p* = 0*.*004], and the TTGm [*F*(1*,* 14) = 8*.*619, *p* = 0*.*011]. We determined that for all seven regions, the complex sequences were driving the main effect of supra-syllabic complexity as they elicited higher levels of BOLD signal than simple sequences (paired sample *t*-tests, Bonferroni corrected). For the STSm and SFp, a hemisphere ∗ supra-syllabic interaction was observed [STSm: *F*(1*,* 14) = 10*.*06, *p* = 0*.*007, SFp:*F*(1*,* 14) = 11*.*84, *p* = 0*.*004]. For both regions, paired sample *t*-tests revealed that the source of the interaction was due to an effect of supra-syllabic complexity in the left hemisphere (STSm: *t* = 3*.*851, *p* = 0*.*004, SFp: *t* = 2*.*55, *p* = 0*.*046) but not the right hemisphere (STSm: *t* = 0*.*64, *p* = 1, SFp: *t* = 0*.*965, *p* = 1).

The overall pattern that emerges with regard to the suprasyllabic manipulation is a significant increase in sensitivity for complex sequences (i.e., three different syllables) relative to simple sequences (i.e., same syllable repeated 3×) in the STSm, STGa, STGm, PTa, TTSl, TTSm, and TTGm. In addition, in two regions, the STSm and SFp an effect of hemisphere was observed. For both of these regions, the effect of supra-syllabic complexity was only observed in the left hemisphere.

In sum, the pattern that emerges from the ROI analysis suggest that some ROIs (STGm, TTSl, TTGm, TTSm, PTa) are sensitive to both experimental manipulations while others are only sensitive to one experimental manipulation (i.e., syllabic: left SFa, PTm, TTGl; supra-syllabic: left STSm, left SFp; for details refer to **Figure 6**). In addition, for ROIs that were sensitive to both manipulations, the magnitude of the manipulations was equivalent given the absence of syllabic complexity ∗ supra-syllabic complexity interaction within these regions.

## **DISCUSSION**

Neuroimaging studies have consistently documented the role of two large and functionally heterogeneous cortical areas, the STP and STS, in the perception of speech sounds. However, a detailed understanding of the role STP and STS in the processing of sublexical information has not yet emerged. This is largely related to the intrinsic complexity of the speech signal. Indeed, comprehending speech requires the interaction of complex sensory, perceptual, and cognitive mechanisms. The question, then, that naturally arises is whether these regions shows differential patterns of activation as of function of the type of information being processed (syllabic vs. supra-syllabic) (functional heterogeneity) and the specific sub-region (spatial heterogeneity).

The main objective of the current study was to examine, using fMRI, whether the processing of syllabic and supra-syllabic information during a passive listening task involve similar or distinct networks, with an emphasis on the STP and the STS. A passive listening paradigm was used in order to minimize task-related

**FIGURE 5 | Patterns of supra-syllabic complexity effects observed in exploratory STP and STS ROI analysis.** The results mapped onto a flattened schematic representation of STP and STS showing the parcellation used in this study (different areas shown not to scale). Areas in dark blue represent a main effect of complexity and areas in lighter blue represent areas where an interaction was observed (hemisphere ∗ syllabic complexity for the SFp and STSm). Legend: PP, planum polare; TTG, transverse temporal gyrus (m, medial; l, lateral); TTS, transverse temporal sulcus (m, medial, l, lateral); PT, planum temporale (a, anterior; m, middle; p, posterior); SF, caudal Sylvian fissure (a, anterior, p, posterior); STG, superior temporal gyrus (a, anterior, m, middle, p, posterior); STS, superior temporal sulcus (a, anterior, m, middle, p, posterior); <sup>∗</sup>significant contrast at pFWE = 0*.*05, Bonferonni corrected; n.s. non-significant contrast. Error bars represent standard error from the mean.

planum polare; TTG, transverse temporal gyrus (m, medial, l, lateral); TTS, transverse temporal sulcus (m, medial, l, lateral); PT, planum temporale (a, anterior, m, middle, p, posterior); SF, caudal Sylvian fissure (a, anterior, p, posterior); STG, superior temporal gyrus (a, anterior, m, middle, p, posterior); STS, superior temporal sulcus (a, anterior, m, middle, p, posterior).

cognitive/executive demands. Given the importance of the STP and STS in speech processing, we conducted an exploratory ROI analysis focusing on 16 bilateral STP/STS sub-regions to determine whether differential patterns of activation would be observed as a function of the type of information processed (i.e., syllabic or supra-syllabic). To preface the discussion, the results from the whole-brain analysis identified a network of regions involved in the perception of speech sounds that is consistent with previous neuroimaging studies that contrasted the processing of sublexical speech units to rest (Benson et al., 2001; Hugdahl et al., 2003; Wilson et al., 2004; Rimol et al., 2005; Wilson and Iacoboni, 2006). In addition, the results clearly demonstrate that the processing of auditory syllable sequences recruits both the left and right hemisphere, consistent with the notion that the processing of speech sounds is bilateral (Hickok and Poeppel, 2004, 2007; Hickok, 2009). The highly consistent results from the whole-brain and ROI analysis demonstrate that both syllabic and supra-syllabic information are processed during passive listening. The anatomical specificity afforded by the ROI analyses allowed us to go further in exploring the specific functional contribution of sub-regions within the STP and STS during the perception of speech sounds. The findings are discussed below.

Results from the whole-brain analyses demonstrate widespread bilateral supratemporal activation resulting from the syllabic manipulation. The widespread extent of this activation was not expected based on previous fMRI results (McGettigan et al., 2011; Tremblay and Small, 2011). Of the few studies that have investigated the effect of consonant clusters during passive speech perception, in one study, activation within the right PT was scaled to syllabic complexity (Tremblay and Small, 2011) and in the other, no positive effect was reported (McGettigan et al., 2011). Our finding of widespread supratemporal effects may be related to the type of stimuli used. While in the present study we used meaningless sequences of syllables, Tremblay and Small (2011) used whole words, for which the mapping of sounds to linguistic representations may be more automatic, requiring less resources for the processing of syllabic information. However, if the processing of syllabic information interacts with lexical status, an effect of complexity should have been observed in the McGettigan et al. (2011) study given that pseudo-words were used, which are not overlearned stimuli with a stored lexical representation. It is possible that the absence of an effect of syllabic complexity in the latter study is attributable to a less salient experimental manipulation. In the present study, we contrasted sequences of syllables with either six or no consonant clusters, yielding a very robust effect. Although the use of a passive listening paradigm minimized attention-directed processes, mimicking more closely naturalistic speech perception situations, the use of syllable as experimental stimuli might have taxed to a greater extent phonological processes than the use of pseudo-words and words. This line of reasoning is consistent with neuropsychological and neurophysiological evidence suggesting that language comprehension does not depend on the processing of sublexical units (i.e., units smaller than words, such as syllables, phonemes, and phonetic features). For instance, it has been shown that patients with good word-level auditory comprehension abilities can fail on syllable and phoneme discrimination tasks (Basso et al., 1977; Boatman et al., 1995). Similarly, electrocortical mapping studies have provided evidence that phonological processes (e.g., syllable discrimination) and auditory word comprehension processes are not entirely circumscribed to the same STP regions (for a review, refer to: Boatman, 2004). In sum, while syllabic complexity effects are observed in sequences of syllables, further research need to determine whether and how syllabic information contributes to the perception of speech sounds and language comprehension.

Both whole-brain and exploratory ROI analyses identified a region that was sensitive to the presence or absence of consonant clusters; the lateral part of the primary auditory cortex (TTGl). In addition, the exploratory ROI analysis also identified the left SFa and PTm, as regions being sensitive to the syllabic manipulation. These results tentatively suggest that this effect stems from the addition of an extra consonant in the onset of the syllable and not from differences between adjacent syllables (i.e., two different syllables). This pattern of response is consistent with the hypothesis that these regions are sensitive to the structure of the syllable (i.e., whether it is phonologically complex or not). Whether these regions respond to the complexity of the syllabic structure in general or to a specific component of the syllable (i.e., onset, rhyme, nucleus, or coda) however remains to be determined. Though the specific contribution of these three regions in the processing of syllabic information is still awaits further specifications, these three regions are nonetheless robustly activated during the perception of sublexical speech sounds (Benson et al., 2001; Hugdahl et al., 2003; Wilson et al., 2004; Rimol et al., 2005; Wilson and Iacoboni, 2006).

An alternative hypothesis that could explain the complexity effect related to the addition of a consonant to form a cluster is that these regions are responding to an increase in phonological working memory due to an increase in sequence length. This is because the addition of a consonant cluster to increase syllabic complexity also increases the length of the sequence. However, previous studies that have manipulated item length to investigate phonological working memory have reported mixed results that seem dependent upon (1) how length was modulated (CV-CCV vs. number of syllables), (2) the type of stimuli used (words, pseudowords), and (3) task demands (passive listening, judgment or naming task) (Okada et al., 2003; Strand et al., 2008; McGettigan et al., 2011). The most consistent finding is that stimulus length defined as the number of syllable yields more reliable results than the addition of consonant clusters. Moreover, if our syllabic manipulation results reflected an increase in phonological working memory, we would expect this contrast to yield clusters of activation within the pre-motor cortex, the IFG, and the IPL, that is, regions that are typically recruited during verbal working memory tasks (Paulesu et al., 1993; Honey et al., 2000; Marvel and Desmond, 2012). However, none of these regions was found in any of our contrasts.

Another alternative hypothesis is that the syllabic effect is due to an increase in acoustic/phonemic complexity. Indeed, consonant clusters are more complex than single consonants both acoustically and phonemically. Given that we parametrically varied both syllabic and supra-syllabic complexity, if this hypothesis were correct, we would expect the same regions to also exhibit an effect of supra-syllabic complexity since the presentation of three different syllables as opposed to the same syllable presented three times also increases acoustical complexity. In addition, we would also expect to see a syllabic complexity ∗ supra-syllabic complexity interaction driven by a syllabic complexity effect for both simple and complex sequences and a stronger effect of syllabic complexity for the complex sequence. This pattern of result was not found in the SFa or the TTGl or the PTm. However, in the PTm, a region identical to the one reported by Tremblay and Small (2011), sensitivity to the syllabic manipulation was found only for the complex sequences. Combined with the observation that this region is involved in speech production (Dhanjal et al., 2008; Tourville et al., 2008; Peschke et al., 2009; Zheng et al., 2010) and that its activation magnitude varies as a function of syllabic complexity during both speech perception and production (Tremblay and Small, 2011), the result from the current study provides additional support to the hypothesis that the right PT is involved in converting external auditory input into a phonological representation. Our results are in agreement with this hypothesis because an effect of syllabic complexity only emerged in this region when the sequences were composed of three different syllables (i.e., high supra-syllabic complexity). In itself, the addition of a consonant cluster increases the complexity of the syllable template. The additional complexity associated with processing three different sounds (high supra-syllabic complexity) enhances the syllabic manipulation, as three different consonant clusters have to be mapped onto phonological representations as opposed to three single consonants. In sum, the current results lend further support to the notion that regions within the posterior STP are important for the processing of phonological information, perhaps through a template matching mechanisms that uses spectrotemporal information to access stored syllabic representations (Griffiths and Warren, 2002; Warren et al., 2005).

Both the whole-brain and exploratory ROI analyses identified two regions, the STSm and STGa that were sensitive only to the supra-syllabic manipulation. This pattern of response suggests that these regions are involved in tracking changes that affect the structure of the sequence. In the present study, after having heard the second syllable of a sequence, participants could determine whether they would hear the same syllable again (i.e., in the case of simple sequences) or a different syllable (i.e., in the case of complex sequences). Thus, after the second syllable, for simple sequences the continuation was completely deterministic and prediction about upcoming sounds could be made. This pattern of response is also consistent with results from studies that have investigated the perception of speech sounds using a neural adaptation and oddball paradigm. In these studies, cluster of activation were observed within these regions in response to the presentation of a deviant stimulus (Vouloumanos et al., 2001; Joanisse et al., 2007). Overall, the results suggest that these regions are involved in representing sequences overtime. Thus, speech perception mechanisms, even in the absence of a task, are sensitive to changes that affect the structural properties of auditory sequences, consistent with previous work (Tremblay et al., 2012).

Both whole-brain and exploratory ROI analyses also identified a group of regions that was sensitive to both manipulations. These regions included the STGm, the TTGm, the TTSl, the TTSm, the PTa, and the SFp. Sensitivity to both manipulations suggests that these regions do not exhibit a differentiation in processing syllabic or supra-syllabic information. In a previous neuroimaging study using the same parcellation scheme of the STP, both the TTSl and PTa responded to speech and non-speech sounds, whereas the STGm, SFp, and TTGm exhibited an absolute preference for speech sounds (Tremblay et al., 2012), consistent with the idea that regions located anterior and lateral the primary auditory cortex are involved in processing changes in spectro-temporal features (Scott and Johnsrude, 2003). These results suggest that both syllabic and supra-syllabic information recruits common mechanisms involved in processing acoustical information.

In the current study, we explored the neural mechanisms involved in the processing of syllabic and supra-syllabic information during passive speech perception. We demonstrated that both syllabic and supra-syllabic information are processed automatically during passive speech listening, a finding that is consistent with the finding of distinct neural representations for syllable and sequence-level information during speech production (Bohland and Guenther, 2006; Peeva et al., 2010). Importantly, these findings suggest that processing of sublexical information is automatic, at least during the processing of meaningless syllable sequences. Future studies need to examine whether the processing of sub-lexical information is automatic and necessary during language comprehension using more naturalistic stimuli such as words or connected speech. It is possible that the recruitment of phonological mechanisms depends upon the context, or the kind or quality of auditory stimuli being processed. Degraded speech stimuli, for instance, could recruit sublexical phonological mechanisms to a greater extent than high-quality speech sounds. Nevertheless, the present study offers new insight into the functional neuroanatomy of the system involved in sublexical phonological processing, highlighting the importance of the anterior two-thirds of the PT, the primary auditory cortices and the middle part of the STS and STG in these processes.

#### **ACKNOWLEDGMENTS**

This work was supported by research grants from the *Natural Sciences and Engineering Research Council* of Canada (*NSERC,* grant #1958126) and from the Fonds de la Recherche du Québec—Société et Culture (FRQSC,grant #169309) to Pascale Tremblay. Support from the Centre de Recherche de l'Institut Universitaire en santé mentale de Québec is also gratefully acknowledged. Technical support was provided by the "Consortium d'imagerie en neuroscience et santé mentale de Québec" (CINQ) for protocol development and MRI data acquisition.

## **REFERENCES**


invariance in the response to intelligible speech. *Cereb. Cortex* 20, 2486–2495. doi: 10.1093/cercor/bhp318


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 26 February 2014; accepted: 17 June 2014; published online: 08 July 2014. Citation: Deschamps I and Tremblay P (2014) Sequencing at the syllabic and suprasyllabic levels during speech perception: an fMRI study. Front. Hum. Neurosci. 8:492. doi: 10.3389/fnhum.2014.00492*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Deschamps and Tremblay. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Roles of frontal and temporal regions in reinterpreting semantically ambiguous sentences

#### *Sylvia Vitello1 \*, Jane E. Warren1,2, Joseph T. Devlin1 and Jennifer M. Rodd1*

*<sup>1</sup> Department of Experimental Psychology, University College London, London, UK*

*<sup>2</sup> Department of Language and Communication, University College London, London, UK*

#### *Edited by:*

*Patti Adank, University College London, UK*

#### *Reviewed by:*

*Stefanie E. Kuchinsky, University of Maryland, USA Glyn Paul Hallam, The University of York, UK*

#### *\*Correspondence:*

*Sylvia Vitello, Department of Experimental Psychology, University College London, Gower Street, London, WC1E 6BT, UK e-mail: s.vitello@ucl.ac.uk*

Semantic ambiguity resolution is an essential and frequent part of speech comprehension because many words map onto multiple meanings (e.g., "bark," "bank"). Neuroimaging research highlights the importance of the left inferior frontal gyrus (LIFG) and the left posterior temporal cortex in this process but the roles they serve in ambiguity resolution are uncertain. One possibility is that both regions are engaged in the processes of semantic reinterpretation that follows incorrect interpretation of an ambiguous word. Here we used fMRI to investigate this hypothesis. 20 native British English monolinguals were scanned whilst listening to sentences that contained an ambiguous word. To induce semantic reinterpretation, the disambiguating information was presented after the ambiguous word and delayed until the end of the sentence (e.g., "the teacher explained that the BARK was going to be very damp"). These sentences were compared to well-matched unambiguous sentences. Supporting the reinterpretation hypothesis, these ambiguous sentences produced more activation in both the LIFG and the left posterior inferior temporal cortex. Importantly, all but one subject showed ambiguity-related peaks within both regions, demonstrating that the group-level results were driven by high inter-subject consistency. Further support came from the finding that activation in both regions was modulated by meaning dominance. Specifically, sentences containing biased ambiguous words, which have one more dominant meaning, produced greater activation than those with balanced ambiguous words, which have two equally frequent meanings. Because the context always supported the less frequent meaning, the biased words require reinterpretation more often than balanced words. This is the first evidence of dominance effects in the spoken modality and provides strong support that frontal and temporal regions support the updating of semantic representations during speech comprehension.

**Keywords: fMRI, LIFG, lexical ambiguity, speech comprehension, semantics, reinterpretation, sentence processing**

## **INTRODUCTION**

Many of the words encountered in everyday language have multiple meanings, which makes the process of mapping word form onto meaning often ambiguous. This means that listeners must routinely combine various kinds of contextual information to understand the meaning that is intended by the speaker. For example, to understand the sentence "the woman used a microphone to make the toast," listeners must use the word "microphone" to understand that the semantically ambiguous word "toast" refers to a celebratory speech rather than grilled bread. Importantly, such ambiguity is often not noticed by listeners (Rodd et al., 2005), suggesting that disambiguation is generally a highly efficient and effective process. An understanding of the neural substrates supporting this process is essential in order to gain insight into the efficiency of language comprehension and because the breakdown of this process can lead to severe communication difficulties due to the prominence of ambiguous words in everyday language (Parks et al., 1998; Rodd et al., 2002).

Cumulative evidence from recent neuroimaging studies has highlighted the importance of two brain areas for semantic ambiguity resolution: the left inferior frontal gyrus (IFG) and the left posterior temporal cortex (Rodd et al., 2005, 2010b, 2012; Davis et al., 2007; Mason and Just, 2007; Zempleni et al., 2007; Bekinschtein et al., 2011). However, the relative contributions of these regions to ambiguity processing are uncertain. Psycholinguistic research converges on several cognitive processes that underpin semantic ambiguity resolution: accessing the alternative meanings of an ambiguous word, selecting a single meaning, and reinterpreting that meaning when an incorrect selection is initially made (e.g., Duffy et al., 1988; Gernsbacher, 1991; Simpson, 1994; Twilley and Dixon, 2000; Duffy et al., 2001; Rodd et al., 2010a).

One hypothesis of the contribution of LIFG and posterior temporal cortex to ambiguity resolution is that both regions play an important role in reinterpretation processes (Zempleni et al., 2007; Rodd et al., 2010a,b; Bekinschtein et al., 2011). Semantic reinterpretation occurs when listeners encounter context that is not consistent with their initial understanding of the ambiguous word, requiring them to suppress the initially-selected meaning and integrate the alternative, contextually-appropriate interpretation (e.g., Duffy et al., 1988; Twilley and Dixon, 2000; Rodd et al., 2010a). Evidence for the reinterpretation hypothesis comes from the finding that activation in these frontal and temporal regions are greater for sentences with a higher likelihood of reinterpretation. For example, several functional MRI (fMRI) studies have shown increased activation in these regions for sentences in which the disambiguating information is delayed until after the ambiguous word (e.g., "The teacher explained that the BARK was going to be very damp"] compared to unambiguous sentences (Mason and Just, 2007; Zempleni et al., 2007; Rodd et al., 2010b, 2012; Bekinschtein et al., 2011). These are known as late-disambiguation sentences. Delaying the disambiguating information makes it impossible for listeners to determine the intended (e.g., tree) meaning of the ambiguous word when it is initially encountered. Thus, listeners will initially misinterpret the correct meaning on some occasions (i.e., first selecting the dog meaning of "bark") and need to revise their understanding when they encounter the disambiguating information later on in the sentence (i.e., to the tree meaning). This process of initial meaning selection followed by reinterpretation is assumed in many influential cognitive models of semantic ambiguity resolution (Swinney, 1979; Twilley and Dixon, 2000; Duffy et al., 2001), on the basis of numerous cross-modal priming studies and eyemovement research which show that listeners and readers select a meaning within a few hundred millisecond of encountering an ambiguous word (e.g., Swinney, 1979; Seidenberg et al., 1982; Rayner and Duffy, 1986; Duffy et al., 1988). Various psycholinguistic studies, including eye-movement and dual-task research, also provide converging evidence that reinterpretation occurs for late-disambiguation sentences by showing that listeners and readers incur greater behavioral costs of processing the disambiguating regions in these sentences (e.g., longer reading times or poorer performance on an unrelated concurrent task) compared to processing equivalent regions in early-disambiguation sentences (e.g., "The hunter thought that the HARE in the field was actually a rabbit") or unambiguous sentences (Rayner and Duffy, 1986; Duffy et al., 1988, 2001; Rodd et al., 2010a). Zempleni et al. (2007) provide more direct support that frontal and temporal regions support such reinterpretation processes by showing that activation in the LIFG and posterior middle/inferior temporal gyrus was modulated by meaning dominance, that is, how frequent the intended meaning is relative to the other meanings. Specifically activation in these regions was greater for late-disambiguation sentences that corresponded to the subordinate (i.e., less frequent) meaning than the dominant meaning (Zempleni et al., 2007). Reinterpretation is more likely in subordinate-biased sentences because people will typically select the dominant meaning in the absence of prior biasing context (Rayner and Duffy, 1986; Duffy et al., 1988; Simpson and Krueger, 1991).

Rodd et al. (2012) further proposed that the LIFG, in particular, may also be important for the initial selection of an ambiguous word's meaning that occurs when the word is initially encountered during a sentence (Twilley and Dixon, 2000). This suggestion was based on the finding that the LIFG, but not the posterior temporal cortex, was also more active for sentences in which reinterpretation was unlikely (compared to unambiguous sentences). These were sentences in which the disambiguating information *preceded* the ambiguous word. In addition, this region also showed activation that was temporally associated with the ambiguous word as well as the disambiguating information in late-disambiguation sentences. Thus, these results suggested that the LIFG may be involved in multiple processes of ambiguity resolution, not only when a meaning needs to be reinterpreted.

Supporting evidence for the involvement of the LIFG and posterior temporal cortex in reinterpretation and/or initial meaning selection, however, is not conclusive on various levels. First, the functional contributions of these regions to these two ambiguityrelated processes is uncertain because not all studies have found the same response pattern to the different types of ambiguous sentences that load on these processes (Mason and Just, 2007; Bekinschtein et al., 2011). Second, different methods for examining neural responses to reinterpretation and/or initial meaning selection have been used for written sentences compared to spoken sentences. For example, studies have assessed how meaning dominance modulates ambiguity-related neural responses but these have only been conducted on visually-presented latedisambiguation sentences (Mason and Just, 2007; Zempleni et al., 2007). It is important to examine whether such dominance patterns found in the visual modality also replicate for spoken sentences in order to understand whether these ambiguity-related responses generalize across modalities. Although many of the regions reported in semantic ambiguity studies are considered modality-general (Binder et al., 2009; Novick et al., 2009; Price, 2012), it is possible that speech places different demands on ambiguity processes (particularly working memory aspects) due to the transient, fast-fading nature of the speech signal and, thus, may place different demands on the underlying neural circuitry. Third, the precise nature of the LIFG and posterior temporal cortex's involvement in ambiguity processes is also uncertain because there is considerable anatomical variability in the locus and extent of the ambiguity responses in these regions reported across studies (Rodd et al., 2005, 2010b, 2012; Mason and Just, 2007; Zempleni et al., 2007; Bekinschtein et al., 2011). As these anatomical differences relate to different anatomical regions that have been associated with different functions (see Price, 2012, for a recent review), it is important to explore the potential sources of this variability. It is possible that such variability reflects effects of statistical thresholds, differences in ambiguous stimuli or experimental protocols or even inter-subject functional variability given the finding of looser function-anatomy mappings for high-level cognitive processes (Duncan et al., 2009; Tahmasebi et al., 2012).

Furthermore, it is unclear how these ambiguity-responsive regions relate to those associated with sentence comprehension more generally. Do semantically ambiguous words place additional demands on regions that are already involved in the processing of sentences in general or do they engage regions that are more specific to semantically demanding stimuli? Neural models of language comprehension give different answers to this question. For example, Hagoort's unification account of LIFG function argues for the former, imputing a sentence-general function to this region (Hagoort, 2005, 2013), while Novick and colleagues' conflict resolution account argues for the latter (Novick et al., 2005, 2009). Such differences in perspective are also found across theories of the posterior temporal cortex's function in language processing (e.g., Hickok and Poeppel, 2007; Jefferies, 2013).

In summary, the current literature raises several questions regarding the involvement of the LIFG and posterior temporal cortex in semantic ambiguity resolution. What functional roles do these regions play in ambiguity resolution and sentence comprehension more generally? Which specific anatomical sub-fields within these regions are engaged by semantic ambiguity? How consistent is this ambiguity network across individuals? These questions were investigated using fMRI. Neural responses to a large set of late-disambiguation sentences were compared with those to well-matched unambiguous sentences. Based on previous research, it was predicted that ambiguity-elevated responses would be broadly found in the LIFG and the left posterior temporal cortex (Rodd et al., 2005, 2010b, 2012; Davis et al., 2007; Mason and Just, 2007; Zempleni et al., 2007; Bekinschtein et al., 2011). The areas showing a significant ambiguity effect were then investigated to answer three specific questions pertaining to function and inter-subject variability:


than biased sentences because this process is more difficult since listeners have less strong preferences for one particular meaning of these words (Duffy et al., 1988, 2001; Twilley and Dixon, 2000). If regions are involved in both processes (relatively equally), then they may show equivalently strong activation to both balanced and biased sentences, since they both load on (at least) one of these processes. In addition, two types of biased sentences were compared: strongly subordinate and weakly subordinate words. This comparison allows us to examine whether responses are merely related to the likelihood of reinterpretation, where the dominance pattern would be: "strongly biased" *>* "weakly biased" *>* "balanced," or whether a less linear relationship exists between reinterpretation and ambiguity-responses. For example, a region may be especially engaged when very infrequent meanings need to be integrated, which would produce a pattern of: "strongly biased" *>* "weakly biased" = balanced". See **Table 1** for example sentences in each of the ambiguous and unambiguous conditions.

3. Finally, how consistent are these neuronal ambiguity-effects across individuals? Inter-subject variability of regions showing an ambiguity effect were assessed by examining whether the regions that showed reliable activation at the group-level were activated in all subjects.

## **METHODS**

#### **PARTICIPANTS**

Twenty native monolingual British English speakers (11 female), aged 18–35 (*M* = 23.8) participated in the study. All were righthanded, had normal or corrected-to-normal vision and had no known hearing or language impairment. Participants were recruited from the UCL experimental subject pool and were paid for their participation. All gave informed consent and appropriate ethical approval was obtained from the UCL Departmental Ethics committee.

#### **STIMULI**

Ninety two ambiguous auditory sentences were created based on items from Rodd et al. (2010a). They were all late-disambiguation sentences where the disambiguating context was presented after a semantically ambiguous noun. For example, in the sentence "the woman had to make the *toast* with a very old *microphone*," "toast" is the ambiguous word (i.e., grilled bread vs. celebratory speech)

#### **Table 1 | Sentence conditions.**


*In each example, the ambiguous word is capitalized and the disambiguating word is underlined.*

and "microphone" is the disambiguating word. On average, the ambiguous words were presented 6.70 (*SD* = 1*.*00) words into the sentence and were a mixture of homographic and heterographic nouns (e.g., bark, night/knight). The disambiguation was always provided by the sentence-final word, except in four sentences where it was the last two words. There were at least 4 words between the ambiguous and disambiguating words (*M* = 5.79, *SD* = 1*.*46) to give listeners enough time to select their preferred meaning before they hear the disambiguating information (e.g., Swinney, 1979; Seidenberg et al., 1984; Rodd et al., 2010a). To ensure that the rest of the words were neutral, the sentences were created such that only the disambiguating word needed to be changed to instigate the alternative meaning. For example, in the above example, "microphone" could be replaced by "grill." These alternative versions were not employed in the experiment.

To elicit semantic reinterpretation, the disambiguating words were chosen to correspond to the less frequent meaning of the ambiguous word because psycholinguistic research demonstrates that the dominant, (and in this case, incorrect), meaning is usually initially selected prior to disambiguating information (Duffy et al., 1988; Simpson and Krueger, 1991; Twilley and Dixon, 2000). The subordinate meaning was based on pre-test scores obtained by Rodd et al. (2010a). To validate these preferences, an independent group of 59 participants performed an extended version of the standard word association task typically used to measure meaning preferences (Twilley et al., 1994). Each ambiguous word was presented in isolation (e.g., fan) and participants typed the first related word that came to mind (e.g., wind, follower, cool). Because some responses could relate to more than one meaning (e.g., cool), after all the isolated words had been presented, the participants selected a definition of their intended meaning (e.g., admirer vs. ventilation device). This ensured that equivocal responses could be coded accurately (for further details see Vitello, 2014). A dominance score was subsequently calculated as the proportion of codable responses that were consistent with the meaning used in the experimental sentence (minimum 31 data-points per item). As expected, most words had low dominance scores (*M* = 0.25, *SD* = 0*.*20) indicating that the meaning used in the experimental sentences was the less preferred, infrequent meaning for the majority of items. These scores spanned across the four main categories of meaning dominance reported in the psycholinguistic ambiguity literature (Rayner and Duffy, 1986; Duffy et al., 1988; Sereno, 1995; Vuong and Martin, 2011): (1) 32 words were strongly subordinate-biased, where the meaning used in the experimental sentences was very infrequent, on average, preferred by only 6% of listeners (dominance range: 0– 0.14); (2) 27 words were weakly subordinate-biased, where the sentence meaning was fairly infrequent, on average preferred by 21% of listeners (dominance range: 0.16–0.30); (3) 27 words were balanced, where the sentence meaning was one of two (or more) relatively equally frequent meanings, on average, preferred by 39% of listeners (dominance range: 0.31–0.54; (4) the remaining six sentences had high dominance scores, where the sentence meaning was, on average preferred by 77% of listeners (dominance range = 0.65–0.84). The range of "balanced" scores coheres with studies in which the less likely meaning of the balanced words was chosen (Rayner and Duffy, 1986; Sereno, 1995; Vuong and Martin, 2011). One-way independent-measures ANOVAs showed that the strongly-biased, weakly-biased and balanced conditions were matched on sentence-level properties (duration in seconds, number of syllables, number of words, position of the ambiguous word, position of the disambiguating word, naturalness rating; all *p*s *>* 0.2) as well as on lexical properties of the ambiguous word [log-transformed frequency, number of letters, number of meanings and number of senses, where "meanings" refers to semantically and etymologically unrelated meanings (e.g., bark) and "word senses" are semantically related (e.g., run), Rodd et al., 2002, all *p*s *>* 0.09].

Each ambiguous sentence was paired with a well-matched unambiguous sentence of similar syntactic structure that had a low-ambiguity noun in the position of the ambiguous word. For example, "the student had to wrap the *wrist* with a very old bandage." Statistical tests confirmed that the ambiguous words had significantly more meanings and senses than the unambiguous words [*t*(91) = 8*.*14, *p <* 0*.*001; *t*(91) = 8*.*31, *p <* 0*.*001, respectively] (Online Wordsmyth English Dictionary-Thesaurus, Parks et al., 1998). They, however, did not differ significantly on other lexical properties, including log-transformed word frequency (CELEX lexical database, Baayen et al., 1995), and number of letters (all *p*s *>* 0.1) See **Table 2** for corresponding descriptive statistics. The sets of ambiguous and unambiguous sentences were additionally matched on physical duration, number of syllables and number of words (all *p*s *>* 0.1). On average, both sets were also judged as highly natural, although statistically, the ambiguous sentences had lower naturalness ratings when rated on a 1 (highly unnatural) to 7 (highly natural) point scale by an independent group of 15 participants [*t*(91) = 3*.*98, *p <* 0*.*001]. See **Table 2** for sentence-level descriptive statistics. All sentences were spoken by the same female speaker (JMR).

Additionally, 46 filler sentences (50% ambiguous) were employed with the same structure as the experimental sentences. 14 were used in an initial practice block, 24 were catch sentences and the remaining 8 constituted dummy trials at the beginning of the fMRI runs. Catch sentences were presented with a visually presented probe word which participants had to decide was related or unrelated to the sentence. The aim of these catch trials was to ensure that attention was paid to the meaning of the sentences. Thus, for each catch sentence, a probe word was selected that was either clearly semantically related (50%) or clearly semantically unrelated (50%) to the sentence's meaning. Finally, to create a low-level auditory baseline condition, 32 experimental sentences were randomly selected and converted to signal-correlated noise (SCN) using Praat software (http://www*.*praat*.*org). Conversion to SCN involved replacing all the spectral detail with noise, rendering sentences unintelligible whilst maintaining low-level acoustic properties by retaining the original spectral and amplitude profiles. SCN was chosen as the baseline condition to be able to directly compare these results with those of previous fMRI studies on ambiguity (Rodd et al., 2005, 2012; Bekinschtein et al., 2011). An additional two sentences were selected and converted to SCN for the practice block.

The auditory stimuli were delivered over Sensimetrics insert earphones (http://www*.*sens*.*com/s14/) in the scanner. EQ


**Table 2 | Descriptive statistics [mean(***SD***)] for properties of the ambiguous and unambiguous target words and sentences.**

filtering Software (Sensimetrics, Malden, MA, USA) was used to filter all sound files to ensure accurate frequency reproduction.

#### **DESIGN AND PROCEDURE**

An event-related, within-subject design was employed in which participants were presented with all types of sentence trials (ambiguous, unambiguous, SCN and catch sentences) as well as silent (rest) trials. The rest trials were included as another baseline condition, having, on average, the same physical duration as the sentence trials (mean = 3 s; range: 2–4 s). The experiment was divided into four sessions, each with 70 trials: 23 ambiguous; 23 unambiguous; 8 SCN and 8 rest trials as well as two dummy trials to allow for T1 equilibrium before the test trials began. The stimuli were pseudo-randomized so that each run had an equal number of each stimulus type and no ambiguous sentence was placed in the same session as its matched unambiguous sentence in order to avoid potential syntactic priming effects. Each session lasted, on average, 8.47 min. The order of the sessions was counterbalanced across participants.

Each trial commenced with a white fixation cross in the center of a black screen. After 1000 ms, an auditory sentence stimulus or rest trial was presented. Then, for all trials, except catch trials, a silent period of 1500 ms occurred, followed by a jittered inter-trial interval (ITI) of 1000–3000 ms. For catch trials, a silent period of 500 ms followed the sentence offset, then the fixation cross was replaced by a probe word that was presented for 1000 ms on the screen (36 pt bold Helvetica font). Participants indicated whether the probe was related or unrelated to the sentence they just heard by pressing a button with the right index or middle finger. Response button order was counterbalanced across subjects. To discourage participants from actively waiting for a probe to appear and ensure attention to each sentence, we emphasized that responding to the probes would be straightforward if they listened carefully to each sentence. Participants practiced the task inside the scanner before the experimental blocks. The practice block contained a higher proportion of catch-trials than the experimental blocks so that participants could familiarize themselves with the task. A jittered ITI also followed the catch-trials sentences but this ranged from 2000–3000 ms to allow participants at least 3000 ms from probe-onset to respond and prepare for the next trial.

All stimuli were presented using MATLAB (Mathworks Inc.) and COGENT 2000 toolbox (www*.*vislab*.*ucl*.*ac*.*uk/cogent/index*.* html). The visual stimuli were projected onto a screen and viewed via mirrors mounted on the head coil. The auditory stimuli were delivered via MRI-compatible insert earphones (Sensimetrics, Malden, MA, USA, Model S-14), which provided a 20–40 dB attenuation level. Participants wore another set of ear protectors over the insert earphones to provide additional attenuation of the scanner noise. The experimenter checked participants could hear the sentences clearly over the noise of the functional EPI sequence prior to the experimental scanning blocks by conducting a practice run in the scanner.

#### **MRI ACQUISITION**

Participants were scanned at the Birkbeck-UCL Centre for Neuroimaging (BUCNI) on a Siemens Avanto 1.5T scanner. Whole-brain functional images were acquired with a gradientecho EPI sequence (*TR* = 3000 ms; *TE* = 50 ms; 3 × 3 × 3 mm resolution). Each run consisted of 180 volumes. In addition, a high-resolution anatomical scan was acquired (T1 weighted FLASH, *TR* <sup>=</sup> 12 ms; *TE* <sup>=</sup> <sup>5</sup>*.*6 ms; 1 mm3 resolution) for anatomical localization purposes.

#### **fMRI DATA ANALYSIS**

The functional images were preprocessed and analyzed using Statistical Parametric Mapping software (SPM8, Wellcome Department of Cognitive Neurology, London, UK). Preprocessing involved realignment, spatial normalization and smoothing (8 mm FWHM Gaussian kernel) of the functional images. Entire datasets from three participants were removed because of excessive translational head motion (*>*3 mm) in at least three of the four scanning sessions. In a further participant, a single scanning session was excluded due to excessive head motion. Finally, for two participants the final five and seven volumes of one run were excluded due to motion. Spatial normalization combined an initial affine component with subsequent non-linear warping (Friston et al., 1995) to best match the Montreal Neurological Institute's MNI-152 template. The resulting images retained their original resolution (3 × 3 × 3 mm). Two analyses were conducted with separate general linear models. The first model combined all ambiguous sentences into a single condition regardless of ambiguous word dominance so that parameter estimates of the overall ambiguity effect would not be biased by the differences in sample sizes between the dominance conditions. At the first level, three experimental conditions (ambiguous, unambiguous and SCN) and one "dummy" condition that included the dummy sentences and catch-trials were modeled separately. For each trial, the onset of the sentence/SCN and its duration were specified. For the catch-trials 1.5s was added to the duration to incorporate the presentation of the visual word. Realignment parameters and temporal and dispersion derivatives were included as additional regressors to help model structured noise in the data. The derivatives, in particular, helped accommodate variability in the onset and duration of neural responses to the ambiguous sentences. At the group-level, random effects analyses were employed for two contrasts: "Unambiguous vs. SCN" and "Ambiguous vs. Unambiguous." The former was conducted first to identify the general language network that is engaged under normal, low-ambiguity, speech. The latter identified the more specific ambiguity-elevated network. For each, the corresponding contrast parameter estimates for each subject were entered into the group-level analysis, where one-sample *t*-tests were computed. Activations were considered significant if they reached a threshold of *p <* 0*.*05 FWE corrected at the voxel level (Worsley et al., 1996).

The second analysis was identical to the first except that ambiguous sentence trials were modeled as separate dominance conditions. To achieve this, the first-level analysis model included four separate regressors corresponding to the four dominance conditions (strongly biased, weakly biased, balanced, and dominant). For each subject, parameter estimates were obtained for three contrasts: "Strongly biased *>* Unambiguous," "Weakly biased *>* Unambiguous" and "Balanced *>* Unambiguous." The dominant condition was not analyzed further due to the small number of trials in this condition. At the group-level, contrast images from these comparisons were entered into a One-Way repeated measures ANOVA to assess effects of dominance across the whole brain and were also employed in region-ofinterest (ROI) analyses, described in more detail in the Results section.

Participants' structural images were normalized to the T1 template and a group mean structural image was created for data display purposes.

#### **RESULTS**

#### **BEHAVIORAL RESULTS**

On the catch trials participants achieved a mean accuracy of 92% (range = 79–100%), with a mean reaction time of 1328 ms (*SD* = 345), indicating that all participants were paying attention to the meaning of the sentences.

#### **UNAMBIGUOUS SENTENCES vs. SCN**

The contrast between unambiguous sentences and the low-level baseline condition showed a significant broad cluster of activation in the left hemisphere centered laterally on the mid-superior temporal sulcus (STS), extending along the length of STS and superiorly to the anterior superior temporal gyrus (STG) (see **Figure 1** and **Table 3**). At a lower threshold (*p <* 0*.*001 uncorrected), the left anterior temporal activation spread inferiorly into anterior middle temporal cortex. In the right hemisphere, there was a smaller significant cluster of activation centered in mid STG that extended anteriorly into STG and STS. At the lower threshold (*p <* 0*.*001 uncorrected) it also extended posteriorly and inferiorly into right STS. There was also significant activation in the left dorsolateral precentral gyrus. The LIFG showed activation when the threshold was lowered to *p <* 0*.*001 uncorrected, specifically within dorsal pars opercularis (peak coordinate [−54, 17, 19]; *z*score = 3.58). For completeness, the results of the Ambiguous *>* SCN contrast is presented in the supplementary materials (see Figure S1 and Table S1).

#### **AMBIGUOUS vs. UNAMBIGUOUS SENTENCES**

Two clusters in the left hemisphere showed significantly greater activation for ambiguous than unambiguous sentences (see **Figure 2A** and **Table 4**). One cluster was located in the LIFG, centered in pars triangularis. Note that this cluster does not overlap with the pars opercularis cluster that showed greater activation for the unambiguous sentences reported in the previous contrast. At a lower threshold (*p <* 0*.*001 uncorrected), the activation spread predominately posteriorly through pars opercularis, thereafter extending primarily dorsally in middle frontal/precentral gyrus. The second cluster was located in the posterior inferotemporal cortex (pIT). Its peak was in the posterior occipitotemporal sulcus

**FIGURE 1 | Unambiguous sentence vs. SCN contrast displayed on the mean group structural image.** Red represents activation significant at *p <* 0*.*05 FWE-corrected and yellow represents activation significant at *p <* 0*.*001 uncorrected.



*Sub-peaks that are more than 8 mm from the main peak are indented. L, left; R, right; STS, superior temporal sulcus, STG, superior temporal gyrus; MTG, Middle temporal gyrus.*

**Table 4 | Ambiguous vs. unambiguous sentences: peak activations at** *p <* **0***.***05 FWE corrected.**


*Sub-peaks are indented following main peak.*

*L, left; OTS, occipitotemporal sulcus; ITG, inferior temporal gyrus; IFG, inferior frontal gyrus.*

(OTS) but extended laterally, with a significant sub-peak in the inferior temporal gyrus (LITG). At a lower threshold (*p <* 0*.*001 uncorrected) this activation extended inferiorly into the posterior and middle portion of the fusiform gyrus, as well as superiorly through the pMTG extending along the STS (see **Figure 2A**).

The response profiles of the regions that showed a significant ambiguity effect were further examined with two regionof-interest (ROI) analyses, performed using the Marsbar toolbox within SPM8 (Brett et al., 2002). The first analysis assessed the nature of the ambiguity difference and the selectively of these regions' responses to ambiguous sentences by examining their responses to ambiguous and unambiguous sentences, separately, relative to SCN. For this analysis, a LIFG and a left pIT ROI were constructed as 8 mm radius spheres centered on the LIFG and left pIT group-peak coordinates obtained from the "Ambiguous *>* Unambiguous" contrast. Mean parameter estimates were obtained in each region for the contrasts "Ambiguous *>* SCN" and the "Unambiguous *>* SCN" for each participant. As shown in **Figure 2B**, the ambiguity difference in both regions was, importantly, driven by increased activity for the ambiguous sentences compared to SCN rather than deactivation in the unambiguous condition. In addition, one-sample *t*-tests revealed that neither the LIFG nor the left pIT ROIs showed a significant response for the unambiguous sentences compared to SCN [*t*(16) = 0*.*17, *p* = 0*.*87; *t*(16) = 1*.*88, *p* = 0*.*25, respectively].

A second ROI analysis assessed whether these regions were affected by meaning dominance. Mean parameter estimates for the strongly biased, weakly biased and balanced conditions relative to the unambiguous condition were obtained for the LIFG and left pIT ROIs. The resulting effect sizes for each region were normalized relative to the average effect size for that ROI across all participants and all three contrasts. This normalization adjusts for differences in overall effect sizes between ROIs that may confound the magnitude of the differences found between conditions between regions. The normalized effect sizes were entered into a 3 × 2 repeated-measures ANOVA with Dominance (strongly biased, weakly biased and balanced) and Region (LIFG, pIT) as the two factors. The results showed a significant main effect of Dominance [*F*(2*,* 32) <sup>=</sup> <sup>3</sup>*.*49, *<sup>p</sup>* <sup>=</sup> <sup>0</sup>*.*04, <sup>η</sup><sup>2</sup> *<sup>p</sup>* = 0*.*18], no significant main effect of site and no significant Dominance x Region interaction (*F <* 1 in each case), indicating no reliable differences between the effect of dominance in the frontal and temporal regions. Paired *t*-tests between each pair of dominance conditions (averaged across region) showed that strongly biased sentences (mean = 1.23, *SD* = 0*.*80) and weakly biased sentences (mean = 1.15, *SD* = 0*.*77) produced significantly greater activation than balanced sentences [mean = 0.62, *SD* = 0*.*79: *t*(16) = 2*.*21, *p* = 0*.*04; *t*(16) = 2*.*19, *p* = 0*.*04, respectively). However, there was no significant difference between the strongly and weakly biased sentences [*t*(16) = 0*.*35, *p* = 0*.*74]. See **Figure 2C** for the patterns of dominance effects for each of the ROIs.

No significant effects of dominance were found in the wholebrain analysis (*p <* 0*.*05 FWE corrected).

#### **INTER-SUBJECT VARIABILITY**

Although peak co-ordinates from the group analysis identify voxels that show the most reliable effects across subjects, it is also important to assess the inter-subject variability around these peaks. For each subject we obtained the nearest local maximum (*p <* 0*.*05 uncorrected) to the frontal [−45, 32 4] and temporal group activation peaks [−45, −55, −11] from the Ambiguous *>* Unambiguous contrast. The location was then examined on each subject's own structural image and identified according to sulcal landmarks. Only peaks that were within the frontal and temporal cortex were considered.

As shown in **Table 5** and **Figure 3**, all subjects, except one, showed significant activation in close proximity to both the frontal and temporal group peaks. Only one subject did not show any significant activation around the frontal peak, with the nearest local maxima located 28 mm from the peak (coordinates [−21, 20, 13], *z*-score = 2.85). There was no significant difference between the two group peaks in terms of the average Euclidian distance of the local maxima [paired *t*-test: *t*(15) = 1*.*37, *p* = 0*.*19]. Interestingly, the locations of these local maxima were notably more anatomically consistent (i.e., residing in the same macroanatomic region) in the frontal than in the temporal cortex. For 13 out of the 16 subjects who showed significant activation around the frontal peak, their local maxima resided in pars triangularis, with 2 additional subjects showing activation on the border between pars triangularis (PTr) and pars orbitalis (POr). In contrast, there was more anatomical variability around the temporal peak, with local maxima residing inferiorly within ventral occipital temporal cortex areas, such as OTS and fusiform gyrus (FSG), whilst others were located more laterally within MTG/ITG.

## **DISCUSSION**

The results of this study replicate previous findings of increased activation in the LIFG and posterior temporal cortex for (temporarily) semantically ambiguous sentences compared to unambiguous sentences (Rodd et al., 2005, 2010b, 2012; Davis et al., 2007; Mason and Just, 2007; Zempleni et al., 2007; Bekinschtein et al., 2011). The current study employed ambiguous sentences for which listeners were likely to initially select the incorrect meaning of the ambiguous word and then need to reinterpret their understanding of the sentence later in the comprehension process. This was achieved by presenting the disambiguating information several words after the ambiguous word (e.g., "the woman had to make the TOAST with a very old *microphone*"), as various psycholinguistic models of ambiguity resolution claim that listeners make an initial meaning selection within a few hundred milliseconds of hearing an ambiguous word (Twilley and Dixon, 2000). Thus, this initial finding of ambiguity-responsive activation in the LIFG and posterior temporal cortex is consistent with the hypothesis that both of these regions may be important for reinterpreting the meaning of a word during sentence comprehension (e.g., Novick et al., 2005; Zempleni et al., 2007; Rodd et al., 2012).

This study, furthermore, explored the roles of these regions in ambiguity resolution by assessing their response profiles to different types of sentence stimuli as well as the inter-subject consistency of these regions' responses to ambiguity. The results of the functional-based analyses are discussed first, separately for the two regions, followed by discussion of the inter-subject variability.

**FIGURE 3 | Inter-subject variability around the Ambiguous vs. Unambiguous contrast group peaks displayed on the group mean structural image.** Red is the group peak and blue are individual subjects' peaks. **(A)** Variability around the LIFG group peak shown on a coronal slice where *y* = 32; **(B)** Variability around the LIFG and OTS group peak shown on a sagittal slice where *x* = −45; **(C)** Variability around the OTS group peak shown on a coronal slice where *y* = −55.


**Table 5 | Individual subjects' "Ambiguous** *>* **Unambiguous" local maxima nearest to the frontal and temporal group peaks.**

Two specific functional questions were assessed. (1) Is activation within these regions specific to ambiguous sentences or present for all sentences, albeit to a less extent for low-ambiguity sentences? (2) Are these regions primarily contributing to semantic reinterpretation processes or initial meaning selection components of ambiguity resolution? For these questions, two contrasts were assessed via ROIs around the frontal and temporal group peak co-ordinate separately: (1) the regions' response to unambiguous sentences compared to a low-level auditory baseline and (2) the modulation of these responses by meaning dominance (i.e., meaning frequency) by comparing biased and balanced ambiguous words. A region showing an ambiguity effect that is primarily involved in semantic reinterpretation will show larger responses for biased than balanced sentences, whereas regions that are primarily involved in initial meaning selection will show the reverse profile. Together the results of these two contrasts give insights into the ways by which these regions support ambiguity resolution and language comprehension more generally, which ultimately help constrain theories of their functions in these processes.

#### **LEFT INFERIOR FRONTAL GYRUS**

Statistically robust activation (*p <* 0*.*05 FWE corrected) for semantically ambiguous sentences was found in the middle portion of the LIFG, namely pars triangularis (**Figure 2A**). This region has been reported in nearly all published studies on semantically ambiguous sentences (Rodd et al., 2005, 2010b, 2012; Davis et al., 2007; Mason and Just, 2007; Zempleni et al., 2007; Bekinschtein et al., 2011). Thus, this study corroborates it as the most consistent site of significant ambiguity-elevated peaks in the frontal cortex.

The results of the two additional contrasts showed two important findings pertaining to the role of this region in language comprehension. First, this region showed no significant response to unambiguous sentences compared to SCN (**Figure 2B**), suggesting that it may not be routinely involved during comprehension of low-ambiguity speech and may, therefore, perform different functions to those involved in general sentence processing. Several other neuroimaging studies have also failed to find significant LIFG responses to low-ambiguity sentences (Crinion et al., 2003; Spitsyna et al., 2006; Rodd et al., 2012).

This response selectivity for ambiguous but not unambiguous sentences is most consistent with the conflict resolution account of LIFG function (Thompson-Schill et al., 1997; Novick et al., 2005, 2009). According to this theory, the LIFG is involved in sentence comprehension only when there is conflict between simultaneously active representations in order to support the selection of one alternative. It is worth noting that although this region is not recruited by the relatively simple low-ambiguity sentences used in this study, its role is very unlikely to be specific to resolving semantic ambiguity as activation in this region has been observed for a range of other types of complex sentences including syntactically ambiguous sentences and syntactically complex sentences (e.g., Santi and Grodzinsky, 2010; Tyler et al., 2011).

The lack of a response for unambiguous sentences is less easily compatible with sentence-general accounts of the LIFG. For example, Hagoort's unification theory (Hagoort, 2005, 2013) proposes that the LIFG serves to combine small units of linguistic information into larger representations of a sentence. Therefore, all sentences should engage this region to some extent. Although the lack of an unambiguous response may merely be masked by activation in the baseline condition (Binder et al., 1999), patient data provide some corroborating evidence that the LIFG may not be necessary for and, thus, not always involved in language comprehension. For example, patients with LIFG lesions have relatively preserved comprehension of words and of relatively simple sentences (Caramazza and Zurif, 1976; Caplan et al., 1996; Yee et al., 2008; Novick et al., 2009).

It is important to note that the whole LIFG was not uniform in its response to the unambiguous condition. A more posterior region, in pars opercularis, showed greater activation for unambiguous sentences as well as an additional response to the ambiguous stimuli, although both of these effects were only significant at a more lenient statistical threshold. Thus, this suggests that there may be functionally distinct regions in the LIFG, some of which perform processes that are general to sentences and others that are more specific to certain types of sentences. However, it is not clear how this can be reconciled with claims that the function of the LIFG can be fractionated on the basis of either the linguistic nature of the processes (Gough et al., 2005; Vigneau et al., 2006; Hagoort, 2013) or the nature of the cognitive operation (Novick et al., 2005; Badre and Wagner, 2007).

The second key question concerned the effect of dominance (i.e., meaning frequency). The results revealed that mid-LIFG activation was greater for ambiguous sentences that contained a biased ambiguous word, which have one particularly dominant meaning (e.g., "toast"), than a balanced ambiguous word whose meanings are relatively equally frequent (e.g., "bark"; **Figure 2C**). This finding further supports the reinterpretation hypothesis, as listeners are more likely to reinterpret the meaning of a biased word because they were always disambiguated to their subordinate meaning (e.g., speech meaning of "toast"). Psycholinguistic research demonstrates that listeners and readers usually initially select the dominant meaning of a biased word when encountered before disambiguating context (e.g., the bread meaning of "toast"), whereas for balanced words there is less systematic bias for either alternative meaning across individuals (e.g., some may select the dog meaning of "bark" while others select the tree meaning). Thus, for biased sentences, the initial interpretation would often be incorrect and, hence, need to be reinterpreted more often than for balanced sentences. Although no significant dominance effects were found in the whole-brain voxel-wise analysis, this may reflect the fact that dominance responses are likely to be highly variable across both voxels and subjects, given the findings that meaning preferences are inherently variable across subjects (Rodd et al., 2013) and that the exact time-course of disambiguation varies across sentences (Rodd et al., 2012) and individuals depending on comprehension ability (Gernsbacher et al., 1990; Gernsbacher and Robertson, 1995).

The results of the dominance contrast directly replicate Mason and Just's (2007) finding of greater LIFG activation for biased than balanced sentences in visually-presented sentences, albeit in a more anterior ventral region, and provide the first evidence of these effects in spoken sentences. The results also converge with other dominance effects found in the LIFG, including greater activation for subordinate-biased compared with dominant-biased sentences (Zempleni et al., 2007) and the finding of a negative correlation between LIFG activation and the dominance of syntactically ambiguous sentences (Rodd et al., 2010b). Again, both of these effects reflect greater activation for sentences where reinterpretation is more likely. The finding that similar dominance effects in this region were found for this set of spoken sentences as has been reported for visually-presented sentences (Mason and Just, 2007; Zempleni et al., 2007) suggests a common system for disambiguating spoken and written sentences.

Interestingly, the current results showed that the two types of biased sentences patterned together: activation for strongly- and weakly-biased sentences was significantly greater than for balanced sentences but were not significantly different from each other. This suggests a non-linear relationship between dominance and neural response with neural responses not simply being associated with the likelihood of semantic reinterpretation. One possible reason for this pattern is that the neural responses may, in part, reflect how *difficult* reinterpretation is because this process is more demanding for biased than balanced words regardless of the extent of the bias *per se*. This explanation is derived from a large body of psycholinguistic research that demonstrates a difference in the state of the alternative meanings of biased and balanced words during the comprehension of late-disambiguation sentences. When biased words are encountered before disambiguating context, their dominant meaning is quickly integrated (e.g., the bread meaning of toast) while their subordinate (speech) meanings are quickly suppressed or not accessed at all. In contrast, multiple meanings of balanced words (tree and dog meanings of "bark") are initially activated and it takes longer for one meaning to be integrated (e.g., Simpson, 1994; Twilley and Dixon, 2000; Duffy et al., 2001). Thus, contextually appropriate, subordinate, meanings may be harder to (re)integrate than non-selected balanced meanings because they are less available when the disambiguating information is later encountered and contextually-inappropriate, dominant, meanings may also be harder to override than initially-selected balanced meanings because they have been more strongly integrated (Simpson, 1994; Twilley and Dixon, 2000; Duffy et al., 2001; Gernsbacher and St John, 2001). Thus, the dominance pattern suggests that the LIFG may be particularly important to integrate less available meanings and/or suppress dominant incorrect representations during sentence reinterpretation. This is highly consistent with a recent patient study demonstrating that patients with damage to the LIFG had particular difficulty in resolving subordinate-biased sentences compared to sentences with balanced ambiguous words (Vuong and Martin, 2011). It is also compatible with Novick et al.'s (2005) view that the LIFG's role in sentence comprehension is to resolve misanalyses, although they refer to syntactic misinterpretations, and converges with findings in non-linguistic domains, such as emotion regulation where two recent meta-analyses have shown that the LIFG (and posterior temporal cortex) is engaged during reinterpretation of emotionally eliciting events (Buhle et al., 2013; Kohn et al., 2014).

This reinterpretation-based conclusion is predicated on the assumption that the greater activation for biased words is related to processes occurring at the time of the disambiguating information. It is therefore important to rule out alternative explanations that could potentially account for these effects in terms of processing at the time that they are initially encountered. At face value, such accounts seem unlikely as no current cognitive theories predict that there should be greater cognitive processing when encountering ambiguous words with one strongly dominant meaning compared with balanced words with two equally-frequent meanings, when these words occur in a neutral context. While it is, in theory, possible that biased words could induce greater processing demands if participants had learnt during the course of the experiment that when they encountered an ambiguous word they should interpret it with the less preferred meaning, existing behavioral and neuroimaging research strongly suggest that such expectations are either not learnt, or do not substantially influence, sentence comprehension. For example, numerous behavioral studies that have examined processing of these late-disambiguation sentences show that listeners' and readers do not experience behavioral processing costs (i.e., longer reading times or poorer performance on a secondary concurrent task) when they encounter biased ambiguous words in a sentence but only experience processing costs when the disambiguating information is encountered later in the sentence (e.g., Duffy et al., 1988, 2001; Rodd et al., 2010a). If biased words induced greater selection conflict at initial encounter with the word then such costs should be found at the time of the ambiguous word. In addition, even in fMRI studies where the ambiguity is concealed, such that participants do not report noticing any ambiguity in the sentences, activation is found in broadly similar brain regions, suggesting that such activity does not reflect greater selection conflict arising from an explicit strategy employed by the listener (Rodd et al., 2005).

In contrast to the current results, which emphasize the role of this region in reinterpretation, various theories suggest that the LIFG should also be important for processes associated with initial meaning selection (in the absence of reinterpretation) whenever this induces conflict (Thompson-Schill et al., 1997; Novick et al., 2009) and/or makes unification difficult (Hagoort, 2013). In addition, Rodd et al. (2012) found evidence that this region responds to both reinterpretation and initial meaning selection stages of ambiguity processing. Although the current results showed greater activation for sentences with a higher likelihood of reinterpretation, the results cannot rule out the possibility that it also responds to initial selection demands but the fMRI protocol was not sensitive enough to detect them. Future research would benefit from using techniques with higher temporal resolution than the fMRI protocol used here, as these processes occur at different times during sentence processing, such as magnetoencephalography (MEG) or time-sensitive fMRI techniques (Rodd et al., 2012), and should compare both the existence and magnitude of these responses.

In summary, the results replicate the involvement of the LIFG in ambiguity resolution and additionally show that this ambiguity-responsive region of the LIFG is not significantly engaged by all types of sentences to the same extent. This region shows no significant response to unambiguous sentences and demonstrates a larger ambiguity response for ambiguous sentences that are more likely to require reinterpretation. Together, the results are most consistent with accounts of this region that do not view LIFG as mandatory for language comprehension (e.g., conflict resolution account) and suggests that it supports comprehension when the listener's current interpretation needs to be updated in light of new contextual information.

#### **POSTERIOR TEMPORAL CORTEX**

In the temporal lobe, statistically robust activation for the semantically ambiguous sentences was located in the left posterior inferior temporal cortex (pIT), specifically in the occipitotemporal sulcus and inferior temporal gyrus. This is in a similar location to that found by Rodd et al. (2012) and Bekinschtein et al. (2011), but is more inferior than other studies where activation centers around pMTG/ITG (Rodd et al., 2005; Davis et al., 2007; Zempleni et al., 2007).

The results of the subsequent experimental contrasts showed that this region had a highly similar response profile to the mid-LIFG. The analyses showed (1) no significant response to unambiguous sentences (**Figure 2B**) and (2) the same pattern of dominance effects, where activation was greater for biased than balanced sentences (**Figure 2C**). Together, the results suggest that this region of the pIT is also involved in semantic reinterpretation processes which are not required for comprehension of low-ambiguity sentences.

The locus of this activation is interesting because it is posterior to regions more strongly associated with multimodal semantic processing, namely the anterior fusiform gyrus (Binder et al., 2009; Price and Devlin, 2011; Seghier and Price, 2011), and the cluster is more inferior than that associated with other lexical/semantic processes such as sound-to-meaning mapping in the pMTG/ITS (Hickok and Poeppel, 2007) and semantic control in the pMTG (Jefferies, 2013). Instead, this region has been more generally attributed to high-level visual processing associated with either the visual form of words (Dehaene and Cohen, 2011) or with visual features of meaningful stimuli more generally (Martin, 2007; Price and Devlin, 2011). This region is not consistently found in auditory single word or spoken sentence studies (Binder et al., 2000; Xiao et al., 2005; Spitsyna et al., 2006; Davis and Gaskell, 2009; Obleser and Kotz, 2010), but a large body of research shows that the response of this region is strongly modulated by non-visual processes such as semantics and phonological information (Devlin et al., 2006; Song et al., 2010; Yoncheva et al., 2010; Twomey et al., 2011) and can be activated in the absence of visual information (e.g., Mellet et al., 1998; Price et al., 2003). Thus, activation in response to ambiguity may reflect top-down accessing of visual information related to orthographic representations and/or visual attributes of the objects referred to in the sentence.

This view makes no strong prediction about whether this response should also occur for unambiguous sentences since these kinds of sentences may also evoke visual information. Indeed a recent study has reported activation for low-ambiguity speech in this region (Rodd et al., 2012). The lack of a response to unambiguous sentences, however, is incompatible with accounts that claim that such visual activation is a fundamental component of semantic processing (Martin, 2007).

Ambiguous sentences may engage visual information processing in various ways. For example, the ambiguity may evoke a visual image of the ambiguous word or an image of the content of the ambiguous sentence, which is supported by a large body of research showing increased activation of visual processing areas during imagery tasks (D'Esposito et al., 1997; Mellet et al., 1998; Martin, 2007; Dehaene and Cohen, 2011). Alternatively, visual representations may be activated more automatically by the increased level of semantic competition induced by the ambiguous words (Gennari et al., 2007), given the evidence of inherent functional and anatomical connections between semantic and perceptual representations (Kherif et al., 2011; Price and Devlin, 2011).

While the locus of this temporal activation is most consistent with regions discussed in visual processing accounts, it must be emphasized that it is also close to regions imputed in other accounts of posterior temporal function. In particular, this region is just inferior to pMTG/ITS that is argued to support soundmeaning mapping (Hickok and Poeppel, 2007; Hickok, 2012). Thus, the finding of an ambiguity effect in the broad vicinity of this region may also be considered consistent with this account, as the mapping between sound and meaning is more uncertain for ambiguous than unambiguous words. Presumably, this mapping needs re-computing when the meaning of word is not supported by contextual information (Rodd et al., 2012), which is further supported by the finding that this region was affected by reinterpretation load. However, it is difficult to explain the lack of significant response for unambiguous sentences in this region if it supports such a fundamental aspect of speech comprehension. Instead, the temporal areas that showed responses to unambiguous sentences were located more superiorly, along the STG/STS and anterior MTG. This distribution is consistent with previous studies on speech comprehension, where activation for low-ambiguity speech is typically confined to superior/middle temporal cortex (Humphries et al., 2001; Spitsyna et al., 2006; Adank and Devlin, 2010; Obleser and Kotz, 2010) rather than extending into inferior temporal regions in the way that is typically seen for studies of ambiguity resolution (Rodd et al., 2005; Davis et al., 2007; Zempleni et al., 2007; Bekinschtein et al., 2011).

In summary, the results replicate the involvement of the posterior inferior temporal cortex in ambiguity resolution and further show that activation in this region is not present for lowambiguity sentences and is particularly responsive to ambiguous sentences that require reinterpretation. Like the response of the mid-LIFG, the results are most consistent with accounts of this region that impute functions that are not mandatory for sentence processing (e.g., visual-based processes) and suggests that this region also supports comprehension particularly when listeners needs to update their understanding of a sentence in light of new contextual information.

#### **INTER-SUBJECT VARIABILITY**

As this study confirms, the involvement of both frontal and temporal regions in the processing of semantically ambiguous sentences is emerging as a highly consistent finding across fMRI studies. However, these results are based on group-level analyses, which do not indicate the extent to which this reflects a network in which all components are engaged by all subjects. To investigate this, inter-subject variability was assessed in relation to the frontal and temporal group peaks. All subjects, except one, showed ambiguity-related local maxima within 10 mm of the LIFG and posterior temporal cortex group peak. These findings provide evidence that the group-level results reflect activation patterns that are consistent across a majority of subjects, rather than being driven by large activations in only a small proportion of individuals.

Other interesting findings also came out of this analysis. First, the anatomical locations of the LIFG individual peaks were highly consistent, being located within pars triangularis in over 80% of subjects. This further highlights the potential importance of this particular LIFG sub-division in semantic ambiguity resolution. In contrast, the locations of the temporal peaks were more anatomically variable. While the majority of subjects showed peaks in inferior, as oppose to middle, temporal regions (ITG, occipitotemporal sulcus, fusiform gyrus vs. MTG), there was no clearly consistent anatomical field. Such anatomical inconsistency in the temporal cortex's response to ambiguity across participants might explain why different studies have reported activation in these different sub-regions. The nature of this inter-subject variability is currently unclear, although several possible explanations exist. MRI and post-mortem investigations of the morphology of the temporal lobe have found that various macroanatomical structures (e.g., the inferior temporal sulcus) are extremely variable across subjects (Ono et al., 1990; Kim et al., 2000). The posterior inferior temporal cortex in particular has also been observed to have less distinct cytoarchitectonic boundaries such that neurologists have reported difficulty in subdividing this region based on microcellular properties (von Economo, 2009). These findings suggest that the relationship between function and macroanatomically-defined regions might be less consistent in the region and, thus, across subjects. Recent fMRI has further shown evidence that higher-level cognitive processes, more generally, show looser function-anatomy mappings than lower-level cognitive processes (Duncan et al., 2009; Tahmasebi et al., 2012). Alternatively, it is possible that this inter-subject variability found in this study may reflect functionally different responses to the ambiguous sentences across subjects, such that subjects draw on different cognitive operations to resolve the ambiguity. Although the reasons for such variability are currently uncertain, these findings clearly show inter-subject consistency of both frontal and temporal regions in processing ambiguous sentences.

#### **ADDITIONAL AMBIGUITY-RESPONSIVE REGIONS**

Inspecting the data at a lower statistical threshold revealed that ambiguity-elevated activations occurred across substantially larger clusters within the frontal and temporal cortex than that shown when applying stringent statistical threshold. The frontal cluster extended throughout pars triangularis and pars opercularis. However, interestingly, activation was not found in its most anterior sub-division, pars orbitalis. This is particularly noteworthy as anterior LIFG has been specifically attributed to semantic processing (Poldrack et al., 1999; Gough et al., 2005; Hagoort, 2005, 2013; Vigneau et al., 2006; Badre and Wagner, 2007). This result is not entirely unexpected as the response of anterior LIFG to semantically ambiguous sentences is the least consistent of the three sub-divisions, with only two studies reporting activation across all three sub-divisions (Rodd et al., 2010b, 2012). One potential explanation is that this region serves a specific semantic-related function that is not important for resolving all types of ambiguous sentences. For example, one current theory of the anterior LIFG is that it supports controlled semantic retrieval (Badre and Wagner, 2007). In these sentences, the disambiguating word may have constituted a sufficiently strong semantic cue to the correct meaning of the word that additional retrieval processes were not needed.

Another interesting observation was the notable extension of ambiguity-related activation into frontal and temporal regions that have been strongly implicated in phonological processing, namely the posterior and mid-STS as well as the posterior LIFG (Hickok and Poeppel, 2007; Rauschecker and Scott, 2009; Hagoort, 2013). Such activation may reflect a replaying of the heard sentence in an attempt to reanalyze the meaning of these sentences. Thus, these additional results may provide working hypotheses for both cognitive and neural models of ambiguity resolution. It is, also, possible that these less robust regions may reflect inter-subject variability in the processing of ambiguous sentences.

Together these findings replicate the involvement of the LIFG and posterior temporal cortex in semantic ambiguity resolution found in previous studies and further demonstrate that this network is highly consistent across individuals. The results, additionally, explored the potential roles of these regions in this process, supporting the hypothesis that both regions may be particularly important when listeners need to reinterpret the meaning of an ambiguous word during sentence comprehension.

#### **ACKNOWLEDGMENTS**

This work was funded by a BBSRC studentship awarded to Sylvia Vitello and a Leverhulme Trust grant awarded to Jennifer M. Rodd and Joseph T. Devlin. We would like to thank Pamela Farago, Alice Treen and Sara Watchko for help with data collection.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fnhum*.* 2014*.*00530/abstract

#### **REFERENCES**


reconsideration of the evidence. *Brain Lang.* 86, 272–286. doi: 10.1016/S0093- 934X(02)00544-8


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 28 April 2014; accepted: 30 June 2014; published online: 29 July 2014. Citation: Vitello S, Warren JE, Devlin JT and Rodd JM (2014) Roles of frontal and temporal regions in reinterpreting semantically ambiguous sentences. Front. Hum. Neurosci. 8:530. doi: 10.3389/fnhum.2014.00530*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Vitello, Warren, Devlin and Rodd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Listening effort and accented speech

#### *Kristin J. Van Engen1 \* and Jonathan E. Peelle2*

*<sup>1</sup> Department of Psychology, Washington University in St. Louis, St. Louis, MO, USA*

*<sup>2</sup> Department of Otolaryngology, Washington University in St. Louis, St. Louis, MO, USA*

*\*Correspondence: kvanengen@wustl.edu*

#### *Edited by:*

*Patti Adank, University College London, UK*

#### *Reviewed by:*

*Caroline Floccia, University of Plymouth, UK*

**Keywords: listening effort, speech comprehension, accent, speech perception, speech perception in noise**

Understanding spoken language requires mapping acoustic input onto stored phonological and lexical representations. Speech tokens, however, are notoriously variable: they fluctuate within speakers, across speakers, and in different acoustic environments. As listeners, we must therefore perceive speech in a manner flexible enough to accommodate acoustic signals that imperfectly match our expectations. When these mismatches are small, comprehension can proceed with minimal effort; when acoustic variations are more substantial, additional cognitive resources are required to process the signal. A schematic model of speech comprehension is shown in **Figure 1**, emphasizing that different degrees of acoustic mismatch will require varying levels of cognitive recruitment. Recent research increasingly supports a critical role for executive processes—such as verbal working memory and cognitive control—in understanding degraded speech (Wingfield et al., 2005; Eckert et al., 2008; Rönnberg et al., 2013). However, to date, the literature has focused on sources of increased acoustic challenge that originate in the listener (hearing loss) or in the listening environment (background noise). Largely unexplored are the cognitive effects of accented speech (i.e., speech produced by a speaker who does not share a native language or dialect with the listener), a ubiquitous source of variability in speech intelligibility. Here we argue that accented speech must also be considered within a framework of listening effort.

#### **LISTENING EFFORT**

Recent years have seen an increasing focus on the cognitive effects of acoustic challenge during speech comprehension (Mattys et al., 2012). A common theme is that when speech is acoustically degraded, it deviates from what listeners are used to (i.e., stored phonological and lexical representations), resulting in a mismatch between expectation and percept (Sohoglu et al., 2012, 2014). As a result, listeners must recruit additional cognitive resources to make sense of degraded speech (Rönnberg et al., 2013). Given that listeners' cognitive resources are limited, at some point the allocation of cognitive resources to resolve acoustic challenge will begin to impinge upon other types of behavior. Indeed, even mild hearing loss has been shown to impact syntactic processing (Wingfield et al., 2006), running memory for speech (McCoy et al., 2005), and subsequent memory for short stories (Piquado et al., 2012). Further support for the connection between acoustic and cognitive processing comes from the fact that behavioral challenges are exacerbated in older adults due to age-related cognitive decline (Wingfield et al., 2005).

If increased executive processing is required to deal with acoustic challenge, the effects should not only be apparent in listeners with hearing loss, but in listeners with good hearing in cases of external auditory interference. Consistent with this view, acoustic distortion reduces the episodic recall of word pairs (Heinrich and Schneider, 2011) or word lists (Rabbitt, 1968; Cousins et al., 2014). Conversely, increasing speech clarity through the use of listener-oriented speech facilitates recognition memory for spoken sentences (Van Engen et al., 2012). Thus, listening effort appears to be a general consequence of challenging speech signals, in which acoustic mismatch can arise due to either internal factors such as hearing loss, or external factors such as background noise.

Functional neuroimaging studies have begun to link these additional executive resources to specific neural systems by identifying increased neural activity resulting from acoustic challenge during speech comprehension (Davis and Johnsrude, 2003; Eckert et al., 2008, 2009; Adank, 2012; Hervais-Adelman et al., 2012; Obleser et al., 2012; Erb et al., 2013). These increases in neural activity frequently involve areas not seen during "normal" speech comprehension—such as frontal operculum, anterior cingulate, and premotor cortex—consistent with listeners' recruitment of additional executive resources to cope with acoustic challenge. Evidence that these increases in brain activity are task-relevant comes from the fact that they vary as a function of attention (Wild et al., 2012), and modulate behavioral performance on subsequent trials (Vaden et al., 2013).

Taken together, then, there is clear evidence that when speech is acoustically degraded, listeners must rely on additional cognitive resources, supported by an extensive network of brain regions. This general principle has been shown in listeners with hearing loss and in good-hearing listeners presented with acoustically degraded materials. In the next section we consider how these findings may play out in the context of understanding accented speech.

#### **LISTENING EFFORT AND ACCENTED SPEECH**

If acoustic deviation from stored phonological/lexical representations is indeed the

primary cause of increased listening effort, then speech produced in an unfamiliar accent (whether a regional accent or a foreign accent) should similarly affect not only speech intelligibility, but also the efficiency and accuracy of linguistic processing, and memory for what has been heard. Furthermore, accented speech would also be expected to involve the recruitment of compensatory executive resources.

Foreign-accented speech, for example, is characterized by systematic segmental and/or suprasegmental deviations from native language norms. Naturally, these mismatches can lead to a reduction in the intelligibility of the speech (Gass and Varonis, 1984; Munro and Derwing, 1995; Bent and Bradlow, 2003; Burda et al., 2003; Ferguson et al., 2010; Gordon-Salant et al., 2010a,b). However, even when foreignaccented speech is fully intelligible to listeners (i.e., they can correctly repeat or transcribe it), processing it requires more effort than processing native accents: listeners report that accented speech is more difficult to understand (Munro and Derwing, 1995; Schmid and Yeni-Komshian, 1999), and it is processed more slowly (Munro and Derwing, 1995; Floccia et al., 2009) and comprehended less well than native-accented speech (Anderson-Hsieh and Koehler, 1988; Major et al., 2002). Similar effects have been observed for unfamiliar regional accents: Adank et al. (2009) have shown, for example, that listeners' response times and error rates on a semantic verification task (i.e., responding to simple true/false questions spoken with different accents) are higher for speech produced in an unfamiliar regional accent. (For a review of the costs associated with processing accented speech across the lifespan, see Cristia et al., 2012.)

The behavioral consequences of listening to accented speech, therefore, include reductions in intelligibility, comprehensibility, and processing speed—all effects that mirror those seen under conditions involving acoustic degradation. To date, there are few functional neuroimaging studies investigating whether increased brain activity is also seen in response to accented speech, although published accounts suggest this is indeed the case (Adank et al., 2012). In general, we would expect that when listening to accented speech, people will recruit comparable executive resources as when listening to other forms of degradation. This would be consistent with increased activity in regions of premotor cortex, inferior frontal gyrus, and the cingulo-opercular network.

That being said, it is important to acknowledge that mismatches between incoming signals and stored representations can arise by different mechanisms. For degraded speech—including steady-state background noise, hearing impairment, or aided listening—listeners experience a loss of acoustic information. This loss is systematic insofar as it involves the inaudibility of a particular portion of the acoustic signal. In accented speech, there are systematic mismatches between the incoming signal and listeners' expectations, but these arise through phonetic and phonological deviations rather than through signal loss. The degree to which the source of acoustic mismatch affects the type and degree of compensatory cognitive processing required for understanding speech remains an open question. It could be that degraded and accented speech require similar types of executive compensation, and thus both neural and behavioral consequences are largely similar. A second option is that although listeners show similar behavioral consequences to these two types of speech, they are obtained through the use of different underlying neural mechanisms. Finally, there may be differences in both the neural and behavioral consequences of degraded compared to accented speech, or between different types of accented speech. The available preliminary evidence suggests a possible dissociation at the neural level, with different patterns of recruitment for speech in noise compared to accented speech (Adank et al., 2012), and in regional compared to foreign accents (Goslin et al., 2012). However, additional data are needed, and the results may also depend on the level of spoken language processing being tested (Peelle, 2012), task demands, and other factors that determine cognitive challenge for listeners.

## **ADDITIONAL CONTRIBUTIONS TO LISTENING EFFORT**

There are undoubtedly a number of additional influences on the perception of accented speech which may not be relevant for acoustically degraded speech. These include familiarity with an accent (Gass and Varonis, 2006), cultural expectations (Hay and Drager, 2010), and intrinsic listener motivation (Evans and Iverson, 2007). Acoustic familiarity may be specifically related to speech, or simply reflect the experience of a particular listener (Holt, 2006). Together, this confluence of factors can interact with acoustic mismatch to determine the degree of perceptual effort experienced by listeners.

## **WHY DO THE COGNITIVE CONSEQUENCES OF ACCENTED SPEECH MATTER?**

If understanding accented speech indeed requires additional cognitive support, then listeners are likely to have greater difficulty not only understanding their accented interlocutors (i.e., reduced intelligibility), but also comprehending and remembering what they have said, and possibly in managing other information or tasks while listening to accented speech. Given the ubiquity of accented speakers (both foreign and regional) in contemporary society, the practical implications of these problems are wide-ranging. Consider, for example, classrooms with foreignaccented teachers or medical settings where patients and medical personnel who *do* share a language may nevertheless *not* speak with similar accents. In such situations the compensatory cognitive processing that can often (though not always) maintain high intelligibility between speakers and listeners may still come at a cost to listeners' ability to encode critical information. Within the context of a broader framework for effortful listening, it is clear that such challenges will be further exacerbated in the frequentlyencountered case of acoustic degradation (such as from background noise or hearing loss), where mismatches between incoming speech and listeners expectations can arise from *both* loss of acoustic information and from distortion due to accent. It has been observed, for example, that noisy or reverberant listening environments disproportionately reduce the intelligibility of foreign-accented speech as compared to native-accented speech (Van Wijngaarden et al., 2002; Rogers et al., 2006).

An important point is that effortful listening is not an all-or-none phenomenon; rather, the level of cognitive compensation required will depend on the degree of acoustic mismatch in any given listening situation. A relatively mild accent, for example, or one that is highly familiar to a particular listener, can be well understood and require little to no additional effort. Furthermore, we know that listeners can rapidly adapt to both foreignaccented speech (Clarke and Garrett, 2004; Bradlow and Bent, 2008; Sidaras et al., 2009; Baese-Berk et al., 2013) and speech produced in unfamiliar regional accents (Clopper and Bradlow, 2008; Maye et al., 2008; Adank and Janse, 2010). Assuming that understanding accented speech is cognitively challenging due to mismatches between signals and listener expectations, as suggested by the general model of effortful listening presented here, it follows that such perceptual adaptation to an accent will decrease listening effort, and thereby *increase* functional cognitive capacity: Adaptation effectively reduces the mismatch between incoming speech and listener expectations, thus lowering the demand for compensatory executive processes (**Figure 1**). Auditory training with accented speech may therefore not only be useful for improving intelligibility, but also for increasing listeners' cognitive capacity<sup>1</sup> .

### **CONCLUSIONS**

When speech does not conform to listeners' expectations, additional cognitive processes are required to facilitate comprehension. In the case of acoustic degradation, it is increasingly accepted that this type of effortful listening can interfere with subsequent attention, language, and memory processes. Here we have argued that accented speech shares critical characteristics with acoustically degraded speech, and that considering the cognitive consequences of acoustic mismatch is critical in understanding how listeners deal with accented speech.

## **ACKNOWLEDGMENTS**

Research reported in this publication was supported by the Dana Foundation and the National Institute on Aging of the National Institutes of Health under award number R01AG038490.

## **REFERENCES**


<sup>1</sup> It is possible that improvements over the course of perceptual learning also rely to some degree on executive processes. So although the most straightforward prediction is that perceptual adaption will reduce the demand on executive resources, the degree to which this actually happens is an empirical question.


for the effortful comprehension of noise-vocoded words. *Lang. Cogn. Process.* 27, 1145–1166. doi: 10.1080/01690965.2012.662280


in spanish-accented speech. *J. Acoust. Soc. Am.* 125, 3306–3316. doi: 10.1121/1.3101452


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 21 May 2014; paper pending published: 08 June 2014; accepted: 14 July 2014; published online: 05 August 2014.*

*Citation: Van Engen KJ and Peelle JE (2014) Listening effort and accented speech. Front. Hum. Neurosci. 8:577. doi: 10.3389/fnhum.2014.00577*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Van Engen and Peelle. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Relationship between perceptual learning in speech and statistical learning in younger and older adults

## *Thordis M. Neger 1,2\*, Toni Rietveld1 and Esther Janse1,3*

*<sup>1</sup> Centre for Language Studies, Radboud University Nijmegen, Nijmegen, Netherlands*

*<sup>2</sup> International Max Planck Research School for Language Sciences, Nijmegen, Netherlands*

*<sup>3</sup> Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, Nijmegen, Netherlands*

#### *Edited by:*

*Carolyn McGettigan, Royal Holloway University of London, UK*

#### *Reviewed by:*

*Frank Eisner, Radboud University, Netherlands Julia Erb, Max Planck Institute for Human Cognitive and Brain Sciences, Germany*

#### *\*Correspondence:*

*Thordis M. Neger, Centre for Language Studies, Radboud University Nijmegen, PO Box 310, 6500 AH Nijmegen, Netherlands e-mail: thordis.neger@mpi.nl*

Within a few sentences, listeners learn to understand severely degraded speech such as noise-vocoded speech. However, individuals vary in the amount of such perceptual learning and it is unclear what underlies these differences. The present study investigates whether perceptual learning in speech relates to statistical learning, as sensitivity to probabilistic information may aid identification of relevant cues in novel speech input. If statistical learning and perceptual learning (partly) draw on the same general mechanisms, then statistical learning in a non-auditory modality using non-linguistic sequences should predict adaptation to degraded speech. In the present study, 73 older adults (aged over 60 years) and 60 younger adults (aged between 18 and 30 years) performed a visual artificial grammar learning task and were presented with 60 meaningful noise-vocoded sentences in an auditory recall task. Within age groups, sentence recognition performance over exposure was analyzed as a function of statistical learning performance, and other variables that may predict learning (i.e., hearing, vocabulary, attention switching control, working memory, and processing speed). Younger and older adults showed similar amounts of perceptual learning, but only younger adults showed significant statistical learning. In older adults, improvement in understanding noise-vocoded speech was constrained by age. In younger adults, amount of adaptation was associated with lexical knowledge and with statistical learning ability. Thus, individual differences in general cognitive abilities explain listeners' variability in adapting to noise-vocoded speech. Results suggest that perceptual and statistical learning share mechanisms of implicit regularity detection, but that the ability to detect statistical regularities is impaired in older adults if visual sequences are presented quickly.

**Keywords: perceptual learning, statistical learning, individual differences, aging, working memory, attention switching control, processing speed, vocabulary**

## **INTRODUCTION**

Listeners' ability to rapidly learn to understand unfamiliar speech conditions such as accented, disordered or noise-vocoded speech is impressive. Within a few sentences, listeners learn to map a new type of speech input onto their old percept, some improving their speech recognition performance by more than 60% (Eisner et al., 2010). However, listeners show great variability in the amount of such perceptual learning (Eisner et al., 2010). This raises the question which mechanisms underlie perceptual learning.

Perceptual learning can be defined as "relatively long-lasting changes to an organism's perceptual system that improve its ability to respond to its environment" (Goldstone, 1998, p. 585). As listeners are not able to describe the changes that led to their improved perception, perceptual learning is assumed to be a type of implicit learning (Fahle, 2006). A conceptual framework that accounts for changes in the perceptual system is the Reverse Hierarchy Theory (RHT) (Ahissar and Hochstein, 2004). The RHT argues that perceptual learning is a top–down guided process. When a listener is exposed to a novel speech condition, initial performance fails as the speech input can no longer be readily matched to higher-level representations such as word representations. According to the RHT, prolonged exposure modifies these higher-level representations, which subsequently enables top–down guidance to retune weights at lower levels of the processing hierarchy: the weights of task-relevant input are increased and the weights of task-irrelevant input are pruned. This process of weight retuning starts at the highest level of the hierarchy and continues gradually to the lower levels (i.e., the reverse hierarchy). When lower-level representations have been modified, performance under difficult conditions can be based on accessing these low-level representations. This is illustrated by findings that adaptation to noise-vocoded speech generalizes to novel words (Hervais-Adelman et al., 2008), to non-words (Loebach et al., 2008) and to the recognition of environmental sounds (Loebach et al., 2009). These generalization findings suggest that perceptual learning in speech modifies representations at lower levels of the hierarchy, that is, representations at a sublexical level (Hervais-Adelman et al., 2008; Banai and Amitay, 2012).

The RHT has been influential in explaining behavioral observations in visual and auditory perceptual learning (Nahum et al., 2010; Banai and Amitay, 2012; Cohen et al., 2013; Sabin et al., 2013). However, the RHT does not specify which processes take place in the initial stages of adaptation that enable the perceptual system to identify task-relevant cues in the input and to modify high-level representations. One of the basic principles in the RHT and other models of perceptual learning is the retuning of weights based on the relevance of features or dimensions for the specific task (Goldstone, 1998; Dosher and Lu, 1999; Ahissar and Hochstein, 2004; Petrov et al., 2005). This principle implies that stimuli have to share certain features, which can thus be considered task-relevant, for perceptual learning and for transfer of learning to occur. Accordingly, several studies have highlighted the importance of structural regularities (Cohen et al., 2013) and of stimulus consistencies for perceptual learning (e.g., Nahum et al., 2010). In other words, for learning to occur, participants need to detect specific regularities in the input. Therefore, individual differences in sensitivity to such regularities may indicate why listeners differ in adapting to unfamiliar speech input.

An implicit learning mechanism that has been linked to pattern sensitivity is statistical learning. Statistical or probabilistic learning describes the ability to implicitly extract regularities from an input by detecting the probabilities with which properties co-occur (Misyak and Christiansen, 2012). Statistical learning has gained increasing attention over the past years in language research, as language itself is probabilistic in nature (Auer and Luce, 2005). Accordingly, co-occurrence probabilities of units have been shown to facilitate processing at various linguistic levels (e.g., effects of phonotactic probability; Vitevitch et al., 2004) or transitional probability (e.g., Thompson and Newport, 2007). Statistical learning has been found to be of major importance in language acquisition (Saffran, 2003). Also in adulthood, individual differences in statistical learning have been shown to predict sentence processing performance (Misyak and Christiansen, 2012). Moreover, deficits in statistical learning ability have been reported for various language-related disorders such as specific language impairment (Evans et al., 2009), agrammatic aphasia (Christiansen et al., 2010), and language-based learning disabilities (Grunow et al., 2006). As statistical probabilities are provided and continuously updated by the input, relying on statistical probabilities actually enables language users to adapt to their environment, which is the essential characteristic of perceptual learning. Therefore, the present study aims to investigate whether statistical learning relates to perceptual learning in speech perception. If adaptation to a novel speech condition and statistical learning share general mechanisms of implicit regularity detection, then statistical learning performance in a non-auditory modality using non-linguistic stimuli should predict individuals' perceptual learning for speech comprehension.

Perceptual learning in speech and statistical learning may also draw (partly) on the same underlying cognitive abilities, such as working memory and attention. Therefore, we investigated whether both types of learning could be predicted from general cognitive and linguistic abilities. Ahissar and Hochstein (2004) proposed that attentional mechanisms may be engaged in choosing which neuronal populations pass on task-relevant information to the higher levels and in increasing the functional weights of these populations. Several frameworks of perceptual learning incorporate the idea that attentional mechanisms are involved in perceptual learning (e.g., Goldstone, 1998; Fahle, 2006; Dosher et al., 2010). A study on frequency discrimination found that perceptual learning even occurred after training with non-discriminable stimuli (Amitay et al., 2006). Apparently, training directed the participants' attentional focus to the relevant stimulus dimension, which was sufficient to access the relevant low-level representations during the test phase (Amitay et al., 2006). Moreover, performance on a selective attention task predicted the amount of learning in adaptation to accented speech (Janse and Adank, 2012). Further evidence that attention is involved in perceptual learning comes from studies in which listeners were simultaneously exposed to noise-vocoded speech and both auditory and visual distractors (Huyck and Johnsrude, 2012; Wild et al., 2012). Only listeners who attended the noisevocoded stimuli showed improved performance in recognizing noise-vocoded speech. Similar effects of attentional focus arise in tasks of visual statistical learning. When observers are asked to attend to symbols of a certain color in a two-color symbol stream, statistical learning effects unfold for regularities within the attended color but not for regularities within the unattended color (Turk-Browne et al., 2005). These findings imply that only attended features are effectively learned. It has been proposed that training procedures that facilitate participants to *switch* their attention to appropriate perceptual features (e.g., fixed temporal presentation of multiple stimuli, repeated presentation) may particularly enhance perceptual learning (Zhang et al., 2008). Therefore, attention switching control may be involved in the process of distinguishing relevant from non-relevant features in tasks of implicit learning.

Another cognitive ability that may be involved in tasks of implicit learning is working memory, which is required to simultaneously store and process auditory or visual information (Gathercole, 1999). Performance on working memory tasks has been shown to predict performance in various speech reception tasks (for a review see Akeroyd, 2008) and, more specifically, there are indications that working memory relates to perceptual learning performance. Teenaged students with learning and reading disabilities who participated in perceptual learning tasks of frequency and duration discrimination showed improved working memory skills after training (Banai and Ahissar, 2009). Furthermore, the two students who failed to show perceptual learning were characterized by the poorest working memory capacity in the sample. During training, students were repeatedly presented with the same stimuli, which allowed them to access low-level representations, thereby improving frequency and duration discrimination. Thus, working memory may have aided perceptual learning by keeping stimuli accessible (also see Goldstone, 1998). In contrast to these findings, Erb et al. (2012) did not find an association between working memory and individual differences in adaptation to noise-vocoded speech. Note, however, in this study, working memory was measured by tasks that relied on immediate recall and, hence, on short term memory (i.e., nonword repetition task, digit span forward task). Possibly, more complex span tasks, that measure the ability to simultaneously store and process information, rather than just recall capacity, may be particularly associated with tasks of perceptual learning. With respect to statistical learning, recent studies reported correlations between working memory capacity and performance on implicit sequence learning tasks (Bo et al., 2011, 2012). However, findings regarding the link between working memory and implicit learning of sequences are controversial (for a review see Janacsek and Nemeth, 2013) and it has been argued that working memory as an executive resource is not involved in tasks of implicit learning (Kaufman et al., 2010).

An additional cognitive ability that should be considered is processing speed. Processing speed reflects the efficiency of a processing system to perform simple operations (Kaufman et al., 2010) and as a general index of processing efficiency, may be assumed to facilitate perceptual learning. Previous research showed that processing speed correlates with performance on tasks of implicit sequence learning (Salthouse et al., 1999; Kaufman et al., 2010). Higher efficiency of the processing system may be beneficial at various stages of the adaptation process. In the framework of the RHT, processing speed may reduce listeners' time to retrieve high-level representations and to initiate modification processes. Furthermore, processing speed may accelerate the process of weight retuning, thereby gaining faster access to low-level representations.

As the current study focuses on adaptation for spoken language understanding, perceptual learning may also draw on linguistic knowledge. Davis et al. (2005) presented data on how the so-called pop-out effect accelerates the process of perceptual learning: if listeners knew the content of what was going to be said before they actually heard the sentence in its degraded form, this benefited their perceptual learning. In line with the Eureka effect in the RHT, in which a cue regarding the content of the stimulus can trigger direct perception of the stimulus and facilitates strong and long-lasting learning effects (Ahissar and Hochstein, 2004), this pop-out finding suggests that lexical knowledge facilitates access to higher-level representations, thereby initiating top–down processes that aid sublexical retuning (Davis et al., 2005). Accordingly, vocabulary, as a measure of lexical knowledge, has been found to predict the amount of perceptual learning in listeners who were adapting to an unfamiliar foreign-sounding accent (Janse and Adank, 2012), accents being linguistic degradations of the stimulus. If we assume that lexical knowledge aids perceptual learning by guiding the top–down search, effects of lexical knowledge should also arise in non-linguistic speech degradations. Therefore, we investigate whether linguistic knowledge, as indexed by vocabulary knowledge, may also facilitate shifting of attention to relevant features of acoustically degraded speech.

As we want to investigate which cognitive processes are involved in perceptual learning in speech, we also aim to test whether our findings generalize to a heterogeneous group of listeners. Older adults typically form a highly heterogeneous group, as perceptual and cognitive processing undergo changes over the life span. Age-related changes in hearing acuity (Lin et al., 2011), processing speed, capacity on working memory tests, attentional control (for a review see Park and Reuter-Lorenz, 2009) but also lexical knowledge (Ramscar et al., 2014) may therefore help to identify relevant cognitive processes. Importantly, the ability to adapt to unfamiliar speech input is preserved throughout the life span (Peelle and Wingfield, 2005; Golomb et al., 2007; Adank and Janse, 2010; Gordon-Salant et al., 2010). However, differences in the amount and pattern of perceptual learning over exposure between younger and older adults also indicate changes in the underlying processes. While younger and older listeners show the same amount of learning in the initial adaptation phase, older listeners' performance plateaus earlier in adapting to unfamiliar speech (Peelle and Wingfield, 2005; Adank and Janse, 2010), older adults show less transfer of learning to similar conditions (Peelle and Wingfield, 2005), and exhibit slower consolidation of learning (Sabin et al., 2013). Such differences illustrate that the interdependency between cognitive functions and implicit learning processes may change as a function of age. Cognitive abilities associated with adaptation to unfamiliar speech in younger adults may not be the same as in older adults. In order to gain more insights into individual abilities associated with adaptation to unfamiliar speech across the life span, we tested both younger and older adults.

In sum, this study investigates perceptual learning for spoken language understanding in younger and older adults. We use noise-vocoded speech, an acoustic degradation of the speech signal which simulates the auditory signal of a cochlear implant. In contrast to naturally occurring variability in speech (such as accents), participants do not encounter noise-vocoded speech in everyday life. As a consequence, all participants share the same naïve exposure level. We specifically study whether perceptual learning is associated with a general ability to implicitly detect statistical regularities. By testing participants' probabilistic sequence learning with visual non-linguistic stimuli, we apply a rigorous test of the association between the two types of implicit learning. Additionally, we investigate whether both types of implicit learning are associated with individual differences in attention switching control, working memory, information processing speed or lexical knowledge.

## **MATERIALS AND METHODS**

#### **PARTICIPANTS**

In total, 60 younger and 73 older adults participated in the current study. All participants were native speakers of Dutch, neurologically intact and had no history of language disorders. One younger participant was excluded as he showed floor performance throughout the perceptual learning task (i.e., he did not understand the noise-vocoded speech at all). Younger adults were aged between 18 and 29 years (mean age 21 years, *SD* 2.5 years) and older adults were aged between 60 and 84 years (mean age 68.4 years, *SD* 5.7 years). In both age groups, the majority of participants were female (53 out of 59 participants in the younger and 47 out of 73 participants in the older sample). Participants had normal or corrected-to-normal vision. Participants were recruited via the subject database of the Max Planck Institute for Psycholinguistics and were compensated C8 per hour for their time.

#### **AUDITORY, COGNITIVE, AND LINGUISTIC BACKGROUND MEASURES** *Auditory measure*

*Hearing thresholds.* Age-related hearing loss is prevalent in older adults (Lin et al., 2011). Poorer hearing may affect perceptual learning as auditory input contains less detail, thereby interfering with accessing and retuning low-level representations. Participants' auditory function was assessed by measuring airconduction pure tone thresholds with the aid of an Oscilla USB-300 screening audiometer. As age-related hearing loss particularly affects sensitivity to high frequencies, a high-frequency pure tone average [PTAH] was taken as index of hearing acuity. This PTAH was calculated as the mean hearing threshold over 1, 2, and 4 kHz (instead of the standard PTA over 0.5, 1, and 2 kHz). Only the PTAH of the best ear was entered in the analysis, as all auditory stimuli were presented binaurally. Twenty-seven older participants actually qualified for hearing aids on the basis of their hearing thresholds according to the standard of hearing-aid coverage in the Netherlands (PTAH of the worst ear ≥35 dB HL). None of the participants wore hearing aids in daily life, however. Higher thresholds reflected poorer hearing. Mean thresholds at different frequencies per age group are given in **Figure 1**.

#### *Cognitive measures*

*Working memory.* Participants performed a digit span backward task as an index of working memory capacity. The test was a computerized variant of the digit span backward task included in the Wechsler Adult Intelligence Scale Test (Wechsler, 2004) and presented via E-prime 1.2 (Schneider et al., 2002). Participants were asked to report back sequences of digits in reverse order. Digits were presented in a large white font (Arial, font size 100) against a black background. Each digit was presented for 1 s with an interval of 1 s between the consecutive digits of a sequence. Sequence length increased stepwise from two to seven digits and performance on each sequence length was tested on two different trials (all participants were presented with all sequence lengths, regardless of their performance on earlier easier trials). The actual test trials were preceded by two practice trials with a sequence length of three to familiarize participants with the task. Participants had

to recall 12 test sequences in total. Individual performance was operationalized as the proportion of correctly reported sequences (out of 12).

*Processing speed.* Information processing speed was assessed by means of a digit symbol substitution task. Participants had to convert as many digits as possible into assigned symbols in a fixed amount of time (90 s). The digit symbol substitution task is a paper-and-pencil test that was derived from the Wechsler Adult Intelligence Scale Test (Wechsler, 2004). Performance was measured by the number of correctly converted digits in 90 s, meaning that higher scores reflected higher information processing speed.

*Attention switching control.* The Trail Making Test was administered to obtain a measure of attention switching control. The paper-and-pencil test contained two parts. In Part A, participants were asked to connect numbers as quickly as possible in ascending order (i.e., 1-2-3*...*), the numbers being spread randomly over a white page. The Part B page had both numbers and letters randomly spread over the page. Participants now had to alternately join numbers and letters in ascending order (i.e., 1- A-2-B-3-C*...*). In both parts, 25 items had to be connected and the total time to complete each part was measured. We calculated the ratio between both parts (Part B/Part A) as measure of attention switching control (Arbuthnott and Frank, 2000), thereby taking general slowing into account (Verhaeghen and De Meersman, 1998; Salthouse, 2011). Higher scores indicated higher costs of switching between letters and numbers, therefore, poorer attention switching control.

#### *Linguistic measure*

*Vocabulary knowledge.* A vocabulary test in the form of multiple choice questions was administered to obtain a measure of linguistic knowledge (Andringa et al., 2012). The computerized test was administered in Excel (Courier font size 15). Participants had to indicate which out of five possible answers was the correct meaning of Dutch low-frequency words, the last alternative always being "I don't know." Words were not domain-specific and each target word was embedded in a different, neutral carrier phrase. The vocabulary test consisted of 60 items. There was no time limit or pressure to complete the test. Performance was measured by test accuracy, that is, the proportion of correct answers (out of 60). Higher scores thus reflected greater vocabulary knowledge.

#### **STATISTICAL LEARNING** *Materials and design*

To investigate statistical learning, we adopted the artificial grammar learning—serial reaction time (RT) paradigm (Misyak et al., 2010a). This paradigm has typically been used in studies on statistical learning in language processing and has been found to link to individual language processing abilities (Misyak et al., 2010a,b; Misyak and Christiansen, 2012). As artificial grammar learning simulates language learning processes, the task makes use of auditory presented sound sequences such as non-words. However, as we wanted to investigate whether individuals' ability to adapt to an unfamiliar speech condition could be predicted by a general ability to implicitly detect regularities, we used visual and non-linguistic stimuli in the statistical learning task. That is, we applied a rigorous test for the relationship between statistical learning and perceptual learning by preventing that a relationship between both measures of learning was specific for auditory and linguistic processing.

Participants were presented with familiar, geometrical shapes in a 2 × 2 design on the computer screen (see **Figure 2B**), in which one shape on either side of the screen served as target and one as distractor item. Target shapes were sequentially highlighted by a visual marker and participants' task was to click as fast as possible on the highlighted target. The first target was always one on the left side of the screen (i.e., upper or lower one in the first column) and the second target was always on the right side of the screen (i.e., upper or lower one in the second column). The second target was only highlighted after the participant had clicked on the first target item. Crucially, which of the two items in the right-hand column would be highlighted was predictable on the basis of the first target [e.g., in **Figure 2B**, a *triangle* would always be followed by a *star* or a *square* (*the latter is not in the display*), but never by a *heart*].

Materials consisted of eight familiar, geometrical shapes drawn with a single, continuous black line. The shapes were divided into two grammatical subsets of four shapes each (i.e., Set 1: *triangle, hexagon, star, square;* Set 2: *arrow, circle, heart, cross*). Within each set, two items were selected to appear as first targets (i.e., Set 1: *triangle, hexagon;* Set 2: *arrow, circle*) and were always followed by one of the other two items that served as second targets (i.e., Set 1: *star, square*; Set 2: *heart, cross*). Therefore, four combinations of shapes were grammatical within each set, resulting in a total set of eight grammatical combinations (see **Figure 2A**). Target items were presented along with distractors in a rectangular grid display on the computer screen (see **Figure 2B**). Distractor items were shapes from the subset that was currently not tested and the two distractor shapes on the screen formed a grammatical combination themselves. Thus, within a grammatical trial, the transitional probability from the first to the second target was 1, as the first target could only be followed by the target from the same subset. Within the grammar, however, the transitional probability between two adjacent items was 0.5, as a target was followed by a specific successor only half of the time (i.e., a *circle* being followed by either a *heart* or a *cross*, see **Figure 2A**). Target positions were randomly assigned such that it was unpredictable whether a first or second target would be displayed in the upper or lower row of a particular column.

The artificial grammar learning task was composed of blocks and split into an exposure phase, a test phase and a recovery phase. During the exposure phase, participants could learn the grammar by picking up on the co-occurrence probabilities of the shapes. In total, the exposure phase consisted of 16 grammatical blocks. Within each block, all grammatical combinations were repeated once, resulting in 128 exposure trials (8 × 16). The test phase consisted of two ungrammatical blocks (2 × 8 trials). In these ungrammatical blocks, the original grammar was reversed, such that a target was followed by targets of the other (competing) subset. Participants who implicitly learned the grammar should show a drop in performance as they would need to correct their predictions, resulting in a slowed response to the second target. This measure of learning is widely accepted in the literature on implicit learning (Janacsek and Nemeth, 2013): a drop in performance due to removing the underlying regularities can only be linked to grammar sensitivity, whereas learning measures in terms of improvement during the exposure phase cannot be teased apart into general task learning and statistical learning. Therefore, statistical learning was operationalized by the difference in task performance between the last four blocks of the exposure phase (blocks 13–16) and the subsequent ungrammatical test phase (blocks 17–18). The recovery phase again consisted of two grammatical blocks and serves as a control phase. If participants learned the grammar, by re-introducing the regularities in the recovery phase, participants' performance should not decrease any further. In total, the artificial grammar learning task thus contained 20 blocks and 160 trials (8 × 20).

#### *Procedure*

The artificial grammar learning task was presented in E-prime (Schneider et al., 2002) and started with five practice trials that were all grammatical. Participants were instructed to click as quickly as possible on target shapes that were marked by a small filled red cross (10 × 10 mm) in the center of the target shape. Participants were informed that they had to click on two successive targets and that the first target would be located in

and the second target is always displayed on the right side of the screen. **(B)** Procedure of a grammatical trial during the exposure phase.

the first column and the second target would be located in the second column. Each trial started with the presentation of the visual display that consisted of the four shapes and two grid lines, marking the four quadrants on the screen. At the start of each trial, the mouse cursor was located in the center of the screen. Each shape was displayed in a size of 75 × 70 mm. The visual marker appeared in the middle of the first target shape 500 ms after the onset of the visual display, and was shown until the participant clicked on the marked picture. After the participant had responded, the mouse cursor was automatically set back to the center of the screen to ensure the same distance for all click responses. The second visual marker (same red cross now marking the second target shape) appeared 500 ms after the first click. This time interval was implemented in the design to allow for prediction effects, even in the adults who had slower processing. This time interval had been successfully applied in an earlier study on implicit sequence learning in older adults (Salthouse et al., 1999). Participants could not make errors: the experiment only proceeded if a participant clicked on the appropriate target shape. Clicking on a distractor shape or outside the target picture before giving a correct click resulted in a higher RT. The intertrialinterval was 500 ms. After each block, a small break of 2500 ms was implemented to avoid fatigue effects. During this break, participants saw the block number of the upcoming block and a reminder to click as quickly as possible. It took approximately 20 min to complete the task.

To assess statistical learning, we measured latencies from target highlighting to the subsequent mouse response. Facilitation scores were calculated to index individuals' sensitivity to implicit regularities. The facilitation score was calculated by dividing the RT to the first, unpredictable target within a trial by the RT to the second, predictable target within the same trial. Thus, RT to the first target served as baseline performance within each trial. This was important to minimize biases of task learning and motor performance, particularly for those older adults who may have had little practice in using a computer mouse. During the course of the experiment, RTs may generally get faster as older adults get more experienced in using a mouse. By implementing a new baseline within each new trial, such motor learning should be accounted for. If participants cannot predict which target will be highlighted next, their RTs to both targets within a trial will be similar and will result in a facilitation score of 1. During the exposure phase, learning manifests itself in an increasing facilitation score. That is, if participants learn to predict the second target, RTs to the second item will be faster and, therefore, shorter compared to the first, unpredictable target RTs.

#### **PERCEPTUAL LEARNING**

#### *Materials and design*

Sixty Dutch sentences were noise-vocoded to create an unfamiliar speech condition to which participants needed to adapt. In noisevocoded speech, frequency information in the signal is replaced by noise while preserving the original amplitude structure over time. The speech signal was split into multiple non-overlapping frequency bands, which approximately matched equal distances on the basilar membrane (Greenwood, 1990). From each frequency band the smoothed amplitude envelope was derived and imposed on wide-band noise in the same frequency range. In a last step, these modulated noise bands were recombined, creating a speech signal that sounded like a harsh robot voice. All signal editing was done in Praat (Boersma and Weenink, 2011).

An important characteristic of noise-vocoded speech is that the comprehension level of the speech signal can easily be manipulated by varying the number of frequency bands. The more frequency bands are used to decompose the speech signal, the more detail of the original temporal and amplitude structure is preserved and the more intelligible the speech signal is. Previous research has shown that 10 frequency bands are enough for naïve listeners to immediately understand more than 90% of noisevocoded speech (Sheldon et al., 2008). However, when presented with speech noise-vocoded with fewer bands, participants only reach this level of performance after a certain amount of exposure.

The maximal amount of learning or intelligibility improvement can be observed if the starting level is neither too high nor too low, so that sufficient information can be derived from the acoustic materials to initiate learning while at the same time allowing for sizeable improvement (see Peelle and Wingfield, 2005). We initially tried to provide participants with an individual starting level from which they could still show improvement. In a separate pilot study, we therefore assigned 23 older adults to a specific noise-vocoding condition (i.e., 4 or 6 bands) on the basis of their performance on a speech reception threshold (SRT) task in noise. Inspection of the data showed that participants' starting level clustered according to band condition. Older adults in the 4 band condition showed a very low starting level (on average they understood only 10% of the sentences correctly), whereas older adults in the 6 band condition showed a very high starting level (on average they already understood 65% of the sentences). Relatedly, the correlation between SRT result and initial performance on the noise-vocoded speech was weak. As our attempt to individualize starting levels on the basis of a speech-in-noise task was not successful, we aimed to provide a roughly similar starting level for both age groups. Based on the results of the pilot study, we decided to present older adults with speech that was vocoded with 5 bands (corner values using 5 frequency bands: 50-280-757-1742-3781-8000 Hz). As younger adults understand more when being exposed to the same degradation as older adults (Peelle and Wingfield, 2005; Sheldon et al., 2008), we presented younger adults with four-band speech (corner values using 4 frequency bands: 50-369-1161-3125-8000 Hz), thus, a more difficult speech condition than older adults (cf. Golomb et al., 2007). Consequently, we were able to see sizeable and comparable amounts of improvement over the course of exposure in both age groups.

Sentences were selected from audiological test materials (Versfeld et al., 2000) and were all produced by the same, male speaker. Each sentence had a length of eight or nine syllables and contained four keywords. Keywords in the selected set of sentences included a noun, verb and preposition. The fourth keyword was an adjective, adverb or a second noun. An example sentence "*De sneeuw glinstert in het maanlicht*" ("*The snow is glistening in the moonlight*") contained the keywords "*sneeuw*," "*glinstert*," "*in*," and "*maanlicht*." Note that five additional sentences were selected for practice purposes, so that there was no overlap in sentence content between practice and test items. Practice sentences had the same length as test items (a list of all sentences used in the current study is provided in Supplementary Material).

#### *Procedure*

An auditory sentence identification task was administered to investigate perceptual learning using the experiment program E-prime (Schneider et al., 2002). Participants listened to the noise-vocoded sentences and were asked to identify and repeat these sentences. They were encouraged to guess if they were unsure. Participants were first presented with five practice trials. First, participants listened to three clear sentences to familiarize them with the task and the speaker. Moreover, these practice trials were used to check whether participants' memory span was sufficient to perform the task given clear input, which was the case for all participants. Then participants listened to two sentences that were noise-vocoded with only two frequency bands to present them with the type of degradation. This more difficult condition with fewer bands was chosen to make sure that no learning could occur during the practice phase (e.g., Ahissar and Hochstein, 1997; Pavlovskaya and Hochstein, 2004; Liu et al., 2008). Practice trials were identical for all participants and were presented in the same order. In contrast, the 60 test sentences were presented in random order for each participant, so that observed learning effects would be independent of inherent intelligibility differences between sentences (e.g., due to differences in semantic predictability). Participants heard a short (125 ms) 3.5 kHz tone to call their attention to the upcoming stimulus 500 ms before sentence onset. After each sentence, the researcher scored the number of correctly repeated keywords (0–4) online. The next trial started immediately after the researcher had confirmed the scoring of the previous trial. Auditory stimuli were presented binaurally via dynamic closed, circumaural headphones (Sennheiser HD 215), at a level of 85 dB SPL. Participants' answers were audiorecorded to allow for later checking of their responses.

#### **EXPERIMENTAL PROCEDURE**

Measures of younger adults were obtained in a single experimental session. Testing was spread over two sessions for the older adults, as they also participated in a different study. During the first session, older adults performed the background measures described above. The second session consisted of the statistical learning and the perceptual learning task and followed within a month on the first session. In both age groups, tasks were presented in a fixed order. Although the order differed between younger and older adults, the statistical learning task was always presented before the perceptual learning task. All participants were tested individually in a sound-attenuating booth to minimize distraction. Before the start of each task, participants received verbal and printed task instructions. Participants could ask questions at any time. Between tasks, participants were encouraged to take small breaks.

#### **DATA ANALYSIS**

#### *Statistical modeling*

To assess learning performance, we implemented linear mixedeffects models using the lmer function from the lme4 package (Bates et al., 2012) in R (version 2.15.1). In this way, both participants and items could be assessed as random factors and the maximal random slope structure of models could be defined to reduce the probability of a type 1 error (Barr et al., 2013). First, we modeled statistical and perceptual learning performance as a function of age group to assess whether younger and older adults differed in their learning performance. Second, we analyzed the contributions of individual abilities in learning separately within each group as our focus was on individual differences within the respective age groups. Thus, the modeling process that is described here was applied to the statistical learning data and to the perceptual learning data of both age groups.

Linear regression models are based on the assumption that the predictors included in the analysis do not show collinearity (Baayen, 2012). Although some predictor measures were intercorrelated (see Section Performance on Background Measures), we did not control for these intercorrelations for two reasons. First, most correlations explained less than 20% of the variance in the correlated measure (i.e., with correlation coefficients below 0.45). Only the correlation between age and speed in the older adults was moderately correlated (*r* = −0*.*562). Second, simultaneous inclusion of correlated measures in the analysis has been shown to provide a more reliable interpretation of estimates than inclusion of residualized variables (York, 2012; Wurm and Fisicaro, 2014).

Statistical learning was defined as a drop in performance in the test phase (blocks 17–18) compared to the performance at the end of the exposure phase (blocks 13–16). Therefore, in models of statistical learning, the fixed categorical variable phase (exposure vs. test phase) was the variable of interest to predict individuals' facilitation scores and to indicate learning. Additionally, two control variables and the corresponding two- and three-way interactions with phase were included in models of statistical learning. Control variables were the categorical variable "first target position" (was the first target displayed in the upper or lower row of the left column?) and the categorical variable "target alignment" (were the two targets in a trial aligned horizontally or diagonally?). Given the directionality of Western writing systems, we expected a first target position effect as participants may click faster on a target in the upper left quadrant than in the lower left quadrant. We also expected the drop in facilitation score during the test phase to be less distinct in trials with the first target appearing in the upper left quadrant, such that target position was expected to interact with the amount of learning. Moreover, the alignment of targets was thought to affect second target RTs. Note that the experimental program always set the mouse back to the center of the screen after each click. Despite this automatic mouse reset, participants tended to also move the mouse back to the center of the screen. By doing that, participants unintentionally initiated a movement toward the diagonal shape. Therefore, we assumed that participants would be faster in responding to the second target if targets were arranged diagonally rather than horizontally (see **Figure 2**), which would result in higher facilitation scores. This direction effect may interact with the effect of removing the regularities, such that the grammaticality effect be decreased for the diagonal movements.

In models of perceptual learning, the number of correctly repeated keywords per sentence served as index of recognition performance and was entered as numerical dependent variable into the model. As perceptual learning was defined as the improvement in speech understanding over exposure, we split the experiment into six blocks, containing 10 sentences each and added block as numerical measure of exposure to the model. However, before block was included in the analysis, we performed a log-transformation of block, as perceptual learning has typically been described by fast initial learning that levels off with increasing exposure (see also **Figure 4**). The transformation of block therefore provided us with an index of exposure that took this non-linear improvement curve into account and converted the improvement over exposure into a linear scale1 .

In the first step of the analysis, we identified the maximal random slope structure of our models to allow for the fact that different participants or items may vary with regard to how sensitive they are with respect to the variables at hand (Cunnings, 2012; Barr et al., 2013): if, e.g., vocabulary knowledge only matters for the understanding of some sentences in the perceptual learning task but not for others, the effect of vocabulary should be modeled individually for each sentence and removed from the fixed effect structure. Changes in the random-slope structure were evaluated by means of the Akaike information criterion (AIC). The model with the lower AIC value (difference ≥2) and, therefore, better model fit was retained. As we were interested in the predictors of individual amount of learning, a random participant slope of phase was included in all models of statistical learning. Accordingly, in models of perceptual learning, a random participant slope of block was inserted. That is, all models calculated the learning effect (i.e., the effect of phase in statistical learning and the effect of block in perceptual learning, respectively) individually for each participant.

After determining the maximal random slope structure, we first performed an age group comparison by testing the interactions between age group and the respective index of learning (i.e., phase or block). As younger and older adults may differ with respect to the effects of target position and target alignment on their learning performance, all possible two-way interactions between grammaticality, age group, target alignment and first target position and the three-way interactions between (1) age group, grammaticality and target position and between (2) age group, grammaticality and target alignment were included in the age group comparison of statistical learning.

In a second step, we assessed which cognitive abilities may facilitate learning within the separate age groups. In the statistical learning analysis, the best model that explained the facilitation score on basis of the interactions between phase, target position and target alignment was taken as initial model. In the perceptual learning analysis, the initial model only contained block. Then, measures of age (in older adults only), hearing sensitivity (in models of perceptual learning only), statistical learning performance (in models of perceptual learning only), attention switching control, working memory, processing speed and vocabulary (all evaluated as numerical covariates) and their interaction with phase (in models of statistical learning) or with block (in models of perceptual learning) were added simultaneously to the initial model. This method of forced entry was preferred, as we had no prior theoretical assumptions about the relative importance of each predictor and aimed to identify those predictors that had unique exploratory power in predicting facilitation scores. All individual predictor measures were centered around their mean prior to inclusion. After we had entered the individual predictor measures, we adopted a backward stepwise selection procedure, in which first interactions and then predictors were removed if they did not attain significance at the 5% level. Each change in the fixed effect structure was evaluated in terms of loss of model fit by means of a likelihood ratio test. Results of the analysis are indicated in estimated absolute effect sizes (β), standard errors, *t*-values and *p*-values. Note however that the current version of the lme4 package does not report *p*-values for *t*-tests in models with a maximal random slope structure, as it is presently unclear how to calculate the appropriate number of degrees of freedoms (Baayen, 2012). Reported *p*-values were, therefore, derived by performing a likelihood ratio test between a model that included the specific fixed effect or interaction and a model that did not while all other model parameters were kept constant. That is, *p*-values actually reflect the significance of loss in model fit if the effect or interaction was excluded from the model.

#### *Individual measure of statistical learning performance*

As we wanted to assess whether individual statistical learning performance predicts adaptation to noise-vocoded speech, we needed an index of statistical learning ability for each participant. We derived this index by calculating the random participant slopes of phase (individual adjustments to the general slope) on the basis of the most parsimonious model, in which facilitation scores were predicted only by phase and the control variables (i.e., we derived the measure of statistical learning ability before we included effects of individual predictor measures in the above mentioned analysis).

Thus, we determined an individual value for each participant with which the general effect of phase (in the fixed structure of the model) had to be adjusted to resemble his/her individual learning effect. The lower the value, the more negative was a participant's slope when changing from the end of the exposure phase to the test phase, indicating a steeper drop in facilitation score and, hence, more statistical learning.

## **RESULTS**

#### **PERFORMANCE ON BACKGROUND MEASURES**

Mean performance of younger and older adults and age group differences on all background measures are displayed in **Table 1**. As expected, hearing acuity was better in younger adults (i.e., thresholds were lower) than in older adults. Moreover, younger adults showed faster processing and larger memory capacity than older adults. On average, older adults were able to correctly repeat 5.62 test sequences in the working memory test, which

<sup>1</sup>Note that we ran a second analysis in which we kept the original index of block. The analysis resulted in the same best models and showed the same effects as the models reported here. However, models that included the logtransformed index of block showed an increased model fit, indicating nonlinear learning behavior.

corresponds to a mean digit span of four. Younger adults correctly repeated 8.08 test sequences, corresponding to a mean digit span of five. No difference could be observed in attention switching control between age groups. Older adults outperformed younger adults on the vocabulary test. However, older adults also showed relatively little variation on the vocabulary test [coefficient of variation (*SD/M*) = 6.9%]. Statistical testing confirmed that the variance in older adults' vocabulary scores was significantly lower than the variability in younger adults' data (coefficient of variation = 11.8%) (Levene's Test: *F* = 4*.*15, *df*<sup>1</sup> = 1, *df*<sup>2</sup> = 130, *p* = 0*.*044).

Intercorrelations between background measures within each age group are reported in **Table 2**. In younger adults, significant correlations were observed between the cognitive measures of working memory and processing speed and between working memory and vocabulary. The same intercorrelations were also observed in the older adults. As expected, age correlated with hearing sensitivity and with processing speed within the older sample: older–older participants generally had poorer

**Table 1 | Mean performance per age group and age group differences on cognitive, linguistic, and auditory measures.**


*t-tests tested two-tailed.*

hearing and slower processing than younger–older participants. Moreover, processing speed was related to hearing sensitivity in older adults. However, when both measures (i.e., speed and hearing) were controlled for age, this correlation was no longer significant (*r* = −0*.*128, *p* = 0*.*279, *df* = 71).

#### **STATISTICAL LEARNING**

Valid facilitation scores were restricted to those within 2.5 *SD* from the mean facilitation score within each age group. **Table 3** shows the average performance of younger and older adults on the statistical learning task in terms of response times and facilitation score. As expected, younger adults were significantly faster in responding to the first target (*t* = −84*.*30, *df* = 23249*.*45, *p <* 0*.*001) and to the second target (*t* = −104*.*34, *df* = 23585*.*75, *p <* 0*.*001) than older adults. Note that all responses in the statistical learning task were accurate as the experimental task only proceeded when a participant had clicked on the correct shape. **Figure 3A** shows the average facilitation scores for both age groups over block2 . **Figure 3B** displays the mean facilitation scores

2Note that the drop in performance that can be observed in the younger adults during the exposure phase (see **Figure 3A**, blocks 5 and 6) is not significant (beta = −0*.*018, *SE* = 0*.*010, *t* = −1*.*71*, p* = 0*.*088). This suggests that there was no general drop in performance across the group of younger adults. Likewise, the spread of the individual slope data (*M* = −0*.*018, *SD* = 0*.*019, Min = −0*.*078, Max = 0*.*027) also includes positive slope values (indicating

**Table 3 | Mean response times (in ms) and facilitation scores of younger adults (***n* **= 59) and older adults (***n* **= 73) on the statistical learning task.**


**Table 2 | Pearson's correlation coefficients between measures of cognitive, linguistic, and auditory functioning per age group.**


*\*p < 0.05, \*\*p < 0.01 (tested two-tailed).*

**FIGURE 3 | Performance on the statistical learning task.** A drop in facilitation score from the end of the exposure phase (blocks 7–8) to the test phase (block 9) indicates learning. Error bars indicate two standard errors from the mean. **(A)** Mean statistical learning performance per age group and block. The area between the dotted lines represents where the effect of

removing the underlying regularities should be observed. **(B)** Mean statistical learning performance per age group and phase. **(C)** Boxplot of statistical learning performance in younger and older adults (individual exposure-to-test slopes from the statistical model). More negative slopes reflect more learning.

at the end of the exposure phase, in the test phase and in the recovery phase to illustrate the learning effect. Moreover, the range of statistical learning that was observed within each age group is displayed in **Figure 3C**. Estimates of the best model within each age group are displayed in **Table 4**.

The age group comparison showed a significant effect of phase (beta = −0*.*137, *SE* = 0*.*045, *t* = −3*.*03, *p* = 0*.*002), indicating statistical learning in the group of younger adults, who were placed on the intercept. This effect of phase was modified by age group (beta = 0*.*125, *SE* = 0*.*061, *t* = 2*.*06, *p* = 0*.*039), suggesting that older adults learned less than younger adults and (given the almost equal beta values) that older adults were not affected by removal of the underlying regularities. This interaction between age group and phase tended to be less pronounced in diagonal trials (beta = −0*.*070, *SE* = 0*.*036, *t* = −1*.*91, *p* = 0*.*056). A fixed effect of age group indicated that, overall, older adults showed a lower facilitation score than younger adults (beta = −0*.*189, *SE* = 0*.*044, *t* = −4*.*28, *p <* 0*.*001). This effect of age group was influenced by both control variables. That is, the difference in facilitation score between younger and older adults was less distinct in diagonal (beta = 0.065, *SE* = 0*.*025, *t* = 2*.*54, *p* = 0*.*011) and in upper left trials (beta = 0.060, *SE* = 0*.*027, *t* = 2*.*25, *p* = 0*.*025). As expected, facilitation scores were higher in diagonal trials (beta = 0.120, *SE* = 0*.*019, *t* = 6*.*33, *p <* 0*.*001) and lower in trials, in which the first target appeared upper left (beta = −0*.*100, *SE* = 0*.*020, *t* = −4*.*98, *p <* 0*.*001). Moreover, the effect of phase was modified by both target position and by target alignment, implying that effects of statistical learning were less pronounced in diagonal trials (beta = 0.062, *SE* = 0*.*027, *t* = 2*.*29, *p* = 0*.*022) and in trials with an upper left target (beta = 0.064, *SE* = 0*.*027, *t* = 2*.*38, *p* = 0*.*017). The random slope structure indicated that participants differed in the degree to which they were affected by target position and by target alignment.

In the younger adults, the best-fitting model showed a significant effect of phase: the facilitation score of younger adults was lower in the test phase than at the end of the exposure phase, indicating that younger adults were affected by removing the underlying regularities. However, none of the individual listener characteristics interacted significantly with test phase, suggesting that amount of statistical learning was not associated with any of the selected measures of cognitive or linguistics abilities. Only processing speed showed a significant fixed effect on facilitation score, indicating that participants with higher processing speed had higher facilitation scores at the end of the exposure phase. As expected, facilitation scores were lower if the first target was displayed upper left and higher if targets were aligned diagonally. Both effects modulated learning in the anticipated direction: the effect of statistical learning was smaller in diagonal trials and in trials in which the first target was displayed upper left. In addition to the random slope of phase, the maximal random slope structure included random effects of first target position and target alignment on participant. Inclusion of these effects suggests that younger participants differed in the degree to which they were affected by target alignment, that is, whether they had to move the cursor horizontally or diagonally. Removing the random slope of phase within subject from the maximal random slope structure did not result in a significant loss in model fit, indicating that the amount of statistical learning did not differ considerably among younger adults (see **Figure 3C**).

Overall, older adults showed no significant effect of test phase, suggesting that they generally did not pick up the subtle regularities in the input. Age was the only individual background measure that predicted performance: the older the participants

improvement, rather than decreased performance). Moreover, a paired samples t-test shows that the size of the unexpected drop in the exposure phase is significantly smaller than the drop in the test phase that is considered to reflect learning (*t* = 17*.*91, *df* = 58, *p <* 0*.*001).


were, the lower was their facilitation score at the end of the exposure phase. In older adults, facilitation score was mainly influenced by the control variables. That is, diagonal alignment of targets enhanced facilitation scores and upper left position of the first target decreased facilitation benefit. A significant interaction between target position and target alignment indicated that effects of one control variable were modified by the other control variable: the effect of the first target being located upper left was smaller when participants could make a diagonal mouse movement to the second target, respectively, the benefit in facilitation score based on a diagonal movement was decreased in case the first target was displayed in the upper left corner of the screen. The maximal random slope structure showed that older adults differed in the degree to which they were affected by changes in target position (random slope of first target position within subject) and target alignment (random slope of first target position within subject). However, in modeling the statistical learning data of the older adults, we had kept in a random slope of phase to allow that participants may vary in how much their performance was affected by removing the regularities (we also needed this random slope parameter as the individual measure of statistical learning). Importantly, inclusion of this random effect of phase did not increase the model fit, implying that older participants did not differ much in their sensitivity to statistical regularities: they were all relatively insensitive to the probabilistic sequence information. Note that older adults continued to show increased facilitation throughout the exposure phase (cf. **Figure 3A**). As their performance was unaffected by the removal of the underlying regularities in the test phase, this suggests that the improvement over block in older adults reflects effects of task learning rather than effects of statistical learning.

#### **PERCEPTUAL LEARNING**

As we wanted to include statistical learning performance as a predictor in the analyses of the perceptual learning data alongside the auditory and cognitive measures, we checked for intercorrelations between statistical learning ability and other individual background measures. In the older adults, no correlations were observed. In the younger adults, intercorrelations between statistical learning performance and both working memory (*r* = −0*.*263, *p* = 0*.*044; rho = −0*.*297, *df* = 57, *p* = 0*.*022) and information processing speed (*r* = −0*.*279, *p* = 0*.*032; rho = − 0*.*223, *df* = 57, *p* = 0*.*089) were significant: more learning was associated with better working memory and with higher processing speed.

In **Figure 4**, the average recognition score per block is displayed to illustrate perceptual learning of the noise-vocoded speech within age group. Moreover, **Figure 4** shows the range of perceptual learning that could be observed within each age group. Although younger adults were presented with a more difficult noise-vocoding condition (4 bands) than older adults (5 bands) and showed a lower starting performance, both age groups showed similar progress in perceptual learning. This indicates that speech conditions were appropriately selected to elicit sizeable and comparable amounts of improvement over the course of exposure in both age groups. Estimates of the best model

**FIGURE 4 | Performance on the perceptual learning task.** Error bars indicate two standard errors from the mean. **(A)** Mean improvement in speech understanding per age group over block. **(B)** Improvement in speech understanding performance (in %) relative to baseline level. **(C)** Box plot of perceptual learning performance in younger and older adults (individual block slopes from the statistical model). More positive slopes reflect more learning.



*n.s. = p > 0.05.*

to predict sentence identification performance within each age group are displayed in **Table 5**.

The age group comparison showed a significant effect of block (beta = 0.710, *SE* = 0*.*034, *t* = 20*.*78, *p <* 0*.*001) that was not modified by age group (beta = −0*.*071, *SE* = 0*.*046, *t* = −1*.*56, *p* = 0*.*120), indicating that both age groups showed a similar amount of perceptual learning over the course of the experiment3 . As older adults were presented with an easier condition (5 instead of 4 band vocoded speech), a fixed effect of age group showed that older adults repeated more key words correctly than younger adults (beta = 0.971, *SE* = 0*.*103, *t* = 9*.*41, *p <* 0*.*001). Our results suggest that we were successful in providing older and younger adults with a starting level that

<sup>3</sup>Note that we also performed an age group comparison including the data of the pilot study. In this analysis, we compared younger adults' performance on the 4 bands speech (placed on the intercept) to older adults' performance across the different band conditions (i.e., 4 bands, 5 bands and 6 bands). In the 4 band condition, younger adults showed a higher starting performance

than older adults (beta = −0*.*639, *SE* = 0*.*160, *t* = 3*.*99 and *p <* 0*.*001). More importantly, in the 4 band condition, we found an interaction between age group and improvement over blocks. That is, older adults showed a significantly smaller effect of block and, hence, less learning, than younger adults (beta = −0*.*098, *SE* = 0*.*026, *t* = −3*.*80 and *p <* 0*.*001). This result emphasizes that it is not possible to elicit similar amounts of perceptual learning in the two age groups by presenting younger and older adults with the same signal degradation condition.

allowed for a comparable amount of perceptual learning within both age groups.

In younger adults, none of the predictor measures showed a fixed effect, suggesting that none of the predictor measures could be used to predict initial speech recognition performance (i.e., for block 1 performance, being on the intercept). The best fitting model showed that younger participants identified more keywords correctly with increasing exposure over blocks, indicating that they generally adapted to noise-vocoded speech. This effect of perceptual learning was modified by statistical learning ability: the more participants had picked up the implicit regularities in the statistical learning task (and thus the steeper their drop in performance in the test phase), the more they improved in understanding noise-vocoded speech. This result provides first evidence that perceptual learning and statistical learning are associated. Further, the effect of perceptual learning was modified by vocabulary knowledge: younger adults who had greater vocabulary knowledge showed faster speech adaptation over blocks, underscoring the involvement of linguistic knowledge in perceptual learning of speech. Note that we had excluded an interaction between block and processing speed during the modeling process, as its inclusion led only to a marginal improvement of model fit. This marginal interaction suggested that higher processing speed tended to be associated with faster adaptation. The maximal random slope structure included the effect of block within subject. Removing this effect from the maximal random slope structure reduced the model fit significantly, indicating that individuals differed considerably in perceptual learning ability. As random slopes of individual predictor measures within items did not improve the model fit, this indicated that the effects of predictor measures could be generalized across sentences.

In the older adults, initial sentence identification performance was associated with hearing sensitivity and processing speed: hearing loss considerably affected initial speech understanding, whereas those with higher processing speed showed better initial speech recognition performance. Like the younger adults, older adults showed perceptual learning of noise-vocoded speech, which was indicated by a significant improvement in identification performance over blocks. This block effect was modified by age, indicating that older adults within the older age group improved less over the course of exposure than younger older adults. As age in the older adult sample was intercorrelated with processing speed and hearing sensitivity (see intercorrelations in **Table 2**), we also investigated whether either variable would have surfaced as a predictor for adaptation if we left out age. The variance in amount of perceptual learning that was assigned to age was not taken over by any of the other predictors included in the analysis. This indicates that the effect of age explains unique variance in perceptual learning performance that is not captured by the included cognitive and perceptual predictors. Importantly, statistical learning ability did not facilitate the amount of improvement over the course of the experiment. The maximal random slope structure included effects of age and hearing sensitivity on item, suggesting that the effects of age and hearing sensitivity on recall of noise-vocoded sentences differed across sentences. That is, hearing and age affected speech understanding of some sentences more than of others, in addition to the general impact these predictors had on sentence recall. Moreover, inclusion of the random effect of block within participant significantly improved the model fit, implying that older participants differed in their improvement to understand noise-vocoded speech over the course of exposure.

## **DISCUSSION**

This study investigated the contribution of general cognitive abilities to listeners' capacity to adapt to novel speech conditions. In order to gain more insight into individual abilities associated with adaptation to unfamiliar speech across the life span, we tested both younger and older adults. Specifically, we aimed to test the hypothesis that listeners' improvement in understanding unfamiliar types of speech could be predicted from individual differences in statistical learning ability and in general cognitive skills.

The ability to implicitly learn has been argued to remain stable over the life span (Midford and Kirsner, 2005). In line with this, several studies reported that older adults are sensitive to probabilistic sequences (Salthouse et al., 1999; Negash et al., 2003; Simon et al., 2011; Campbell et al., 2012) and found the ability to adapt to novel speech conditions to be preserved in older adults (Peelle and Wingfield, 2005; Golomb et al., 2007; Adank and Janse, 2010; Gordon-Salant et al., 2010). Our findings support the notion that perceptual learning ability remains stable over the life span, as both younger and older listeners showed significant improvement in understanding noise-vocoded speech over exposure. Moreover, the observed amount of learning was comparable in both age groups. This suggests that older adults can reach the same amount of perceptual learning as younger adults given better starting level intelligibility. However, only younger adults were sensitive to statistical regularities in the input. As we found a significant learning by age group interaction, this indicated agerelated declines in the ability to detect statistical regularities if visual sequences are presented quickly.

Possibly, certain aspects of our statistical learning task may be responsible for the absence of a statistical learning effect in older adults. In particular, we had incorporated an inter-target interval of 500 ms (following Salthouse et al., 1999) between both clicks within a trial to allow for prediction effects, even in older adults with slower processing. As we tested statistical learning in a speeded computer mouse task, and movement control on computer mouse tasks is reduced in older adults (Smith et al., 1999), the implemented inter-target interval may have been too short for older adults to show prediction effects. Moreover, to prevent associations between both measures of implicit learning due to modality-specific processing, we chose for a rigorous test of the association between the two types of learning by testing statistical learning ability in a non-auditory (i.e., visual) domain with non-linguistic stimuli. As older adults were able to implicitly learn in the auditory task, it may be argued that we did not observe implicit learning in the visual paradigm due to age-specific modality effects. In both implicit learning tasks, task-relevant information was presented sequentially (i.e., speech unfolding over time in the auditory task and successive highlighting of targets in the visual task). Visual stimuli have been shown to have less salient temporal relations than auditory stimuli (Kubovy, 1988). Consequently, auditory learning is superior to visual learning in sequence learning tasks (Conway and Christiansen, 2005). Additionally, a recent study found that statistical learning performance is decreased if visual stimuli are presented at a fast rate (Emberson et al., 2011). Although stimuli presentation in our statistical learning task was not timed as it depended on participants' performance speed (i.e., participants who clicked faster, saw visual stimuli shorter), the time pressure induced by the speeded task, as well as relatively fast and sequential presentation of visual stimuli, may have interfered with statistical learning performance in older adults. That is, results of the current study suggest that older adults' statistical learning ability is affected if fast, sequential processing of visual stimuli is required. However, as previous studies have shown that older adults remain sensitive to probabilistic information in the input (Salthouse et al., 1999; Negash et al., 2003; Simon et al., 2011; Campbell et al., 2012), our failure to observe statistical learning in older adults should not be taken as evidence that older adults are generally insensitive to probabilistic information in the input, or that probabilistic information in the input is generally unimportant for perceptual learning in older adults. Obviously, further research is required to investigate possible links between statistical and perceptual learning in a setting where older adults do show both types of learning.

Overall, limited variability could be observed on the measure of statistical learning ability in both age groups and the amount of individual statistical learning could not be explained by individual differences in cognitive or linguistic abilities in our analyses. However, note the correlations between statistical learning on the one hand and speed and working memory on the other hand in the younger adults. These correlations suggest that, despite relatively little variation in statistical learning, there was some systematicity in younger adults' statistical learning differences. In contrast, participants showed great variability in the amount of adaptation to degraded speech and individual differences in learning to understand noise-vocoded speech could be associated with listeners' cognitive abilities. This finding supports the claim of the RHT that perceptual learning is a top–down guided process, implying that higher cognitive processes are indeed involved in the top–down search to identify task-relevant cues in the input. However, links between cognitive abilities and perceptual learning performance seem to undergo age-related changes, as different associations between perceptual learning ability and cognitive measures emerged in younger and older adults.

In younger adults, initial performance in identifying noisevocoded speech was not predicted by general cognitive or linguistic abilities. However, differences in the amount of improvement over the course of exposure were associated with individual sensitivity to probabilistic information and with individual vocabulary knowledge. In line with our hypothesis, our results suggest that adaptation to novel speech conditions and statistical learning share mechanisms of implicit regularity detection. Our results contribute to earlier literature indicating a relationship between statistical learning performance and individual differences in language processing (Misyak et al., 2010a). As statistical learning was tested using visual and non-linguistic stimuli, this suggests that general abilities, that are neither modality-specific nor specific for language processing, drive this association.

As argued in the Introduction, the link between statistical learning and perceptual learning in speech can be twofold. On the one hand, statistical and perceptual learning may be associated as they draw on the same underlying abilities. Our findings do not support this "mediation account": the observed association between perceptual learning in speech and statistical learning performance does not seem to be mediated by the specific cognitive abilities tested in the current study. On the other hand, perceptual learning processes may directly rely on statistical properties in the input. In novel speech conditions, perceptual learning may be facilitated by sensitivity to statistical properties as language itself conveys probabilistic information e.g., in terms of phonotactic (Vitevitch et al., 2004) and transitional probability (Thompson and Newport, 2007). Listeners have been shown to make use of this probabilistic information to segment speech streams into words (Saffran et al., 1997). In the framework of the RHT, listeners who are more sensitive to statistical regularities may, hence, be faster in identifying subunits (e.g., words) in novel speech input, thereby facilitating faster access to high-level representations. Moreover, the information that is transferred from lower to higher levels of the hierarchy may itself be probabilistic in nature. Recent theories in visual perceptual learning argue that the process of input reweighting is based on such probabilistic decisions (e.g., Petrov et al., 2005; Zhang et al., 2010). For example, assuming that the information that is conveyed from lower levels to higher levels is normally distributed and that the mean of the distribution resembles the most relevant input, each incoming input could be reweighted based on its relative distance from the mean, with distance serving as index of informational relevance (Zhang et al., 2010). First evidence that probabilistic information may be encoded in the input from lower to higher hierarchical levels comes from a study in which neuronal network models that relied on probabilistic inferences could explain neurophysiological changes in early sensory areas in visual perceptual learning tasks that could not be accounted for by other models (Bejjanki et al., 2011).

The finding that improvement in understanding noisevocoded speech in younger adults is predicted by participants' vocabulary size confirms the link between increased lexical knowledge and success in perceptual learning that has previously been reported in adapting to novel-accented speech (Janse and Adank, 2012). Thus, lexical knowledge is not only associated with adaptation to linguistic degradations, e.g., systematic phonological deviations in how a foreign-accented speaker pronounces words, but also relates to perceptual learning of acoustically degraded speech. Previous research has shown that younger and older individuals with higher scores on vocabulary tests also show better performance on measures of verbal fluency (e.g., Kemper and Sumner, 2001; Hedden et al., 2005). Thus, individuals with greater vocabulary knowledge may be more efficient processors of linguistic information (Kemper and Sumner, 2001), and linguistic knowledge may improve perceptual learning in speech by facilitating access to higher-level representations. As access to higher-level representations aids sublexical retuning by enabling and guiding top–down search processes (Ahissar and Hochstein, 2004), effects of lexical knowledge should in fact arise irrespective of type of systematic speech degradation. Given that Janse and Adank (2012) observed a relationship between vocabulary knowledge and adaptation (to accented speech) in older adults, this raises the question why older adults' perceptual learning performance was not predicted by their linguistic knowledge here. Older adults outperformed younger adults on the measure of lexical knowledge, but note that older adults also showed relatively little variation on the vocabulary test (see Section Performance on Background Measures). Consequently, there was less room to relate lexical knowledge to individual differences in perceptual learning ability in older adults than in younger adults. We checked the variation for older adults' vocabulary scores in the sample of Janse and Adank (2012) (coefficient of variation = 10.3%), which was close to the variation we now observed in the younger adults. Therefore, variation in older adults' vocabulary scores in the current study may have indeed been insufficient to predict perceptual learning.

In the older adult group, listeners' starting level in understanding noise-vocoded speech was associated with higher processing speed and affected by hearing loss, whereas listeners' age predicted how well they adapted to the novel speech condition. That is, younger listeners in the group of older adults showed more learning than older–older listeners. This effect of age had unique explanatory power that was not captured by the included cognitive and perceptual predictors. This finding seems to be consistent with previous research which reported declines in the general identification of noise-vocoded speech with increasing age that were independent of hearing sensitivity (Souza and Boike, 2006; Sheldon et al., 2008) and which may have reflected limited improvement over exposure. Importantly, the current design allowed us to differentiate between effects of individual predictors on both starting level speech identification performance and on amount of perceptual learning. Thus, our results complement earlier findings, suggesting that hearing loss affects initial recognition of noise-vocoded speech, whereas age-related deficits specifically constrain improvement in adaptation to a novel speech input. Younger adults generally outperform older adults when being exposed to the same speech degradation (Peelle and Wingfield, 2005; Sheldon et al., 2008). Importantly, providing younger and older adults with the same speech degradation also has consequences for age groups' ability to improve their performance over exposure (cf. our pilot result data discussed in Section 3.3). In order to have similarly large amounts of perceptual learning for the two age groups, older adults have to be presented with an easier condition than younger adults (Golomb et al., 2007), which was also done in the current study. It is unclear, however, what the age effect on perceptual learning ability among the older adults reflects. A possible account may come from recent studies reporting that coherence between activated brain regions is decreased in older adults (Andrews-Hanna et al., 2007; Peelle et al., 2010), relative to younger adults. Importantly, these deteriorations in connectivity were associated with declines in speech understanding performance under difficult listening conditions (Peelle et al., 2010) and with poorer performance on cognitive tasks (Andrews-Hanna et al., 2007). In the framework of the RHT, we may speculate that a reduced coordination between neuronal regions may hinder effective information flow between hierarchical levels, thereby constraining processes of input reweighting. Consequently, this decreased information flow would then impede modifications to the lower-level representations. Thus, an age-related decrease in the ability to coordinate activity between brain regions may affect adaptation to challenging novel speech input.

In short, our results suggest that individual differences in general cognitive and linguistic abilities can explain listeners' variability in adaptation to noise-vocoded speech, thereby highlighting the involvement of listener-based abilities in perceptual learning. As noise-vocoded speech simulates the auditory signal of a cochlear implant, findings of the current study may provide valuable insights for aural rehabilitation in younger and older adults. Amount of adaptation over the course of exposure was specifically associated with vocabulary knowledge and with individuals' sensitivity to probabilistic regularities. These combined results emphasize the importance of pattern recognition and linguistic knowledge for perceptual learning and adaptation in speech processing.

## **AUTHOR CONTRIBUTIONS**

Thordis M. Neger was intensively involved in literature search, research design, experiment preparation, data collection, data analyses, data interpretation, and article preparation. Esther Janse was intensively involved in research design, data analyses, data interpretation, and article preparation. Toni Rietveld was involved in data analyses and article preparation, and contributed to the research design. All authors approved the final article.

#### **ACKNOWLEDGMENTS**

We wish to thank Willemijn van den Berg for her help in data acquisition. This research was supported by the Netherlands Organization for Research (NWO) under Project No. 276-75-009 (grant awarded to Esther Janse).

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fnhum*.* 2014*.*00628/abstract

### **REFERENCES**


**Conflict of Interest Statement:** The Review Editor Frank Eisner declared to the Guest Associate Editor as being affiliated to the same institution as the authors prior to accepting the review assignment, and the disclosure made was deemed sufficient in this case. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 03 March 2014; accepted: 28 July 2014; published online: 01 September 2014. Citation: Neger TM, Rietveld T and Janse E (2014) Relationship between perceptual learning in speech and statistical learning in younger and older adults. Front. Hum. Neurosci. 8:628. doi: 10.3389/fnhum.2014.00628*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Neger, Rietveld and Janse. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Effects of navigated TMS on object and action naming

## *Julio C. Hernandez-Pavon1,2\*†, Niko Mäkelä1,2†, Henri Lehtinen3, Pantelis Lioumis <sup>2</sup> and Jyrki P. Mäkelä2*

*<sup>1</sup> Department of Biomedical Engineering and Computational Science, Aalto University School of Science, Espoo, Finland*

*<sup>2</sup> BioMag Laboratory, HUS Medical Imaging Center, Helsinki University Central Hospital, Helsinki, Finland*

*<sup>3</sup> Epilepsy Unit, Department of Pediatric Neurology, Helsinki University Central Hospital, Helsinki, Finland*

#### *Edited by:*

*Carolyn McGettigan, Royal Holloway University of London, UK*

#### *Reviewed by:*

*Joseph T. Devlin, University College London, UK Olaf Hauk, MRC Cognition and Brain Sciences Unit, UK Patti Adank, University College London, UK*

#### *\*Correspondence:*

*Julio C. Hernandez-Pavon, Department of Biomedical Engineering and Computational Science, Aalto University School of Science, PO Box 12200, FI-00076 Aalto, Espoo, Finland e-mail: julio.hpavon@aalto.fi; julio.hpavon@gmail.com*

*†These authors have contributed equally to this work.*

Transcranial magnetic stimulation (TMS) has been used to induce speech disturbances and to affect speech performance during different naming tasks. Lately, repetitive navigated TMS (nTMS) has been used for non-invasive mapping of cortical speech-related areas. Different naming tasks may give different information that can be useful for presurgical evaluation. We studied the sensitivity of object and action naming tasks to nTMS and compared the distributions of cortical sites where nTMS produced naming errors. Eight healthy subjects named pictures of objects and actions during repetitive nTMS delivered to semi-random left-hemispheric sites. Subject-validated image stacks were obtained in the baseline naming of all pictures before nTMS. Thereafter, nTMS pulse trains were delivered while the subjects were naming the images of objects or actions. The sessions were video-recorded for offline analysis. Naming during nTMS was compared with the baseline performance. The nTMS-induced naming errors were categorized by error type and location. nTMS produced no-response errors, phonological paraphasias, and semantic paraphasias. In seven out of eight subjects, nTMS produced more errors during object than action naming. Both intrasubject and intersubject analysis showed that object naming was significantly more sensitive to nTMS. When the number of errors was compared according to a given area, nTMS to postcentral gyrus induced more errors during object than action naming. Object naming is apparently more easily disrupted by TMS than action naming. Different stimulus types can be useful for locating different aspects of speech functions. This provides new possibilities in both basic and clinical research of cortical speech representations.

**Keywords: transcranial magnetic stimulation, speech mapping, left hemisphere, object naming, action naming**

### **INTRODUCTION**

Transcranial magnetic stimulation (TMS) is a noninvasive technique where a strong and brief magnetic pulse is delivered to the brain and induces electrical currents. This produces depolarization of cellular membranes and neuronal activation (Barker et al., 1985; Ilmoniemi et al., 1999). TMS has become an important tool for studying speech and language at both the cognitive and neural level (Devlin and Watkins, 2007). TMS may produce both inhibition and facilitation during different phases of speech processing either by directly stimulating a specific speech-related cortical region or indirectly through intracortical networks (Epstein, 1998). TMS has been used for studying the functional localization of speech in healthy subjects, with variable results (Pascual-Leone et al., 1991; Epstein et al., 1999; Devlin and Watkins, 2007; Vigliocco et al., 2011).

Navigated TMS (nTMS) is considered the state-of-the-art technique in performing TMS studies (Siebner et al., 2009). In nTMS, the stimulated cortical site can be defined anatomically from the individual's brain magnetic resonance images (MRI). In addition, orientation and strength of the induced electric field can be estimated (Siebner et al., 2009; Ruohonen and Karhu, 2010). The information provided by nTMS is useful for surgical planning, and it can be transferred into the operating theater via surgical neuronavigation systems.

So far, nTMS has been used in preoperative localization of the motor cortex (Picht et al., 2009; Vitikainen et al., 2009). It localizes the cortical representations of hand muscles as accurately as direct cortical stimulation (DCS) (Picht et al., 2011; Krieg et al., 2012) and more accurately than functional magnetic resonance imaging (fMRI) (Forster et al., 2011; Krieg et al., 2012). In addition, neuromodulation of Broca's area in speech-related tasks is reported to be more robust by nTMS than by conventional TMS based on external landmarks on the head (Kim et al., 2013). These results motivated us to develop a protocol for preoperative localization of speech-related brain areas by utilizing object naming and nTMS (Lioumis et al., 2012). This novel approach has been compared to DCS during awake craniotomy (Picht et al., 2013). The results imply that nTMS is remarkably sensitive but relatively non-specific in detecting the sites producing speech disturbance in DCS. Discordance between nTMS and DCS was observed particularly in the posterior cortical regions (Picht et al., 2013; Tarapore et al., 2013). Preoperative speech mapping by nTMS can give important a priori information to the neurosurgeons. It may aid in objective preoperative risk-benefit balancing of the planned surgery, more targeted and smaller craniotomies, faster and safer intraoperative mapping, and safer surgeries for patients that cannot undergo awake craniotomy (Picht et al., 2013). Recently, two studies have used object naming and nTMS to compare language mapping on patients with brain tumors and healthy subjects, suggesting tumor-induced plasticity of speech representation areas (Krieg et al., 2013; Rösler et al., 2013). Thus, better understanding of the effects of TMS during naming tasks may have an impact on surgery planning and provide information about the cortical organization of speech in general.

Picture naming has been extensively studied in both healthy subjects and patients with various neurological diseases. In magnetoencephalography (MEG) and fMRI studies on healthy subjects, action and object naming activate cortical networks including left inferior frontal gyrus, left dorsal premotor, bilateral occipitotemporal, and bilateral parietal areas (Sörös et al., 2003; Petrovich Brennan et al., 2007; Liljeström et al., 2008, 2009). Some functional neuroimaging studies suggest different cortical representations of action and object naming (for a review, see Vigliocco et al., 2011). It has been suggested that action naming activates particularly the left premotor (Valyear et al., 2007; Canessa et al., 2008), parietal (Noppeney et al., 2005), and frontal cortex (Vigliocco et al., 2011), whereas object naming activates the left temporal areas most strongly (Vigliocco et al., 2011). In line, patients with aphasia due to lesions in left frontal areas have shown more severe deficits in action naming, whereas lesions in the left temporal areas are associated with deficits in object naming (Mätzig et al., 2009; Vigliocco et al., 2011).

Action naming appears to be a demanding process that requires more extensive neural processing than object naming (Mätzig et al., 2009). TMS studies indicate that left prefrontal and motor cortices are involved in processing verbs and actions (Pulvermüller et al., 2005; Cappelletti et al., 2008; Gerfo et al., 2008). However, the areas stimulated in (Cappelletti et al., 2008) and (Gerfo et al., 2008) do not match with areas of greater activation for verbs and nouns in imaging studies using similar tasks (Vigliocco et al., 2011). Intraoperative cortical mapping by DCS during awake craniotomy in tumor surgery has systematically revealed widely distributed and highly individual effective cortical sites (Whitaker and Ojemann, 1977; Sanai et al., 2008; Corina et al., 2010) and dissociation of sites inducing errors in action and object naming (Corina et al., 2005; Lubrano et al., 2014). The cortical sites activated specifically by action naming resided mainly in the parietal cortex (Corina et al., 2005).

We mapped the left-hemispheric speech-related areas by nTMS during object and action naming tasks. The induced errors during object and action naming were categorized by type and location of the stimulated cortical site, and compared with each other. We were particularly interested to see if action naming would be interfered more by nTMS in the posterior cortical areas, where discordant results between nTMS and DCS were seen in an object naming task (Picht et al., 2013; Tarapore et al., 2013). If so, action naming tasks might add information in detecting speech-related cortical areas from this region by means of nTMS, as suggested in previous studies (Corina et al., 2005; Noppeney et al., 2005). As action naming is considered more demanding than object naming (Berndt et al., 1997; Mätzig et al., 2009), we hypothesized that action naming would be more easily disturbed by nTMS than object naming.

## **METHODS**

#### **SUBJECTS**

Eight neurologically normal right-handed subjects (native speakers of Finnish; mean age 26 ± 2 years, four females) participated in the study. The subjects had normal or corrected-to-normal vision. The study was approved by the Ethics Committee of Helsinki University Central Hospital and was in compliance with the declaration of Helsinki. The subjects gave their written informed consent before the experiments.

#### **OBJECT AND ACTION NAMING**

We used two sets of color pictures with a white background, one with 131 images depicting objects and another with 98 images depicting actions. Object images illustrated a simple object (e.g., a chair; **Figure 1A**; see also the video in the Supplementary Material). The action images represented a simple event (e.g., playing an instrument; **Figure 1B**). The subjects were asked to name objects or actions in Finnish as quickly and precisely as possible. Two subjects performed action naming before object naming. The experiment consisted of two baseline sessions without nTMS (one for object and another for action naming) and two nTMS sessions (one with object naming and another with action naming). All sessions were video-recorded for offline analysis. The baseline sessions were done before the nTMS sessions. Images that were unfamiliar or named incorrectly in the baseline session were removed from the image set used during nTMS (see

the Supplementary Table). Thus, only fluently named images were used during the nTMS sessions. The numbers of rejected object and action images did not differ significantly (Mann–Whitney *U*-test; *p* = 0*.*26). The images were displayed in random order within the object naming and action naming sessions (see the **Supplementary Video**). For each subject, all TMS measurements were performed in a row.

### **STIMULATION**

Two recording setups were used. In setup 1, we used eXimia Navigated Brain Stimulation (NBS) version 3.2 (Nexstim Ltd., Helsinki Finland); for details, see Lioumis et al. (2012). In setup 2, we used eXimia NBS version 4.3 and a commercial speechmapping module (NexSpeech, Nexstim Ltd., Helsinki Finland). Both navigation systems calculate the strength of the maximum electric field that is overlaid on-line on the 3-D reconstruction of the individual's brain (Ruohonen and Karhu, 2010). Each stimulation site is tagged to the MR image for subsequent studies.

All stimulations were done with a biphasic figure-of-eight coil. The outer diameter of the coil was 70 mm. The resting motor threshold (MT) was determined from the right abductor pollicis brevis (APB) muscle, and the strength of the induced electric field at the cortex was registered. These electric fields varied between 40 and 100 V/m at approximately 25 mm from the head surface (i.e., at the navigation depth). The stimulus intensity for the speech mapping was adjusted to produce roughly as strong electric field to perisylvian cortical regions. Navigated TMS of temporal areas occasionally produces some discomfort. However, we were particularly meticulous to avoid such unpleasantness; if the stimulation caused discomfort to the subject due to muscle contraction (in a short test session before the actual measurements), the stimulation intensity was lowered in decrements of 5–10% until it was tolerable. Moreover, the experimental setup has been validated by DCS, where discomfort due to scalp or muscle stimulation is not an issue, and found to match well with DCS particularly in ventral anterior areas (see e.g., Picht et al., 2013; Tarapore et al., 2013; Krieg et al., 2014). Consequently, the stimulation intensity varied somewhat across subjects (80–110% of the APB MT; 30–40% of the stimulator output). The stimulation was done with nTMS trains of five pulses at 5 Hz (Epstein et al., 1996; Lioumis et al., 2012). The subjects wore earplugs during all sessions.

The object and action pictures were displayed for 700 ms on a computer screen once every 2.5 s. The nTMS trains were delivered with a 300 ms delay after the picture onset (**Figure 1C**; **Supplementary Video**). The nTMS onset time was chosen on the basis of MEG studies on dynamics of cortical language processing (Salmelin et al., 2000; Sörös et al., 2003); essentially we did not want to interfere with the visual inspection, but to disturb other stages of language production (e.g., conceptual processing, lexical selection, phonological encoding, and articulatory preparation). The coil was hand-held and it was moved freely between the pulse trains. Approximately 200 sites were stimulated in the left hemisphere by moving the coil semi-randomly in between the trains of pulses, following a grid-like pattern so that the tested target sites covered systematically a wide fronto-temporo-parietal cortical area. The same areas were stimulated for both tasks. The orientation of the coil was adjusted to induce current primarily perpendicular to the fibers of the temporalis muscle to minimize muscle twitching, and secondarily perpendicular to the sulcus at the stimulation target. The cortical sites where nTMS-induced errors were observed online and were revisited to evaluate the repeatability of the effect (see **Supplementary Video**). On average, 257 stimulus trains were delivered to the left hemisphere during object naming and 243 during action naming in each subject. The maximum difference between repetitions for two different images was one, as the content of the subject-validated image stack was randomized each time a new round of the images started.

#### **DATA ANALYSIS**

A neuropsychologist with expertise in effects of DCS on speech (HL) analyzed naming performance in the recorded videos. During the analysis, the stimulation sites were not visible. The baseline naming responses were compared with those recorded during nTMS. The observed errors were categorized as noresponse errors, semantic paraphasias, and phonological paraphasias according to previous studies (Corina et al., 2010; Picht et al., 2013; Rösler et al., 2013). *No-response errors*: stimulation leads to a complete lack of naming response. *Phonologic paraphasias*: characterized by unintended phonemic modification of the target word. The spoken word resembles the target word, but is phonetically different. For example the target word "pants" is replaced with "plants." *Semantic paraphasias*: errors in which the patient substitutes a semantically related or associated word for the target word. For example, the target word "cow" is replaced by the word "horse." When a naming error occurred, the corresponding nTMS location was marked as speech-related and tagged by the observed error type. Thereafter, the nTMS sites eliciting naming errors were grouped into cortical regions. For the anatomical labeling, we used the anatomical atlas shown in Corina et al. (2010) as in previous publications (Lioumis et al., 2012; Picht et al., 2013). The cortical surface of each subject was separated into anatomical regions according to this template.

The statistical significance of the results were evaluated both in single-subject and group level. For the single-subject analysis, the statistical significance of the observed effects of nTMS on performance in the naming tasks was evaluated separately for each subject and stimulated area. The null hypothesis was that the observed errors occur due to chance. If so, the number of observed errors should follow a Poisson distribution with the parameter λ = number of observed errors (per area)/the total number of nTMS trials (per area). The probability that the observed number of naming errors in an area could have arisen by chance rather than due to the effect of nTMS was computed by comparing the number of observed errors with one million simulated Poisson samples. The number of samples in the simulated data that were greater than or equal to the observed number of errors gives the probability of the case that the observed effect could have occurred by chance. The significance level was set at 5%. False discovery rate (FDR) was applied on the *p*-values collected from the area wise analyses of each subject to correct for multiple comparisons (Storey, 2002).

For the group level analysis, the 2-tailed Mann–Whitney *U*-test was used to compare the number of errors during object and action naming. The statistical analysis was done to the total number of naming errors in the left hemisphere, and within each error type and gyrus. The significance level was set at 5%.

To visually summarize the speech mapping results, the stimulation sites that were associated with naming errors from all eight subjects were projected on the standardized MNI brain template (Mazziotta et al., 2001), using FSL (Smith et al., 2004; Woolrich et al., 2009; Jenkinson et al., 2012) and FreeSurfer (Fischl et al., 1999) softwares. The brain was segmented from the individual T1-weighted MRIs of each subject and registered with the standard brain template in MNI space. Thereafter, the

**visualized on an inflated reconstruction of the cortex.** Red spheres: no-response errors; green spheres: semantic paraphasias; yellow spheres: phonological paraphasias. (**A**,**B)** All cortical sites that elicited nTMS-induced S7. The number, type, and location of the naming errors vary between subjects. The white asterisks indicate the sites of repeated errors at the same location.


**Table** 


## **RESULTS**

brain template (**Figure 2**).

Overall 93 nTMS trains (4.5% from a total of 2056 trains) induced errors during object naming. During action naming, 33 nTMS trains (1.7% from a total of 1944 trains) induced errors (**Figure 3A**). In seven out of eight subjects, TMS elicited more object naming than action naming errors. In one subject, the total number of induced errors was equal in both (**Table 1**). Naming errors were induced when nTMS was delivered to angular gyrus (anG), inferior frontal gyrus (IFG), middle frontal gyrus (MFG), postcentral gyrus (PoG), precentral gyrus (PrG), superior temporal gyrus (STG), middle temporal gyrus (MTG), and supramarginal gyrus (SMG) (**Table 1** and **Figure 2**).

In the object naming task, 25% of the sites associated with naming errors were located in the PoG, 23% in the STG, 19% in the IFG, 12% in the PrG, 9% in the SMG, 8% in the MTG, 3% in the MFG, and 2% in the anG (**Table 1** and **Figure 2**). A subanalysis by type showed that 24% of the no-response errors (73% of all naming errors, see **Figure 3B**) were induced from the IFG, 24% from the STG, 21% from the PoG, and 12% from the PrG (**Table 1**). Thus, 81% of all sites producing no-response errors during nTMS were concentrated on these areas.

In the action naming task, 34% of the sites associated with naming errors were located in the STG, 21% in the IFG, 12% in the PoG, 12% in the PrG, 12% in the SMG, 3% in the anG, and 3% in the MFG and in the MTG (**Table 1** and **Figure 2**). A subanalysis by type showed that 53% of the sites associated with no-response errors (46% of all naming errors, see **Figure 3C**), were located in the STG, 20% in the PoG, and 13% in the IFG (**Table 1**). Thus, more than 80% of all sites producing no-response errors during nTMS were concentrated on these areas. **Figure 2** depicts the sites where nTMS elicited naming errors in object and action naming tasks. Fused results for all subjects are shown in **Figures 2A,B**. Results for two individual subjects are shown in **Figures 2C–F** to reveal the inter-subject variability. Overall, the number, type, and location of the naming errors varied between the subjects.

The area-dependent subject-level analysis showed significant effects of nTMS in IFG, MFG, PoG, PrG, STG, MTG, and SMG for object naming (*p <* 0*.*05; see **Table 2**) and in IFG, PoG, STG, and SMG for action naming (*p <* 0*.*05; see **Table 2**). The most sensitive cortical sites were IFG, PoG, PrG, STG, and SMG (see **Table 2** for summary). The largest difference of nTMS-sensitive sites in object and action naming tasks was in the PoG, where 7 subjects had a significant effect of nTMS on object naming and only one on action naming (**Table 2**). Clear individual differences between the subjects in the distributions of the speech-related areas were evident (**Table 1** and **Figure 2**).

In group-level analysis, the total number of nTMS-induced errors in object naming was significantly larger than in action naming (*p* = 0*.*002; see **Table 1**). No-response errors were significantly more frequent in object than action naming (*p* = 0*.*002); the number of semantic and phonological paraphasias did not differ significantly between the tasks. When the total

*gyrus.*

number of errors within each gyrus was compared, object naming was more effectively disturbed by nTMS in PoG (*p* = 0*.*014) than action naming. No significant differences were observed for nTMS in the other gyri (**Table 1**).

### **DISCUSSION**

We observed that object naming was consistently more disturbed by nTMS to the left hemisphere than action naming. The induced error types varied between subjects, but no-response errors were the most frequent in both tasks. In parallel with our results, object naming errors were more frequent than action naming errors during left-hemisphere DCS of neurological patients (Lubrano et al., 2014). Apparently, object naming is more sensitive to perturbations elicited by nTMS than action naming. DCS is probably more efficient than nTMS; in our study 3.2% of all trials induced naming errors, whereas 11.5% of the tested DCS sites were associated with induced language interferences (Lubrano et al., 2014). However, DCS mapping is limited by the extent of the craniotomy, and our nTMS speech mapping covered a wide cortical area. Hence, it is more likely to stimulate sites that are not speech-related in the nTMS than DCS mapping.

TMS induced naming errors from virtually all perisylvian sites (**Figure 2**). However, across subjects the location of these individual punctuate regions varied and there were no regionally specific effects of nTMS, which is in line with the results obtained by DCS studies (Corina et al., 2005; Lubrano et al., 2014). The classical Broca's area (Brodmann area 44/45) in IFG and the Wernicke's area (Brodmann area 22) in STG were both sensitive to nTMS in most subjects. It is evident, however, that the classical modular brain–language model is insufficient to explain our results. Instead, the results support the current state-of-the-art models of widely distributed language network (Poeppel and Hickok, 2004; Hagoort and Indefrey, 2014; Hope et al., 2014).

In our study, we did not measure the time-line aspect of language processing *per se*; instead we used repetitive TMS to induce speech disturbances. We assumed that our rTMS train was delivered early enough (from 300 ms onwards) to be able to disturb semantic processing, phonological code retrieval, syllabification, phonetic encoding, and articulation components of the language processing (Indefrey and Levelt, 2004; Indefrey, 2011; Strijkers and Costa, 2011) needed in overt object and action naming. However, as speech processing is not only sequential but probably also occurs in parallel during several phases of the processing, the specific identification of the affected processes seems unreliable.

Recently, 300 and 0 ms nTMS pulse train onsets were compared to study the effects on sensitivity and specificity of picture naming during language mapping with navigated TMS. The 0-ms onset produced more specific results in the parietal areas when compared to DCS data (Krieg et al., 2014). The 0-ms paradigm



*Areas where nTMS induced statistically significant effects on object and action naming. The p-values are indicated; "-" means p > 0.05 and "0" values <* 0*.*001*. Notation: anG, angular gyrus; IFG, inferior frontal gyrus; MFG, middle frontal gyrus; PoG, postcentral gyrus; PrG, precentral gyrus; STG, superior temporal gyrus; MTG, middle temporal gyrus; SMG, supramarginal gyrus.*

resembles more precisely the one applied in DCS so it is not surprising that the match between the 0-ms onset time and DCS is better. Nevertheless, the early onset of the nTMS may also influence conceptual preparation, lexical concept selection, and lemma retrieval attributed to the early stages of picture naming processes (Indefrey and Levelt, 2004; Indefrey, 2011). However, using the 300-ms latency for the nTMS pulse trains should have not biased our results to make object naming more sensitive to TMS than action naming, because action naming is a more demanding and time-taking process (e.g., Vigliocco et al., 2004; Mätzig et al., 2009).

Our results do not allow conclusions on cortical areas essential for processing of object-related or action-related words. Instead, they emphasize the network nature of language processing, in line with previous studies (Vigneau et al., 2006; Mätzig et al., 2009; Vigliocco et al., 2011; Lubrano et al., 2014). We did not confirm the previously described particular sensitivity of action naming for parietal cortical DCS (Corina et al., 2005). Our results were in line with more recent DCS results (Lubrano et al., 2014).

As the action naming was not specifically influenced by nTMS to posterior cortical areas, its use in preoperative speech mapping probably does not increase the sensitivity of nTMS in these regions. Therefore, the discordant results between nTMS and DCS of the posterior cortical areas (Picht et al., 2013; Tarapore et al., 2013), would probably not be improved by replacing the object naming with an action naming task.

Speakers name pictures of objects faster than those of actions, and action naming is more difficult than object naming in terms of accuracy and latencies (Vigliocco et al., 2002, 2004, 2011; Mätzig et al., 2009; Strijkers and Costa, 2011). This would suggest that action naming would be more easily disrupted by nTMS than object naming. However, the reverse was true in our experiment. Naming of words related with actions has been reported to involve more the motor cortex (Pulvermüller, 2005; Pulvermüller et al., 2005) and middle frontal gyrus (Lubrano et al., 2014). In our study, we did not stimulate those areas extensively enough to reach such conclusions. However, if the motor areas are more involved in action naming than object naming, it is possible that this "extra support" makes action naming less sensitive to TMS than object naming, when perisylvian regions are stimulated.

Object naming was particularly sensitive for nTMS to PoG, which is not typically studied by DCS when comparing object and action naming (Corina et al., 2005; Lubrano et al., 2014). However, in direct cortical recordings, spectral activity in PoG is modified during naming (Wu et al., 2011; Cogan et al., 2014). In fMRI, action naming induces stronger activation than object naming in PoG (Liljeström et al., 2008, 2009). It is possible that this stronger activation by action naming is less vulnerable to nTMS-induced perturbation.

Both fMRI and DCS have been used for language mapping. DCS during awake craniotomy is considered the gold standard for intraoperative brain mapping of cortical speech representations. However, it is demanding for the patient, strongly invasive, and may produce after-discharges, making the results difficult to interpret (Giussani et al., 2010). Moreover, the studied cortical regions are limited by the extent of craniotomy and demands of the surgery. Results from fMRI vary between different language paradigms and individuals, and its spatial accuracy in patients with gliomas has been questioned (Giussani et al., 2010; Wang et al., 2012). A recent case report suggests that nTMS may be more sensitive in defining speech lateralization than fMRI (Sollmann et al., 2013). The results of our study support the usefulness of picture naming combined with nTMS in presurgical planning (Krieg et al., 2013, 2014; Picht et al., 2013; Rösler et al., 2013). It also provides new possibilities for basic research of cortical speech representation. In addition, it may offer complementary information in comparison to other non-invasive methods (e.g., MEG and fMRI). Moreover, our results suggest that the efficacy of TMS in inducing naming errors can be modulated by the task; if a higher sensitivity is required, object naming is preferred; if a sparse amount of nTMS sites is required, action naming can be used.

Static pictures have limitations in exploring action naming performance, and some research groups have used videos of actions as stimuli to overcome this issue (e.g., Corina et al., 2005). Nevertheless, static images have been widely used in studies of action naming (see Mätzig et al., 2009). We did not match the frequency, familiarity length, or visual complexity of the pictures of objects vs. actions. Instead, subject-validated image stacks for objects and actions were obtained in the baseline naming session. It should be emphasized that we did not directly compare the naming of objects vs. actions, but we compared the sensitivity of fluently named objects or actions to nTMS (for a similar approach, see Lubrano et al., 2014).

In summary, we have compared the naming error distributions induced by nTMS during object and action naming tasks. We suggest that object naming is more easily disrupted by nTMS than action naming. Particularly nTMS to PoG induced more errors during object naming than during action naming. Thus, use of action naming instead of object naming tasks most likely would not improve the specificity of nTMS in mapping posterior speech-related areas (Picht et al., 2013; Tarapore et al., 2013). These features, however, can be used in varying the sensitivity of functional mapping by nTMS for different cognitive paradigms in basic research and for presurgical planning. To resume, TMS applied to 8 subjects induced 93 errors during object naming and 33 during action naming. We find this surprisingly convincing for relatively small material, but believe that increasing the number of subjects will provide further important information for cortical speech organization.

#### **ACKNOWLEDGMENTS**

This study was financially supported by grants from the Helsinki University Hospital research fund and by the SalWe Research Program for Mind and Body (Tekes—the Finnish Funding Agency for Technology and Innovation grant 1104/10). We thank Dr. Thomas Picht, Department of Neurosurgery, Charité Hospital, Berlin, Germany, for providing us the action and object images, Antonios K. Thanellas for helping in the segmentation and registration of the MRIs to produce **Figure 2**, Dr. Ilkka Nissilä for his suggestions in the statistical analysis and the English examiner, Steve Lipson, M.A., for revising the English of our manuscript. Julio C. Hernandez-Pavon wants to thank CONACYT (Consejo Nacional de Ciencia y Tecnologia) for funding.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fnhum*.* 2014*.*00660/abstract

#### **Supplementary Video | Clips from baseline recordings and from errors in object and action 32 naming during navigated TMS recorded with experimental setup 1.**

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 13 May 2014; accepted: 08 August 2014; published online: 02 September 2014.*

*Citation: Hernandez-Pavon JC, Mäkelä N, Lehtinen H, Lioumis P and Mäkelä JP (2014) Effects of navigated TMS on object and action naming. Front. Hum. Neurosci. 8:660. doi: 10.3389/fnhum.2014.00660*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Hernandez-Pavon, Mäkelä, Lehtinen, Lioumis and Mäkelä. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Processing of acoustic and phonological information of lexical tones in Mandarin Chinese revealed by mismatch negativity

#### **Keke Yu<sup>1</sup> , Ruiming Wang<sup>1</sup>\*, Li Li <sup>2</sup> and Ping Li <sup>3</sup>\***

<sup>1</sup> Center for Studies of Psychological Application, School of Psychology, South China Normal University, Guangzhou, China

<sup>2</sup> College of International Culture, South China Normal University, Guangzhou, China

<sup>3</sup> Department of Psychology and Center for Brain, Behavior, and Cognition, Pennsylvania State University, Pennsylvania, PA, USA

#### **Edited by:**

Patti Adank, University College London, UK

#### **Reviewed by:**

Nikhil Sharma, University College London, UK Xiaoqing Li, Institute of Psychology, Chinese Academy of Sciences, China

#### **\*Correspondence:**

Ruiming Wang, Center for Studies of Psychological Application, School of Psychology, South China Normal University, No. 55, West Zhongshan Ave., Tianhe District, Guangzhou 510631, China e-mail: wangrm@scnu.edu.cn; Ping Li, Department of Psychology and Center for Brain, Behavior, and Cognition, Pennsylvania State University, University Park, 201 Old Main, Pennsylvania, PA 16802, USA e-mail: pul8@psu.edu

The accurate perception of lexical tones in tonal languages involves the processing of both acoustic information and phonological information carried by the tonal signal. In this study we evaluated the relative role of the two types of information in native Chinese speaker's processing of tones at a preattentive stage with event-related potentials (ERPs), particularly the mismatch negativity (MNN). Specifically, we distinguished the acoustic from the phonological information by manipulating phonological category and acoustic interval of the stimulus materials. We found a significant main effect of phonological category for the peak latency of MMN, but a main effect of both phonological category and acoustic interval for the mean amplitude of MMN. The results indicated that the two types of information, acoustic and phonological, play different roles in the processing of Chinese lexical tones: acoustic information only impacts the extent of tonal processing, while phonological information affects both the extent and the time course of tonal processing. Implications of these findings are discussed in light of neurocognitive processes of phonological processing.

**Keywords: Chinese lexical tones, acoustic processing, phonological processing, mismatch negativity (MMN), preattentive stage**

#### **INTRODUCTION**

The use of lexical tones to differentiate lexical semantics is a characteristic of Chinese and other tonal languages. According to Yip (2002), the world's languages can be categorized into three types depending on the role that the pitch information plays in the expression of meaning: tone language (e.g., Chinese), intonation language (e.g., English), and pitch-accent language (e.g., Japanese). In languages like Chinese, different lexical semantics are expressed through the variations of pitch height and pitch contour at the syllable level, as opposed to intonation languages where pitch variations occur only at the phrase or sentence level, or pitch-accent language where pitch variations occur between syllables. In recent years, researchers have become interested in the neurocognitive processes associated with tonal languages (see Gandour, 2006; Jongman et al., 2006 for reviews). Several studies have also used neuroimaging techniques including event-related potential (ERP) and functional magnetic resonance imaging (fMRI) to study the processing of lexical tones in Chinese (e.g., Gandour et al., 2000; Klein et al., 2001; Li et al., 2008; Zhang et al., 2011; Wang et al., 2013a).

When a native speaker processes lexical tones in Chinese,<sup>1</sup> the speaker deals with at least two types of information: the acoustic information that includes the physical features of auditory input such as the fundamental frequency (F0), and the phonological information that expresses lexical semantics on the basis of which word categories are identified (Luo et al., 2006; Xi et al., 2010). To investigate the difference between the processes of the acoustic vs. the phonological information in lexical tones, one line of previous research has been to understand whether lexical tonal processing in Chinese is left or right lateralized in the brain. The functional hypothesis claims that when tones are processed as phonological units they are lateralized to the left hemisphere, whereas when they are processed as purely acoustic information they are lateralized to the right hemisphere (van Lancker, 1980; Wong, 2002). Chinese lexical tones contain both the acoustic information and the phonological information.

<sup>1</sup>Although our experimental materials involve lexical tones only from Mandarin Chinese, the findings from this study apply to all Chinese dialects. We therefore use "Chinese" as a generic term henceforth in the paper to refer to all Chinese dialects.

Thus, based on this hypothesis, Chinese tonal processing involves both hemispheres for different processing of these two kinds of information. A competing theory, the acoustic hypothesis, claims that the brain lateralization of tonal processing depends on the acoustic properties of the auditory input. Spectral variations such as those contained in pitch information are preferentially processed in the right hemisphere, whereas temporal variations such as those contained in vowels are processed more strongly by the left hemisphere (Zatorre and Belin, 2001; Zatorre et al., 2002). According to this hypothesis, the acoustic contrasts and the phonological contrasts in Chinese tones are both spectral variations, and so both acoustic processing and the phonological processing are lateralized to the right hemisphere.

In light of these views on brain lateralization, Gandour et al. have put forward a comprehensive hypothesis on this issue (Gandour et al., 2000, 2004; Gandour, 2006). They suggested that lexical tonal processing engages both hemispheres, depending on the types of information involved during processing. In particular, for the same lexical tones, the left hemisphere is more involved in the semantic processing whereas the right hemisphere more in the acoustic processing of the pitch information. Given that processing of lexical tones involves the processing of both acoustic information and phonological information (which is semantically differentiating), this hypothesis suggests that there is no simple lateralization pattern associated with the brain's processing of tones.

The key issue under consideration here is therefore whether processing involves acoustic features of tones or phonological features of tones. Luo et al. (2006) used mismatch negativity (MMN) to study the features of tonal processing. MMN is a powerful method to examine the early stages of acoustic and phonological processing, as previous studies have indicated (e.g., Näätänen et al., 1978, 2007; Näätänen and Alho, 1997). It is an eletrophysiological component that reflects the brain's automatic detection of deviant patterns that do not match the general pattern that has been observed, which peaks at typically around 200–250 ms after stimulus onset, mostly in the frontocentral areas. MMN responses can be elicited by an oddball paradigm in which several infrequent deviant stimuli are embedded in frequent standard stimuli during auditory presentation. Luo et al. (2006) found that the mean amplitude of MMN elicited by tone deviants in the right hemisphere was larger than that in the left hemisphere. Their conclusion was that at the pre-attentive stage, a stage that is characterized by automatic processing when people are unconscious about the detailed properties of the stimuli (Kubovy et al., 1999), listeners mainly process the acoustic information of lexical tones. At this stage, the processing is usually lateralized to the right hemisphere. At a second stage of tone processing, the attentive stage, a stage in which people process the presented stimuli consciously, listeners tended to process the semantic information of tones via the left hemisphere. Luo et al.'s novel finding was that, in addition to the work of Gandour et al. whether the cognitive processing of Chinese lexical tones involves both hemispheres will depend on the different processing stages. Thus, the MMN provides an excellent measurement of the time course of processing, contributing to additional insights in lexical tone processing.

In addition to the lateralization debate of tone processing in Chinese, recent studies have also highlighted the role of categorical perception of tones. Categorical perception refers to the ability that human listeners can perceive continuous acoustic signals as discrete linguistic representations: listeners are sensitive to the boundaries between different phonetic categories, but are insensitive to acoustic changes within the boundaries of same phonetic category (Liberman et al., 1957, 1967). This acrosscategory vs. within-category perception difference has been extensively studied in previous work with segmental phonemes, such as consonants and vowels (e.g., with VOT characteristics), but are less well understood in studies with suprasegmental features such as tones. It has been found recently, however, that the perception of lexical tones shows categorical perception just as do phonemes: native speakers of tonal languages are more sensitive to acrosscategory tonal variations than within-category variations (see Francis et al., 2003; Hallé et al., 2004; Xu et al., 2006; Xi et al., 2010).

Xi et al. (2010) used MMN to examine categorical perception of tones in Chinese. The authors found that although both acrosscategory and within-category variations of tones elicited MMNs in bilateral frontal-central areas, the former elicited larger MMNs in the left than that in the right hemisphere, whereas the latter elicited larger MMNs in the right than in the left hemisphere. These patterns provide support for the categorical perception of tones, while at the same time evidence for the hypothesis that both hemispheres are involved in the processing of tones (Gandour, 2006), in that the processing of within-category stimuli mainly involves the acoustic processing of pitch information while the processing of across-category stimuli the phonological information. Moreover, within the time window of MMN (200–250 ms), both acoustic information and phonological information are processed in parallel, which contrasts to the two-stage hypothesis of Luo et al. (2006).

An fMRI study by Zhang et al. (2011) further indicated that the interaction between acoustic processing and phonological processing in the two hemispheres: across-category variations elicited stronger activation in the left middle temporal gyrus than did the within-category variations, whereas within-category variations elicited stronger activation in the right superior temporal gyrus. These fMRI findings indicate how low-level acoustic analysis (within-category) is modulated by high-level phonological representations (across-category), and are therefore consistent and complementary with the MMN findings reported in Xi et al. (2010). In another ERP study of categorical perception of tones at the attentive stage, Zhang et al. (2012a) found that the conscious processing of the within-category stimuli and the across-category stimuli involved the N2b and P3b components. The data were compatible with the response patterns as Xi et al. (2010): for both N2b and P3b, across-category stimuli elicited larger response in the left recording sites than the right; while the within-category stimuli elicited the same response in both hemispheres.

Such findings of categorical perception of tones also have significant implications for understanding disorders during processing. For example, Zhang et al. (2012b) found that Chinese-speaking children who were at risk for dyslexia showed no significant differences between the across-category vs. within-category stimuli, in contrast to both adults and agematched normally developing children. It could be that children with reading disorders may perceive the phonological information in the same way as they do with acoustic information.

One important question that remains unclear from previous research is whether acoustic information and phonological information of tones are fundamentally different (i.e., different kinds of information), or whether they are only different on a continuum (same kinds of information). As suggested by the functional hypothesis (van Lancker, 1980), the pitch information contained in Chinese lexical tones can be divided into two types depending on whether it serves as acoustic signal or phonological unit: these are two distinct kinds and also have different cognitive processing consequences. Gandour et al. (2000) further distinguished between the pure acoustic features vs. acoustic features plus semantic features in the perception of tones such as lexical tones in Chinese and Thai. An alternative view to the above is that there is no fundamental difference between the two in terms of cognitive processing: the pitch contrasts in tones are spectral variations, and the processing of these contrasts always involves the same type of acoustic analysis regardless of its specific features (e.g., Zatorre and Belin, 2001; Warrier and Zatorre, 2004; Ren et al., 2009). This alternative view suggests that the fundamental features of the acoustic information and the phonological information both lie in the spectral variations, and therefore are similar with regard to neural mechanisms, the processing of phonological information and that of acoustic information involve the same neural patterns, both in the right hemisphere. This argument contrasts with the suggestion of different neural correlates of acoustic vs. phonological processing, as discussed above (e.g., MMN and fMRI evidence of left vs. right lateralized patterns).

In the present study, we designed an experiment to further explore the different roles that acoustic vs. phonological information may play in the perception of Chinese tones. In particular, we relied on the MMN patterns that have been previously used successfully in this area (Luo et al., 2006; Xi et al., 2010). Previous work, including Xi et al. (2010) and Zhang et al. (2011), has examined only the contrast between acoustic information and phonological information as the contrast between withincategory and across-category variations. In the MMN paradigm, this means that the across- or within-category deviants are equally spaced when compared with the standard stimuli of tonal contrasts. In the current study, to further differentiate the two types of information, we added a new variable, the acoustic interval, which refers to the different F0 interval between the deviant and the standard stimuli. This new variable has two levels, small acoustic interval vs. large acoustic interval.

The stimuli materials used by Xi et al. (2010) and Zhang et al. (2011) were chosen from a Chinese lexical tonal continuum, from the high-rising tone (tone 2,/pa2/) to the failing tone (tone 4,/pa4/). The differences between the two tones can be acoustically manipulated according to the F0 contour, to produce a range of in-between tones, resulting in a continuum of stimulus 1 to stimulus 11 (see **Figure 1** and Section Materials and Methods for details). Previous studies were focused on categorical

perception of tones, so only stimuli 3, 7, 11 were used in their experiments (e.g., Xi et al., 2010; Zhang et al., 2011). In the current study, we continued to use the 11 stimulus set, but we included stimuli 5 and 9 in our experiment, to produce an orthogonal design that involved both phonological categories and acoustic intervals (see Section Materials under Methods for details). This design would allow us to systematically test the role of acoustic vs. phonological information in the processing of Chinese lexical tones, with both small and large interval differences in the acoustic signal and across and within differences in the phonological category. Specifically, we predict that the MMN patterns may reflect different impact of these two types of information, with regard to both the magnitude and the peak latency. We hypothesize that if the acoustic and the phonological information belongs to the same type of information (different dimensions of the auditory input), as proposed by Zatorre et al. the variations in the MMN mean amplitudes and peak latencies would be similar. On the other hand, if these two types of information are fundamentally different and reflect different cognitive processes, we may see variations of MMN patterns as a function of their differences, for example, in amplitude and time course.

## **MATERIALS AND METHODS**

#### **PARTICIPANTS**

Thirty-six neurologically healthy volunteers (21 females, mean age 20 years, range 19–21 years) took part in the study. All participants had normal hearing and minimal musical experience, and were native speakers of Mandarin Chinese recruited from the South China Normal University. The participants were all right-handed basing on their self-assessments. They gave written consent before they took part in the experiment and received monetary compensation for taking part in the experiment. This study was approved by the ethics review board of South China Normal University.

#### **EXPERIMENTAL DESIGN**

Our study used a within-subject design with two factors: phonological category (within-category/ across-category), and acoustic interval (large/small). See *Materials* section below for details. The dependent variables were the mean amplitude and the peak latency of the MMNs elicited by the stimuli.

#### **MATERIALS**

The materials were the same as used in Xi et al. (2010; see **Figure 1**), except noted below. They were acoustically manipulated based on Chinese lexical tones, to produce a continuum from the high-rising tone (tone 2) to the failing tone (tone 4). Two Chinese monosyllables, /pa2/and /pa4/, were firstly produced by a native speaker and digitally edited by Sound-Forge (SoundForge9, Sony Corporation, Japan) to get a same duration (200 ms). Then the two monosyllables were further edited by the Praat software<sup>2</sup> to keep the same acoustic features except the pitch contour. They were then used as the endpoint stimuli to create the 10-interval lexical tone continuum in Matlab (Mathworks Corporation, USA) using the toolbox of STRAIGHT (Kawahara et al., 1999). Thus there were 11 artificially generated stimuli in the continuum that differed only in the F0 (labeled as stimulus 1 to stimulus 11). The F0 intervals between any two adjacent stimuli were also acoustically manipulated to be the same, for the purpose of the experiment.

In Xi et al. (2010) study, stimuli 3, 7, and 11 were used to construct the across-category pair (3 vs. 7) and the withincategory pair (7 vs. 11) in their experiment. These pairs of contrast between 3 and 7, and 7 and 11, had large acoustic intervals for both across- and within-category contrasts. Given the focus of the previous studies on categorical perception, they served to address the research questions well. In the present study, we constructed an orthogonal design incorporating two variables, phonological category (as in Xi et al., 2010), and acoustic interval, in the following way: stimuli 5 vs. 7 (small interval, across category), stimuli 3 vs. 7 (large interval, across category), 7 vs. 9 (small interval, within category), stimuli 7 vs. 11 (large interval, within category). This new design would allow us to address additional research questions, as discussed in the Introduction section.

To ensure that the materials meet the need of our ERPs experiment, we conducted a norming study for the experimental material with a separate group of participants. 15 participants took part in an identification task and a discrimination task. In the identification task, they were asked to identify whether each stimulus (stimulus 3, 5, 7, 9, 11) was the high-rising tone (tone 2) or the failing tone (tone 4). 20 trials for each stimulus were randomly presented in isolation, with no stimulus being presented consecutively three times. In the discrimination task, participants were asked to judge whether a pair of presented stimuli were the same or not. The stimulus pairs were presented randomly, and the pairs which contained different stimuli were presented in both directions, each for 10 trials. To balance the number of "yes" or "no" responses, we included pairs of the same stimulus, each presented for 16 times (80 in total). Practice trials were given to the participants before the actual norming experiment.

### **PROCEDURE**

An improved passive oddball paradigm, proposed by Duncan et al. (2009), was used for this experiment. Classic oddball paradigms usually contained one kind of deviant stimuli and one kind of standard stimuli in a block, whereas the improved paradigm contained more than one kind of deviant stimuli in a block. The improved oddball paradigm has been shown to produce MMNs more quickly and effectively. A total of 1015 stimuli were presented to the participants during the experiment, including 15 standard stimuli at the beginning of the experiment and followed by 600 standard stimuli and 400 deviant stimuli, each type of deviant being 100. 15 standard stimuli were firstly presented to promote the participants to adapt to the experiment. The deviants were presented pseudo-randomly among standards, and any two adjacent deviants were different. Each stimulus was presented for 200 ms. The stimulus-onset-asynchrony (SOA) was set to be 800 ms. The stimulus presentation lasted for about 15 min.

Participants took a passive auditory task. They were instructed to watch a movie (Whisper of the Heart) (e.g., Duncan et al., 2009) in which the sound tracks of the movie were turned off. Moreover, they saw the movie for 5 min first before starting the experiment. No overt responses were required of the participants. To ensure that the participants paid attention to the movie, the experimenter asked them to answer five questions about the movie after the experiment. The experiment lasted 22 min in total (5 min in movie plus 15 min in experimental session proper).

#### **ELECTROENCEPHALOGRAM (EEG) RECORDING**

EEG was recorded using a 64-channel (Ag–AgCl) NeuroScan system (NeuroScan). Electrodes were positioned following the 10–20 system convention. The reference electrode was placed at the tip of the nose and the ground electrode was placed at FPz. The vertical electro-oculogram (EOG) was recorded supra- and infra-orbitally from the left eye. The horizontal EOG was recorded as the left vs. right orbital rim. The impedance of each electrode was kept below 5 kΩ. EEG and EOG signals were digitized online at 500 Hz and band-pass filtered from 0.05 to 100 Hz.

#### **DATA ANALYSIS**

Off-line signal processing was carried out using Scan 4.3 (NeuroScan). The reference electrode was converted first to bilateral mastoid (M1 and M2) and some artifacts were rejected manually. Data from two participants were excluded from further analyses due to their excessive eye blinking. Data were then adjusted by eliminating the interference of the horizontal and vertical eye-movements. The data were segmented for a 700 ms time window, including a 100-ms pre-stimulus baseline. Then the baseline was corrected according to Zhao (2010) and the recorded trials with eye blinks or other activities beyond the range of −80 to 80 mV were rejected. The data from the whole-head recordings were off-line band-pass filtered (1–30 Hz) with a finite impulse

<sup>2</sup>http://www.fon.hum.uva.nl/praat/

response filter. Finally, the ERPs evoked by the standard stimuli and the deviant stimuli were calculated by taking the averages of individual trials from each subject. Only those data with at least 80 accepted deviant trials in each deviant condition were adopted. With this criterion, data from another two participants were excluded from further analyses. MMNs were then derived by subtracting the ERPs evoked by the standard stimuli from those evoked by the deviant stimuli.

On the basis of findings from previous studies in the literature, we selected three recording sites for statistical analyses: F3, F4 and FZ. In previous studies of MMN, the MMN component typically peaks around 200–250 ms (Näätänen et al., 1978, 2007; Näätänen and Alho, 1997). Researchers usually choose a time window around this peak based on the grand-average waveforms of their particular data, which could include 100–350 ms (e.g., Kaan et al., 2008), 150–300 ms (e.g., Tsang et al., 2011), 230–360 ms (Wang et al., 2013b), and so on. Considering the usual range of MMN's peak and the grand-average waveforms of the present study, we chose a time window of the MMNs to be 200–350 ms. The peak of MMN for each subject at the four conditions in this time window was detected by using the procedure "Peak detection" in Scan 4.3 (NeuroScan). The MMN mean amplitudes were calculated by averaging the responses within the time window ranging from 20 ms before the peak of MMN recorded from electrode FZ to 20 ms after that peak. The mean amplitudes and peak latencies of the three chosen recording sites (F3, F4, FZ) were used for further statistical analyses.

## **RESULTS**

#### **NORMING EXPERIMENT**

In the identification task, participants were asked to identify whether each stimulus (stimulus 3, 5, 7, 9, 11) was the high-rising tone (tone 2) or the failing tone (tone 4). In the discrimination task, participants were asked to judge whether a pair of presented stimuli were the same or not. We analyzed the proportions of different tone judgments in the identification task and the proportions of "yes" or "no" responses in the discrimination task according to Xi et al. (2010) study. In the identification task, the proportion of different stimuli, 3, 5, 7, 9, 11, which were regarded as tone 4, was 4.4%, 8%, 87.8%, 97.4%, 98.2%, respectively. These results showed that the participants identified stimuli 3 and 5 as tone 2 and stimuli 7,9,11 as tone 4 indeed. In the discrimination task, the proportion of negative judgments for pairs 3–7, 5–7, 9–7, 11–7 was 89.7%, 89.7%, 15%, 30%, respectively. These results showed that the participants identified stimuli 3 and 7, and 5 and 7 as different tones, and stimuli 7 and 9, and 7 and 11 as the same tone.

The norming experiment showed that we can reliably treat stimuli 3 and 7, 5 and 7 as across-category pairs and stimuli 7 and 9, and 7 and 11 as within-category pairs. Considering the F0 intervals as illustrated above, stimuli 3 and 7, 11 and 7 were treated as the large interval stimulus pairs, and the F0 intervals of each pair were four F0 units. Similarly, stimuli 5 and 7, 9 and 7 were treated as the small interval stimulus pairs, and the F0 intervals of each pair were two F0 units. These results helped us to determine stimulus 7 in the tonal continuum (**Figure 1**) as the standard stimulus, stimulus 3 as an across-category with large interval deviant, stimulus 5 as an across-category with small interval deviant, stimulus 9 as a within-category with small interval deviant, and stimulus 11 as a within-category with large interval deviant.

#### **ERP EXPERIMENT**

**Figure 2** presents the grand average waveforms of the ERPs elicited by the standard stimulus and four deviants at F3, FZ, F4 electrode locations. As shown in **Figure 2**, different waveforms of the ERPs to the standard stimuli and the four deviants were observed at the three electrode locations.

To ensure that the deviant stimuli elicited MMNs, we conducted four paired samples *t*-tests to compare the grand average amplitudes of the four deviant contrasts in the time window of MMN at three electrode locations. The difference between the grand average amplitudes of stimuli 3,5,11 and standard stimuli were significant (*t*(1,31) = 6.124, *p* < 0.001; *t*(1,31) = 3.928,

*p* < 0.001; *t*(1,31) = 2.625, *p* = 0.013 < 0.05). The difference between stimuli 9 and standard stimuli was marginally significant (*t*(1,31) = 1.718, *p* = 0.096 < 0.1). Thus the four deviant stimuli elicited reliable MMNs.

**Figure 3** presents the different MMNs obtained by subtracting the ERP waveforms of the standard stimuli from those of the deviant stimuli at F3, FZ, F4 electrode locations. In the figure, distinct MMNs were displayed in the MMN time window (200–350 ms) at the three electrode locations.

We conducted two separate repeated-measures 2 × 2 ANOVAs, with phonological category (within-category/across-category) and acoustic interval (large/small) as independent variables, one for the mean amplitude and another for the peak latency of the MMNs. For all analyses, degrees of freedom were adjusted according to the Greenhouse–Geisser method when appropriate.

#### **MMN peak latency**

**Figure 4** presents the mean peak latencies of MMNs at F3, F4, FZ electrodes (the line segment represents one standard error). ANOVA results showed a significant main effect of phonological category (*F*(1,31) = 15.256, *p* < 0.001, across-category < withincategory), reflecting that the across-category deviants were processed earlier than the within-category deviants. There was no significant main effect of acoustic interval, nor an interaction between phonological category and acoustic interval (*p*s > 0.10).

#### **MMN mean amplitudes**

**Figure 5** presents the mean peak amplitudes of MMNs at F3, F4, FZ electrodes. The main effect of phonological category was significant (*F*(1,31) = 20.312, *p* < 0.001, across-category > withincategory), showing that the processing of across-category deviants elicited greater MMN patterns than the processing of withincategory deviants. There was also a significant main effect of acoustic interval (*F*(1,31) = 9.924, *p* < 0.01, large interval > small interval), indicating that the processing of large interval deviants elicited greater MMN patterns than the processing of small interval deviants. There was no significant interaction between phonological category and acoustic interval (*p* > 0.10).

## **DISCUSSION**

The perception of lexical tones in Chinese consists of the processing of the acoustic and the phonological information. But it has been debatable in the literature whether these two types of information are fundamentally different or whether they belong to the same auditory input that involves similar cognitive processes. As we reviewed earlier in the Introduction section, previous studies have investigated the differences between the acoustic information and the phonological information from the perspective of brain lateralization (Xi et al., 2010; Zhang et al., 2011, 2012a). The present study has focused on the different roles of acoustic vs. phonological information in native Chinese listener's processing through acoustic interval and phonological category. In particular, we used a paradigm from the MMN studies of categorical perception of Chinese lexical tones, examining four kinds of stimuli with different phonological categories (across vs. within) and distinct acoustic intervals (large vs. small).

The results from this study show that the acoustic and the phonological information of Chinese lexical tones have distinct impacts on processing as revealed by MMNs. The main effect of phonological category and acoustic interval for MMN mean amplitudes are both statistically significant. In particular, MMNs elicited by across-category deviants were larger than those by within-category deviants and MMNs in response to large interval deviants were greater than small interval deviants. The results revealed that both types of information influenced MMN mean amplitude. But only phonological category showed a significant main effect for MMN peak latency: the MMN peak latencies for the across-category deviants were shorter than for the withincategory deviants.

In the ERP literature, the mean amplitude of ERP components has been used to reflect the extent of neural resources during cognitive processing, and the peak latency has been argued to indicate the time course of the process (e.g., Duncan et al., 2009). The present study found that the acoustic information with different intervals only affected the MMN mean amplitude, but the phonological information influenced both the mean amplitude and the peak latency of the MMNs. The results revealed that

the acoustic information of Chinese lexical tones impacted the extent of tonal processing while the phonological information affected not only the extent but also the time course of tonal processing. These patterns showed the different roles that the acoustic information and the phonological information play in the processing of Chinese lexical tones at the pre-attentive stage.

As discussed earlier, Gandour et al. have argued for the functional hypothesis of tonal processing, according to which the phonological information contained in pitch information, such as in tones, is different from the general acoustic information, because the former expresses lexical semantic differences (e.g., Gandour et al., 2000, 2002; Wong, 2002). An alternative view, the acoustic hypothesis proposed by Zatorre et al. perceives the pitch information in tones as no different from general acoustic information (e.g., Zatorre and Belin, 2001; Warrier and Zatorre, 2004). In the present study, distinct profiles in MMN mean amplitudes and peak latencies were observed to be associated

with the acoustic and the phonological variations. These results provide support to the functional hypothesis rather than the acoustic hypothesis.

Some previous studies such as Luo et al. (2006), proposed that acoustic information of tones is mainly processed at the pre-attentive stage while phonological information is processed at the attentive stage. The proposal agreed with the view that the processing of the acoustic information and the phonological information are different. However, this proposal emphasized that the two kinds of information are processed at different stages. It is generally agreed that people can automatically process the stimuli without attention at the pre-attentive stage. Recent studies, in contrast to Luo et al. (2006), have suggested that acoustic and phonological information may be processed at both attentive and pre-attentive stages in parallel (Xi et al., 2010; Zhang et al., 2012b). In the present study, we differentiated the acoustic and the phonological information with a more fine-grained level, especially with regard to the acoustic interval of the stimulus. That is, the stimuli contained not only phonological category differences (different tones), but also acoustic interval changes (same tone with different F0). The results showed that both the acoustic and the phonological information of Chinese lexical tones can be processed at pre-attentive stage by native listeners. In addition, the present study found that the peak latencies of MMNs elicited by the across-category deviants were earlier than by the within-category deviants. Thus, phonological information associated with lexical tones may be processed even before the acoustic information at pre-attentive stage. This is a rather surprising finding that directly contrasts with arguments of the twostage model by Luo et al. (2006).

How do we account for the surprising finding that phonological information may be activated even earlier than acoustic information in Chinese lexical tone processing? As discussed above, MMN reflects the automatic detection of distinct stimuli, acoustic or phonological, by the human brain. Considering the auditory stimuli in the present study, the within-category stimuli differed only in the acoustic information, while the across-category stimuli involved differences in both the phonological and the acoustic dimension. Thus, MMN may reflect the latter differences more easily and quickly, which is why we observed shorter peak latency of the across-category stimuli.

Some researches considered that the MMNs elicited by tone variations may result from the long-term memory trace of Chinese lexical tones. Chandrasekaran et al. (2007a) assessed the different MMN responses to non-speech stimuli involving similar pitch variations as Chinese lexical tones by Chinese and English listeners. Their results showed that Chinese listeners had larger MMN responses than English listeners, which indicated that the same non-speech stimuli involving pitch variations may activate long-term memory traces of lexical tones for Mandarin listeners (Chandrasekaran et al., 2007a). The patterns from the present study showed that the MMNs elicited by the stimuli involving both semantic variations and acoustic variations were larger and earlier than the stimuli only involving acoustic variations in mean amplitude. It seems that long-term memory representation of tones may have contributed to these patterns. Thus, based on Chandrasekaran et al.'s finding and our own results, the activation of the long-term phonological memory trace may play a significant role to enhance tone perception with regard to the processing of phonological information.

Pitch contour refers to the direction of change in F0 according to Gandour (1983). Chandrasekaran et al. (2007b) investigated whether different pitch contours of Chinese lexical tones have different impacts on tonal processing by tone 1, tone 2 and tone 3. Their results showed that the MMN peak latency elicited by tone 1 vs. tone 3 was earlier than that elicited by tone 2 vs. tone 3. As the difference between tone 1 and tone 3 was larger than that between tone 2 and tone 3 in pitch contour shapes, they concluded that the different pitch contour shapes impacted the MMN peak latencies. In the present study, we compared the within-category stimuli (9 vs. 7; 11 vs. 7) with the across-category stimuli (3 vs. 7; 5 vs. 7), which have different direction of pitch contour (see **Figure 1**). That is, the difference between the across-category stimuli and the standard stimuli was larger than that between the within-category stimuli and the standard stimuli in pitch contour shapes. Our result showed that the peak latency of across-category deviants was earlier than that of within-category deviants. This pattern is consistent with Chandrasekaran et al. (2007b) and further indicated the impact that pitch contour shapes have on the MMN peak latency. In addition, as discussed earlier, the present study further indicated the interaction of two independent variables, showing the different impacts of acoustic interval and phonological category in lexical tone processing. However, there remains the question of whether the changes in pitch contour shapes and the changes in phonological information have the same amount of impact on tonal processing. Our current study design cannot yet address this question and it should be examined in future research.

In conclusion, our study has explored MMN patterns (mean amplitude and peak latency) to identify the cognitive processes associated with the pre-attentive stage processing of both acoustic and phonological information involved in the perception of Chinese lexical tones. The results showed that the acoustic information of tones only impacts the extent of tonal processing, whereas the phonological information impacts the time course as well as the extent of the processing. Our data suggest that the acoustic information and the phonological information of tones were distinct auditory inputs, at least to the native listeners of Chinese who have long-term experience with the representation of pitch information that differentiates lexical meanings.

#### **ACKNOWLEDGMENTS**

We thank Hua Shu, Linjun Zhang, and Yang Zhang for providing the experimental materials and helping to experimental design. This work was supported by the National Natural Science Foundation of China (31200785), National Social Science Foundation of China (11CYY023, CBA130125), and Graduate Research Innovation Foundation of South China Normal University (2013kyjj083).

#### **REFERENCES**

Chandrasekaran, B., Krishnan, A., and Gandour, J. T. (2007a). Experiencedependent neural plasticity is sensitive to shape of pitch contours. *Neuroreport* 18, 1963–1967. doi: 10.1097/wnr.0b013e3282f213c5


Zhao, L. (2010). *Experimental Course of ERPs.* Chinese: Southeast University Press.

**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 21 June 2014; accepted: 29 August 2014; published online: 16 September 2014*.

*Citation: Yu K, Wang R, Li L and Li P (2014) Processing of acoustic and phonological information of lexical tones in Mandarin Chinese revealed by mismatch negativity. Front. Hum. Neurosci. 8:729. doi: 10.3389/fnhum.2014.00729*

*This article was submitted to the journal Frontiers in Human Neuroscience*.

*Copyright © 2014 Yu, Wang, Li and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms*.

## Anterior insular cortex activity to emotional salience of voices in a passive oddball paradigm

## *Chenyi Chen1, Yu-Hsuan Lee1 and Yawei Cheng1,2,3\**

*<sup>1</sup> Institute of Neuroscience, National Yang-Ming University, Taipei, Taiwan*

*<sup>2</sup> Department of Rehabilitation, National Yang-Ming University, Yilan, Taiwan*

*<sup>3</sup> Department of Education and Research, Taipei City Hospital, Taipei, Taiwan*

#### *Edited by:*

*Sonja A. Kotz, Max Planck Institute Leipzig, Germany*

#### *Reviewed by:*

*Sascha Frühholz, University of Geneva, Switzerland Alessandro Tavano, University of Leipzig, Germany*

#### *\*Correspondence:*

*Yawei Cheng, Institute of Neuroscience, National Yang-Ming University, 155, Sec. 2, St. Linong, Dist. Beitou, Taipei 112, Taiwan e-mail: ywcheng2@ym.edu.tw*

The human voice, which has a pivotal role in communication, is processed in specialized brain regions. Although a general consensus holds that the anterior insular cortex (AIC) plays a critical role in negative emotional experience, previous studies have not observed AIC activation in response to hearing disgust in voices. We used magnetoencephalography to measure the magnetic counterparts of mismatch negativity (MMNm) and P3a (P3am) in healthy adults while the emotionally meaningless syllables *dada*, spoken as neutral, happy, or disgusted prosodies, along with acoustically matched simple and complex tones, were presented in a passive oddball paradigm. The results revealed that disgusted relative to happy syllables elicited stronger MMNm-related cortical activities in the right AIC and precentral gyrus along with the left posterior insular cortex, supramarginal cortex, transverse temporal cortex, and upper bank of superior temporal cortex. The AIC activity specific to disgusted syllables (corrected *p <* 0*.*05) was associated with the hit rate of the emotional categorization task. These findings may clarify the neural correlates of emotional MMNm and lend support to the role of AIC in the processing of emotional salience already at the preattentive level.

**Keywords: mismatch negativity (MMN), magnetoencephalography (MEG), anterior insular cortex (AIC), emotional salience**

### **INTRODUCTION**

Mismatch negativity (MMN) has recently been used as an index of the salience of emotional voice processing (Schirmer et al., 2005; Cheng et al., 2012; Fan et al., 2013; Hung et al., 2013; Fan and Cheng, 2014; Hung and Cheng, 2014). MMN reflects the early saliency detection of auditory stimuli regarding stimulus discrimination based on the perceptual processes of physical features (Pulvermüller and Shtyrov, 2006; Thönnessen et al., 2010). Considering that the anterior insular cortex (AIC) plays a critical role in negative emotional experience (Craig, 2002, 2009), particularly in perceiving disgust, and magnetoencephalography (MEG) could complement the spatiotemporal dynamics in a passive auditory oddball paradigm, we proposed the AIC activation with respect to emotional MMN.

The AIC is a polysensory cortex involved in the awareness of bodily sensations and subjective feelings (Craig, 2002, 2009). Perceiving a disgusting odor, disgusted faces, and imagining feeling disgust have consistently activated the AIC (e.g., Phillips et al., 1997, 2004; Adolphs et al., 2003; Krolak-Salmon et al., 2003; Wicker et al., 2003; Jabbi et al., 2008). Menon and Uddin (2010) suggested that the AIC is a key region of the emotional salience network that integrates external stimuli with internal states to guide behaviors. However, previous studies have failed to identify the AIC activation associated with disgusted vocal expressions (Phillips et al., 1998).

MEG enables non-invasive measurements of neural activity with sufficient spatial resolution and excellent temporal resolution. MMN/its magnetic equivalent (MMNm), and P3a/P3am, can be elicited using a passive auditory oddball paradigm in which participants engage in a task and must ignore the stimuli that are presented in a random series, with one stimulus (standard) occurring more frequently than the other stimuli (deviant). P3a/P3am is associated with involuntary attention switches for sound changes (Alho et al., 1998). As a preattentive change detection index, MMN/MMNm can reflect N-methyl-D-aspartate receptor function (Näätänen et al., 2011), which mediates sensory memory formation and emotional reactivity in various neuropsychiatric disorders (Campeau et al., 1992; Barkus et al., 2010). MMNm elicited by emotional (happy and angry) deviants in an oddball paradigm can reflect early stimulus processing of emotional prosodies (Thönnessen et al., 2010). Recent studies have indicated that, in addition to being used as an index of the acoustic features of sounds, such as frequency, duration, and phonetic contents (e.g., Ylinen et al., 2006; Horvath et al., 2008), MMN can also be used as an index of the salience of emotional voices (Cheng et al., 2012; Fan et al., 2013; Hung et al., 2013; Fan and Cheng, 2014; Hung and Cheng, 2014).

Various manners have been used for the acoustic control of emotional voices. For example, scrambling voices enables the amplitude envelope to be preserved (Belin et al., 2000). In one study, simple tones synthesized from the strongest formant of the vowel were used as the control stimuli in the mismatch paradigm (Ceponien ˇ e et al., 2003 ˙ ). In another studies, physically identical stimuli were presented as both standards and deviants (Schirmer et al., 2005, 2008). Because no single acoustic parameter can fully explain strong neural responses to emotional prosodies (Wiethoff et al., 2008), the present study, using the same theorems as Belin et al. (2000) did, involved employing two stringent sets of acoustic control stimuli, simple tones and complex tones, to control the temporal envelope and core spectral elements of emotional voices [spectral centroid (fn) and fundamental frequency (f0)], respectively.

To elucidate the neural correlates underpinning the emotional salience of voices, we measured MMNm and P3am in a passive auditory oddball paradigm while presenting the neutrally, happily, and disgustedly spoken syllables *dada* to young adults. We hypothesized that, if the AIC is involved in the preattentive processing of emotional salience, then AIC activation would be observed in the source distribution of MMNm, in accordance with the auditory cortices and early response latencies. If AIC activation is specific to voices, MMNm in response to acoustic attributes, i.e., simple and complex tone deviants, would not elicit the AIC activation. If neurophysiologic changes can guide behaviors (Menon and Uddin, 2010), then people exhibiting stronger AIC activation in response to hearing emotional syllables are expected to perform more favorably in emotional recognition. Furthermore, because men and women might engage in dissimilar neural processing of emotional stimuli (Hamann and Canli, 2004), the gender factor was introduced into the analyses.

## **MATERIALS AND METHODS**

#### **PARTICIPANTS**

Twenty healthy participants (10 men), aged 18–30 years (mean ± *SD*: 22 ± 1.9), underwent MEG recording and structural MRI scanning after providing written informed consent. The study was approved by the ethics committee in National Yang-Ming University and conducted in accordance with the Declaration of Helsinki. One person was excluded from data analysis because of motion artifacts. All participants were righthanded without hearing or visual impairments. They had no neurological and psychiatric disorders. Participants received monetary compensation for their participation.

#### **AUDITORY STIMULI**

The stimulus material consisted of three categories: emotional syllables, simple tones, and complex tones. For the emotional syllables, a young female speaker produced the syllables *dada* with two sets of emotional (happy and disgusted) prosodies and one set of neutral prosodies. Within each type of emotional or neutral prosodies, the speaker produced the *dada* syllables more than ten times to enable validation. Sound Forge 9.0 and Cool Edit Pro 2.0 were used to edit the syllables so that they were equally long (550 ms) and loud (max: 62 dB, mean: 59 dB).

Each syllable set was rated for emotionality on a 5-point Likert-scale by a total of 120 listeners (60 men). For the disgusted set, listeners classified each stimulus from *extremely disgusted* to *not disgusted at all*. For the happy set, listeners classified from *extremely happy* to *not happy at all* and for the neutral set, listeners classified from *extremely emotional* to *not emotional at all*. Emotional syllables that were consistently identified as the extremely disgusted and happy (i.e., the highest ratings) as well as the most emotionless (i.e., the lowest rating) were used as the stimuli. The Likert-scale (mean ± *SD*) of happy, disgusted, and neutral syllables were 4.34 ± 0.65, 4.04 ± 0.91, and 2.47 ± 0.87, respectively.

Although firmly controlling the spectral power distribution may result in the loss of temporal flow associated with formant contents in voices (Belin et al., 2000), the synthesizing the temporal envelope and the core spectral elements of voices should enable the maximal control of the spectral and temporal features of vocal and corresponding non-vocal sounds (Schirmer et al., 2007; Remedios et al., 2009). In order to create a set of stimuli that retain acoustical correspondence with the emotional syllables, we synthesized simple and complex tones by using Praat (Boersma, 2001) and MATLAB (The MathWorks, Inc., Natick, MA, USA). Using a sine waveform, we extracted the fundamental frequencies (f0) and the spectral centroid (fn) of each original syllable to produce complex and simple tones, respectively (Supplementary Figures S1, S2). The lower end of the spectrogram at each time point determined the fundamental frequency (f0). For the complex tones, the f0 over time was extracted to preserve the pitch contour. For the simple tones, the spectral centroid (fn), indicating the center of mass of the spectrum, was extracted to reflect the brightness of sounds. The original syllable envelope then multiplied the extracted frequencies. Hence, to control temporal features, three categories (emotional syllables, complex tones, and simple tones) were assigned to have identical temporal envelopes. To control spectral features, complex tones retained the f0 whereas simple tones retained the fn of emotional syllables. The length (550 ms) and loudness (max: 62 dB; mean: 59 dB) of all stimuli were controlled.

#### **PROCEDURES**

During MEG recording, participants lay in a magnetically shielded chamber and watched a silent movie with subtitles while the task-irrelevant vocally spoken or synthesized stimuli were presented. To ensure that the auditory stimuli were sufficiently irrelevant, participants attentively watched the movie and answered questions regarding the movie content after data recording.

Three sessions (emotional syllables, complex tones, and simple tones) were conducted. The session order was pseudorandomized among participants. In the emotional session, neutral syllables set as the standard (S), and happy and disgusted syllables designed as two isometric deviants (D1, D2) followed the oddball paradigm. During the complex and simple sessions, we applied an identical oddball paradigm for the corresponding synthesized tones so that relative acoustic features among S, D1, and D2 were controlled across all three categories. Each session consisted of 800 standards, 100 D1s, and 100 D2s. A minimum of two standards was presented between any two deviants. The successive deviants were always diverse. The stimulus onset asynchrony was 1200 ms.

After MEG recording, participants performed a forced-choice emotional categorization task. While listening to the forty-five stimuli, including five Ss, five D1s, and five D2s of each stimulus category, participants identified each emotional characteristics as one of three types (emotionless, happy, or disgusted) in a self-paced manner. The chance level was 33.33% based on three alternatives.

#### **APPARATUS AND RECORDINGS**

The data were recorded by using a 157-channel axial gradiometer whole-head MEG system (Kanazawa Institute of Technology, Kanazawa, Japan). Prior to data acquisition, the locations of five head position indicator coils attached to the scalp and several additional scalp surface points were recorded with respect to fiduciary landmarks (nasion and two preauricular points) by using a 3-D digitizer, which digitized each participant's head shape and localized the position of the participant's head inside the MEG helmet. Data were collected at a sampling frequency of 1 kHz. Participants kept their heads steady during MEG recording. The head-shape and head-position indicator locations were digitized at the onset of recording and were later used to coregister the MEG coordinate system with the structural MRI of each participant. Structural MR images were acquired on a 3 T Siemens Magnetom Trio-Tim scanner using a 3D MPRAGE sequence (TR/TE = 2530/3.5 ms, FOV = 256 mm, flip angle = 7◦, matrix = 256 × 256, 176 slices/slab, slice thickness = 1 mm, no gap).

#### **MEG PREPROCESSING AND ANALYSIS**

In offline processing the MEG data, we applied a low-pass filter at 20 Hz (Luck, 2005) and reduced the noise using the algorithm of time-shifted principle component analysis (de Cheveigné and Simon, 2007; Hsu et al., 2011). The MEG data were then epoched for each trial type by time locking the stimulus onsets at 100-ms prestimulus intervals and 700-ms poststimulus intervals. Epochs with a signal range exceeding 1.5 fT at any channel were excluded from the averaging and subsequent statistical analyses, in which the deviant-stimulus averages were calculated based on at least 90 trials per participants. The amplitudes of averaged MEG response waveforms were measured with respect to a 100-ms prestimulus baseline.

#### **EVENT-RELATED FIELDS (ERF)**

For amplitude and latency analyses, we used the Isofield Contour Map to identify the channels with the strongest signal in the direction. Because head position variation might unequally contribute to the differential activity observed at individual sensors, we created a composite map by grand-averaging nine conditions, where three D1s, three D2s and three Ss of three categories were pooled together. For the ERF difference, the difference maps [disgusted MMNm (D2-S); happy MMNm (D1-S)] were averaged for each category. Based on the composite maps of each ERF component, we selected the four clusters with the strongest signal in the direction. The amplitudes of the sensory ERF peaks (N1m and P2m), MMNm and P3am were measured as an average within a 60-ms window centered at each participant's individual peak latencies. P3am was defined as the component immediately following MMNm, peaking at 300-500 ms. Two-tailed *t* tests were used to determine the statistical presence (difference from 0 fT) of ERF peaks related to the stimuli.

Statistical analysis involved Three-Way mixed ANOVAs with two within-subject factors: category (emotional syllables, complex tones, or simple tones) and stimulus [neutral (S), happy (D1), or disgusted (D2)] and one between-subject factor: gender (males vs. females). The dependent variables were the amplitudes and latencies of each component. Bonferroni test was conducted only when preceded by significant effects.

#### **MEG SOURCE ANALYSIS**

The structural MR images were processed using FreeSurfer (CorTechs Labs, La Jolla, CA and MGH/HMS/MIT Athinoula A. Martinos Center for Biomedical Imaging, Charleston, MA) to create a cortical reconstruction of each brain. Minimumnorm estimates (MNEs) (Hämäläinen and Ilmoniemi, 1994) were computed from combined anatomical MRI and MEG data by using the MNE toolbox (MGH/HMS/MIT Athinoula A. Martinos Center for Biomedical Imaging, Charleston, MA). For inverse computations, the cortical surface was decimated to 5000–10,000 vertices per hemisphere. We used the boundary-element model method to compute the forward solution, which was an estimate of the magnetic field at each MEG sensor resulting from the activity at each of the vertices. The forward solution was then employed to create the inverse solution, which enabled identifying the spatiotemporal distribution of any activity over sources, that most accurately account for each participant's average MEG data. The noise covariance matrix was estimated according to the prestimulus baselines of the individual trials. Only the components of activation that were in a direction normal to the cortical surface were retained in the minimum-norm solution. The MNE results were then converted into dynamic statistical parameter maps (dSPM), which measured the noise-normalized activation at each source and enabled several standard minimum-norm calculations inaccuracies to be avoided (Dale et al., 2000).

To test whether the evoked response significantly differed between conditions, the problem of multiple comparisons was addressed by conducting a cluster-level permutation test across space. For each cortical location within each region of interest (ROI), a paired-samples *t* value was computed for testing the deviant-standard contrast or the contrast between two deviants (*p* = 0*.*05). We then selected all of the samples for which this *t* value exceeded an a priori threshold (uncorrected *p <* 0*.*05). Finally, the selected samples were clustered according to spatial adjacency. By clustering neighboring cortical locations that exhibited the same effect, we addressed the multiple comparisons problem while considering the dependency of the data. Cortical dipoles were considered to be neighbors if the distance between them was less than 12 mm. A sample was included in the cluster only when there were at least two neighboring samples in space.

#### **REGIONS OF INTERESTS (ROI) AND SOURCE-SPECIFIC TIME-COURSE EXTRACTION**

The cortical surface of each participant was normalized onto a standard brain supported by FreeSurfer, and the dSPM solutions of all participants were subsequently averaged so that they could be used in the defined regions and time windows of interest. Considering the fact that the trial number contributed considerably to the inverse source estimation, we selected only the standard exactly preceding the deviant to estimate cortical activity The dSPM solutions estimated for the standards, which immediately preceded D1 and D2, were then averaged to represent the cortical activity for standard sound processing. To further qualitatively clarify the underlying neural correlates of MMNm and P3am, a functional map, using inclusive masking to display significant dSPM activation for either deviant (D1 or D2), was used to select the ROIs for all of the stimulus categories. Specifically, the grand averaged functional maps evoked by D1 and D2 were overlaid onto a common reconstructed cortical sphere, respectively. The ROIs were drawn along the border of functional maps as well as the anatomical criterion where the vertices were optimally parceled using the gyral-sulcal patterns (Fischl et al., 1999; Sereno et al., 1999; Leonard et al., 2010). The ROIs for D1 and D1 were then combined to form an inclusive mask displaying significant dSPM activations for either deviant (D1 or D2). Then, the dSPM time courses were extracted from the predetermined ROIs after their amplitudes were measured and calculated as described previously (**Figure 1**).

#### **RESULTS**

#### **BEHAVIORAL PERFORMANCE**

**Table 1** shows the performance on the emotional categorization task. A Three-Way mixed ANOVA targeting categories (emotional syllables, complex tones, or simple tones) and stimulus (neutral, happy, or disgusted) as the within-subject variables and gender (male vs. female) as the between-subject variable was computed in terms of the hit rate. The category effect [*F*(2*,* 32) = 4*.*90, *p* = 0*.*01] and the interaction between category and stimulus [*F*(4*,* 64) = 2*.*80, *p* = 0*.*03] were significant (**Figure 2**). No significance was observed regarding gender and gender-related interaction. The emotional syllables exhibited more favorable

**FIGURE 1 | The grand averaged functional map evoked by emotional syllables.** The functional map of dSPM solutions estimated for each category (emotional syllables, complex, and simple tones) was superimposed onto a reconstructed anatomical criterion where the vertices were optimally parceled out using the gyral-sulcal patterns. We overlaid the grand averaged functional map in response to emotional syllables onto the common reconstructed cortical sphere as an example.

#### **Table 1 | Behavioral performance.**


performance than did the complex (*p* = 0*.*031) and simple (*p* = 0*.*005) tones. *Post-hoc* comparisons showed that the neuralrelative to emotional-derived tones exerted more favorable performance in the complex [Neutral- *>* Happy-derived tones: *t*(17) = 1*.*9, Cohen's *d* = 0*.*70, one-tailed *p* = 0*.*035; Neutral- *>* Disgusted-derived tones: *t*(17) = 2*.*5, Cohen's *d* = 0*.*92, onetailed *p* = 0*.*01] and simple tones [*t*(17) = 1*.*8, Cohen's *d* = 0*.*67, one-tailed *p* = 0*.*04; *t*(17) = 1*.*8, Cohen's *d* = 0*.*67, one-tailed *p* = 0*.*04], but this pattern was not observed in the emotional syllables (*p* = 0*.*11; *p* = 0*.*94). Only the emotional syllables exerted above-chance hit rates (*>*33.33%) in all emotions, rather than the complex and simple tones, indicating emotional neutrality of acoustic controls.

#### **SENSORY ERF**

Each stimulus type of each category reliably elicited an N1- P2 complex, which is typically obtained in adults during fast stimulus presentation (Supplementary Table S1 and Figure S3) (Näätänen and Picton, 1987; Pantev et al., 1988; Tremblay et al., 2001; Shahin et al., 2003; Ross and Tremblay, 2009).

Statistical analyses for each identified cluster on the Isofield Contour Map revealed that N1m had the category effect for the cluster over the right anterior region [*F*(2*,* 36) = 5*.*33, *p* = 0*.*009] and P2m had the category effect for the clusters over the left and right posterior regions [*F*(2*,* 36) = 15*.*03, *p <* 0*.*001; *F*(2*,* 36) = 13*.*65, *p <* 0*.*001]. *Post-hoc* analyses indicated that the emotional syllables elicited stronger N1m than did the complex tones (*p* = 0*.*005) and simple tones (*p* = 0*.*02) for the cluster over the right anterior region. For those over the left and right posterior regions, the simple tones elicited stronger P2m amplitudes than did the emotional syllables (left: *p* = 0*.*03; right: *p* = 0*.*02) and complex tones (*p* = 0*.*01; *p <* 0*.*001). There was no significance for gender and gender-related interaction.

#### **MMNm**

All deviant stimuli of each category elicited MMNm significantly. Statistical analyses for each identified cluster on the Isofield Contour Map revealed that there were significant interactions between category and stimulus over the left [*F*(2*,* 36) = 4*.*73,

**FIGURE 2 | Hit rate in the emotional categorization task.** The emotional syllables exhibited more favorable performance than did the complex and simple tones. Only the emotional syllables attained above-chance hit rates for each emotion. The asterisk (∗*p <* 0*.*05) indicates that the hit rate is statistically higher than the chance level (dashed line).

*p* = 0*.*015] and right posterior [*F*(2*,* 36) = 5*.*05, *p* = 0*.*012] clusters. In the left posterior cluster, *post-hoc* analyses showed that the stimulus effect where disgusted (D2) relative to happy (D1) MMNm was larger in amplitudes was present in the emotional syllables (*p* = 0*.*004), but none was detected in the simple tones (*p* = 0*.*27) and complex tones (*p* = 0*.*41). In the right posterior cluster, the stimulus effect was found in the emotional syllables (*p <* 0*.*001) and simple tones (*p* = 0*.*005). Gender and genderrelated interaction were not significant (*p >* 0*.*05).

#### **MMNm-RELATED CORTICAL ACTIVITIES**

**Table 2** lists the peak latencies used for analyzing source-specific amplitudes for each ROI. Statistical analyses, using a Three-Way mixed ANOVA targeting category (emotional syllables, complex tones, or simple tones) and stimulus [neutral (S), happy (D1), or disgusted (D2)] as the within-subject variables and gender (males vs. females) as the between-subject variable for each ROI, revealed that the brain regions exhibiting an interaction between the category and stimulus were the right AIC [*F*(4*,* 68) = 3*.*35, *p* = 0*.*015], right precentral gyrus [*F*(4*,* 68) = 5*.*54, *p* = 0*.*001], left supramarginal cortex [*F*(4*,* 68) = 3*.*03, *p* = 0*.*023], upper and lower bank of superior temporal sulcus (uSTS and lSTS) [*F*(4*,* 68) = 2*.*78, *p* = 0*.*03; *F*(4*,* 68) = 3*.*41, *p* = 0*.*03] together with the left posterior insular cortex (PIC) [*F*(4*,* 68) = 5*.*09, *p* = 0*.*006], left and right transverse temporal cortex [*F*(4*,* 68) = 4*.*71, *p* = 0*.*002; *F*(4*,* 68) = 3*.*02, *p* = 0*.*02] (**Figure 3**). Gender and gender-related interaction were non-significant. *Post-hoc* tests indicated that the processing of emotional salience (**Figure 4**), as indicated by stronger cortical activities for disgusted syllables relative to happy syllables (D2 *>* D1: FDR corrected *p <* 0*.*05) occurred in the right AIC [*F*(2*,* 34) = 7*.*84, *p* = 0*.*002] and precentral gyrus [*F*(2*,* 34) = 7*.*98, *p* = 0*.*005] along with the left PIC [*F*(2*,* 34) = 5*.*55, *p* = 0*.*008], supramarginal cortex [*F*(2*,* 34) = 6*.*71, *p* = 0*.*004], transverse temporal cortex [*F*(2*,* 34) = 8*.*95, *p* = 0*.*001], and uSTS [*F*(2*,* 34) = 8*.*49, *p* = 0*.*001].

#### **Table 2 | MMNm-related cortical activities.**

In addition, the right AIC activities, which surpassed the dSPM criterion, specifically responded to the disgusted syllables (**Figure 5**). The correlation analysis revealed that the MMNmrelated AIC activities were associated with the hit rates for emotional syllables in the emotional categorization task [*r*(18) = 0*.*49, *p* = 0*.*036]. Participants exhibiting larger amplitudes in the right AIC activation triggered by disgusted syllables were likely to perform better in the emotional categorization task.

#### **P3am**

Paired *t*-tests used to determine the statistical presence (difference from 0 fT/cm) indicated that all deviants from each category elicited P3am, temporally following MMNm (Supplementary Table S2). The ANOVA model on P3am amplitudes for each identified cluster on the Isofield Contour Map, did not find any significance in the category, stimulus, gender, and their related interaction (all *p >* 0*.*05).

#### **P3am-RELATED CORTICAL ACTIVITIES**

P3am and MMNm had similar brain sources and manifested as two contiguous peaks in the dSPM source-specific time course (**Figure 3** and **Table 3**). Statistical analyses on P3amrelated cortical activities used a Three-Way mixed ANOVA targeting category (emotional syllables, complex tones, or simple tones) and stimulus [neutral (S), happy (D1), or disgusted (D2)] as the within-subject variables and gender (male vs. female) as the between-subject variable. The brain regions exhibiting the stimulus effect included the right supramarginal cortex [*F*(2*,* 34) = 25*.*37, *p* = 0*.*032], uSTS [*F*(2*,* 34) = 23*.*48, *p* = 0*.*018], lSTS [*F*(2*,* 34) = 30*.*94, *p* = 0*.*006], and posterior superior temporal sulcus (pSTS) [*F*(2*,* 34) = 17*.*44, *p* = 0*.*004], as well as the left transverse temporal cortex [*F*(2*,* 34) = 16*.*67, *p* = 0*.*002], supramarginal cortex [*F*(2*,* 34) = 12*.*00, *p* = 0*.*008], PIC [*F*(2*,* 34) = 16*.*25, *p* = 0*.*006], uSTS [*F*(2*,* 34) = 14*.*81, *p* = 0*.*003], and lSTS [*F*(2*,* 34) = 26*.*42, *p* = 0*.*001]. None of ROIs reached


*S, standard (neutral); D1, happy; D2, disgusted.*

any interaction between the category and stimulus. Gender and gender-related interaction were non-significant.

## **DISCUSSION**

Although the AIC plays a critical role in the negative experience of emotions (Craig, 2002, 2009; Menon and Uddin, 2010), including disgust (e.g., Phillips et al., 1997, 2004; Adolphs et al., 2003; Krolak-Salmon et al., 2003; Wicker et al., 2003; Jabbi et al., 2008), previous studies have not observed AIC activation in response to hearing disgusted voices. It thus leaves a room for more research to clarify whether AIC activation is specific to disgust or, alternatively, reflects general aversive arousal in response to negative emotions. In contrast to the predicted correlation between AIC activation and disgust recognition, we determined that the AIC activation predicted the performance of emotion recognition in general within the emotional category. This may be partially attributed to the functional role of the AIC in salience processing (Seeley et al., 2007; Sridharan et al., 2008; Menon and Uddin, 2010; Legrain et al., 2011). One MEG study on disgusted faces reported that early insular activation occurs at approximately 200 ms after the stimulus onset of emotionally arousing stimuli, regardless of valence, whereas the later insular response (350 ms) differentiates disgusted from happy facial expressions (Chen et al., 2009). Accordingly, AIC activation occurred in response to disgusted voices at approximately 250 ms in our study. The AIC is a brain region underpinning error awareness and saliency detection (Sterzer and Kleinschmidt, 2010; Harsay et al., 2012). This passive oddball study required no target detection. The AIC activation, which surpassed the dSPM criterion, specifically responded to disgusted syllables rather than happy syllables. Participants exhibiting stronger AIC activities were likely to have higher hit rats in the emotional categorization task (please see **Figure 5**). In addition, disgusted relative to happy syllables exhibited stronger MMN-related cortical activities, lending support for the notion that disgusted relative to happy voices might be more acoustically salient (Banse and Scherer, 1996; Simon-Thomas et al., 2009; Sauter et al., 2010).

Using MEG in a passive auditory oddball paradigm, we demonstrated the involvement of AIC in the preattentive perception of disgusted voices. MMNm-related AIC activation was specific to disgusted syllables, but not happy syllables. In addition, acoustically matched simple and complex tones did not activate the AIC in the same manner. Participants who exhibited stronger MMNm-related AIC activations were more prone to obtaining higher hit rates in the emotional categorization task. The involvement of AIC in emotional MMNm appears to be consistent between genders.

The MMNm response was sensitive to the positive and negative valence of emotional voices, as indicated by stronger amplitudes elicited by disgusted syllables than by happy syllables. Particularly, in the left hemisphere, the emotional salience processing was specific to voices rather than their acoustic attributes (see **Figure 4**). This should not be surprising because affective discrimination beyond acoustical distinction emerges early in the neonatal period (Cheng et al., 2012). Hearing angry and fearful syllables relative to happy syllables elicited stronger MMN (Schirmer et al., 2005; Fan et al., 2013; Hung et al., 2013; Fan and Cheng, 2014; Hung and Cheng, 2014). From an evolutionary

**FIGURE 5 | The anterior insular cortex activity in response to disgusted syllables.** Disgusted deviants exclusively elicited the activation in the AIC, as indicated by MMNm-related cortical activities. Grand average (*n* = 19 participants) time courses of the mean estimated current strength were

extracted from the right AIC (red line: disgust, D2; blue line: happy, D1; black line: standard, S). The AIC activation for disgusted deviants and the hit rates in the emotional categorization task were positively correlated [*r*(18) = 0*.*49, *p* = 0*.*036].


**Table 3 | P3am-related cortical activities.**

*S, neutral; D1, happy; D2, disgusted.*

perspective, disgust, an aversive emotion, exhibits a negativity bias that elicits stronger responses than neutral events do (Lange, 1922; Huang and Luo, 2006).

MMNm-related AIC activation may reflect emotional salience at the preattentive level. The significance of this finding lies in the lack of similar findings in the auditory modality by previous studies, despite the widespread evidence of the involvement of AIC in the experience of negative emotions. Particularly, attentively hearing disgusted voices did not activate the AIC (Phillips et al., 1998), indicating that AIC may be involved in the preattentive processing of disgusted voices. Theoretically, the passive auditory oddball paradigm should be the optimal approach for probing the preattentive processing of emotional voices because MMN can indicate the neural activity in a comatose or deep-sleeping brain (Kotchoubey et al., 2002). Although a silent movie presentation is unable to guarantee the lack of awareness to auditory stimuli, limited attention resources can indeed modulate the neurophysiological processing of emotional stimuli (Pessoa and Adolphs, 2010). We do not assert that the preattentive processing of emotional salience of voices was only dominated by the AIC. Through the AIC, the cortical-subcortical interactions for coordinating the function of cortical networks might be attributed to the neural mechanism underpinning the evaluation of the biological significance of affective voices.

The PIC exhibited stronger MMNm-related cortical activities for disgusted syllables relative to happy syllables, possibly involved in the representation of emotional salience. The PIC that has been functionally identified as the portion of the extended auditory cortex responded preferentially to vocal communication sounds (Remedios et al., 2009). The salient sensory information would reach the multimodal cortical areas, such as the PIC, directly from the thalamus, bypassing primary sensory cortices. This direct thalamocortical transmission is parallel to the modality-specific processing of stimulus attributes via the transmission from the thalamus to the relevant primary sensory cortices (Liang et al., 2013). In the present study, failing to observe any activation in the thalamus and anterior cingulate cortex in every condition could be attributed to the stringent statistic dSPM criterion as well as the absence of the target detection in a passive oddball task. The insular cortex, including AIC and PIC, can monitor the salience (appetitive and aversive) and integrate with the stimulus effect on the state of the body (Deen et al., 2011).

In addition, the left transverse temporal cortex, a part of primary auditory cortex, was sensitive to the processing of emotional salience [disgust (D2) *>* happy (D1)]. This finding supports that vocal emotional expression might be processed beyond the right hemisphere, being anchored within sensory, cognitive, and emotional processing systems at an early auditory discrimination stage (Schirmer and Kotz, 2006). The activation of the precentral gyrus observed across all three categories may reflect general attention and memory enhancement during information processing (Chen et al., 2010).

Importantly, the present study identified several future areas of inquiry. First, familiarity might potentially confound the effect of affective modulation observed here. Simple and complex tones are less familiar than emotional voices, and it might be impossible to categorize synthesized tones in relation to natural speech sounds. Second, using a pseudoword, such as *dada*, as an example might limit the generalizability for emotion representation. By using non-linguistic emotional vocalizations (Fecteau et al., 2007), additional studies are needed to verify whether the passive oddball paradigm is optimal for detecting emotional salience. Third, based on three alternatives, we defined the hit rate as the number of hits divided by the total number of trials at each stimulus category. This study did not control false alarm rates with the traditional approaches within the framework of Signal Detection Theory [SDT: d = Z (hit rate) − Z (false alarm rate)]. The performance of acoustic controls showed the skewed distribution, in which participants tended to classify the emotional-derived tones as neutral. Accordingly, a higher hit rate for neutral-derived tones than happy-/disgusted-derived tones might potentially violate the assumption that acoustic controls are emotionless. On the other hand, the sum of the hit rate for emotional syllables in the emotional categorization task, which prevents false alarms, exhibited a better prediction to the AIC activation than did the hit rate specific for disgust syllables. The hit rate of 75% for disgusted syllables and 55% for happy syllables corroborated existing findings in the identification of prototypical disgust and pleasure vocal burst (Simon-Thomas et al., 2009). However, only 55% being the hit rate for happy syllables has to be interpreted with caution. Future research should include the pre-evaluation of voices not only on the target emotion but also on other valence scales.

This MEG study clearly demonstrated the right AIC activation in response to disgusted deviance in a passive auditory oddball paradigm. The MMNm-related AIC activity was associated with the emotional categorization performance. The findings may clarify the neural correlates of emotional MMN and support that the AIC is involved in the processing of emotional salience at the preattentive level.

#### **ACKNOWLEDGMENTS**

The study was funded by the Ministry of Science and Technology (MOST 103-2401-H-010-003-MY3), National Yang-Ming University Hospital (RD2014-003), Health Department of Taipei City Government (10301-62-009), and Ministry of Education (Aim for the Top University Plan). None of the authors have any conflicts of interest.

#### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fnhum*.* 2014*.*00743/abstract

#### **REFERENCES**


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 06 March 2014; accepted: 03 September 2014; published online: 22 September 2014.*

*Citation: Chen C, Lee Y-H and Cheng Y (2014) Anterior insular cortex activity to emotional salience of voices in a passive oddball paradigm. Front. Hum. Neurosci. 8:743. doi: 10.3389/fnhum.2014.00743*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Chen, Lee and Cheng. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Talking hands: tongue motor excitability during observation of hand gestures associated with words

#### **Naeem Komeilipoor 1,2 , Carmelo Mario Vicario<sup>3</sup> , Andreas Daffertshofer <sup>2</sup>\* and Paola Cesari <sup>1</sup>**

<sup>1</sup> Department of Neurological and Movement Sciences, University of Verona, Verona, Italy

<sup>2</sup> MOVE Research Institute Amsterdam, VU University Amsterdam, Amsterdam, Netherlands

<sup>3</sup> School of Psychology, Bangor University, Bangor, UK

#### **Edited by:**

Carolyn McGettigan, Royal Holloway University of London, UK

#### **Reviewed by:**

Claudia Gianelli, University of Potsdam, Germany Benjamin Straube, Philipps University, Germany

#### **\*Correspondence:**

Andreas Daffertshofer, MOVE Research Institute Amsterdam, VU University Amsterdam, Van der Boechorststraat 9, Amsterdam, 1081 BT, Netherlands e-mail: a.daffertshofer@vu.nl

Perception of speech and gestures engage common brain areas. Neural regions involved in speech perception overlap with those involved in speech production in an articulator-specific manner. Yet, it is unclear whether motor cortex also has a role in processing communicative actions like gesture and sign language. We asked whether the mere observation of hand gestures, paired and not paired with words, may result in changes in the excitability of the hand and tongue areas of motor cortex. Using single-pulse transcranial magnetic stimulation (TMS), we measured the motor excitability in tongue and hand areas of left primary motor cortex, while participants viewed video sequences of bimanual hand movements associated or not-associated with nouns. We found higher motor excitability in the tongue area during the presentation of meaningful gestures (noun-associated) as opposed to meaningless ones, while the excitability of hand motor area was not differentially affected by gesture observation. Our results let us argue that the observation of gestures associated with a word results in activation of articulatory motor network accompanying speech production.

**Keywords: transcranial magnetic stimulation, tongue motor excitability, speech perception, gesture perception, sign language**

## **INTRODUCTION**

The processes underlying sign and spoken language perception are known to involve overlapping neural populations. Apparently linguistic information conveyed through gestures and sounds is processed in similar ways (Damasio et al., 1986; Hickok et al., 1996; Neville, 1998; MacSweeney et al., 2002; Newman et al., 2002; Xu et al., 2009; Straube et al., 2012, 2013). This overlap let McNeill (1996) speculate about a unified communication system.

It has been proposed that the evolutionary transition from gesticulation to speech has been mediated by the mirror neuron system, which is believed to underlie the understanding of others' actions and intentions (Rizzolatti and Arbib, 1998). Interestingly, mirror neurons have first been discovered in monkey area F5 that is considered homolog to human area 44 (Broca's area), which hosts speech production (Rizzolatti et al., 1996, 2002; Kohler et al., 2002). Nonetheless, no proper evidence supporting the evolution of language from gesture has emerged. According to this idea, vocal communication has become more and more autonomous at the expense of gestures that gradually lost their importance. In *the motor theory of speech perception*, Liberman and colleagues have already proposed the motor system to be involved in sensory perception (Liberman et al., 1967; Liberman and Mattingly, 1989). Hence, listeners may perceive spoken language by generating forward models in the motor system by activating articulatory phonetic gestures used to produce acoustic speech signals.

Imaging and transcranial magnetic stimulation (TMS) experiments revealed that speech perception triggers activity in brain areas that are involved in speech production in a somatotopic manner (Fadiga et al., 2002; Watkins et al., 2003; Pulvermüller et al., 2006; D'Ausilio et al., 2014; Möttönen et al., 2014). Repetitive TMS over the left premotor or primary motor cortex causes the capacity of phonetic discrimination to be significantly reduced (Meister et al., 2007; Möttönen and Watkins, 2009; Sato et al., 2009; Möttönen et al., 2014), indicating a causal relationship between the motor system and speech perception. Neural controllers of the articulator's movement seemingly contribute to both production and perception of speech. Nevertheless it has been argued that the activation of motor areas during listening to speech is neither essential in speech perception nor does it reflect phonetic processing of the speech signal as suggested in motor theory of speech perception (Scott et al., 2009). Evidence from functional lesion studies also supports the idea that involvement of motor areas during speech production does not necessarily contribute to speech perception (for review see, e.g., Hickok and Poeppel, 2000). Scott et al. (2009) argued that several different linguistic functions could be served by motor cortex during speech perception, including a specific role in sensorimotor processing in conversation. But is motor cortex activated differently during the observation of communicative actions such as gesture and sign language?

Recently, Möttönen et al. (2010) reported that motor evoked potentials (MEPs) elicited by stimulating the hand representation in the primary motor cortex (M1) did not differ when participants observed signs with known vs. signs with unknown meanings. If M1 hand area seems insensitive to the distinction between action associated and not associated with words, then other regions in M1 like the tongue or lip areas might be better candidates for this (cf. Fadiga et al., 2002; Watkins et al., 2003; Roy et al., 2008; Sato et al., 2010; D'Ausilio et al., 2014). It remains unclear, however, whether the motor representations of tongue and lips are capable of distinguishing between those actions that symbolically represents words (e.g., an object or a state) and those that do not.

In this study we investigated whether observation of newly learned hand gestures paired and not paired with words may result in changes in the excitability of the hand and tongue areas of motor cortex. We studied MEPs recorded from tongue and hand muscles in a group of healthy Italian participants who had been taught some signs in American Sign Language (ASL). Participants were asked to observe signs associated and not associated with words, i.e., trained and untrained signs. We first trained participants to learn the associated words for several signs (through visual presentation of signs with the associated words as subtitle), while the associated words for the other half of the signs were not taught (the signs presented without subtitles). To ensure that all participants learned the associated words, they underwent a testing session during which participants were observing the video of all the signs but this time without subtitles. They were asked to choose corresponding words for the observed signs among four possible alternatives displayed on the screen. Finally, participants underwent a TMS session, during which we measured the motor excitability in tongue and hand areas of left primary motor cortex while participants were observing the stimuli. We expected the observation of hand gestures alone would lead to similar excitability of hand motor representation, regardless of whether they represent a word or not. We also expected that only the observation of hand gestures associated with words would modulate the excitability of the tongue motor representation.

### **MATERIALS AND METHODS**

#### **EXPERIMENTAL PROCEDURE**

The experiment was designed as a 2 × 2 repeated measurement with two sign types (i.e., meaningful and meaningless indicating hand movements associated and not-associated with words, respectively) and two muscles (tongue and hand). During the experiment TMS-induced MEPs were recorded from tongue and hand muscles while participants observed video sequences of hand movements associated or not-associated with nouns. For each experimental condition 18 MEPs were recorded. The experiment was divided into three sessions: training, test, and TMS.

#### **PARTICIPANTS**

Ten non-signer adult, native Italian speakers (5 females; 23.5 ± 2.6) participated in the study. All were right-handed and had normal or corrected-to-normal vision with no history of speaking or hearing disorders. None of the participants were experienced in ASL. The experimental protocol had been approved by the members of the Ethics Committee of the Department of Neurological, Neuropsychological, Morphological and Movement Sciences of the University of Verona. All participants provided their informed consent prior to entering the study, which had been approved by the institutional review board.

### **STIMULI**

Stimuli consisted of six short (duration 3 s) black and white videos depicting hands performing bimanual movements (the actor's hands and trunk was presented against a gray background). The hand movements were signs in ASL, which were not related in movement structure to any Italian symbolic gestures. Moreover, ASL and not the Italian one was chosen to rule out the contingency of participants' familiarity with the signs used (having seen the signs and learned their related meanings). All the signs were nouns or adverbs (Necklace, Night, Land, Collision, Below, Current) consisting of double consonants "rr" "ll" "tt" in their Italian translation (Collana, Notte, Terre, Collisione, Sotto, Corrente), which require strong tongue mobilization for proper pronunciation; see **Figure 1**. The signs were chosen to share the following features: (1) contraction and visibility of the right hand first dorsal interosseous (FDI) muscle in the videos; (2) having associated words that require strong tongue mobilization when pronounced. The FDI muscle was visible and contracted in all the videos to reduce the variability among stimuli because previous studies showed that action observation under different circumstances may lead to modulation of corticospinal motor excitability (for a review, see Rizzolatti and Craighero, 2004). Further, to reduce the variability amongst stimuli concerning the associated words, which would share an element like the visibility and contraction of FDI muscle, we used words with double consonants. These words require strong tongue mobilization when produced, which has already been shown to modulate tongue motor excitability when listening to (Fadiga et al., 2002).

#### **TRAINING SESSION**

Before training participants were informed that they were going to view six videos of various hand movements each repeated ten times, three of which had related-word in the form of a subtitle and three did not. We restricted the study to six stimuli to ensure that all participants could readily learn the three associated words. Participants were instructed to memorize the association gesture-word from each of the three videos with the subtitles. One group learned the associated gesture-words of (Collana, Notte, Terre) and the other one learned (Collisione, Sotto, Corrente). The training was set up as follows: at first a screen-centered fixation cross was displayed for 1000 ms; subsequently, video stimuli (three with subtitles and three without) were presented in random order for 3 s. Participants were asked to be silent during the entire experiment. To test whether the participants learned the meaning of the signs, they underwent a test session after training.

#### **TEST SESSION**

During the test session the videos of the training session were presented at random without subtitles, each repeated ten times. After

each stimulus presentation, participants were asked to choose the corresponding word for the observed signs among four possible alternatives displayed on the screen until participants choose the correct answer by clicking the right mouse button with the index finger of the left hand. The four possible choices were the three learned words plus a question mark indicating "I do not know the answer". The displays were centered in the four quadrants of the screen. For every answer participants received feedback of correctness (knowledge of results). The feedback for the correct answer was displayed in white on a black background in the center of screen, and the feedback for the incorrect answer was displayed in red on a white background. Stimulus order and target position on the screen were randomized. All the participants accomplished the test session successfully (100% of correct responses) without any errors rendering ongoing ASL learning unlikely.

### **TMS/EMG SESSION Procedure**

The experiment was designed using the E-Prime 2 (Psychology Software Tools, Inc, USA) software running on a PC computer with a Windows XP operating system to control the stimulus presentation, randomization of trials and to trigger the TMS and EMG recordings. Transcranial magnetic stimulation induced EMG activity was collected from all participants.

During the experiment, the subjects were comfortably seated on an armchair in a dimly-lit room at a distance of 80 cm from a computer screen (Asus, 17", 60 Hz refresh rate). Each trial started with a fixation cue (the "+" symbol), presented for 1000 ms immediately followed by the stimulus that lasted for 3 s. The left M1 was stimulated via a single-pulse TMS delivered through a figure-of-eight coil at 120% of the individual resting motor threshold (over both tongue and hand motor areas). The TMS pulses were generated randomly within the last 2 s of stimulus presentation (from the beginning of the second to the end of the third second), when in the observed actions the FDI muscle was contracting and the meaning had already been conveyed. This was done to ensure that FDI muscle was clearly observable when the TMS pulses are delivered, and to give the participants more time to recognize the associated word. After each stimulus presentation, participants were asked to choose the corresponding word for the observed signs in a same way they did during the testing session. After pressing the space button to continue, the next stimulus was delivered with an inter stimulus interval of 8 s. Every TMS session took about 15 min and consisted of 36 trials (18 per each condition). The two sessions (tongue and hand stimulation) were carried out on the same day and their order was counter-balanced across participants.

#### **Data acquisition**

Focal TMS was applied with a 70-mm figure-of-eight coil that was powered by a STM9000 Magnetic Stimulator (ATES Medical Device, IT) producing a maximum output of 2T at the coil surface. Before each session, the coil was moved over the scalp in order to determine the optimal site from which maximal amplitude MEPs were elicited in the tongue and hand muscles separately. The coil was held tangentially to the scalp with the handle pointing 45◦ away from the nasion–inion line in a posterolateral direction (Mills et al., 1992) to find the FDI representation area. Following the same procedure pursued in a previous work of our group (Vicario et al., 2014), the tongue area was stimulated with the coil handle oriented at 90◦ directed straight posteriorly.

The resting motor threshold of the muscles was determined according to standard methods as the minimal intensity capable of evoking MEPs in 5 out of 10 trials of the relaxed muscles with amplitude of at least 50 µV (Rossini et al., 1994). Bipolar EMG from the tongue muscles were acquired using a pair of Ag-AgCl surface electrodes (Ø 1 cm). The electrodes were pasted on plastic buttons and fixed on a spring of iron zinc. Before recording, electrodes were immersed in a disinfectant solution (Amuchina, sodium hypochlorite 1.1 grams per 100 ml of purified water) for 5 min and rinsed in drinking water. Participants were asked to introduce their tongue within these two electrodes, adjust the spring so that it was perfectly fitting with the tongue, and remain as relaxed as possible for the full duration of the experiment. The ground electrode was placed on the forehead of the participant. In separate sessions, EMG activity was recorded from the FDI muscle of the right hand by placing surface electrodes over the muscle belly (active electrode) and over the tendon of the muscle (reference electrode). The ground electrode was placed over the dorsal part of the elbow. The activity of muscles were registered in separate blocks and counterbalanced across participants. Electromyography signals were band-pass filtered online (20–3000 Hz), amplified (Digitimer, Hertfordshire, England) and sampled at a rate of 5 kHz (CED Micro 1401, Cambridge Electronic Design, Cambridge, England). Motor evoked potentials' peakto-peak amplitude (in millivolts) were calculated off-line using Spike 2 (version 6, Cambridge Electronic Design) and stored on a computer. We determined muscle pre-activation through visual inspection and excluded contaminated trials from the analysis (6.3% of trials, See **Figure 2** for MEP examples).

### **STATISTICS AND RESULTS**

Motor evoked potentials' amplitude values were normalized (*z*scored) for every subject and muscle. A two-way repeated measures ANOVA was performed with two sign types (meaningful and meaningless) and two muscles (tongue and hand). *Post hoc* comparisons were performed by means of *t*-tests applying a Bonferroni correction for multiple comparisons when required. A partial-eta-squared statistic served as effect size estimate. The interaction between the sign types and muscles was significant; (*F*(1,9) = 7.875, *p* = 0.021, η <sup>2</sup> = 0.46). Tongue cortical excitability was enhanced during the presentation of meaningful (trained) as compared to meaningless (untrained) signs (*p* = 0.02). That is, the presentation of word-associated gestures yielded an increase in tongue MEPs compared to the observation of signs that were not associated with words. The hand MEP *z*-scores did not reveal significant differences between the two types of signs (and, therefore, the mean *z*-scores were close to zero) (*p* > 0.05)<sup>1</sup> . Further *post hoc* analysis (Bonferroni test) indicated that observation of word-associated signs elicited significantly larger MEP amplitudes, relative to meaningless signs on the tongue compared to the FDI muscle (*p* = 0.025). By contrast, meaningless signs were accompanied by relative decrement of MEP amplitudes in *z*scores on the tongue as compared to the FDI muscle (*p* = 0.019); see **Figure 3**. Moreover, the raw MEP amplitudes recorded from the hand for each individual participant were greater than those recorded from the tongue muscle (cf. **Figure 2**). Because MEPs amplitude values were normalized using *z*-scores for each muscle, differences between MEP amplitudes of the two muscles (as shown in the **Figure 3**) are not necessarily indicative of differences in the magnitude of excitability.

## **DISCUSSION**

To the best of our knowledge we have provided the first experimental evidence for the modulation of excitability in the tongue area of M1 cortex as a function of observation of word-associated movements. We found the highest cortical excitability in the tongue area during the presentation of word-associated gestures compared with gestures not associated with any words (meaningless). On the contrary, the hand motor area presented the same level of excitability for both type of gestures. Our results are in line with the TMS study by Möttönen et al. (2010) showing that MEPs elicited by the stimulation of the hand representation in the left M1 did not significantly differ when participants observed signs with known vs. signs with unknown meanings. To unravel motor cortex modulation during gesture observation, they recorded TMS-induced MEPs from hand muscles of participants during sign language observation. They also compared the MEPs obtained before and after individuals learned the meanings of the signs presented and found that the excitability of left and right hand representation in M1 was equally lateralized before participants knew that the presented hand movements were signs. By contrast, after learning both known and unknown signs, the motor cortical excitability significantly increased only on the left M1 side, supporting the left hemispheric dominance for language processing (Knecht, 2000). Moreover, it has been suggested that brief inactivation of Broca's area by use of repetitive TMS affects verbal responses to gesture observation, suggesting the involvement of Broca's area in the instantaneous control of gestures and word pronunciation (Gentilucci et al., 2006). In addition, the very recent study by Vicario et al. (2013) showed that M1 might be indirectly involved in the mapping process of newly acquired, action-related, categorical associations.

The current work complements these findings and underscores the contribution of tongue but not hand motor area in the processing of communicative hand actions associated with the words. Several TMS studies have demonstrated modulation in the excitability of tongue motor area during speech perception (Fadiga et al., 2002; Watkins et al., 2003; Roy et al., 2008; Sato et al., 2010; D'Ausilio et al., 2014). It has been thoroughly argued that motor activation during speech perception emerges as a result of different task demands or experimental conditions rather than being an essential activity underlying speech perception (for review see, e.g., Lotto et al., 2009; Scott et al., 2009). Moreover, whether articulatory commands activated automatically and involuntarily during speech perception, is still

<sup>1</sup>To test whether this non-significant result was due to a lack of statistical power, we conducted an analysis using G<sup>∗</sup> Power (Faul et al., 2009) with *t*-test as family test and "Means: difference between two dependent means (matched pairs)" as a statistical test and "*A priori*: compute required sample size—given α, power, and effect size" as analysis type. The input parameters and their values were set as follows: Tail(s) = Two, Effect size *dz* = 0.3234 (derived from differences in the mean (0.024) and the standard deviation (0.0742)), α err prob = 0.05 and Power (1 – β err prob) = 0.8. This analysis indicated that the sample size ought to be increased to *N* = 77 for the hand muscle to reach statistical significance at a 0.05 level. It is hence unlikely that the non-significant results found for hand muscle could be attributed to a limited sample size.

a matter of debate (McGettigan et al., 2010). Here we have shown that it is not the hand but the tongue motor area that is specifically involved during the observation of gestures associated with a word, although individuals were not required to pronounce that word. Note that an additional control condition such as videos showing objects or symbols or fractals associated to specific words would have enabled us to determine whether the excitability in the M1 tongue area was a function of sign language observation or due to the effects of covert speech associated to the observed video. This should be addressed in future work.

Previous TMS studies have reported facilitation of the corticospinal tract excitability during the mere observation of another person's actions (for a review see Fadiga et al., 2005). The mirror system is active under various circumstances. For instance, somatotopic activation is present in the motor cortex when individuals observed and imagined actions (for a review, see Rizzolatti et al., 1996; Fadiga et al., 1999; Rizzolatti and Craighero, 2004). Even more critical is the mirror system involvement when the actions are not directly visible to the observer but implicitly presented (Bonaiuto et al., 2007). Building on to a vast amount of literature, one may speculate that humans have internal representations

of the movements either observed or imagined and that these internal representations resemble very closely the action when it is actually performed. In the present study we aimed for determining whether and how observation of hand gestures linked and not linked to specific words involves an internal motor simulation. We showed that while the observation of hand movements required similar internal motor simulations within the hand area of M1, regardless of whether they are associated with a word or not, only the observation of hand movements associated with words activated the tongue area of M1, indicating an extra level of coding. We here suggest that disentangling word-associated gestures, i.e., meaningful signs vs. meaningless signs, leads to internal simulations within the language motor regions (i.e., tongue).

The so-called gestural-origins theory of speech ascribes a precise role in language evolution to gestures (Corballis, 2003; see also Vicario, 2013 for a recent discussion). It has been suggested that spoken language evolves from an ancient communication system using arm gestures. Gestures of the mouth might have been added to the manual system to form a combined manuofacial gestural system (Corballis, 2003; Gentilucci and Corballis, 2006). Our results may suggest that the perception of sign language might require similar neural activity in speech motor centers as speech perception does. In this sense, our findings contribute to the view that the perception of speech and gesture share common neural substrates. Recent neuroimaging studies have revealed that semantic processing of speech and gestures engages common brain network with a specific involvement of left motor cortex (Xu et al., 2009; Straube et al., 2012, 2013). This is particularly interesting because it implies that motor cortex may be activated in response to language information independently of the communication modality.

Viewing sign language by deaf signers activates the classical language areas (left frontal and temporal areas) similar to the pattern of activity present when hearing participants listen to spoken words (Neville, 1998; MacSweeney et al., 2002; Newman et al., 2002). Damage to the left hemisphere often produces sign language aphasia just like aphasia in spoken language, suggesting the left cerebral hemisphere dominancy for both signed and spoken languages (Damasio et al., 1986; Hickok et al., 1996). Taken together we conclude that the involvement of the tongue region of the primary motor cortex is not merely limited to the perception and production of speech but might rather play a general role in encoding linguistic (maybe related to phonological retrieval) information even during perception of actions paired with words.

#### **ACKNOWLEDGMENTS**

We thank the Netherlands Organization for Scientific Research for financial support (NWO grant #400-08-127).

#### **REFERENCES**


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 27 June 2014; accepted: 10 September 2014; published online: 30 September 2014*.

*Citation: Komeilipoor N, Vicario CM, Daffertshofer A and Cesari P (2014) Talking hands: tongue motor excitability during observation of hand gestures associated with words. Front. Hum. Neurosci. 8:767. doi: 10.3389/fnhum.2014.00767*

*This article was submitted to the journal Frontiers in Human Neuroscience*.

*Copyright © 2014 Komeilipoor, Vicario, Daffertshofer and Cesari. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution and reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms*.

## The neural processing of foreign-accented speech and its relationship to listener bias

## *Han-Gyol Yi 1, Rajka Smiljanic <sup>2</sup> and Bharath Chandrasekaran1,3\**

*<sup>1</sup> SoundBrain Lab, Department of Communication Sciences and Disorders, Moody College of Communication, The University of Texas at Austin, Austin, TX, USA*

*<sup>2</sup> UT Sound Lab, Department of Linguistics, College of Liberal Arts, The University of Texas at Austin, Austin, TX, USA*

*<sup>3</sup> Institute for Neuroscience, The University of Texas at Austin, Austin, TX, USA*

#### *Edited by:*

*Carolyn McGettigan, Royal Holloway University of London, UK*

#### *Reviewed by:*

*Patti Adank, University College London, UK Daniel A. Abrams, Stanford University, USA*

#### *\*Correspondence:*

*Bharath Chandrasekaran, SoundBrain Lab, Department of Communication Sciences and Disorders, Moody College of Communication, The University of Texas at Austin, 1 University Station, C7000, 2504A Whitis Ave. (A1100), Austin, TX 78712, USA e-mail: bchandra@utexas.edu*

Foreign-accented speech often presents a challenging listening condition. In addition to deviations from the target speech norms related to the inexperience of the nonnative speaker, listener characteristics may play a role in determining intelligibility levels. We have previously shown that an implicit visual bias for associating East Asian faces and foreignness predicts the listeners' perceptual ability to process Korean-accented English audiovisual speech (Yi et al., 2013). Here, we examine the neural mechanism underlying the influence of listener bias to foreign faces on speech perception. In a functional magnetic resonance imaging (fMRI) study, native English speakers listened to native- and Korean-accented English sentences, with or without faces. The participants' Asian-foreign association was measured using an implicit association test (IAT), conducted outside the scanner. We found that foreign-accented speech evoked greater activity in the bilateral primary auditory cortices and the inferior frontal gyri, potentially reflecting greater computational demand. Higher IAT scores, indicating greater bias, were associated with increased BOLD response to foreign-accented speech with faces in the primary auditory cortex, the early node for spectrotemporal analysis. We conclude the following: (1) foreign-accented speech perception places greater demand on the neural systems underlying speech perception; (2) face of the talker can exaggerate the perceived foreignness of foreign-accented speech; (3) implicit Asian-foreign association is associated with decreased neural efficiency in early spectrotemporal processing.

**Keywords: foreign-accented speech, speech perception, fMRI, implicit association test, neural efficiency, primary auditory cortex, inferior frontal gyrus, inferior supramarginal gyrus**

#### **INTRODUCTION**

Foreign-accented speech (FAS) can constitute an adverse listening condition (Mattys et al., 2012). Perception of FAS is often less accurate and more effortful compared to native-accented speech (NAS; Munro and Derwing, 1995b; Schmid and Yeni-Komshian, 1999; Van Wijngaarden, 2001). The reduced FAS intelligibility has been attributed to deviations from native speech in terms of segmental (Anderson-Hsieh et al., 1992; Van Wijngaarden, 2001) and suprasegmental (Anderson-Hsieh and Koehler, 1988; Anderson-Hsieh et al., 1992; Tajima et al., 1997; Munro and Derwing, 2001; Bradlow and Bent, 2008) cues. Nevertheless, listeners can adapt to FAS following exposure or training (Clarke and Garrett, 2004; Bradlow and Bent, 2008; Sidaras et al., 2009; Baese-Berk et al., 2013). Thus, listener's perception of FAS perception can improve over time (Bradlow and Pisoni, 1999; Bent and Bradlow, 2003). The neuroimaging literature on FAS perception is scant. However, perception of foreign phonemes has been shown to engage multiple neural regions. These include the superior temporal cortex, which matches the auditory input to the preexisting phonological representations ("signal-to-phonology mapping") in the articulatory network, encompassing the motor cortex, inferior frontal gyrus, and the insula (Golestani and Zatorre, 2004; Wilson and Iacoboni, 2006; Hickok and Poeppel, 2007; Rauschecker and Scott, 2009). In particular, the inferior frontal gyrus exhibits phonetic category invariance, in which the response patterns differ according to between-category phonological variances but not to within-category acoustical variances (Myers et al., 2009; Rauschecker and Scott, 2009; Lee et al., 2012). Accordingly, processing of artificially distorted speech which, reduces speech intelligibility but does not necessarily introduce novel phonological representations, has been shown to involve additional recruitment of the superior temporal areas, the motor areas, and the insula, but not the inferior frontal gyrus (for review, see Adank, 2012). These findings lead to two predictions regarding neural activity during FAS processing. First, lack of adaptation to FAS would manifest in increased activity in the superior temporal auditory areas, due to the increased demand on auditory input processing. The primary auditory cortex is sensitive not only to rudimentary acoustic information such as frequency, intensity, and complexity of the auditory stimuli (Strainer et al., 1997), but also to the stochastic regularity in the input (Javit et al., 1994; Winkler et al., 2009) and attention (Jäncke et al., 1999; Fritz et al., 2003). The response patterns of the primary auditory cortex is modulated by task demands (attentional focus: Fritz et al., 2003; target properties: Fritz et al., 2005), attention, training effects (frequency discrimination: Recanzone et al., 1993), and predictive regularity in the auditory input (Winkler et al., 2009). Furthermore, early acoustic signal processing time for speech stimuli has been shown to be reduced with accompanying visual information, indicating that the primary auditory cortex activity is modulated by crossmodal input (Van Wassenhove et al., 2005). Thus, the primary auditory cortex is attuned to analyzing the details of the incoming acoustic signals, but is also influenced by contextual information and modulated by experience. Second, difficulty in resolving phonological categories would manifest in increased activity in the articulatory network. In contrast to the early spectrotemporal analyses of speech, later stages of phonological processing are largely insensitive to withincategory acoustic differences and exhibit enhanced sensitivity to across-category differences. Such phonological categorization is achieved via a complex network involving the inferior frontal cortex, insula, and the motor cortex (Myers et al., 2009; Lee et al., 2012; Chevillet et al., 2013).

Signal-to-phonology mapping, however, is not the only factor that modulates FAS perception. Listener beliefs regarding talker characteristics have been shown to modify the perceptual experience of speech (Campbell-Kibler, 2010; Drager, 2010). Specifically, different assumptions held about speaker properties by the individual listeners can potentially alter perception of the otherwise identical speech sounds. For instance, explicit talker labels (e.g., Canadian vs. Michigan) and indexical properties (e.g., gender, age, socioeconomic status) implied in visual representation of the talkers can change phonemic perception for otherwise identical speech sounds, even when the listeners are aware that this information is not accurate (Niedzielski, 1999; Strand, 1999; Hay et al., 2006a,b; Drager, 2011). The impact of perceived talker characteristics on FAS perception can be complex. Explicit labels have been linked to increased response times in lexical tasks for FAS, thus indicating increased processing load (Floccia et al., 2009), while visual presentation of race-matched faces have been shown to increase intelligibility for Chineseaccented speech (McGowan, 2011). These findings suggest that listener variability in FAS intelligibility may be partly accounted for using measures of listeners' susceptibility to these indexical cues (Hay et al., 2006b). In social psychology, the implicit association test (IAT) has been used extensively to quantify the degree of implicit bias which may not be measured using explicit self-reported questionnaire entries (Greenwald et al., 1998, 2009; Mcconnell and Leibold, 2001; Bertrand et al., 2005; Devos and Banaji, 2005; Kinoshita and Peek-O'leary, 2005). During an IAT, the participants are instructed to make associations between two sets of stimuli (e.g., American vs. Foreign scenes; Caucasian vs. Asian faces). The response times between two conditions (e.g., Caucasian-American and Asian-Foreign vs. Caucasian-Foreign and Asian-American) are compared, and the magnitude of the difference between the mean RTs are considered to reflect the degree of implicit bias toward the corresponding association. In non-speech research domains, the IAT measures have been shown to be positively correlated with neural responses to dispreferred stimuli in various networks, including the amygdala, prefrontal cortex, thalamus, striatum, and the anterior cingulate cortex (Richeson et al., 2003; Krendl et al., 2006; Luo et al., 2006; Suslow et al., 2010). A recent study has shown that native American English listeners with greater implicit bias toward making Asianto-foreign and Caucasian-to-American associations experienced greater relative difficulties in transcribing English sentences in background noise, which were produced by native Korean speakers than that produced by native English speakers. This relationship between racial bias and FAS intelligibility was only observed when the auditory stimuli were paired with video recordings of the speakers producing the sentences (**Figure 1**; Yi et al., 2013). In spite of the novelty of the finding, did not reach a conclusive implication of the behavioral results, but rather cautiously suggesting that the listener bias likely led to altered incorporation of visual cues which are beneficial for enhancing speech intelligibility in adverse listening situations (Sumby and Pollack, 1954; Grant and Seitz, 2000). The precise neural mechanism underlying the relationship between listener bias and FAS perception remains unclear.

In this fMRI study, monolingual native English speakers (*N* = 19) were presented with English sentences produced by native English or native Korean speakers in an MR scanner. The sentences were presented either along with video recordings of the speakers producing the sentences ("audiovisual modality") or without ("audio-only modality"). A rapid event-related design was used to acquire functional images. This setup allowed us to independently estimate BOLD responses to stimulus presentation and motor response on a trial-by-trial basis. Outside the scanner, the participants performed an IAT which was designed to measure the extent of the association between Asian faces and foreignness. Whole brain analyses were conducted to test the prediction that FAS perception would involve increased activation in the superior temporal cortex and the articulatory-phonological network, consistent with previous research on foreign phonemes processing (Golestani and Zatorre, 2004; Wilson and Iacoboni, 2006), speech intelligibility processing (Adank, 2012), and categorical perception in the inferior frontal gyrus (Myers et al., 2009; Rauschecker and Scott, 2009; Lee et al., 2012). ROI analyses were conducted to test whether the degree of implicit association between Asian faces and foreignness would be associated with modifications in the signal-to-phonology mapping process. For this purpose, the ROI analysis was restricted to the primary auditory cortex, involved in auditory input processing, and the inferior frontal gyrus, involved in phonological processing. Previous neuroimaging studies utilizing IAT as a covariate have consistently shown positive correlation between the IAT scores and the neural response to the dispreferred stimuli, which has led us to hypothesize that higher IAT scores (stronger Asian-Foreign and Caucasian-American association) would be associated with greater BOLD response to FAS, especially in the audiovisual modality.

### **MATERIALS AND METHODS MATERIALS** *Participants*

Nineteen young adults (age range: 18–35; 11 female) were recruited from the Austin community. All participants passed a hearing-screening exam (audiological thresholds *<*25 dB HL across octaves from 500 to 4000 Hz), had normal or corrected

to normal vision, and self-reported to be right-handed. Potential participants were excluded if their responses to a standardized language history questionnaire revealed significant exposure to any language other than American English (LEAP-Q; Marian et al., 2007). Data from one participant (male) were excluded from all analysis due to detection of a structural anomaly. All recruitment and participation procedures were conducted in adherence to the University of Texas at Austin Institutional Review Board.

#### *Audiovisual speech stimuli*

Four native American English (2 female) and four native Korean speakers (2 female) produced 80 meaningful sentences with four keywords each (e.g., "the GIRL LOVED the SWEET COFFEE"; Calandruccio and Smiljanic, 2012). The speakers were between 18 and 35 years of age. The speakers were instructed to read text provided on the prompter as if they were talking to someone familiar with their voice and speech patterns. The NAS stimuli had been rated to be 96.2% native-like, and the FAS stimuli had been rated to be 20.7% native-like (converted from a 1-to-9 Likert scale; Yi et al., 2013). Twenty non-overlapping sentences from each speaker was selected, resulting in 80 sentence stimuli used in the experiment. The video track was recorded using a Sony PMW-EX3 studio camera, and the audio track was recorded with an Audio Technica AT835b shotgun microphone placed on a floor stand in front of the speaker. Camera output was processed through a Ross crosspoint video switcher and recorded on an AJA Pro video recorder. The recording session was conducted on a sound-attenuated sound stage at The University of Texas at Austin. The raw video stream was exported using the following specifications. Codec: DV Video (dvsd); resolution: 720 × 576; frame rate: 29.969730 (**Figure 1**). The raw audio stream was RMS amplitude normalized to 62 dB SNL and exported using the following specifications. Codec: PCM S16 LE (araw); mono; sample rate: 48 kHz; 16 bits per sample.

#### *IAT*

Ten young adult Asian (5 female) and 10 Caucasian (5 female) face images were used for Caucasian vs. Asian face categories (Minear and Park, 2004). All face images had been edited to exclude hair, face contour, ear, and neck information, then rendered into grayscale with constant luminosity (Goh et al., 2010). Public domain images of 10 iconic American scenes (Grand Canyon, Statue of Liberty, Wrigley Field, Golden Gate Bridge, Pentagon, Liberty Bell, White House, Capitol, New York Central Park, Empire State Building) and 10 non-American foreign scenes (Eiffel Tower, Pyramids, Angkor Wat, London Bridge, Brandenburg Gate, Stonehenge, Great Wall of China, Leaning Tower of Pisa, Sydney Opera House, Taj Mahal) were obtained online and used for American vs. Foreign scene categories. No scene image contained face information. All images were cropped to a square proportion. The stimuli and the design used for the IAT were identical to those used in our previous study (**Figure 1**; Yi et al., 2013).

#### **METHODS**

#### *Scan parameters*

The participants were scanned via the Siemens Magnetom Skyra 3T MRI scanner at the Imaging Research Center of the University of Texas at Austin. High-resolution whole-brain T1-weighted anatomical images were obtained via MPRAGE sequence (*TR* = 2*.*53 s; *TE* = 3*.*37 ms; FOV = 25 cm; 256 × 256 matrix; 1 × 1 mm voxels; 176 axial slices; slice thickness = 1 mm; distance factor = 0%). T2∗-weighted Whole-brain blood oxygen level dependent (BOLD) images were obtained using a gradient-echo multi-band EPI pulse sequence (flip angle = 60◦; *TR* = 1*.*8 s; 166 repetitions; *TE* = 30 ms; FOV = 25 cm; 128 × 128 matrix; 2 × 2 mm voxels; 36 axial slices; slice thickness = 2 mm; distance factor = 50%) using GRAPPA with an acceleration factor of 2. Three hundred and thirty-four time points were collected, resulting in the scanning duration of approximately 10 min. This was a part of a larger scanning protocol which lasted for approximately 1 h for each participant.

#### *fMRI task*

Participants were instructed to listen to the recorded sentences and rate the clarity of each one. After the presentation of each stimulus, a screen prompting the response was presented, upon which the participants rated the clarity of the stimulus by pressing one of the four buttons on the button boxes, ranging from 1 (not clear) to 4 (very clear). This was done to ensure that participants were attending to the presentation of the stimuli. The audio track for the sentences were presented auditorily via MR-compatible insert earphones (ER30; Etymotic Research), and the visual track was presented via projector visible by an inscanner mirror. The stimuli were spoken by native English or native Korean speakers. There were two experimental conditions: an audio-only condition where only the acoustic signals were presented with a fixation cross being displayed, and an audiovisual condition where the video recordings of the talkers' faces producing the sentences were also presented. All sentences were presented only once in a single session without breaks. Therefore, the 80 sentences were subdivided into 20 sentences per each of the four conditions (native with visual cues; native without visual cues; nonnative with visual cues; nonnative without visual cues). We used a rapid event related design with jittered interstimulus intervals of 2–3 s. The order of the stimuli followed a pseudorandom sequence predetermined to avoid consecutive runs of stimuli of a given condition.

### *IAT*

Following the fMRI acquisition session, IAT was conducted outside the scanner in a soundproof testing room. The IAT procedures were identical to those used in our previous study (Yi et al., 2013). For each trial, a face or scene stimulus was displayed on the screen. The face stimuli differed from the main task in the scanner in that they were still images unrelated to sentence production. In the congruous category condition, participants had to press a key on the keyboard when they saw a Caucasian face or an American scene, and a different key for an Asian face or a Foreign scene. In the incongruous category condition, participants had to press a key for a Caucasian face or a Foreign scene, and a different key for an Asian face or an American scene (Devos and Banaji, 2005). Participants were instructed to respond as quickly as possible without sacrificing accuracy. Each condition was presented twice with the key designations switched in a randomized order. These yielded four test blocks. Four practice blocks were included prior to the test blocks, in which only scenes or faces were presented. An incorrect response led to a corrective feedback of "Error!" (Greenwald et al., 2003).

## **ANALYSES**

## *IAT*

In a standalone analysis, a linear mixed effects analysis (Bates et al., 2012) was run with the response times in milliseconds as the dependent variable to directly quantify the delayed response times due to the incongruous association. The category condition (congruous vs. incongruous) and the neural index were entered as the fixed effects to measure the delay effect of face-scene pairings incongruent with the implicit association. By-subject random intercepts were included. The optimizer was set to BOBYQA (Powell, 2009). Individual IAT scores were calculated following the standard guidelines (Greenwald et al., 2003). Trials with response times longer than 10,000 ms or shorter than 400 ms were excluded. Response times for incorrect trials were replaced by the mean of the response times for correct trials within the same block, increased by 600 ms. The average response time discrepancies across the two pairs of congruous vs. incongruous blocks were divided by the standard deviation of response times in the two blocks. These two discrepancy measures were averaged to yield in the final IAT score, which was used as a covariate in other analyses.

## *Clarity rating*

Clarity ratings for all sentences from each participant were entered as the dependent variable, after being mean-centered to 0, in a linear mixed effects analysis. In order to counteract different clarity criteria across the participants, the model was corrected for by-participant random intercepts. In the first analysis, the fixed effects included the accent and modality of the stimuli, the individual IAT measures, and the ensuing interactions. The optimizer was set to BOBYQA (Powell, 2009).

#### *fMRI preprocessing*

fMRI data were analyzed using FMRIB's Software Library Version 5.0 (Smith et al., 2004; Woolrich et al., 2009; Jenkinson et al., 2012). BOLD images were motion corrected using MCFLIRT (Jenkinson et al., 2002). All images were brain-extracted using BET (Smith, 2002; Jenkinson et al., 2005). Registration to the high-resolution anatomical image (*df* = 6) and the MNI 152 template (*df* = 12; Grabner et al., 2006a) was conducted using FLIRT (Jenkinson and Smith, 2001; Jenkinson et al., 2002). Six separate block-wise first-level analysis were run within-subject. The following pre-statistics processing were applied; spatial smoothing using a Gaussian kernel (FWHM = 5 mm); grandmean intensity normalization of the entire 4D dataset by a single multiplicative factor; highpass temporal filtering (Gaussianweighted least-squares straight line fitting, with sigma = 50.0 s). Each event was modeled as an impulse convolved with a canonical double-gamma hemodynamic response function (phase = 0 s). Motion estimates were modeled as nuisance covariates. Temporal derivative of each event regressor, including the motion estimates, was added. Time-series statistical analysis was carried out using FILM with local autocorrelation correction (Smith et al., 2004). The event regressors consisted of stimulus, response screen, and clarity response. The stimulus regressors were subdivided into accent (native vs. foreign) and modality (audiovisual vs. audioonly) conditions. The missed trials were separately estimated as nuisance variables. Three sets of *t*-test contrast pairs were tested, which examined modality (audiovisual – audio-only; audioonly – audiovisual), accent (native-accented – foreign-accented; foreign-accented – native-accented), and the interaction effects (audiovisual native – audiovisual foreign – audio-only native + audio-only foreign; audiovisual foreign – audiovisual native – audio-only foreign + audio-only native).

#### *Whole brain analysis*

Group analysis was performed for each contrast using FLAME1 (Woolrich et al., 2009). To correct for multiple comparisons, post-statistical analysis was performed using randomize in FSL to run permutation tests (*n* = 50*,* 000) for the GLM and yield in threshold-free cluster enhancement (TFCE) estimates of statistical significance. The corresponding family-wise error corrected *p*-values are presented in the results (Freedman and Lane, 1983; Kennedy, 1995; Bullmore et al., 1999; Anderson and Robinson,

#### **Table 1 | Whole brain analysis results for the contrasts of interest.**


*Clusters are based on the p < 0.025 threshold as well as the size criterion of 10 voxels.*

2001; Nichols and Holmes, 2002; Hayasaka and Nichols, 2003). The results are presented in the **Table 1**.

#### *ROI analysis*

The ROIs were anatomically defined as the left and right primary auditory cortices (combination of Te 1.0, 1.1, and 1.2; Morosan et al., 2001) and the left inferior frontal gyrus (Brodmann area 44; Amunts et al., 1999) using the Jülich histological atlas (threshold = 25%; Eickhoff et al., 2005, 2006, 2007). Percent changes in BOLD responses for the stimuli in four conditions (native-accented with faces; native-accented without faces; foreign-accented with faces; foreign-accented without faces) were calculated by first linearly registering the ROIs to the individual BOLD spaces using FLIRT with the appropriate transformation matrices generated from the first level analysis and nearest neighbor interpolation (Jenkinson and Smith, 2001; Jenkinson et al., 2002). Then, the parameter estimate images were masked for the transformed ROIs, multiplied by height of the double gamma function for the stimulus length of 2 s (0.4075), converted into percent scale, divided by mean functional activation, and averaged within the ROI, using fslmaths (Mumford, 2007). The percent signal change was entered as the dependent variable in a linear mixed effects analysis. In the mixed effects analysis, the fixed effects included the accent (native vs. foreign), face (faces vs. no faces), individual IAT values and their interaction terms. The model was corrected for by-participant random intercepts (Bates et al., 2012). The optimizer was set to BOBYQA (Powell, 2009).

### **RESULTS**

## **BEHAVIORAL RESULTS**

## *Clarity ratings*

The overall mean clarity rating was 2.94 (*SD* = 1*.*09). The mean clarity rating for the NAS was 3.28 (*SD* = 1*.*10) in the audioonly condition and 3.34 (*SD* = 1*.*06) in the audiovisual condition, while the rating for the FAS was 2.55 (*SD* = 0*.*96) in the audio-only condition and 2.57 (*SD* = 0*.*96) in the audiovisual condition. In this analysis, the fixed effects of modality, accent, the IAT scores, and their interaction terms were included as fixed effects for the dependent variable of clarity ratings for each sentence, which was mean-centered to 0. The three-way interaction was not significant, which was excluded in the final model. The accent effect was significant, *b* = −0*.*89, *SE* = 0*.*095, *t* = −9*.*36, *p <* 0*.*0001, 95% CI [−1.08, −0.70], indicating that FAS was perceived to be less clear than NAS. The accent by IAT interaction was significant, *b* = 0*.*33, *SE* = 0*.*15, *t* = 2*.*13, *p* = 0*.*034, 95% CI [0.026, 0.63], indicating that higher IAT values were more associated with higher perceived clarity ratings for the FAS relative to NAS. The intercept was significant, *b* = 0*.*97, *SE* = 0*.*39, *t* = 2*.*45, *p* = 0*.*024, 95% CI [0.15, 1.78]. The modality effect was not significant, *b* = −0*.*015, *SE* = 0*.*094, *t* = −0*.*16, *p* = 0*.*88, 95% CI [−0.20, 0.17], failing to provide evidence that perceived clarity was modified by the availability of visual cues. This is in contrast to the extensive previous literature that have indicated the intelligibility benefit from the audiovisual modality (Sumby and Pollack, 1954; Macleod and Summerfield, 1987; Ross et al., 2007). We ascribe this null finding to the task properties which did not require the participants to actively decipher the sentences, but only to rate their clarity (Munro and Derwing, 1995a). The IAT effect was not significant, *b* = −0*.*34, *SE* = 0*.*70, *t* = −0*.*49, *p* = 0*.*63, 95% CI [−1.79, 1.10]. The modality by accent interaction was not significant, *b* = −0*.*054, *SE* = 0*.*077, *t* = −0*.*71, *p* = 0*.*48, 95% CI [−0.20, 0.096]. The modality by IAT interaction was not significant, *b* = 0*.*15, *SE* = 0*.*15, *t* = 0*.*97, *p* = 0*.*33, 95% CI [−0.15, 0.45]. These results altogether suggest that FAS is perceived to be less clear by the listeners. Participants with higher IAT scores, i.e., those who were more likely to implicitly associate East Asian faces with foreignness, have decreased tendency to perceive FAS to be unclear, compared to NAS.

#### *IAT*

The overall mean response time was 948 ms (*SD* = 586 ms). The mean RT was 824 ms (*SD* = 408 ms) in the congruous condition, and 1073 ms (*SD* = 700 ms) in the incongruous condition. One fixed effects term was included in the model: incongruity of the stimuli pairing. The intercept was significant, *b* = 823*.*83, *SE* = 50*.*92, *t* = 19*.*50, *p <* 0*.*0001, 95% CI [718.84, 928.81], showing approximately 820 ms baseline response time. The incongruity effect was significant, *b* = 249*.*24, *SE* = 19*.*90, *t* = 12*.*52, *p <* 0*.*0001, 95% CI [210.22, 288.27], suggesting that incongruous stimuli pairing delayed each response by approximately 250 ms. The mean IAT score was calculated to be 0.51 (*SD* = 0*.*25), indicating a general trend of implicit bias toward making the Asian-Foreign association.

## **fMRI RESULTS**

## *Audio-only vs. Audiovisual*

BOLD signals were compared across the audiovisual and audioonly stimuli. The [audiovisual – audio-only] contrast revealed extensive activity in the occipital cortex, as the visual information in the faces required computations in the visual modality. Activity in the bilateral middle temporal gyri, left posterior superior temporal gyrus, and the right temporal pole was also observed, presumed to reflect integrative effort of the visual cues available in the facial stimuli (Sams et al., 1991; Möttönen et al., 2002; Pekkola et al., 2005). The [audio-only – audiovisual] contrast revealed activity in the bilateral superior and middle frontal gyri, right motor and somatosensory cortices, and the bilateral supramarginal gyri (**Figure 2**). The increased activation in the motor and somatosensory areas for audio-only speech than for audiovisual speech is in contrast to previous research that has shown the opposite pattern (Skipper et al., 2005). It is possible that the absence of visual cues induced more effortful processing in these areas. The activity in these regions is presumed to reflect the necessity of additional computation in the speech processing network.

#### *Native- vs. Foreign-accented speech*

BOLD signals were compared across the speaker accent. The [native – foreign] contrast revealed greater activity in the right angular gyrus, supramarginal gyrus, the posterior middle, and inferior temporal gyri. Supramarginal gyri have been suggested to be involved in making phonological decisions, which in the context of this study is presumed to reflect improved phonological

processing for NAS than for FAS (Hartwigsen et al., 2010). The [foreign – native] contrast revealed greater activity in the motor cortex, somatosensory cortex, inferior frontal gyrus, insula, and the anterior cingulate cortex. These areas have been previously indicated to be additionally recruited for perception of foreign phonemes (Golestani and Zatorre, 2004; Wilson and Iacoboni, 2006) or distorted speech (Adank, 2012). The omission of the superior temporal areas are significant, which run counter to our initial hypothesis regarding increased computational demand due to unfamiliar auditory input. A potential interpretation of this null finding is that the activity in the superior temporal cortex was more variable than that in the motor and frontal areas, an idea which was tested in the subsequent ROI analysis (**Figure 3**). The modality by accent interaction contrasts did not yield significant results.

#### *ROI analyses*

The ROI analyses were constrained to the left and right primary auditory cortices and the left inferior frontal gyrus. The fixed effects included accent (foreign- vs. native-accented speech), modality (audiovisual vs. audio-only), IAT scores, and their interaction terms. In the left primary auditory cortex, no threeway or two-way interactions were significant, leaving the model with only three main effects of accent, modality and IAT to be considered. The modality effect was significant, *b* = −0*.*047, *SE* = 0*.*014, *t* = −3*.*34, *p* = 0*.*0016, 95% CI [−0.075, −0.020], suggesting that the audiovisual stimuli reduced computational demand in this region, relative to the audio-only stimuli. The accent effect was not significant, *b* = 0*.*010, *SE* = 0*.*014, *t* = 0*.*71, *p* = 0*.*48, 95% CI [−0.018, 0.038]. The IAT effect was not significant, *b* = −0*.*27, *SE* = 0*.*20, *t* = −1*.*31, *p* = 0*.*21, 95% CI [−0.66, 0.13]. The intercept was significant, *b* = 0*.*45, *SE* = 0*.*11, *t* = 3*.*89, *p* = 0*.*0013, 95% CI [0.22, 0.67]. In the right primary auditory cortex, the three-way interaction across modality, accent, and IAT was significant, *b* = 0*.*25, *SE* = 0*.*11, *t* = 2*.*26, *p* = 0*.*028, 95% CI [0.042, 0.46], suggesting that higher IAT scores were associated with increased response to FAS with faces (**Figure 4**). The interaction between accent and modality was

significant, *b* = −0*.*15, *SE* = 0*.*062, *t* = −2*.*43, *p* = 0*.*019, 95% CI [−0.27, −0.34], suggesting that the decreased neural efficiency due to FAS was ameliorated by the availability of faces. The accent effect was significant, *b* = 0*.*090, *SE* = 0*.*044, *t* = 2*.*04, *p* = 0*.*047, 95% CI [0.0070, 0.17], suggesting that FAS increased the computational demand in this region. The intercept was significant, *b* = 0*.*32, *SE* = 0*.*12, *t* = 2*.*57, *p* = 0*.*020, 95% CI [0.077, 0.55]. The accent by IAT interaction was not significant, *b* = −0*.*061, *SE* = 0*.*078, *t* = −0*.*78, *p* = 0*.*44, 95% CI [−0.21, 0.087]. The IAT by modality interaction was not significant, *b* = −0*.*13, *SE* = 0*.*078, *t* = −1*.*65, *p* = 0*.*11, 95% CI [−0.28, 0.018]. The modality by accent interaction was not significant, *b* = −0*.*15, *SE* = 0*.*062, *t* = −2*.*43, *p* = 0*.*019, 95% CI [−0.27, −0.034]. The IAT main effect was not significant, *b* = −0*.*23, *SE* = 0*.*22, *t* = −1*.*03, *p* = 0*.*32, 95% CI [−0.65, 0.20]. The modality main effect was not significant, *b* = 0*.*011, *SE* = 0*.*044, *t* = 0*.*25, *p* = 0*.*81, 95% CI [−0.072, 0.094]. The intercept was significant, *b* = 0*.*32, *SE* = 0*.*12, *t* = 2*.*57, *p* = 0*.*020, 95% CI [0.077, 0.55].

In the left inferior frontal gyrus, no three-way or two-way interactions were significant, leaving the model with only three main effects of accent, modality and IAT to be considered. The accent effect was significant, *b* = 0*.*058, *SE* = 0*.*018, *t* = 3*.*15, *p* = 0*.*0028, 95% CI [0.022, 0.094], suggesting that FAS increased computational demand in this region. The IAT effect was not significant, *b* = −0*.*37, *SE* = 0*.*20, *t* = −1*.*83, *p* = 0*.*086, 95% CI [−0.76, 0.024]. The modality effect was not significant, *b* = −0*.*026, *SE* = 0*.*018, *t* = −1*.*41, *p* = 0*.*16, 95% CI [−0.062, 0.010]. The intercept was not significant, *b* = 0*.*12, *SE* = 0*.*11, *t* = 1*.*02, *p* = 0*.*32, 95% CI [−0.11, 0.34]. In the right inferior frontal gyrus, no three-way or two-way interactions were significant, leaving the model with only three main effects of accent, modality, and IAT to be considered, The accent effect was not significant, *b* = 0*.*0080, *SE* = 0*.*019, *t* = 0*.*43, *p* = 0*.*67, 95% CI [−0.029, 0.045]. The IAT effect was not significant, *b* = −0*.*36, *SE* = 0*.*19, *t* = −1*.*89, *p* = 0*.*077, 95% CI [−0.74, 0.011]. The modality effect was not significant, *b* = −0*.*0090, *SE* = 0*.*019, *t* = −0*.*48, *p* = 0*.*63, 95% CI [−0.046, 0.028]. The intercept was not significant, *b* = 0*.*12, *SE* = 0*.*11, *t* = 1*.*10, *p* = 0*.*29, 95% CI [−0.093, 0.33].

## **DISCUSSION**

Listening to FAS can be challenging compared to NAS. This difficulty can be partly attributed to a demanding process of mapping somewhat unreliable incoming signals to phonology. We hypothesized that FAS perception will require additional spectrotemporal analysis of the acoustic signal and place a greater demand on the phonological processing network. We therefore predicted increased functional activity in the superior temporal cortex and the inferior frontal gyrus, insula, and the motor cortex (Hickok and Poeppel, 2007; Rauschecker and Scott, 2009; Adank and Devlin, 2010; Adank et al., 2012a,b). Furthermore, FAS perception is additionally modulated by listeners' underlying implicit bias (Greenwald et al., 1998, 2009; Mcconnell and Leibold, 2001; Bertrand et al., 2005; Devos and Banaji, 2005; Kinoshita and Peek-O'leary, 2005; Yi et al., 2013). Thus, we hypothesized that individual variability in implicit Asian-foreign association will be associated with functional activity during early spectrotemporal analysis in the primary auditory cortex or for later, more categorical processing in the inferior frontal gyrus (Hickok and Poeppel, 2007; Rauschecker and Scott, 2009).

#### **INCREASED COMPUTATIONAL DEMAND FOR FOREIGN-ACCENTED SPEECH PROCESSING**

Relative to native speech, FAS was associated with increased BOLD response in the bilateral superior temporal cortices, potentially reflecting increased computational demand on these regions. The anterior and posterior portions of the superior temporal cortex have been associated with spectrotemporal analysis of the speech signal (Hickok and Poeppel, 2007), as well as with speech intelligibility processing (Scott et al., 2000; Narain et al., 2003; Okada et al., 2010; Abrams et al., 2013). While these previous studies have observed increased activation of the superior temporal cortex for intelligible speech compared to unintelligible acoustically complex stimuli, we found increased activation for the FAS stimuli than for the NAS stimuli, although FAS is less intelligible than NAS (Yi et al., 2013). This contradiction is resolved by considering the nature of comparisons involved in previous neuroimaging studies examining mechanisms underlying speech intelligibility. Both native- and FAS used in the current study have semantic and syntactic content which are absent in non-speech stimuli used as control in the previous studies (e.g., spectrally-rotated speech), and both functions have been suggested to occur within the superior temporal cortex (Friederici et al., 2003). The superior temporal cortex is a large region with possibly multiple functional roles in processing information in the speech signal. A direct recording study has shown that speech sound categorization is represented in the posterior superior temporal cortex (Chang et al., 2010), while a direct stimulation study had indicated the role of anterior superior temporal cortex in comprehension but not auditory perception (Matsumoto et al., 2011).

We found that presentation of FAS was associated with greater activity in the articulatory-phonological network, encompassing bilateral inferior frontal gyri, insula, and the right motor cortex. The inferior frontal gyrus is thought to be responsible for mapping auditory signals onto articulatory gestures (Myers et al., 2009; Lee et al., 2012; Chevillet et al., 2013). It has been suggested that the role of the inferior frontal gyrus is defined by the linkage between motor observation and imitation, which allows for abstraction of articulatory gestures from the auditory signals, along with the motor cortex and the insula (Ackermann and Riecker, 2004, 2010; Molnar-Szakacs et al., 2005; Pulvermüller, 2005; Pulvermüller et al., 2005, 2006; Skipper et al., 2005; Galantucci et al., 2006; Meister et al., 2007; Iacoboni, 2008; Kilner et al., 2009; Pulvermüller and Fadiga, 2010). On the other hand, both fMRI and transcranial magnetic stimulation (TMS) studies have indicated a functional heterogeneity within the inferior frontal cortex, which includes semantic processing (Homae et al., 2002; Devlin et al., 2003; Gough et al., 2005). The FAS and NAS stimuli had been controlled for syntactic, semantic, and phonological complexity (Calandruccio and Smiljanic, 2012). Since the task for each stimulus had also been identical (clarity rating), the increased activation across the speech processing network—including the superior temporal cortex and the articulatory network—during FAS perception is interpreted to reflect decreased neural efficiency for FAS processing (Grabner et al., 2006b; Rypma et al., 2006; Neubauer and Fink, 2009).

#### **IMPLICIT ASIAN-FOREIGN ASSOCIATION ASSOCIATED WITH EARLY SPECTROTEMPORAL ANALYSIS**

Previous behavioral studies have shown that FAS perception is modulated not only by the signal-driven factors, but also by the listener-driven factors. These listener factors can include listeners' familiarity and experience with the talkers (Bradlow and Pisoni, 1999) or language experience (Bradlow and Pisoni, 1999; Bent and Bradlow, 2003). Multiple studies have shown that listeners are also sensitive to the information regarding talker properties (Campbell-Kibler, 2010; Drager, 2010), either through explicit labels (Niedzielski, 1999; Hay et al., 2006a; Floccia et al., 2009) or facial cues (Strand, 1999; Hay et al., 2006b; Drager, 2011; McGowan, 2011; Yi et al., 2013). Listeners vary in their susceptibility to these talker cues (Hay et al., 2006b), which can override their explicit knowledge (Hay et al., 2006a). Accordingly, listeners' implicit association between faces and foreignness (Greenwald et al., 1998, 2009; Mcconnell and Leibold, 2001; Bertrand et al., 2005; Devos and Banaji, 2005; Kinoshita and Peek-O'leary, 2005) modulates FAS perception only when the faces are present, through a neural mechanism hitherto unknown (Yi et al., 2013). In this study, the IAT was used to measure the degree of listener bias in which the East Asian faces are associated with foreignness of the speakers (Greenwald et al., 1998, 2009; Mcconnell and Leibold, 2001; Bertrand et al., 2005; Devos and Banaji, 2005; Kinoshita and Peek-O'leary, 2005). Previous fMRI studies that have used IAT as a covariate have consistently showed a pattern in which higher measures of implicit association are associated with higher activation in various neural areas for dispreferred stimuli (Richeson et al., 2003; Krendl et al., 2006; Luo et al., 2006; Suslow et al., 2010).

Examining the connection between FAS perception and listener bias, we found that listeners' implicit Asian-foreign association was reflected in the functional activity in the right primary auditory cortex. Participants with higher IAT scores showed greater activity in the primary auditory cortex for Koreanaccented sentences when audiovisual information was presented. The primary auditory cortex is the site for early spectrotemporal analysis for the speech signal, sensitive to acoustic properties of the signal (Strainer et al., 1997), as well as task demands (Fritz et al., 2003, 2005), attention (Jäncke et al., 1999), context (Javit et al., 1994), and training effects (Recanzone et al., 1993). In contrast, IAT scores were not associated with the activity in the inferior frontal gyrus. Past findings indicated that individual listeners' perceived talker properties from pictorial stimuli differentially modulate the perceptual experience (Hay et al., 2006b). In the case of FAS, the presentation of race-matched faces enhanced perception of Chinese-accented English speech (McGowan, 2011), and the individual variability in implicit Asian-foreign association predicted Korean-accented speech intelligibility (Yi et al., 2013). The present findings suggest that the listeners' implicit bias for associating Asian speakers with foreignness may be related to the early neural processing for FAS, specifically low-level spectrotemporal analysis of the acoustic properties of the signal.

#### **LIMITATIONS AND FUTURE DIRECTIONS**

In this study, Korean-accented speech was used as the proxy for FAS. Accordingly, all foreign-accented talkers appeared East Asian. In order to extend our results to the general phenomenon of FAS perception, we propose a multifactorial design in future studies where, in addition to the stimuli produced by Caucasian native speakers and Asian nonnative speakers, those by Asian native speakers and Caucasian nonnative speakers are incorporated into the study design. Also, additional explicit questionnaire on listener experience and exposure to foreign-accented stimuli could be collected to augment our understanding of the complex nature in which underlying listener biases modulate speech perception. Finally, we acknowledge that the current study did not incorporate parametric variations on the intelligibility or accentedness of the FAS stimuli. Therefore, it is impossible to determine whether the increased BOLD response in the speech processing areas and the anterior cingulate cortex reflects increased difficulty in comprehension or the degree of perceived foreign accent *per se* (Peelle et al., 2004; Wong et al., 2008).

## **CONCLUSIONS**

In this study, we presented evidence of increased computational demand for FAS perception. Changes in the reduced neural efficiency for FAS processing was associated with the variability in the underlying listener biases (Yi et al., 2013). These results suggest that implicit racial association is associated with early neural response to FAS. Future studies on speech perception should examine the contribution of visual cues and listener implicit biases in order to obtain a more comprehensive understanding of the phenomenon of FAS processing.

## **ACKNOWLEDGMENTS**

Research reported in this publication was supported by the National Institute On Deafness And Other Communication Disorders of the National Institutes of Health under Award Number R01DC013315 awarded to Bharath Chandrasekaran as well as the Longhorn Innovation Fund for Technology to Bharath Chandrasekaran and Rajka Smiljanic. The authors thank Kirsten Smayda, Jasmine E. B. Phelps, and Rachael Gilbert for significant contributions in data collection and processing; the faculty and the staff of the Imaging Research Center at the University of Texas at Austin for technical support and counsel; the Texas Advanced Computing Center at The University of Texas at Austin for providing computing resources.

## **REFERENCES**


Smith, S. M. (2002). Fast robust automated brain extraction. *Hum. Brain Mapp.* 17, 143–155. doi: 10.1002/hbm.10062


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 23 May 2014; accepted: 10 September 2014; published online: 08 October 2014.*

*Citation: Yi H, Smiljanic R and Chandrasekaran B (2014) The neural processing of foreign-accented speech and its relationship to listener bias. Front. Hum. Neurosci. 8:768. doi: 10.3389/fnhum.2014.00768*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Yi, Smiljanic and Chandrasekaran. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## Multi-talker background and semantic priming effect

## *Marie Dekerle1,2\*, Véronique Boulenger 2,3, Michel Hoen2,4 and Fanny Meunier 1,2*

*<sup>1</sup> Laboratoire sur le Langage, le Cerveau et la Cognition, Centre National de la Recherche Scientifique, UMR 5304, Lyon, France*

*<sup>2</sup> University of Lyon, Lyon, France*

*<sup>3</sup> Laboratoire Dynamique Du Langage, Centre National de la Recherche Scientifique, UMR5596, Lyon, France*

*<sup>4</sup> Centre National de la Recherche Scientifique, UMR5292, Institut National de la Santé et de la Recherche Médicale, U1028, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France*

#### *Edited by:*

*Sonja A. E. Kotz, Max Planck Institute Leipzig, Germany*

#### *Reviewed by:*

*Carolyn McGettigan, Royal Holloway University of London, UK Frederic Marmel, Universidad de Salamanca, Spain*

#### *\*Correspondence:*

*Marie Dekerle, Groupe de Recherche ALP, Laboratoire sur le Langage, le Cerveau et la Cognition, Institut des Sciences Cognitives, 67, Bd Pinel, 69675 BRON CEDEX, France*

*e-mail: marie.dekerle@isc.cnrs.fr*

The reported studies have aimed to investigate whether informational masking in a multi-talker background relies on semantic interference between the background and target using an adapted semantic priming paradigm. In 3 experiments, participants were required to perform a lexical decision task on a target item embedded in backgrounds composed of 1–4 voices. These voices were Semantically Consistent (SC) voices (i.e., pronouncing words sharing semantic features with the target) or Semantically Inconsistent (SI) voices (i.e., pronouncing words semantically unrelated to each other and to the target). In the first experiment, backgrounds consisted of 1 or 2 SC voices. One and 2 SI voices were added in Experiments 2 and 3, respectively. The results showed a semantic priming effect only in the conditions where the number of SC voices was greater than the number of SI voices, suggesting that semantic priming depended on prime intelligibility and strategic processes. However, even if backgrounds were composed of 3 or 4 voices, reducing intelligibility, participants were able to recognize words from these backgrounds, although no semantic priming effect on the targets was observed. Overall this finding suggests that informational masking can occur at a semantic level if intelligibility is sufficient. Based on the Effortfulness Hypothesis, we also suggest that when there is an increased difficulty in extracting target signals (caused by a relatively high number of voices in the background), more cognitive resources were allocated to formal processes (i.e., acoustic and phonological), leading to a decrease in available resources for deeper semantic processing of background words, therefore preventing semantic priming from occurring.

**Keywords: semantic priming, informational masking, cocktail party, cognitive load, effortfulness hypothesis**

### **INTRODUCTION**

In daily life, speech is rarely perceived in silence, but with interference from wind, music or other people's conversation. Although often used to study psychoacoustic topics (Brungart, 2001; Brungart et al., 2001; McDermott, 2009), speech-in-noise and cocktail party situations (i.e., speech-in-speech, Cherry, 1953) also appear to be interesting paradigms to tackle linguistic processes and competition occurring between backgrounds and targets (Hoen et al., 2007; Boulenger et al., 2010). Our study aimed to investigate the extent to which multi-talker background is processed semantically when listening to speech-in-speech and therefore how the cocktail party situation can be used to study the automaticity of word semantic activation.

The cocktail party situation is described as involving two types of masking effects: energetic and informational masking (Brungart, 2001). Energetic masking relies on the spectrotemporal features of sounds and results from different sounds stimulating the same part of the cochlea at the same time so that one of them cannot be heard (i.e., as two signals increasingly share spectro-temporal characteristics, energetic masking becomes more efficient). In multi-talker background situations, the magnitude of energetic masking is proportional to the number of voices that comprise the background (Simpson and Cooke, 2005). Informational masking, however, usually refers to masking effects that cannot be attributed to energetic masking. Specifically, it is related to the overlap of information carried by the different signals at a higher level (e.g., lexical level and working memory; see Durlach, 2006; Cooke et al., 2008; Mattys et al., 2009; Mattys and Wiget, 2011). Whereas background noise mainly elicits energetic masking, a speech background produces both energetic and informational masking (Brungart et al., 2006). Despite the masking, it is still possible to detect and recognize a word or linguistic token embedded in a babble. Of course, as more voices are present in the babble, participants become less accurate (Freyman et al., 2004). However, it is interesting to note that Simpson and Cooke (2005) showed, using a—6 dB SNR, that intelligibility decreases as a monotonic function of the number of speakers in babbles of up to 8 voices. Specifically, participants' accuracy to detect the target token decreases as the number of voices increases up to 8 voices. Further increasing the number of voices does not lead to a decrease in accuracy. These results suggest that if energetic masking is too high, informational masking decreases with the diminution of the available linguistic cues. For example, with more than 8 talkers, phonetic cues are not or less available and therefore cannot be attributed incorrectly to the target.

The first aim of this paper is to test whether semantic features are involved in informational masking. It has been established that informational masking is not monolithic and occurs at many linguistic levels. Indeed, a multi-talker background will create less interference on a target word, if it is pronounced in a different language (Van Engen and Bradlow, 2007; Brouwer et al., 2012) and different languages will not have the same masking power (Gautreau et al., 2013). By manipulating the number of talkers in the background, Boulenger et al. (2010) revealed lexical competitions between a 2-talker background and target speech using a lexical decision task. Increasing the number of voices in the background, however, led to the disappearance of lexical interference because of increased energetic masking (i.e., words from the background became less intelligible and therefore competed less with target processing). However, using the same paradigm but with an intelligibility task, Hoen et al. (2007) showed that lexical processing of a background could be performed with up to 4 concurrent voices; beyond that threshold, masking was too high and seemed to prevent linguistic processes. Although it has been shown that phonological and lexical information contribute to the informational masking effect, our experiments tested the role of semantic information.

Processing of the background's semantic content has already been highlighted with 2 talkers, pronouncing either semantically correct sentences (i.e., "rice is often served in round bowls") or incorrect sentences (i.e., "the great car met the milk," Brouwer et al., 2012). Semantic incoherence in the background impacts the recognition of the target sentence. This result suggests that the background signal with 2 talkers is processed semantically. Our experiments aimed to identify how many talkers are allowed in this semantic processing using words and how semantic information from the backgrounds interferes with the identification of target words.

The ability to semantically process auditory words presented outside of the attentional focus is traditionally studied using dichotic listening. This paradigm allows to study pure informational masking as no energetic masking occurs in dichotic listening. However, discrepant results have been reported (Cherry, 1953; Lewis, 1970; Eich, 1984; Wood et al., 1997; Dupoux et al., 2003). In 1984, Eich showed a semantic effect on the recognition of words presented in the unattended channel. However, this effect resulted, at least partially, from an attentional shift toward the to-be-ignored channel (as suggested by Wood et al., 1997). As the speaker rate was very slow in Eich's experiment (85 words per minute), it allowed participants to listen to the supposedly unattended channel without disturbing the primary task (in the case of Eich's study, a shadowing task). In replicating Eich's study using the same speech rate, Wood et al. (1997) observed the same semantic effect; however, it disappeared if the speaker's rate was increased to 170 words per minute, corresponding to a more ecologically valid rate. The authors concluded that as this faster speech rate demanded more cognitive resources, participants could no longer shift attention to the unattended channel while performing the primary task, suggesting that at least in dichotic listening, informational masking does not involve semantic information. The issue raised by this paradigm is that the spatial separation of auditory signals creates a masking release compared to a binaural condition and therefore facilitates stream segregation that could prevent competition between the to-be-ignored and target speech (Drullman and Bronkhorst, 2000; Hawley et al., 2004).

Concerning semantic activation and according to traditional theoretical models, semantic memory is organized into networks. The recognition of one word leads to its activation in semantic memory, and this activation is supposed to spread automatically to other related concepts (Collins and Quillian, 1969; Collins and Loftus, 1975). This supposition is derived from semantic priming paradigms, shown in auditory and visual modalities, in which the presentation of prime word leads to faster recognition of a semantically related target word (Meyer and Schvaneveldt, 1971; Donnenwerth-Nolan et al., 1981; Radeau, 1983; Schacter and Church, 1992; Radeau et al., 1998; Spruyt et al., 2012). For example, the presentation of the prime "nurse" before the target "doctor" facilitates the recognition of the target word "doctor" compared to a condition in which the prime is unrelated to the target (Meyer and Schvaneveldt, 1971). Adapting this paradigm to the cocktail party situation will allow us to investigate if the semantic content of the background is processed and interferes with the target word, despite decreased intelligibility. Some background words will therefore act as primes.

In the current study, we used the rationale of a priming paradigm by manipulating the association between words pronounced in the background and target words. Additionally, we varied the amount of masking to evaluate how it modulates semantic priming effects. Participants were required to perform a lexical decision task on a target item (i.e., decide whether the target item is a word or a pseudo-word) embedded in backgrounds composed of 1 to 4 voices depending on the experiment. These voices could pronounce words that were semantically related to each other and that were related or unrelated to the target. They acted as primes and were called Semantically Consistent (SC) voices. Additional voices pronounced words that were always unrelated to each other and unrelated to the target, acting as maskers. They were called Semantically Inconsistent (SI) voices.

Across experiments, we manipulated the ratio between SC and SI voices. The aim was to test the preservation of the semantic processing of SC voices despite increased masking (i.e., more SI voices). In Experiment 1, backgrounds were composed of 1 or 2 SC voices. In Experiments 2 and 3, respectively, 1 and 2 SI voices were added to each background to increase masking and therefore, decrease the intelligibility of the SC voices. Consequently, in Experiment 2, backgrounds in one condition consisted of 1 SC voice and 1 SI voice and in a second condition of 2 SC voices and 1 SI voice. In Experiment 3 they comprised 1 SC voice and 2 SI voices in one condition and 2 SC voices and 2 SI voices in the other condition.

Overall increasing the number of voices allowed us to examine if and how semantic priming can be impacted by the increase in the number of talkers in the background. Additionally, the variation in the number of SC voices compared to the number of SI voices allowed us to study the effect of prime saliency on semantic processing and therefore its participation in informational masking. Indeed, across experiments, backgrounds can consist of the same number of voices whereas the number of SC voices compared to the number of SI voices could differ (e.g., 3 voices in the background: either 2SC/1SI in Experiment 2 or 1SC/2SI in Experiment 3).

If semantic processing can occur automatically, semantic priming should be observed at least as long as background words are intelligible and should not be disturbed by increased masking and decreased prime saliency. Indeed, automaticity is defined as a strategy free processing that occurs without using the resources of a limited capacity central processor (Neely, 1977). Therefore, if semantic processing is strategy free, it should occur even if participants are not aware that a given word is presented to them (as is done in visual modality in classical masked priming paradigms, see Forster and Davis, 1984).

## **EXPERIMENT 1**

The aim of this experiment was to first establish set up and test our paradigm and experimental materials. Backgrounds were composed of 1 or 2 SC voices that pronounced words sharing semantic features with each other. In the related condition, target words belonged to the same semantic field as the prime, but they did not in the unrelated condition. We therefore expected to observe a semantic priming effect: participants should more quickly and accurately identify target words in the related compared to the unrelated condition. The second aim of this first experiment was to test if the presence of 2 voices in the background would affect participants' performance as suggested by the psychoacoustic literature (Brungart, 2001; Brungart et al., 2001). We therefore hypothesized that target words would be answered to more slowly and less accurately in the 2SC condition compared to the 1SC condition. Finally, we examined whether the semantic priming effect was modulated by increased energetic and informational masking caused by the augmentation of the number of voices in the background.

#### **METHOD**

#### *Participants*

Twenty-seven participants (20 females) volunteered for this experiment. All were right-handed, French native speakers and reported no known hearing or language disorder. Subjects' ages ranged from 18 to 25 years old. All participants gave written informed consent and were not aware of the experiment's purpose. They were compensated for their participation. The protocol that was used in this experiment was approved by the local ethics committee (CPP Sud-Est IV, Lyon; ID RCB: 2008-A0 0708-47).

#### *Stimuli*

Forty-eight disyllabic target words (*M*lexical frequency = 21*.*94 per million, *SD* = 18*.*75 according to the French database Lexique 3, New et al., 2001) were selected, and each word belonged to a specific semantic field (e.g., *CAROTTE* "carrot"; *MÉTRO* "subway"). Each target word was matched to 10 words belonging to the same semantic field (e.g., *CAROTTE* "carrot" was associated with *légume, chou, céleri, salade, tomate* "vegetable, cabbage, celery, lettuce, tomato"). As participants had to perform a lexical decision task, 48 pseudo-words respecting French phonotactic rules were created (e.g., *PLARO*, *HUMEL*). Ten words sharing semantic features with each other were arbitrarily associated with each pseudo-word target, resulting in a total of 96 groups of 10 words (See Supplementary Material) (*M*lexical frequency = 21*.*86, *SD* = 18*.*20). As each background comprised 1 or 2 SC voices (related or not to the target), each group was divided into two subgroups of 5 words one of the subgroups was spoken by a first speaker (S1), and the other by a second speaker (S2).

Target words were presented with a semantically related (related condition) or semantically unrelated background (unrelated condition). In the unrelated condition, SC voices pronounced words that were semantically related to each other but not to the target (see **Figure 1**). Backgrounds comprised 1 SC voice (1SC condition) or 2 SC voices (2SC condition). The 48 target words were divided into 4 groups of 12 words, the mean frequency did not differ significantly between the groups (*F <* 1), nor did the number of phonemes [*M* = 6*.*97, *SD* = 5*.*65; *F*(3*,* 44) = 1*.*1, *n.s.*] and phonological neighbors [*M* = 4*.*75, *SD* = 0*.*81; *F*(3*,* 44) = 2*.*2, *n.s.*]. Each group of twelve target words was assigned to a condition (1SC related, 1SC unrelated, 2SC related, 2SC unrelated) depending on the experimental list. The same was true for pseudo-words. Four experimental lists of 96 stimuli (i.e., 48 target words and 48 target pseudo-words) were created so that each target word was presented in each condition, but only once in a list (each participant was presented with one list only).

Targets and SC voices were recorded by 3 different French native female speakers (age: 21–22) in a sound-proof room (22 kHz, mono, 16 bits). Auditory sequences of 5 words from Speakers 1 and 2 (S1 and S2) were segmented into 3 s periods. The periods were then normalized at an intensity of 60 dB-A and mixed together to create backgrounds. All audio files were synchronized at the beginning, so all voices started to speak at the same time. However, as all voices pronounced words of different lengths, they soon became desynchronized, and there was always one speaker talking in the background. Targets recorded by Target Speaker (TS; also normalized at an intensity of 60 dB-A) were inserted 2 s after the start of the backgrounds (so that each participant always had the same exposure to the background before the target speech was presented), with a 0 dB SNR (Signal/Noise Ratio; see **Figure 1**). Because the backgrounds, which comprised 1 or 2 voices, generated different amounts of energy, the intensity of all stimuli was varied over a ±3 dB range in 1 dB steps to prevent participants from predicting condition depending on individual stimuli intensity.

#### *Procedure*

Participants sat in front of a computer screen and heard the stimuli binaurally through headphones at a comfortable level (mean level 65 dB-A, ranging from 62 dB-A to 68 dB-A, normalized using an artificial ear). A fixation cross was presented on the screen at the beginning of each trial and remained on the screen during stimulus presentation. Participants were asked to listen to the stimuli to decide as quickly and accurately as possible whether the target was a word or a pseudo-word, by pressing one of two

pre-specified keys. After a response was given, a string of hash marks indicated that the trial was over; participants could then press a key to start the next trial. Half of the participants gave the response to "word" with their left hand and to "pseudo-word" with their right hand. As all participants were right-handed, they might answer faster with their right hand than their left hand. To avoid this confounding effect, the other half were given the opposite instruction. A training session composed of twelve trials (different from the experimental stimuli) preceded the test session so that participants could acclimate to the stimuli and the task.

#### **RESULTS**

Two Two-Way repeated measures analyses of variance (ANOVAs) by participants (*F*1) and by items (*F*2) were conducted, with Response Times (RTs, in ms) and Error Rates (ERs) for target word identification as dependent variables. We included Number of Voices in the background (1 Voice, 1SC or 2 Voices, 2SC) and Semantic Link between prime and target (related or unrelated) as within-subjects factors. Three participants were excluded from analyses because of very high ERs (more than 40%). Four target words error rates greater than 50% were also excluded from Item analyses (*POIGNET*, *RIDEAU*, *RATON,* and *RACINE* "wrist, curtain, baby rat, root"). Trials with RTs below or above 2.5 standard deviations from the individual means (4.5%) and trials in which participants made mistakes (19.5%) were not included in RTs analysis. Means and Standard Deviations (SDs) of RTs and ERs are summarized in **Table 1**.

The ANOVA by participants first revealed a significant main effect of the Number of Voices: participants were faster [*F*1(1*,* 23) = 4*.*25, *p* = 0*.*05] and more accurate [*F*1(1*,* 23) = 9*.*12, *p <* 0*.*01] to identify targets in the 1SC condition (*M*RT = 1008 ms, *SD* = 157; *M*ER = 11*.*7%, *SD* = 10*.*9) than in the 2SC condition (*M*RT = 1042 ms, *SD* = 161; *M*ER = 20*.*1%,

**Table 1 | Means and Standard Deviations (SDs) of Response Times (RTs) and Error Rates (ERs) depending on the number of voices in the background and the semantic link between prime and target in Experiment 1.**


*1SC, 1 SC voice condition; 2SC, 2 SC voices condition; related, semantic link between the prime and the target; unrelated, no semantic link between the prime and the target.*

*SD* = 16*.*5). The Item analysis, however, did not highlight an effect of the Number of Voices on RT [*F*2(1*,* 43) = 1*.*9, *p* = 0*.*1], although target words were better categorized as words in the 1SC condition [*F*2(1*,* 43) = 13*.*43, *p <* 0*.*001].

The main effect of Semantic Link also appeared to be significant on RTs [*F*1(1*,* 23) = 14*.*24, *p <* 0*.*001; *F*2(1*,* 43) = 4*.*63, *p <* 0*.*05], participants responded faster if targets shared semantic features with the prime (*M* = 997 ms, *SD* = 169) than if they did not (*M* = 1053 ms, *SD* = 145); this resulted in a 56 ms priming effect. This effect was also significant for ERs in the participant analysis [*F*1(1*,* 23) = 3*.*93, *p* = 0*.*05], and there was only a trend in the item analysis [*F*2(1*,* 43) = 3*.*07, *p <* 0*.*10]. Participants tended to be more accurate in the related condition (*M* = 15*.*4%, *SD* = 15*.*4) than in the unrelated condition (*M* = 18*.*55%, *SD* = 13*.*2). There was no significant interaction between the two factors for RTs (*F*<sup>1</sup> *<* 1; *F*<sup>2</sup> *<* 1) and ERs [*F*1(1*,* 23) = 2*.*34, *n.s.*; *F*2(1*,* 43) = 1*.*97, *n.s.*], suggesting that the semantic priming effect was not modulated by the Number of Voices (one or two) in the background.

### **DISCUSSION**

These results first highlight that participants were slowed by the increase in the number of voices in the background. This effect is certainly attributable to enhanced target masking in the two-voice condition (Brungart, 2001; Brungart et al., 2001). Interestingly, participants' performance was improved by the semantic relationship between the prime and target, and this effect was independent of the number of voices, suggesting that the increase in masking from one to two background voices, was not sufficient to prevent semantic processing. However, in this first experiment, prime was salient in both conditions (1SC voice and no SI voice or 2SC voices and no SI voice). To test whether participants could still take advantage of the semantic relationship between target and prime if the intelligibility of the SC voices was further decreased, we conducted a second experiment in which a SI voice was added to each background.

## **EXPERIMENT 2**

This second experiment aimed to investigate whether the semantic priming effect would resist increased masking. An SI voice was therefore added to each background. This voice pronounced words sharing no semantic features with each other or with the target word, whatever the condition. The purpose was to use the same material and procedure as in Experiment 1 with the addition of mask on the SC voices. In Experiment 2, backgrounds were composed of two voices (1 SC voice + 1 SI voice) or 3 voices (2 SC voices + 1 SI voice). A deleterious effect of the number of voices on participants' performance was predicted, and we expected that the presence of SI voice would not affect the semantic priming effect if this latter effect is automatic.

Another change was made in Experiment 2 regarding target items. In Experiment 1, target items were pronounced by a female speaker, and were consequently, difficult to detect among the other female speakers (S1 and S2). These difficulties might partly explain the low accuracy and long response times to target words inserted in babbles that were only composed of one or two voices. To avoid flux segregation difficulties (Festen and Plomp, 1990; Brungart et al., 2001), target items were therefore pronounced by a male speaker (Target Speaker 2; TS2) in the two following experiments.

## **METHOD**

#### *Participants*

Twenty-four right-handed French native speakers (18 females), aged 18–30 years, participated in this second experiment. They had no known auditory or language disorders. All participants gave their written informed consent and were compensated for their participation. None of the participants had been tested in Experiment 1 and they were not aware of the aim of the study before testing.

#### *Stimuli*

To add an SI voice to each background used in Experiment 1, 96 groups of 5 words (*M*lexical frequency = 18*.*15, *SD* = 9*.*75), not semantically related to each other, were generated. Average lexical frequency did not differ between SC voices (from Experiment 1) and the SI voice as shown by an ANOVA [*F*(2*,* 190) = 1*.*16, *n.s.*]. Each group was selected to mask a specific prime (composed of 1 or 2 SC voices), which shared no semantic link (e.g., the prime *légume, chou, céleri, salade, tomate* "vegetable, cabbage, celery, lettuce, tomato" was always presented with the SI voice pronouncing *policier, intéressant, cour, affiche, étagère* "policeman, interesting, yard, poster, shelf"). This SI voice was recorded by another French native female speaker (S3, age = 20) using the same method as in Experiment 1.

Backgrounds composed of 1 SC voice (S1) in Experiment 1 were now composed of 1 SC voice and 1 SI voice (S1 + S3), corresponding to the 1SC/1SI condition, and backgrounds composed of 2 SC voices (S1 + S2) in Experiment 1 were now composed of 2 SC voices and 1 SI voice (S1 + S2 + S3), corresponding to the 2SC/1SI condition. The 4 groups of 12 target words and pseudo-words created for Experiment 1 were used. Target words were presented in each of the 4 conditions: 1SC/1SI related, 1SC/1SI unrelated, 2SC/1SI related and 2SC/1SI unrelated. The corresponding number of pseudo-words was also presented. Four experimental lists were created so that each target word was seen in each condition but only once in a list.

Recordings of S1 and S2, used in the previous experiment, were mixed with S3 following the previously established experimental lists. Targets were recorded by a French native male speaker (Target Speaker 2; age = 20) and were inserted into backgrounds 2 s after the start of the sequence (see **Figure 2**).

#### *Procedure*

The procedure was the same as in Experiment 1.

#### **RESULTS**

Similar analyses as in Experiment 1 were performed by subjects (*F*1) and by items (*F*2), with RTs (in ms) and ERs for target word identification as the dependent variables. We included Number of Voices in the background (two voices, 1SC/1SI or three voices, 2SC/1SI) and Semantic Link between prime and target (related or unrelated) as within-subjects factors. As in Experiment 1, target words for which less than 50% of participants answered correctly were not analyzed. The target words *BILLET* and *CHIGNON* ("ticket" and "bun") were therefore not included. Moreover, 15% of data were excluded (13% of errors and 2% of extremes values) from RT analysis. Mean RTs and ERs with SDs are summarized in **Table 2**.

The Two-Way repeated measures ANOVAs showed a significant main effect of the Number of Voices in the backgrounds [*F*1(1*,* 23) = 6*.*08, *p <* 0*.*05; *F*2(1*,* 45) = 4*.*34, *p <* 0*.*05]. Participants responded faster to target words if backgrounds comprised 2 voices (*M* = 946 ms, *SD* = 157) compared to 3 voices (*M* = 973 ms, *SD* = 175). This main effect was also significant for ERs [*F*1(1*,* 23) = 7*.*18, *p <* 0*.*05] but only in the participants' analysis [*F*2(1*,* 45) = 3*.*72, *p <* 0*.*10]. Responses were more accurate in the 1SC/1SI condition (*M* = 8*.*6%, *SD* = 8*.*2) than in the 2SC/1SI condition (*M* = 12*.*5%, *SD* = 9*.*1).

The main effect of Semantic Link was also significant [*F*1(1*,* 23) = 5*.*32, *p <* 0*.*05; *F*2(1*,* 45) = 5*.*38, *p <* 0*.*05], responses

Experiment 2, presented with a semantically related target word (left; related condition) or not (right; unrelated condition). S3, speaker 3 (see the legend of

**Table 2 | Means and SDs of RTs and ERs depending of the number of voices in the background and the semantic link between prime and target in Experiment 2.**


*1SC/1SI, 1 SC voice and 1 SI voice condition; 2SC/1SI, 2 SC voices and 1 SI voice condition.*

were 40 ms faster if the prime and target shared semantic features (*M* = 939 ms, *SD* = 162) compared to if they did not share features (*M* = 979 ms, *SD* = 170). This effect also reached significance for ERs in the participants' analysis [*F*1(1*,* 23) = 5*.*33, *p <* 0*.*05; *F*2(1*,* 45) = 1*.*98, *p* = 0*.*10]. ERs decreased by 3.9% if the prime and target were semantically related (*M* = 8*.*61%, *SD* = 8*.*5 in the related condition and *M* = 12*.*53%, *SD* = 8*.*8 in the unrelated condition).

There was no significant interaction between the Number of Voices in the backgrounds and the Semantic Link between prime and target for RTs [*F*1(1*,* 23) = 1*.*92, *n.s.*; *F*<sup>2</sup> *<* 1] and ERs (*F*<sup>1</sup> *<* 1; *F*<sup>2</sup> *<* 1), indicating that the priming effect was not affected by the increase in the number of voices in the backgrounds.

#### **DISCUSSION**

As in Experiment 1, performances improved if the prime and target were semantically related, although participants were slower

condition of Experiment 2, presented with a semantically related target word (left; related condition) or not (right; unrelated condition).

in condition 2SC/1SI than 1SC/1SI because of increased masking (Bronkhorst, 2000; Brungart, 2001; Brungart et al., 2001). Interestingly, there was again no indication that increased masking reduced the semantic priming effect. To further test this resistance of the semantic priming effect to prime intelligibility loss, we conducted a third experiment in which a second SI voice was added to each background to further decrease prime saliency. However, the intelligibility of the target item may decrease with the addition of a second SI voice in the background. However, as the target item is pronounced by a male voice and embedded in a female voice background, it is still quite salient compared to background voices (Brungart et al., 2001). Indeed, Brungart and collaborators showed that participants more easily recognize a target sentence if it was embedded in a background composed of voices of a different sex than if all the sentences (background and target) were pronounced by speakers of the same sex. Additionally, in our experiment, target items were presented at the same time (2 s after the beginning of the stimulus), and this regular timing helps participants to detect the target, as they know when to listen to it. Consequently, in our paradigm, the masking effect of the SI voice on the target item was quite low compared to its effect on SC voices.

#### **EXPERIMENT 3**

This last experiment was the same as Experiment 2, except that a second SI voice was added to each background (i.e., backgrounds comprised 3 or 4 voices). The same method was used as in Experiment 2. The aim was to further increase the masking of SC voices to test the resistance and automaticity of semantic processing. As the number of voices in the backgrounds increased (i.e., up to 4 voices), we wanted to make sure whether participants could still identify words from the background. Therefore, after the main experiment, we asked participants to perform a recognition test. This post-test was designed to examine whether, in agreement with Hoen et al. (2007), words in the background were still intelligible. It aimed to clarify, in the case of significant semantic priming, if this effect resulted from preserved intelligibility of the prime words by testing lexical access or from automatic processing. In case of preserved intelligibility we cannot prove that the semantic effect results from automatic processing, it may be either automatic or strategic. However, if a priming effect is found without preserved intelligibility, semantic processing is automatic (as shown with masked priming paradigm in visual modality, see Dehaene et al., 1998; Naccache and Dehaene, 2001; Spruyt et al., 2012). Proof of non-intelligibility was consequently needed in the case of a significant semantic priming effect to provide evidence for an automatic process. Otherwise, we would not be able to detect whether automatic components are present in semantic processes. As in our previous experiments, we expected to observe a significant effect of the number of voices in the background; increasing the number of voices has shown to decrease intelligibility for up to 8 voices (Simpson and Cooke, 2005). No interaction between the number of voices and the semantic association between prime and target was expected, at least if the words composing the background were still intelligible.

#### **METHOD**

#### *Participants*

Twenty-four participants (19 females) were recruited for this experiment (age: 18–34). All were right-handed French native speakers and reported no hearing or speech disorder. They gave written informed consent and were compensated for their participation. None of them had participated in Experiments 1 or 2, and they were unaware of the experiment's purpose prior to testing.

#### *Stimuli*

To create an extra SI voice, pronounced by a fourth speaker (S4), 96 groups of 5 words, that were semantically unrelated to each other, were generated (*M*lexical frequency = 20*.*88, *SD* = 1*.*22). The word mean frequency did not differ between the different voices (SC and SI voices; *F <* 1). As in Experiment 2, each group of words was matched to a prime (composed of 1 or 2 SC voices) with which it did not share any semantic features and was systematically presented with (e.g., the prime *légume, chou, céleri, salade, tomate* "vegetable, cabbage, celery, lettuce, tomato" was always presented with the masker *étui, liberté, drôle, global, sympathie* "case, freedom, funny, global, sympathy"). Therefore, with the addition of this second SI voice, backgrounds composed of 2 voices (S1 + S3) from Experiment 2 became 3-voice backgrounds (S1 + S3 + S4; 1 SC voice and 2 SI voices; 1SC/2SI condition) and 3-voice backgrounds (S1 + S2 + S3) from Experiment 2 became 4-voice backgrounds (S1 + S2 + S3 + S4; 2 SC voices and 2 SI voices; 2SC/2SI condition). The 4 groups of 12 target words and pseudo-words created in Experiment 1 were used and presented in the following conditions: 1SC/2SI related, 1SC/2SI unrelated, 2SC/2SI related, and 2SC/2SI unrelated. Four experimental lists were created so that each target word was presented in each condition but only once in a list.

The second SI voice (S4) was recorded by a French native female speaker (age = 23), using the same procedure as in previous experiments. S1, S2, and S3's auditory sequences were mixed with S4's. Targets used in Experiment 2 (recorded by TS2) were embedded in the backgrounds 2 s after their beginning start (see **Figures 3**, **4**).

#### *Recognition post-test*

A recognition test was devised to test whether participants could recognize words previously heard during the experiment. Fifty words were presented to participants after the main experiment on a sheet of paper, 20 had been previously presented in the backgrounds, whereas 30 were new words not used as stimuli in the experiment. A list of new words was generated (*M*lexical frequency = 23*.*15, *SD* = 48*.*29), and their lexical frequency did not significantly differ from that of previously heard words (*M*lexical frequency = 18*.*47, *SD* = 20*.*61; *F <* 1).

#### *Procedure*

The same procedure was used as in Experiments 1 and 2. At the end of the experiment, a post-test was given to participants, instructing them to decide (i.e., write down) whether they had heard the given word during the experiment. They were asked to do the best they could but not to think about it too much, and to simply answer what they thought was right.

## **RESULTS**

#### *Test*

The same statistical analyses as in Experiments 1 and 2 were performed by participants (*F*1) and by items (*F*2), with RTs (in ms) and ERs for target word identification as dependent variables. We included Number of Voices in the background (3 voices, 1SC/2SI condition or 4 voices, 2SC/2SI condition) and Semantic Link between prime and target (related or unrelated) as withinsubjects factors. Target words answered correctly by less than half of participants were excluded from analyses (*CHIGNON, EPAULE,* and *HIBOU,* "bun, shoulder, owl"). In total, 16.8% of data were excluded from RTs analysis because of errors (15%) or extreme values (1.8%). Means and SDs for RTs and ERs are summarized in **Table 3**.

The analysis by participants revealed a significant main effect of the Number of Voices on RTs [*F*1(1*,* 23) = 4*.*01, *p* = 0*.*05; *F*<sup>2</sup> *<* 1]. Participants more quickly identified target words if backgrounds were composed of 3 voices (1SC/2SI condition; *M* = 925 ms, *SD* = 155) than if they were composed of 4 voices (2SC/2SI condition; *M* = 941 ms, *SD* = 155). The ERs analysis also showed that participants responded significantly more accurately [*F*1(1*,* 23) = 6*.*19, *p <* 0*.*05; *F*2(1*,* 44) = 5*.*80, *p <* 0*.*05] in the 1SC/2SI condition (*M* = 10*.*02%, *SD* = 10*.*10) than in the 2SC/2SI condition (*M* = 13*.*21%, *SD* = 8*.*95). The main effect of Semantic Link was not significant for either RTs [*F*1(1*,* 23) = 1*.*51, *n.s.*; *F*2(1*,* 44) = 3*.*21, *n.s.*] or ERs [*F*1(1*,* 23) = 3*.*19, *n.s.*; *F*2(1*,* 44) = 2*.*92, *n.s.*]. No significant interaction emerged between the 2 factors for RTs [*F*1(1*,* 23) = 1*.*39, *n.s.*; *F*2(1*,* 44) = 1*.*10, *n.s.*] and ERs (*F*<sup>1</sup> *<* 1; *F*<sup>2</sup> *<* 1). These last results indicate that participants performed better in the lexical decision task, both in terms of speed and accuracy, if the backgrounds were composed of 3

**FIGURE 3 | (A)** Example of a background in the 1SC/2SI condition of Experiment 3, presented with a semantically related target word (left; related condition) or not (right; unrelated condition). S4, speaker 4 (see the legend of

**Figures 1**, **2** for other abbreviations). **(B)** Example of a background in the 2SC/2SI condition of Experiment 3, presented with a semantically related target word (left; related condition) or not (right; unrelated condition).

voices rather than 4 voices. This finding highlights the impact of masking on target recognition. However, no significant semantic priming effect was observed in these conditions, suggesting that informational masking from 1SC/2SI and 2SC/2SI backgrounds was so efficient that it prevented semantic processing of the prime, which could therefore not affect target word identification. Additionally, in the 2SC/2SI condition, energetic masking was more important; this factor might also explain the important masking of prime and explain the lack of semantic priming.

#### *Post-test*

Analysis of the participants' answers showed a mean ER of 44.5% (*SD* = 7) and *d*- = 0*.*26 (*SD* = 0*.*5). Although these results are close to chance, both were significant results, as shown by a one-sample *t*-test (ER *p <* 0*.*01; *d p <* 0*.*05). No significant correlation was found between *d* and priming effect (*r* = −0*.*39, *n.s.*). This finding implies that participants heard some background words, but those words were not used to improve performances. Additionally, a repeated measure ANOVA was performed with ERs as the dependent variable and number of Voices (3 or 4) in the background and Semantic Link between presented word and target as a within subjects factor. Neither the effect of the Number of Voices nor the effect of Semantic Link were significant (*F*Number of Voices *<* 1; *F*Semantic Link *<* 1).

#### **DISCUSSION**

In this third experiment, target word identification was again disturbed by the increase in the number of voices in the backgrounds, confirming that masking is more efficient in the 4-voice than in the 3-voice condition. Disappearance of the semantic priming effect also suggests that the number of SC voices compared to the number of SI voices was too small. To compare the data of the 3 experiments, we considered the ratio of SC voices over the total number of voices. Therefore, the ratio of SC/total voices is 1 in 1SC and 2SC conditions; ratio 2/3 in 2SC/1SI; ratio 1/2 in 1SC/2SI and 2SC/2SI; and ratio 1/3 in 1SC/2SI. An ANCOVA on all data using the number of voices as the independent variable and the ratio of SC voices/total voices as covariate on priming effect confirmed this hypothesis (effect of ratio: *p <* 0*.*05): when the ratio was too low, SC voices were not salient enough for participants to perform semantic processing. In a post-test, participants were asked to perform a recognition task immediately after the experiment. Participants scored significantly better than chance at the recognition test, showing that at least some words in the background were identified. This finding is consistent with the results by Hoen et al. (2007) who showed

**Table 3 | Means and SDs of RTs and ERs depending on the number of voices in the background and the semantic link between prime and target in Experiment 3.**


*1SC/2SI, 1 SC voice and 2 SI voices condition; 2SC/2SI, 2 SC voices and 2 SI voices condition.*

that in a transcription task, with up to 4 voices in the background, participants gave words from the background as responses instead of target words. Overall Experiment 3 showed that participants were unable to process the prime at a semantic level (i.e., no priming effect was observed) although the post-test results suggest that they hear it sufficiently to recognize it in a recognition post-test. This finding suggests that a word can be heard and implicitly encoded without being sufficiently deeply processed to elicit semantic priming.

## *POST-HOC* **ANALYSES**

To better analyze the impact of the SC/total voices ratio, we conducted *post-hoc* analyses. A HSD Tukey test showed that up to 4 voices in the background if the ratio of SC/total voices was inferior to 1/2, there was no significant semantic priming effect (Experiment 2. 1SC/1SI; Experiment 3. 1SC/2SI and 2SC/2SI). However, if the ratio was superior to 1/2, semantic priming was significant (Experiment 1: 1SC, *p <* 0*.*05 Cohen's *d* = 0*.*31; 2SC, *p <* 0*.*05, Cohen's *d* = 0*.*35; Experiment 2. 2SC/1SI, *p <* 0*.*05, Cohen's *d* = 0*.*31; cf **Figure 5** and **Table 4**).

## **GENERAL DISCUSSION SUMMARY**

The aim of this study was to investigate to what extent the semantic content of the multi-talker babble could interfere with the processing of targets in a cocktail party situation. In the three reported experiments, participants were required to perform a lexical decision task on a target item embedded in a multi-talker background. The backgrounds were composed of 1 or 2 SC voices pronouncing words that shared semantic features with each other and could be semantically related to the target or not. The ratio of SC voices over the total number of voices in the background was varied across experiments. In Experiment 1, ratios were 1/1 and 2/2 (i.e., only SC voices were presented in the background).



*\* Indicates significant priming. The ratio is expressed in terms of the number of SC voices over the total number of voices in the background: 1, 1 SC voice or 2 SC voices; 2/3, 2 SC voices, 1 SI voice; 1/2, 1 SC voice, 1 SI voice or 2 SC voices, 2 SI voices; 1/3, 1 SC voice, 2 SI voices.*

In Experiments 2 and 3, 1 and 2 SI voices, respectively, that pronounced semantically unrelated words, were added to backgrounds to decrease the intelligibility of the SC voices, which acted as the prime. The ratio of SC/total voices in Experiment 2 was therefore 1/2 and 2/3 and in Experiment 3 it was 1/3 and 2/4.

The main effect of the Number of Voices was significant in each experiment; participants responded faster and more accurately to target words if a smaller number of voices composed the backgrounds. This delayed response time with increasing number of voices is a well-known phenomenon (Brungart et al., 2001; Boulenger et al., 2010). As the number of voices increases, energetic masking is enhanced so that the signal is saturated, and target items are consequently more difficult to process. In addition to physical masking, more information is perceived and must be processed (i.e., informational masking), leading to slower response times.

Overall the main effect of Semantic Link between prime and target seems to be an all-or-none phenomenon. The semantic priming effect could have decreased with an increasing number of voices in the background, but our results suggest this effect does not occur, as no interaction between the Number of Voices and the Semantic Link was observed in any of the three experiments. The main effect of Semantic Link depends on the ratio of SC voices over the total number of voices in the background. A semantic priming effect emerged in our experiments spanning from 1 to 4 voices in the background but only if the number of SC voices was higher than the number of SI voices. This ratio of SC voices is interesting because it highlights the necessity of prime saliency for semantic processing to occur; it also gives an objective measure of this prime saliency across experiments. This finding suggests that informational masking does occur at the semantic level but only if the prime is sufficiently salient. If informational and energetic masking had the same role, a significant semantic priming effect would appear in conditions containing 3 voices (i.e., Experiment 2 2SC/1SI and Experiment 3 1SC/2SI). However, this effect did not occur; we therefore conclude that informational masking can be semantic only if it is sufficiently salient to be used to increase performance.

#### **LEXICAL PROCESSING WITHOUT SEMANTIC ACTIVATION**

Our results suggest that semantic processing in cocktail party situation is not automatic. In fact, it seems that in challenging listening situations, one can hear and activate a mental representation of a word without deeply processing it at a semantic level. Background words can be semantically processed with up to 3 voices (2SC/1SI, Experiment 2); however no priming effect emerged in Experiment 3, despite the recognition post-test's results suggesting that primes were heard and recognized. This finding is consistent with the way the language system has been modeled; most models of word processing, both in the auditory and the visual modalities, suggest a distinction between lexical and semantic levels of processing (Marslen-Wilson and Welsh, 1978; McClelland and Elman, 1986; Grainger and Holcomb, 2009). If one considers that these processes are independent stages of word recognition, it seems reasonable that one of these stages can be reached (i.e., lexical) without strongly activating the deepest stage (i.e., semantic).

Many studies have highlighted the automaticity of semantic processing using masked semantic priming (Naccache and Dehaene, 2001; Klauer et al., 2007; Kouider and Dehaene, 2009; Spruyt et al., 2012). However, masked semantic priming is only found in very specific conditions in the visual modality and with semantic categorization tasks (see Van den Bussche et al., 2009, for a review). One can therefore assume that semantic activation remains superficial in masked priming paradigm, and only activates very close concepts such as superordinates, which is enough to create a priming effect in semantic categorization tasks. In our experiments however, a deeper semantic processing was necessary. For example, in the semantic field of birds, the SC voice pronounced: *corbeau, rossignol, cage, voler, nid* ("craw, nightingale, cage, fly, nest") and *PIGEON* ("pigeon") was the target word. In a condition of high intelligibility, these words primed the target word; however, with decreased intelligibility, if the participants only heard the word "cage," this word may have activated its superordinate (e.g., "object") but not its associates such as "bird." Consequently, it may not have primed "pigeon." The absence of a semantic priming effect in addition to the decrease in intelligibility seems to show that greater difficulty in processing auditory signals at a superficial level, causes decreased processing at a (higher) linguistic level.

Consistent with our results, and using an auditory masked priming paradigm, Kouider and Dupoux (2005) demonstrated lexical access without semantic priming. In their experiments, prime was a time-compressed word embedded in a masker composed of time reversed and compressed words. They manipulated the compression rate of the prime to vary its intelligibility. If the prime was not intelligible, they found a significant repetition priming effect on words but not non-words suggesting that this effect "involved a lexical activation of abstract word form." In the same condition of prime compression, they did not find any semantic priming effect. This finding would therefore suggest a dissociation between lexical and semantic processing.

#### **COGNITIVE LOAD**

Overall our results are compatible with the idea of cognitive load, in which simultaneously processed information and interactions can either under-load or overload the finite amount of processing capacities. As shown by our results, it is more difficult to process items with more voices in the background. This finding might be partly due to the high perceptual load and because the target item was embedded in backgrounds that could not be completely ignored (Lavie, 1995, 2005). Processing words in the background might have been particularly cognitively effortful and, semantic processing of a word can be delayed and even prevented if participants have to perform an additional task (i.e., high cognitive load, Hohlfeld et al., 2004; Hohlfeld and Sommer, 2005; Van Petten, 2014). In Experiment 3 we argue that some background words were heard but not deeply semantically processed because it was both very demanding and irrelevant to perform the task. In Experiment 2, 2SC/1SI background words were also very difficult to hear and process (as in Experiment 3 1SC/2SI); however, they were semantically processed as revealed by the semantic priming effect observed. As SC voices were more salient in Experiment 2 2SC/1SI than in Experiment 3 1SC/2SI, participants might have heard more related words and one could argue that semantic priming in our study in fact relied on the chance for participants to hear a SC word. Although this hypothesis is interesting, it does not seem sufficient to explain our results. Indeed, according to this hypothesis, a priming effect should have appeared also in Experiment 3 where participants, in line with previous results from the literature (Brungart, 2001; Hoen et al., 2007), recognized SC words presented in the post-test. Given the overall results of our experiments, we argue that if the ratio of SC/total voices was *<*1/2, participants heard some SC words, but, because of high cognitive load (i.e., intelligibility was low and they focused on the target item) these words were not processed sufficiently deeply to lead to semantic priming (a similar dissociation between lexical and semantic processing was also found by Kouider and Dupoux, 2005). However, if the ratio was *>*1/2, SC words were salient, and participants thus, processed them sufficiently to improve their performance. As this interpretation relies on the post-test effect which is quite small, although significant, more experiments should be performed. For example, using only SC voices and degrading intelligibility by adding noise or filtering the signal instead of adding SI voice would be a good way to test this hypothesis in future studies.

Our results suggest that the increased cognitive load necessary to reconstruct the degraded signal reduced available resources for higher-level processes. This claim is consistent with the Effortfulness Hypothesis (Rabbitt, 1968) that states degraded signals require allocating many cognitive resources to formal processes (i.e., orthographic or phonological), leaving less available cognitive resources to perform higher-level processes (e.g., lexical). It has been shown that hearing-impaired participants are less accurate at recalling previously heard final sentence words than their control peers (Pichora-Fuller et al., 1995). The underlying assumption is that for hearing-impaired participants, the auditory signal is highly degraded and therefore demands more cognitive resources to be formally (i.e., phonologically) processed. As studies usually use recall tasks of lists of unrelated words or digits (Surprenant, 1999; Murphy et al., 2000; Wingfield et al., 2005), verbal working memory only relies on the phonological loop (Baddeley and Hitch, 1974; Baddeley, 2000) and only involves formal processes. Our results showed that higher levels of processing such as semantic activation may be specifically impacted by signal degradation.

Recently, semantic and syntactic integration difficulties if a signal is degraded have been reported in the visual modality (Gao et al., 2011, 2012). Experiments conducted by Gao et al. (2011) using visual noise (i.e., pixel's brightness variation) showed that if participants allocate more resources to formal processes, semantic integration is affected. Indeed, after reading an entire text, participants were worse at recalling the main proposition in the noisy condition. Altogether, these findings suggest that the availability of cognitive resources is involved at various levels during language processing. Whereas previous studies have shown that noise impairs memory systems, our study provides evidence that semantic activation is linked to cognitive resources, independently of memory.

## **CONCLUSION**

This study explored the semantic nature of informational masking in a cocktail party situation. The results of three behavioral studies reveal that the emergence of semantic priming effects relies on prime intelligibility and saliency. These findings question the assumption that signal degradation has no effect on speech processing if target signals can be recognized. The results reveal that high-level processes, such as semantic processing, might not be as automatic as previously thought but are subjected to the limits of cognitive resources. Our study also demonstrates how the cocktail party situation can be used to study the automaticity of linguistic processes.

## **ACKNOWLEDGMENTS**

The first author is funded by a PhD grant from Rhône-Alpes region, France. This research was supported by a European Research Council grant to the SpiN project (no. 209234).

### **SUPPLEMENTARY MATERIAL**

The Supplementary Material for this article can be found online at: http://www*.*frontiersin*.*org/journal/10*.*3389/fnhum*.* 2014*.*00878/abstract

#### **REFERENCES**


simultaneous talkers. *J. Acoust. Soc. Am.* 110, 2527–2538. doi: 10.1121/1. 1408946


Wood, N. L., Stadler, M. A., and Cowan, N. (1997). Is there implicit memory without attention? A reexamination of task demands in Eich's (1984) procedure. *Mem. Cognit.* 25, 772–779. doi: 10.3758/BF03211320

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 20 March 2014; accepted: 12 October 2014; published online: 31 October 2014.*

*Citation: Dekerle M, Boulenger V, Hoen M and Meunier F (2014) Multi-talker background and semantic priming effect. Front. Hum. Neurosci. 8:878. doi: 10.3389/ fnhum.2014.00878*

*This article was submitted to the journal Frontiers in Human Neuroscience.*

*Copyright © 2014 Dekerle, Boulenger, Hoen and Meunier. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

## ADVANTAGES OF PUBLISHING IN FRONTIERS

FAST PUBLICATION Average 90 days from submission to publication

COLLABORATIVE PEER-REVIEW

Designed to be rigorous – yet also collaborative, fair and constructive

RESEARCH NETWORK Our network increases readership for your article

#### OPEN ACCESS

Articles are free to read, for greatest visibility

#### TRANSPARENT

Editors and reviewers acknowledged by name on published articles

GLOBAL SPREAD Six million monthly page views worldwide

#### COPYRIGHT TO AUTHORS

No limit to article distribution and re-use

IMPACT METRICS Advanced metrics track your article's impact

SUPPORT By our Swiss-based editorial team

EPFL Innovation Park · Building I · 1015 Lausanne · Switzerland T +41 21 510 17 00 · info@frontiersin.org · frontiersin.org