# OVERLAP OF NEURAL SYSTEMS FOR PROCESSING LANGUAGE AND MUSIC

EDITED BY: McNeel Gordon Jantzen, Edward W. Large and Cyrille Magne PUBLISHED IN: Frontiers in Psychology and Frontiers in Neuroscience

### *Frontiers Copyright Statement*

*© Copyright 2007-2016 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.*

*The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.*

*Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.*

*Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.*

*As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.*

> *All copyright, and all rights therein, are protected by national and international copyright laws.*

*The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use.*

ISSN 1664-8714 ISBN 978-2-88919-911-2 DOI 10.3389/978-2-88919-911-2

### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# **OVERLAP OF NEURAL SYSTEMS FOR PROCESSING LANGUAGE AND MUSIC**

Topic Editors:

**McNeel Gordon Jantzen,** Western Washington University, USA **Edward W. Large,** University of Connecticut, USA **Cyrille Magne,** Middle Tennessee State University, USA

The interplay between musical training and speech perception continues to intrigue researchers in the areas of language and music alike. Historically, language function has been attributed to brain regions localized predominately in left hemisphere, whereas music has been attributed to right hemisphere dominant regions. Recent studies demonstrating neural overlap for processing speech and music, and enhanced speech perception and production in musicians suggest that these regions may be inextricably intertwined. The extent of neural overlap between music and speech remains hotly debated, with surprisingly little empirical research exploring specific neural homo-logs and analogs. Moreover, despite recognition that shared processes likely exist throughout development and depend upon an individual's acoustic experiences, even less research exists on how overlapping neural structures for music and language are affected by developmental trajectories.

Nonetheless, the field is well poised to address key empirical questions, in part because of the recent development of new theories that address the neural and developmental interaction between music and language processing in conjunction with the broad availability of sophisticated tools for quantifying brain activity and dynamics. To understand the overlap of neural structures for language and music processing, research is needed to identify those specific functions of each that influence the other, with areas for enhanced perception of pitch and onset time having already been targeted. Research is also needed to identify the extent to which this overlap is developed in infancy or early childhood and the process by which it affects neural reorganization, plasticity, and trainability in adulthood.

For this research topic, we would like to further explore the relationship between language and music in the brain from two perspectives: 1) understanding the nature of shared neural and cognitive processing for music and language and 2) understanding the developmental trajectory of these neural systems and how they are influenced by experience. We seek to gather technically diverse original research articles that present new empirical findings relevant to understanding:

1. When, in the brain, acoustic information becomes processed specifically as language or music. 2. The shared and independent neural structures for processing music and language.

3. How acoustic experiences such as musical training influence overlap of neural structures for language and music.

4. How the overlap of processing regions changes over time due to experiences at any developmental stage.

**Citation:** Jantzen, M. G., Large, E. W., Magne, C., eds. (2016). Overlap of Neural Systems for Processing Language and Music. Lausanne: Frontiers Media. doi: 10.3389/978-2-88919-911-2

# Table of Contents


Allison R. Fogel, Jason C. Rosenberg, Frank M. Lehman, Gina R. Kuperberg and Aniruddh D. Patel


Franziska Degé, Claudia Kubicek and Gudrun Schwarzer


Sydney L. Lolli, Ari D. Lewenstein, Julian Basurto, Sean Winnik and Psyche Loui

*105 Cross-domain processing of musical and vocal emotions in cochlear implant users*

Alexandre Lehmann and Sébastien Paquette

*110 Music and literature: are there shared empathy and predictive mechanisms underlying their affective impact?*

Diana Omigie

# Editorial: Overlap of Neural Systems for Processing Language and Music

McNeel G. Jantzen<sup>1</sup> \*, Edward W. Large<sup>2</sup> and Cyrille Magne<sup>3</sup>

<sup>1</sup> Language and Neural Systems Laboratory, Department of Psychology and Behavioral Neuroscience Program, Western Washington University, Bellingham, WA, USA, <sup>2</sup> Music Dynamics Laboratory, Department of Psychology, University of Connecticut, Storrs, CT, USA, <sup>3</sup> Brain and Language Laboratory, Department of Psychology and Interdisciplinary Program in Literacy Studies, Middle Tennessee State University, Murfreesboro, TN, USA

Keywords: musical training, language, speech, music, auditory perception, neural plasticity, development, neural overlap

**The Editorial on the Research Topic**

### **Overlap of Neural Systems for Processing Language and Music**

The relationship between musical training and speech perception has intrigued researchers in language and music for decades, from Bever and Chiarello's (1974) work emphasizing hemispheric specialization to Tallal and Gaab's (2006) findings of shared neural circuitry. Recent studies demonstrating neural overlap for processing speech and music, and enhanced speech perception and production in musicians, suggest that these regions may be inextricably intertwined (Sammler et al., 2007; Wong P.C. et al., 2007; Wong P. et al., 2007; Rogalsky et al., 2011; Schulze et al., 2011). Patel's OPERA hypothesis and Hickok and Poeppel's (2000, 2007) neuroanatomical models continue to evolve and guide this field of research. However, the extent of neural overlap between music and speech remains hotly debated (Norman-Haignere et al., 2015; Peretz et al., 2015), with surprisingly little empirical research exploring specific neural homologs and analogs. Emerging evidence suggests that shared processes likely exist throughout development, depend upon an individual's acoustic experiences, and are affected by developmental trajectories. Moreover, developing theories that address the neural and developmental interaction between music and language processing in conjunction with the broad availability of sophisticated tools for quantifying brain activity and dynamics offer the perfect opportunity for researchers to address these key empirical questions. Taken together, this field of research has begun to elucidate the complex dynamics of overlapping neural areas for processing language and music. This special issue highlights the development of this overlap in early childhood and explores how the interaction between language and musical training enhances cognitive functioning in adults.

This E-Book comprises 10 opinion, perspective, and research papers that focus on the overlap of neural systems for processing language and music. Eight of these papers report original research and new findings that support overlapping neural systems for processing language and music. LaCroix et al. performed a meta-analysis of 171 neuroimaging studies to examine the role of context in processing music and language. Their findings suggest that observed neural overlaps for speech and music might be task-dependent. Fogel et al. developed a novel method for studying and quantifying predictions in musical tasks that is consistent with language tasks. Their melodic cloze probability task can be used to test computational models of melodic expectation and allows for a more precise examination of the relationship between predictive mechanisms in music and language. Using a garden-path design, Jung et al. demonstrated that rhythmic expectancy is crucial to the interaction of processing musical and linguistic syntax. Additionally, their findings support the incorporation of dynamic models of attentional entrainment into existing theories of musical and linguistic syntactical processing. Margulis et al. used the speech-to-song illusion

Edited and reviewed by: Isabelle Peretz, Université de Montréal, Canada

> \*Correspondence: McNeel G. Jantzen mcneel.jantzen@wwu.edu

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 03 May 2016 Accepted: 27 May 2016 Published: 14 June 2016

### Citation:

Jantzen MG, Large EW and Magne C (2016) Editorial: Overlap of Neural Systems for Processing Language and Music. Front. Psychol. 7:876. doi: 10.3389/fpsyg.2016.00876

to examine the role of pronunciation difficulty and temporal regularity. Their finding—that difficult to pronounce languages, not differing temporal intervals, elicited a stronger speechto-song illusion—suggests a stronger speech representation for native and easy to pronounce languages. Miles et al. demonstrated that females have an advantage for recognizing familiar musical melodies. They believe this advantage is related to superior declarative memory, which may underlie the storage and knowledge of both the mental lexicon in language (e.g., Ullman, 2001) and some aspects of familiar melodies in music (Miranda and Ullman, 2007). Two papers report finding that musical training during development enhances literacy skills, including phonological awareness and reading fluency, via neural mechanisms for both language and music (Degé et al.; Gordon et al.). Moreover, Degé and colleagues provide evidence that music production and music perception are associated with multiple precursors of reading. Finally, Lolli et al. examined the effect of sound frequency on judgments of emotion in speech by congenital amusics. Using both high and low-pass filtered speech in a pitch discrimination and emotion identification task, their findings demonstrate the important role of low frequency information in identifying the emotional content of speech.

In addition to these eight research papers there are two perspective and opinion papers that emphasize the affective

### REFERENCES


and emotive commonalities between music and language (Lehmann and Paquette; Omigie). Lehmann and Paquette provide a neurobehavioral approach for examining crossdomain processing of musical and vocal emotions, suggesting that studying cochlear implant users may allow for a richer understanding of neural overlap between music and language. Omigie (2015) provides evolutionary evidence for shared underlying neural mechanisms for our emotive responses to music and literature.

This E-Book provides a comprehensive snapshot of the research examining the complex overlap of neural systems for processing language and music. Both musical experience and training enhance the development of linguistic representations, emotion perception, and other cognitive skills. Furthermore, the research presented here contributes to current knowledge of neuroplastic reorganization and repair in clinical populations, and may aid in the design of new and more effective rehabilitative protocols.

# AUTHOR CONTRIBUTIONS

MJ prepared and wrote the editorial soliciting feedback from EL and CM. EL and CM provided feedback regarding organization as well.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Jantzen, Large and Magne. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The relationship between the neural computations for speech and music perception is context-dependent: an activation likelihood estimate study

### Arianna N. LaCroix, Alvaro F. Diaz and Corianne Rogalsky \*

Communication Neuroimaging and Neuroscience Laboratory, Department of Speech and Hearing Science, Arizona State University, Tempe, AZ, USA

The relationship between the neurobiology of speech and music has been investigated for more than a century. There remains no widespread agreement regarding how (or to what extent) music perception utilizes the neural circuitry that is engaged in speech processing, particularly at the cortical level. Prominent models such as Patel's Shared Syntactic Integration Resource Hypothesis (SSIRH) and Koelsch's neurocognitive model of music perception suggest a high degree of overlap, particularly in the frontal lobe, but also perhaps more distinct representations in the temporal lobe with hemispheric asymmetries. The present meta-analysis study used activation likelihood estimate analyses to identify the brain regions consistently activated for music as compared to speech across the functional neuroimaging (fMRI and PET) literature. Eighty music and 91 speech neuroimaging studies of healthy adult control subjects were analyzed. Peak activations reported in the music and speech studies were divided into four paradigm categories: passive listening, discrimination tasks, error/anomaly detection tasks and memory-related tasks. We then compared activation likelihood estimates within each category for music vs. speech, and each music condition with passive listening. We found that listening to music and to speech preferentially activate distinct temporo-parietal bilateral cortical networks. We also found music and speech to have shared resources in the left pars opercularis but speech-specific resources in the left pars triangularis. The extent to which music recruited speech-activated frontal resources was modulated by task. While there are certainly limitations to meta-analysis techniques particularly regarding sensitivity, this work suggests that the extent of shared resources between speech and music may be task-dependent and highlights the need to consider how task effects may be affecting conclusions regarding the neurobiology of speech and music.

Keywords: music perception, speech perception, fMRI, meta-analysis, Broca's area

# Introduction

The relationship between the neurobiology of speech and music has been investigated and debated for nearly a century. (Henschen, 1924; Luria et al., 1965; Frances et al., 1973; Peretz, 2006; Besson et al., 2011). Early evidence from case studies of brain-damaged individuals suggested a dissociation of aphasia and amusia (Yamadori et al., 1977; Basso and Capitani, 1985;

### Edited by:

McNeel Gordon Jantzen, Western Washington University, USA

### Reviewed by:

Lutz Jäncke, University of Zurich, Switzerland Yi Du, McGill University, Canada

### \*Correspondence:

Corianne Rogalsky, Department of Speech and Hearing Science, Arizona State University, PO Box 570102, Tempe, AZ 85287-0102, USA corianne.rogalsky@asu.edu

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 08 April 2015 Accepted: 22 July 2015 Published: 11 August 2015

### Citation:

LaCroix AN, Diaz AF and Rogalsky C (2015) The relationship between the neural computations for speech and music perception is context-dependent: an activation likelihood estimate study. Front. Psychol. 6:1138. doi: 10.3389/fpsyg.2015.01138 Peretz et al., 1994, 1997; Steinke et al., 1997; Patel et al., 1998b; Tzortzis et al., 2000; Peretz and Hyde, 2003). However, more recent patient work examining specific aspects of speech and music processing indicate at least some overlap in deficits across the two domains. For example, patients with Broca's aphasia have both linguistic and harmonic structure deficits, and patients with amusia exhibit pitch deficits in both speech and music (Patel, 2003, 2005, 2013). Electrophysiological (e.g., ERP) studies also suggest shared resources between speech and music; for example, syntactic and harmonic violations elicit indistinguishable ERP responses such as the P600 response, which is hypothesized to originate from anterior temporal or inferior frontal regions (Patel et al., 1998a; Maillard et al., 2011; Sammler et al., 2011). Music perception also interacts with morphosyntactic representations of speech: the early right anterior negativity (ERAN) ERP component sensitive to chord irregularities interacts with the left anterior negativity's (LAN's) response to morphosyntactic violations or irregularities (Koelsch et al., 2005; Steinbeis and Koelsch, 2008b; Koelsch, 2011).

Several studies of trained musicians and individuals with absolute pitch also suggest an overlap between speech and music as there are carry-over effects of musical training onto speech processing performance (e.g., Oechslin et al., 2010; Elmer et al., 2012; for a review see Besson et al., 2011).

There is a rich literature of electrophysiological and behavioral work regarding the relationship between music and language (for reviews see Besson et al., 2011; Koelsch, 2011; Patel, 2012, 2013; Tillmann, 2012; Slevc and Okada, 2015). This work has provided numerous pieces of evidence of overlap between the neural resources of speech and music, including in the brainstem, auditory cortex and frontal cortical regions (Koelsch, 2011). This high degree of interaction between speech and music coincides with Koelsch et al.'s view that speech and music, and therefore the brain networks supporting them, cannot be separated because of their numerous shared properties, i.e., there is a "music-speech continuum" (Koelsch and Friederici, 2003; Koelsch and Siebel, 2005; Koelsch, 2011). However, evidence from brain-damaged patients suggests that music and speech abilities may dissociate, although there are also reports to the contrary (see above). Patel's (2003, 2008, 2012) Shared Syntactic Integration Resource Hypothesis (SSIRH) is in many ways a remedy to the shared-vs.-distinct debate in the realm of structural/syntactic processing. Stemming in part from the patient and electrophysiological findings, Patel proposes that language and music utilize overlapping cognitive resources but also have unique neural representations. Patel proposes that the shared resources reside in the inferior frontal lobe (i.e., Broca's area) and that distinct processes for speech and music reside in the temporal lobes (Patel, 2003).

The emergence of functional neuroimaging techniques such as fMRI have continued to fuel the debate over the contributions of shared vs. distinct neural resources for speech and music. FMRI lacks the high temporal resolution of electrophysiological methods and can introduce high levels of ambient noise potentially contaminating recorded responses to auditory stimuli. However, the greater spatial resolution of fMRI may provide additional information regarding the neural correlates of speech and music, and MRI scanner noise can be minimized using sparse sampling scanning protocols and reduced-noise continuous scanning techniques (Peelle et al., 2010). Hundreds of fMRI papers have investigated musical processes, and thousands have investigated the neural substrates of speech. Conversely, to our knowledge and as Slevc and Okada (2015) noted, only a few studies have directly compared activations to hierarchical speech and music (i.e., sentences and melodies) using fMRI (Abrams et al., 2011; Fedorenko et al., 2011; Rogalsky et al., 2011). Findings from these studies conflict with the ERP literature (e.g., Koelsch, 2005; Koelsch et al., 2005) in that the fMRI studies identify distinct neuroanatomy and/or activation response patterns for music and speech processing, although there are notable differences across these studies, particularly relating to the involvement of Broca's area in speech and music.

The differences found across neuroimaging studies regarding the overlap of the neural correlates of speech and music likely arise from the tasks used in each of these studies. For example, Rogalsky et al. used passive listening and found no activation of Broca's area to either speech or music compared to rest. Conversely, Fedorenko et al. used a reading/memory probe task for sentences and an emotional ranking for music and found Broca's area to be preferentially activated by speech but also activated by music compared to rest. There is also evidence that the P600, the ERP component that is sensitive to both speech and music violations, is only present when subjects are actively attending to the stimulus (Besson and Faita, 1995; Brattico et al., 2006; Koelsch, 2011). The inclusion of a task may affect not only the brain regions involved, but also reliability of results: an fMRI study of visual tasks reported that tasks with high attentional loads also had the highest reliability measures compared to passive conditions (Specht et al., 2003). This finding in the visual domain suggests the possibility that greater (within and between) subject variability in passive listening conditions may lead to null effects in group-averaged results.

Given the scarcity of within-subject neuroimaging studies of speech and music, it is particularly critical to examine across-study, between-subjects findings to build a better picture regarding the neurobiology of speech and music. A major barrier in interpreting between-subject neuroimaging results is the variety of paradigms and tasks used to investigate speech and music neural resources. Most scientists studying the neurobiology of speech and/or music would likely agree that they are interested in understanding the neural computations employed in naturalistic situations that are driven by the input of speech or music, and the differences between the two. However, explicit tasks such as discrimination or error detection are often used to drive brain responses in part by increasing the subject's attention to the stimuli and/or particular aspects of the stimuli. This may be problematic: the influence of task demands on the functional neuroanatomy recruited by speech is well documented (e.g., Baker et al., 1981; Noesselt et al., 2003; Scheich et al., 2007; Geiser et al., 2008; Rogalsky and Hickok, 2009) and both speech and music processing engage domaingeneral cognitive, memory, and motor networks in likely distinct, but overlapping ways (Besson et al., 2011). Task effects are known to alter inter and intra hemisphere activations to speech (Noesselt et al., 2003; Tervaniemi and Hugdahl, 2003; Scheich et al., 2007; Geiser et al., 2008; Rogalsky and Hickok, 2009). For example, there is evidence that right hemisphere frontotemporal-parietal networks are significantly activated during an explicit task (rhythm judgment) with speech stimuli but not during passive listening to the same stimuli (Geiser et al., 2008). The neurobiology of speech perception, and auditory processing more generally, also can vary based on the type of explicit task even when the same stimuli are used across tasks (Platel et al., 1997; Ni et al., 2000; Von Kriegstein et al., 2003; Geiser et al., 2008; Rogalsky and Hickok, 2009). This phenomenon is also well documented in the visual domain (Corbetta et al., 1990; Chawla et al., 1999; Cant and Goodale, 2007). For example, in the speech domain, syllable discrimination and single-word comprehension performance (as measured by a word-picture matching task) doubly dissociate in stroke patients with aphasia (Baker et al., 1981). Syllable discrimination implicates left-lateralized dorsal frontal-parietal networks, while speech comprehension and passive listening tasks engage mostly mid and posterior temporal regions (Dronkers et al., 2004; Schwartz et al., 2012; Rogalsky et al., 2015). Similarly, contextual effects have been reported regarding pitch: when pitch is needed for linguistic processing, such as in a tonal language, there is a left hemisphere auditory cortex bias, while pitch processing in a melody discrimination task yields a right hemisphere bias (Zatorre and Gandour, 2008). Another example of the importance of context in pitch processing is in vowel perception: vowels and tones have similar acoustic features and when presented in isolation (i.e., just a vowel, not in a consonant-vowel (CV) pair as would typically be perceived in everyday life) no significant differences have been found in temporal lobe activations (Jäncke et al., 2002). However, there is greater superior temporal activation for CVs than tones suggesting that the context of the vowel modulates the temporal networks activated (Jäncke et al., 2002).

One way to reduce the influence of a particular paradigm or task is to use meta-analysis techniques to identify areas of activation that consistently activate to a particular stimulus (e.g., speech, music) across a range of tasks and paradigms. Besson and Schön (2001) noted that meta-analyses of neuroimaging data would provide critical insight into the relationship between the neurobiology of language and music. They also suggested that meta-analyses of music-related neuroimaging data were not feasible due to the sparse number of relevant studies. Now, almost 15 years later, there is a large enough corpus of neuroimaging work to conduct quantitative meta-analyses of music processing with sufficient power. In fact, such meta-analyses have begun to emerge, for specific aspects of musical processing, in relation to specific cognitive functions [e.g., Slevc and Okada's (2015) cognitive control meta-analysis in relation to pitch and harmonic ambiguity], in addition to extensive qualitative reviews (e.g., Tervaniemi, 2001; Jäncke, 2008; Besson et al., 2011; Grahn, 2012; Slevc, 2012; Tillmann, 2012).

The present meta-analysis addresses the following outstanding questions: (1) has functional neuroimaging identified significant distinctions between the functional neuroanatomy of speech and music and (2) how do specific types of tasks affect how music recruits speech-processing networks? We then discuss the implications of our findings for future investigations of the neural computations of language and music.

# Materials and Methods

An exhaustive literature search was conducted via Google Scholar to locate published fMRI and PET studies reporting activations to musical stimuli. The following search terms were used to locate papers about music: "fMRI music," "fMRI and music," "fMRI pitch," and "fMRI rhythm." To the best of our knowledge, all relevant journal research articles have been collected for the purposes of this meta-analysis.

All journal articles that became part of the meta-analysis reported peak coordinates for relevant contrasts. Peak coordinates reported in the papers identified by the searches were divided into four categories that encompassed the vast majority of paradigms used in the articles: music passive listening, music discrimination, music error detection, and music memory<sup>1</sup> . Passive listening studies included papers in which participants listened to instrumental melodies or tone sequences with no explicit task as well as studies that asked participants to press a button when the stimulus concluded. Music discrimination studies included those that asked participants to compare two musical stimuli (e.g., related/unrelated, same/different). Music error detection studies included studies that instructed participants to identify a dissonant melody, unexpected note or deviant instrument. The music memory category included papers that asked participants to complete an n-back task, familiarity judgment, or rehearsal (covert or overt) of a melodic stimulus.

Only coordinates from healthy adult, non-musician, control subjects were included. In studies that included a patient group and a control group, only the control group's coordinates were included. Studies were excluded from the final activation likelihood estimate (ALE) if the data did not meet the requirements for being included in ALE calculations, including for the following reasons: coordinates not reported, only approximate anatomical location reported, stereotaxic space not reported, inappropriate contrasts (e.g., speech > music only), activations corresponding to participant's emotional reactions to music, studies of professional/trained musicians, and studies of children.

In addition to collecting the music-related coordinates via an exhaustive search, we also gathered a representative sample of fMRI and PET studies that reported coordinates for passive listening to intelligible speech compared to some type of non-speech control (e.g., tones, noise, rest, visual stimuli).

<sup>1</sup>The music categories included studies with stimuli of the following types: instrumental unfamiliar and familiar melodies, tone sequences and individual tones. In comparison, the speech categories described below included studies with stimuli such as individual phonemes, vowels, syllables, words, pseudowords, sentences, and pseudoword sentences. For the purposes of the present study, we have generated two distinct groups of stimuli to compare. However, music and speech are often conceptualized as being two ends of continuum with substantial gray area between the two extremes (Koelsch, 2011). For example, naturally spoken sentences contain rhythmic and pitch-related prosodic features and a familiar melody likely automatically elicits a mental representation of the song's lyrics.

Coordinates corresponding to the following tasks were also extracted: speech discrimination, speech detection, and speech memory. The purpose of these speech conditions is to act as comparison groups for the music groups. Coordinates for this purpose were extracted from six sources: five well-cited review papers, Price (2010), Zheng et al. (2010), Turkeltaub and Coslett (2010), Rogalsky et al. (2011), and Adank (2012) and the brain imaging meta-analysis database Neurosynth.org. The Price (2010), Zheng et al. (2010), Turkeltaub and Coslett (2010), Rogalsky et al. (2011), and Adank (2012) papers yielded a total of 42 studies that fit the aforementioned criteria. An additional 49 relevant papers were found using the Neurosynth.org database with the search criteria "speech perception," "speech processing," "speech," and "auditory working memory." These methods resulted in 91 studies in which control subjects passively listened to speech or completed an auditory verbal memory, speech discrimination, or speech detection task. The passive listening speech condition included studies in which participants listened to speech stimuli with no explicit task as well as studies that asked participants to press a button when the stimulus concluded. Papers were included in the speech discrimination category if they asked participants to compare two speech stimuli (e.g., a same/different task). The speech detection category contained papers that asked participants to detect semantic, intelligibility, or grammatical properties or detect phonological, semantic, or syntactic errors. Studies included in the speech memory category were papers that instructed participants to complete an n-back task or rehearsal (covert or overt) of a speech (auditory verbal) stimulus.

Analyses were conducted using the meta-analysis software GingerALE to calculate ALEs for each condition based on the coordinates collected (Eickhoff et al., 2009, 2012; Turkeltaub et al., 2012). All results are reported in Talairach space. Coordinates originally reported in MNI space were transformed to Talairach space using GingerALE's stereotaxic coordinate converter. Once all coordinates were in Talairach space, each condition was analyzed individually using the following GingerALE parameters: less conservative (larger) mask size, Turkeltaub nonadditive ALE method (Turkeltaub et al., 2012), subject-based FWHM (Eickhoff et al., 2009), corrected threshold of p < 0.05 using false discovery rate (FDR), and a minimum cluster volume of 200 mm<sup>3</sup> . We obtained subtraction contrasts between two given conditions by directly comparing activations between two conditions. To correct for multiple comparisons, each contrast's threshold was set to p < 0.05, wholebrain corrected following the FDR algorithm with p value permutations set at 10,000, and a minimum cluster size of 200 mm<sup>3</sup> (Eickhoff et al., 2009). ALE statistical maps were rendered onto the Colin Talairach template brain using the software MRIcron (Rorden and Brett, 2000).

### Results

### Search Results

The literature search yielded 80 music studies (76 fMRI studies, 4 PET studies) and 91 relevant speech papers (88 fMRI, 3 PET studies) meeting the inclusion criteria described above. **Table 1**



indicates the number of studies, subjects, and coordinates in each of the four music conditions, as well as for each of the four speech conditions.

### Passive Listening To Music vs. Passive Listening To Speech

The music passive listening ALE identified large swaths of voxels bilaterally, spanning the length of the superior temporal gyri (STG), as well as additional smaller clusters, including in the bilateral inferior frontal gyrus (pars opercularis), bilateral postcentral gyrus, bilateral insula, left inferior parietal lobule, left medial frontal gyrus, right precentral gyrus, and right middle frontal gyrus (**Figure 1A**, **Table 2**). The speech passive listening ALE also identified bilateral superior temporal regions as well as bilateral precentral and inferior frontal (pars opercularis) regions. Notably, the speech ALE identified bilateral anterior STG, bilateral superior temporal sulcus (i.e., both banks, the middle and superior temporal gyri) and left inferior frontal gyrus (pars triangularis) regions not identified by the music ALE (**Figure 1A**, **Table 2**). ALEs used a threshold of p < 0.05, FDR corrected.

Pairwise contrasts of passive listening to music vs. passive listening to speech were calculated to identify any brain regions that were significantly activated more by speech or music, respectively. Results were as follows (p < 0.05, FDR corrected): the speech > music contrast identified significant regions on both banks of the bilateral superior temporal sulcus extending the length of the left temporal lobe and mid/anterior right temporal lobe, left inferior frontal lobe (pars triangularis), left precentral gyrus, and left postcentral gyrus regions. Music > speech identified bilateral insula and bilateral superior temporal/parietal operculum clusters as well as a right inferior frontal gyrus region (**Figure 1B**, **Table 2**). These results coincide with previous reports of listening to speech activating a lateral temporal network particularly in the superior temporal sulcus and extending into the anterior temporal lobe, while listening to music activated a more dorsal medial temporal-parietal network (Jäncke et al., 2002; Rogalsky et al., 2011). These results also coincide with Fedorenko et al.'s (2011) finding that Broca's area, the pars triangularis in particular, is preferentially responsive to language stimuli.

### Music Tasks vs. Speech Tasks

The passive listening ALE results identify distinct and overlapping regions of speech and music processing. We now turn to the question of how do these distinctions change as a function of the type of task employed? First, ALEs were computed for each music task condition, p < 0.05 FDR corrected (**Figure 1**, **Table 2**). The music task conditions' ALEs all significantly identified bilateral STG and bilateral precentral gyrus, and inferior parietal regions, overlapping with the passive listening music ALE (**Figure 2**). The tasks also activated additional inferior frontal and inferior parietal regions not identified by the passive listening music ALE; these differences are discussed in a subsequent section.

To compare the brain regions activated by each music task to those activated by speech in similar tasks, pairwise contrasts of the ALEs for each music task vs. its corresponding speech task group were calculated (**Figure 3**, **Table 2**). Music discrimination > speech discrimination identified regions including bilateral inferior frontal gyri (pars opercularis), bilateral pre and postcentral gyri, bilateral medial frontal gyri, left inferior parietal lobule, and left cerebellum, whereas speech discrimination > music discrimination identified bilateral regions in the anterior superior temporal sulci (including both superior and middle temporal gyri). Music detection > speech detection identified a bilateral group of clusters spanning the superior temporal gyri, bilateral precentral gyri, bilateral insula and bilateral inferior parietal regions, as well as clusters in the right middle frontal gyrus. Speech detection > music detection identified bilateral superior temporal sulci regions as well as left inferior frontal regions (pars triangularis and pars opercularis). Music memory > speech memory identified a left posterior superior temporal/inferior parietal region and bilateral medial frontal regions; speech memory > music memory identified left inferior frontal gyrus (pars opercularis and pars triangularis) and bilateral superior and middle temporal gyri.

In sum, the task pairwise contrasts in many ways mirror the passive listening contrast: music tasks activated more dorsal/medial superior temporal and inferior parietal regions, while speech tasks activated superior temporal sulcus regions, particularly in the anterior temporal lobe. In addition, notable differences were found in Broca's area and its right hemisphere homolog: in discrimination tasks music significantly activated Broca's area (specifically the pars opercularis) more than speech. However, in detection and memory tasks speech activated Broca's area (pars opercularis and pars triangularis) more than music. The right inferior frontal gyrus responded equally to speech and music in both detection and memory tasks, but responded more to music than speech in discrimination tasks. Also notably, in the memory tasks, music activated a lateral superior temporal/inferior parietal cluster (in the vicinity of Hickok and Poeppel's "area Spt") more than speech while an inferior frontal cluster including the pars opercularis was activated more for speech than music. Both area Spt and the pars opercularis previously have been implicated in a variety of auditory working memory tasks (including speech and pitch working memory) in both lesion patients and control subjects (Koelsch and Siebel,


### TABLE 2 | Locations, peaks and cluster size for significant voxel clusters for each condition's ALE and for each contrast of interest.





The x, y, z coordinates are in Talairach space and refer to the peak voxel activated in each contrast. All contrasts are thresholded at p = 0.05. Asterisks indicate anatomical location of peak voxel.

FIGURE 2 | Representative sagittal slices of the ALEs for the (A) music discrimination, (B) music error detection and (C) music memory task conditions, p < 0.05, corrected, overlaid on top of the passive music listening ALE for comparison.

memory task conditions, compared to the corresponding speech task, p < 0.05, corrected.

2005; Koelsch et al., 2009; Buchsbaum et al., 2011) and are considered to be part of an auditory sensory-motor integration network (Hickok et al., 2003; Hickok and Poeppel, 2004, 2007).

### Music Tasks vs. Passive Listening To Speech

Findings from various music paradigms and tasks are often reported as engaging language networks because of location; a music paradigm activating Broca's area or superior temporal regions is frequently described as recruiting classic language areas. However, it is not clear if these music paradigms are in fact engaging the language networks engaged in the natural, everyday process of listening to speech. Thus, pairwise contrasts of the ALEs for listening to speech vs. the music tasks were calculated (**Figure 4**; **Table 2**). Music discrimination > speech passive listening identified regions in bilateral precentral gyri, bilateral medial frontal gyri, left postcentral gyrus, left inferior parietal

lobule, left cerebellum, right inferior and middle frontal gyri, and right superior temporal gyrus. Music error detection > speech identified bilateral precentral gyri, bilateral superior temporal gyri, bilateral insula, bilateral basal ganglia, left postcentral gyrus, left cerebellum, bilateral inferior parietal lobe, right middle frontal gyrus, right inferior frontal gyrus and the right thalamus. Music memory > speech identified portions of bilateral inferior frontal gyri, bilateral medial frontal gyri, left inferior parietal lobe, left pre and postcentral gyri, and right insula. Compared to all three music tasks, speech significantly activated bilateral superior temporal sulcus regions and only activated Broca's area (specifically the pars triangularis) more than music detection. The recruitment of Broca's area and adjacent regions for music was task dependent: compared to listening to speech, music detection and discrimination activated additional bilateral inferior precentral gyrus regions immediately adjacent to Broca's area and music memory activated the left inferior frontal gyrus more than speech (in all three subregions: pars opercularis, pars triangularis, and pars orbitalis). In the right hemisphere homolog of Broca's area, all three music tasks activated this region more than listening to speech as well as adjacent regions in the right middle frontal gyrus. All together these results suggest that the recruitment of neural resources used in speech for music processing depends on the experimental paradigm. The finding of music memory tasks eliciting widespread activation in Broca's area compared to listening to speech is likely due to the inferior frontal gyrus, and the pars opercularis in particular being consistently implicated in articulatory rehearsal and working memory (Hickok et al., 2003; Buchsbaum et al., 2011, 2005), resources that are likely recruited by the music memory tasks.

### Music Tasks vs. Passive Listening To Music

Lastly we compared the music task ALEs to the music passive listening ALE using pairwise contrasts to better characterize taskspecific activations to music. Results (p < 0.05, FDR corrected) include: (1) music discrimination > music listening identified bilateral inferior precentral gyri, bilateral medial frontal regions, left postcentral gyrus, left inferior parietal lobule, left cerebellum, right middle frontal gyrus and right insula (2) music error detection > music listening identified bilateral medial frontal, bilateral insula, bilateral inferior parietal areas, bilateral superior temporal gyri, bilateral basal ganglia, left pre and post central gyri, right inferior and middle frontal gyri and right cerebellum; (3) music memory > passive listening identified bilateral inferior frontal gyri (pars opercularis, triangularis and orbitalis in the left hemisphere, only the latter two in the right hemisphere), bilateral medial frontal gyri, bilateral insula, bilateral cerebellum, left middle frontal gyrus, left inferior parietal lobe, left superior and middle temporal gyri, right basal ganglia, right hippocampus and right parahippocampal gyrus (**Figure 5**, **Table 2**). The medial frontal and inferior parietal activations identified in the tasks compared to listening likely reflect increased vigilance and attention due to the presence of a task, as activation in these regions is known to increase as a function of effort and performance on tasks across a variety of stimuli types and domains (Petersen and Posner, 2012; Vaden et al., 2013). To summarize the findings in Broca's area and its right hemisphere homolog, music memory tasks activated Broca's area more than just listening to music, while music discrimination and detection tasks activated right inferior frontal gyrus regions more than listening to music. Also note that all three music tasks compared to listening to music implicate regions on the anterior bank of the inferior portion of the precentral gyrus immediately adjacent to Broca's area. Significant clusters more active for music passive listening than for each of the three task conditions are found in the bilateral superior temporal gyri (**Table 2**).

### Discussion

The present meta-analysis examined data from 80 functional neuroimaging studies of music and 91 studies of speech

to characterize the relationship between the brain networks activated by listening to speech vs. listening to music. We also compared the brain regions implicated in three frequently used music paradigms (error detection, discrimination, and memory) to the regions implicated in similar speech paradigms to determine how task effects may change how the neurobiology of music processing is related to that of speech. We replicated across a large collection of studies' previous within-subject findings that speech activates a predominately lateral temporal network, while music preferentially activates a more dorsal medial temporal network extending into the inferior parietal lobe. In Broca's area, we found overlapping resources for passive listening to speech and music in the pars opercularis, but speech "specific" resources in pars triangularis; the right hemisphere homolog of Broca's area was equally responsive to listening to speech and music. The use of a paradigm containing an explicit task (error detection, discrimination or memory) altered the relationship between the brain networks engaged in music and speech. For example, speech discrimination tasks do not activate the pars triangularis (i.e., the region identified as "speech specific" by the passive listening contrast) more than music discrimination tasks, and both speech detection and memory tasks activate the pars opercularis (i.e., the region responding equally to music and speech passive listening) more than the corresponding music tasks, while music discrimination activates pars opercularis more than speech discrimination. These findings suggest that inferior frontal contributions to music processing, and their overlap with speech resources, may be modulated by task. The following sections discuss these findings in relation to neuroanatomical models of speech and music.

### Hemispheric Differences for Speech and Music

The lateralization of speech and music processing has been investigated for decades. While functional neuroimaging studies report bilateral activation for both speech and music (Jäncke et al., 2002; Abrams et al., 2011; Fedorenko et al., 2011; Rogalsky et al., 2011), evidence from amusia, aphasia and other patient populations have traditionally identified the right hemisphere as critical for music and the left for basic language processes in most individuals (Gazzaniga, 1983; Peretz et al., 2003; Damasio et al., 2004; Hyde et al., 2006). Further evidence for hemispheric differences comes from asymmetries in early auditory cortex: left hemisphere auditory cortex has better temporal resolution and is more sensitive to rapid temporal changes critical for speech processing, while the right hemisphere auditory cortex has higher spectral resolution and is more modulated by spectral changes, which optimize musical processing (Zatorre et al., 2002; Poeppel, 2003; Schönwiesner et al., 2005; Hyde et al., 2008). Thus, left auditory cortex has been found to be more responsive to phonemes than chords, while right auditory cortex is more responsive to chords than phonemes (Tervaniemi et al., 1999, 2000). This hemispheric specialization coincides with evidence from both auditory and visual domains, suggesting that the left hemisphere tends to be tuned to local features, while the right hemisphere is tuned to more global features (Sergent, 1982; Ivry and Robertson, 1998; Sanders and Poeppel, 2007).

Hemispheric differences in the present study for speech and music vary by location. We did not find any qualitative hemispheric differences between speech and music in the temporal lobe. Speech bilaterally activated lateral superior and middle temporal regions, while music bilaterally activated more dorsal medial superior temporal regions extending into the inferior parietal lobe. However, these bilateral findings should not be interpreted as evidence against hemispheric asymmetries for speech vs. music. The hemispheric differences widely reported in auditory cortex almost always are a matter of degree, e.g., phonemes and tones both activate bilateral superior temporal regions, but a direct comparison indicates a left hemisphere preference for the speech and a right hemisphere preference for the tones (Jäncke et al., 2002; Zatorre et al., 2002). These differences would not be reflected in our ALE results because both conditions reliably activate the same regions although to different degrees and the ALE method does not assign weight to coordinates (i.e., all the significant coordinates reported for contrasts of interest in the studies used) based on their beta or statistical values.

The frontal lobe results, however, did include some laterality differences of interest: passive listening to speech activated portions of the left inferior frontal gyrus (i.e., Broca's area), namely in the pars triangularis, significantly more than listening to music. A right inferior frontal gyrus cluster, extending into the insula, was activated significantly more for listening to music than speech. These findings in Broca's area coincide with Koelsch's neurocognitive model of music perception, in that right frontal regions are more responsive to musical stimuli and that the pars opercularis, but not the pars triangularis, is engaged in structure building of auditory stimuli (Koelsch, 2011). It is also noteworthy that the inclusion of a task altered hemispheric differences in the frontal lobes: the music discrimination tasks activated the left pars opercularis more than speech discrimination, while speech detection and memory tasks activated all of Broca's area (pars opercularis and pars triangularis) more than music detection and memory tasks; music detection and discrimination tasks, but not music memory tasks, activated the right inferior frontal gyrus more than corresponding speech tasks. These task-modulated asymmetries in Broca's area for music are particularly important when interpreting the rich electrophysiological literature of speech and music interactions. For example, both the early right anterior negativity (ERAN) and early left anterior negativity (ELAN) are modulated by speech and music, and are believed to have sources in both Broca's area and its right hemisphere homolog (Friederici et al., 2000; Maess et al., 2001; Koelsch and Friederici, 2003). Thus, the lateralization patterns found in the present study emphasize the need to consider that similar ERP effects for speech and music may arise from different underlying lateralization patterns that may be task-dependent.

### Speech vs. Music in the Anterior Temporal Lobe

Superior and middle posterior temporal regions on the banks of the superior temporal sulcus were preferentially activated in each speech condition compared to each corresponding music condition in the present meta-analysis. This is not surprising, as these posterior STS regions are widely implicated in lexical semantic processing (Price, 2010) and STS regions have been found to be more responsive to syllables than tones (Jäncke et al., 2002). Perhaps more interestingly, the bilateral anterior temporal lobe (ATL) also was activated more for each speech condition than by each corresponding music condition. The role of the ATL in speech processing is debated (e.g., Scott et al., 2000 cf. Hickok and Poeppel, 2004, 2007), but the ATL is reliably sensitive to syntactic structure in speech compared to several control conditions including word lists, scrambled sentences, spectrally rotated speech, environmental sounds sequences, and melodies (Mazoyer et al., 1993; Humphries et al., 2001, 2005, 2006; Xu et al., 2005; Spitsyna et al., 2006; Rogalsky and Hickok, 2009; Friederici et al., 2010; Rogalsky et al., 2011). One hypothesis is that the ATL is implicated in combinatorial semantic processing (Wong and Gallate, 2012; Wilson et al., 2014), although pseudoword sentences (i.e., sentences lacking meaningful content words) also activate the ATL (Humphries et al., 2006; Rogalsky et al., 2011). Several of the speech activation coordinates included in the present meta-analysis were from studies that used sentences and phrases as stimuli (with and without semantic content). It is likely that these coordinates are driving the ATL findings. Our finding that music did not activate the ATL supports the idea that the ATL is not responsive to hierarchical structure per se but rather needs linguistic and/or semantic information for it to be recruited.

### Speech vs. Music in Broca's Area

There is no consensus regarding the role of Broca's area in receptive speech processes (e.g., Fedorenko and Kanwisher, 2011; Hickok and Rogalsky, 2011; Rogalsky and Hickok, 2011). Results from the present meta-analysis indicate that listening to speech activated both the pars opercularis and pars triangularis portions of Broca's area, while listening to music only activated the pars opercularis. The pars triangularis has been proposed to be involved in semantic integration (Hagoort, 2005) as well as in cognitive control processes such as conflict resolution (Novick et al., 2005; Rogalsky and Hickok, 2011). It is likely that the speech stimuli contain more semantic content than the music stimuli, and thus semantic integration processes may account for the speech-only response in pars triangularis. However, there was no significant difference in activations in the pars triangularis for the music discrimination and music detection tasks vs. passive listening to speech, and the music memory tasks activated portions of the pars triangularis more than listening to speech. These music task-related activations in the pars triangularis may reflect the use of semantic resources for categorization or verbalization strategies to complete the music tasks, but may also reflect increased cognitive control processes to support reanalysis of the stimuli to complete the tasks. The activation of the left pars opercularis for both speech and music replicates numerous individual studies implicating the pars opercularis in both speech and musical syntactic processing (e.g., Koelsch and Siebel, 2005; Rogalsky and Hickok, 2011) as well as in a variety of auditory working memory paradigms (e.g., Koelsch and Siebel, 2005; Buchsbaum et al., 2011).

### Implications for Neuroanatomical Models of Speech and Music

It is particularly important to consider task-related effects when evaluating neuroanatomical models of the interactions between speech and music. It has been proposed that inferior frontal cortex (including Broca's area) is the substrate for shared speechmusic executive function resources, such as working memory and/or cognitive control (Patel, 2003; Slevc, 2012; Slevc and Okada, 2015) as well as auditory processes such as structure analysis, repair, working memory and motor encoding (Koelsch and Siebel, 2005; Koelsch, 2011). Of particular importance here is Slevc and Okada's (2015) proposal that cognitive control may be one of the shared cognitive resources for linguistic and musical processing when reanalysis and conflict resolution is necessary. Different tasks likely recruit cognitive control resources to different degrees, and thus may explain task-related differences for the frontal lobe's response to speech and music. There is ample evidence to support Slevc and Okada's hypothesis: classic cognitive control paradigms such as the Stroop task (Stroop, 1935; MacLeod, 1991) elicit overlapping activations in Broca's area when processing noncanonical sentence structures (January et al., 2009). Unexpected harmonic and melodic information in music interfere with Stroop task performance (Masataka and Perlovsky, 2013). The neural responses to syntactic and sentence-level semantic ambiguities in language also interact with responses to unexpected harmonics in music (Koelsch et al., 2005; Steinbeis and Koelsch, 2008b; Slevc et al., 2009; Perruchet and Poulin-Charronnat, 2013). The present results suggest that this interaction between language and music possibly via cognitive control mechanisms, localized to Broca's area, may be task driven and not inherent to the stimuli themselves. In addition, many language/music interaction studies use a reading language task with simultaneous auditory music stimuli; it is possible that a word-by-word presentation reading paradigm engages additional reanalysis mechanisms that may dissociate from resources used in auditory speech processing (Tillmann, 2012).

Slevc and Okada suggest that future studies should use tasks designed to drive activation of specific processes, presumably including reanalysis. However, the present findings suggest it is possible that these task-induced environments may actually drive overlap of neural resources for speech and music not because they are taxing shared sensory computations but rather because they are introducing additional processes that are not elicited during typical, naturalistic music listening. For example, consider the present findings in the left pars triangularis: this region is not activated during listening to music, but is activated while listening to speech. However, by presumably increasing the need for reanalysis mechanisms via discrimination or memory tasks, music does recruit this region.

There may be inferior frontal shared mechanisms that are stimulus driven while others are task driven: Broca's area is a diverse region in terms of its cytoarchitecture, connectivity and response properties (Amunts et al., 1999; Anwander et al., 2007; Rogalsky and Hickok, 2011; Rogalsky et al., in press). It is possible that some networks are task driven and some are stimulus driven. The hypotheses of Koelsch et al. are largely grounded in behavioral and electrophysiology studies that indicate an interaction between melodic and syntactic information (e.g., Koelsch et al., 2005; Fedorenko et al., 2009; Hoch et al., 2011). It is not known if these interactions are stimulus driven; a variety of tasks have been used in this literature, including discrimination, anomaly/error detection, (Koelsch et al., 2005; Carrus et al., 2013), grammatical acceptability (Patel et al., 1998a; Patel, 2008), final-word lexical decision (Hoch et al., 2011), and memory/comprehension tasks (Fedorenko et al., 2009, 2011). In addition, there is substantial variability across individual subjects, both functionally and anatomically, within Broca's area (Amunts et al., 1999; Schönwiesner et al., 2007; Rogalsky et al., in press). Thus, future within-subject studies are needed to better understand the role of cognitive control and other domain-general resources in musical processing independent of task.

Different tasks, regardless of the nature of the stimuli, may require different attentional resources (Shallice, 2003). Thus, it is possible that the inferior frontal differences between the music tasks and passive listening to music and speech are due to basic attentional differences, not the particular task per se. However, we find classic domain-general attention systems in the anterior cingulate and medial frontal cortex to be significantly activated across all conditions: music tasks, speech tasks, passive listening to music and passive listening to speech. These findings support Slevc and Okada's (2015) claim that domain-general attention mechanisms facilitated by anterior cingulate and medial frontal cortex are consistently engaged for music as they are for speech. Each of our music task conditions do activate these regions significantly more than the passive listening, suggesting that the midline domain-general attention mechanisms engaged by music can be further activated by explicit tasks.

### Limitations and Future Directions

One issue in interpreting our results may be the proximity of distinct networks for speech and music (Peretz, 2006; Koelsch, 2011). Overlap in fMRI findings, particularly in a meta-analysis, does not necessarily mean that speech and music share resources in those locations. It is certainly possible that the spatial resolution of fMRI is not sufficient to visualize separation occurring at a smaller scale (Peretz and Zatorre, 2005; Patel, 2012). However, our findings of spatially distinct regions for music and speech clearly suggest the recruitment of distinct brain networks for speech and music.

Another potential issue related to the limitations of fMRI is that of sensitivity. Continuous fMRI scanning protocols (i.e., stimuli are presented simultaneously with the noise of scanning) and sparse temporal sampling fMRI protocols (i.e., stimuli are presented during silent periods between volume acquisitions) are both included in the present meta-analyses. It has been suggested that the loud scanner noise may reduce sensitivity to detecting hemodynamic response to stimuli, particularly complex auditory stimuli such as speech and music (Peelle et al., 2010; Elmer et al., 2012). Thus, it is possible that effects only detected by a sparse or continuous paradigm are not represented in our ALE results. However, a comparison of continuous vs. sparse fMRI sequences found no significant differences in speech activations in the frontal lobe between the pulse sequences (Peelle et al., 2010).

Priming paradigms measuring neurophysiological responses (ERP, fMRI, etc.) are one way to possibly circumvent taskrelated confounds in understanding the neurobiology of music in relation to that of speech. Tillmann (2012) suggests that priming paradigms may provide more insight into an individual's implicit musical knowledge than is demonstrated by performance on an explicit, overt task (e.g., Schellenberg et al., 2005; Tillmann et al., 2007). In fact, there are ERP studies that indicate that musical chords can prime processing of target words if the prime and target are semantically (i.e., emotionally) similar (Koelsch et al., 2004; Steinbeis and Koelsch, 2008a). However, most ERP priming studies investigating music or music/speech interactions have included an explicit task (e.g., Schellenberg et al., 2005; Tillmann et al., 2007; Steinbeis and Koelsch, 2008a). It is not known how the presence of an explicit task may affect priming mechanisms via top-down mechanisms. Priming is not explored in the present meta-analysis; to our knowledge there is only one fMRI priming study of music and speech, which focused on semantic (i.e., emotion) relatedness (Steinbeis and Koelsch, 2008a).

The present meta-analysis examines networks primarily in the cerebrum. Even though almost all of the studies included in our analyses focused on cortical structures, we still identified some subcortical task-related activations: music detection compared to music passive listening activated the basal ganglia and music memory tasks activated the thalamus, hippocampus and basal ganglia compared to music passive listening. No significant differences between passive listening to speech and music were found in subcortical structures. These findings (and null results) in subcortical regions should be interpreted cautiously: given the relatively small size of these structures, activations in these areas are particularly vulnerable to spatial smoothing filters and group averaging (Raichle et al., 1991; White et al., 2001). There is also strong evidence that music and speech share subcortical resources in the brainstem (Patel, 2011), which are not addressed by the present study. For example, periodicity is a critical aspect of both speech and music and known to modulate networks between the cochlea and inferior colliculus of

### References


the brainstem (Cariani and Delgutte, 1996; Patel, 2011). Further research is needed to better understand where speech and music processing networks diverge downstream from these shared early components.

### Conclusion

Listening to music and listening to speech engage distinct temporo-parietal cortical networks but share some inferior and medial frontal resources (at least at the resolution of fMRI). However, the recruitment of inferior frontal speech-processing regions for music is modulated by task. The present findings highlight the need to consider how task effects may be interacting with conclusions regarding the neurobiology of speech and music.

### Acknowledgments

This work was supported by a GRAMMY Foundation Scientific Research Grant (PI Rogalsky) and Arizona State University. We thank Nicole Blumenstein and Dr. Nancy Moore for their help in the preparation of this manuscript.


phonetic and musical sounds: a magnetoencephalographic (MEG) study. Neuroimage 9, 330–336. doi: 10.1006/nimg.1999.0405


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 LaCroix, Diaz and Rogalsky. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Studying Musical and Linguistic Prediction in Comparable Ways: The Melodic Cloze Probability Method

*Allison R. Fogel1\*, Jason C. Rosenberg2, Frank M. Lehman3, Gina R. Kuperberg1,4,5 and Aniruddh D. Patel1*

*<sup>1</sup> Department of Psychology, Tufts University, Medford, MA, USA, <sup>2</sup> Department of Arts and Humanities, Yale-NUS College, Singapore, Singapore, <sup>3</sup> Department of Music, Tufts University, Medford, MA, USA, <sup>4</sup> MGH/HST Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, USA, <sup>5</sup> Department of Psychiatry, Massachusetts General Hospital, Charlestown, MA, USA*

Prediction or expectancy is thought to play an important role in both music and language processing. However, prediction is currently studied independently in the two domains, limiting research on relations between predictive mechanisms in music and language. One limitation is a difference in how expectancy is quantified. In language, expectancy is typically measured using the cloze probability task, in which listeners are asked to complete a sentence fragment with the first word that comes to mind. In contrast, previous production-based studies of melodic expectancy have asked participants to sing continuations following only one to two notes. We have developed a melodic cloze probability task in which listeners are presented with the beginning of a novel tonal melody (5–9 notes) and are asked to sing the note they expect to come next. Half of the melodies had an underlying harmonic structure designed to constrain expectations for the next note, based on an implied authentic cadence (AC) within the melody. Each such 'authentic cadence' melody was matched to a 'non-cadential' (NC) melody matched in terms of length, rhythm and melodic contour, but differing in implied harmonic structure. Participants showed much greater consistency in the notes sung following AC vs. NC melodies on average. However, significant variation in degree of consistency was observed within both AC and NC melodies. Analysis of individual melodies suggests that pitch prediction in tonal melodies depends on the interplay of local factors just prior to the target note (e.g., local pitch interval patterns) and larger-scale structural relationships (e.g., melodic patterns and implied harmonic structure). We illustrate how the melodic cloze method can be used to test a computational model of melodic expectation. Future uses for the method include exploring the interplay of different factors shaping melodic expectation, and designing experiments that compare the cognitive mechanisms of prediction in music and language.

Keywords: music, language, prediction, music cognition, melodic expectation, cloze probability

# INTRODUCTION

Recent years have seen growing interest in cognitive and neural relations between music and language. Although there are clear differences between the two— for example, language can convey specific semantic concepts and propositions in a way that instrumental music cannot (Slevc and Patel, 2011) — they share several features. For example, both language and music involve the

### *Edited by:*

*McNeel Gordon Jantzen, Western Washington University, USA*

### *Reviewed by:*

*Elizabeth Hellmuth Margulis, University of Arkansas, USA E. Glenn Schellenberg, University of Toronto, Canada*

> *\*Correspondence: Allison R. Fogel allison.fogel@tufts.edu*

### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

*Received: 01 June 2015 Accepted: 26 October 2015 Published: 12 November 2015*

### *Citation:*

*Fogel AR, Rosenberg JC, Lehman FM, Kuperberg GR and Patel AD (2015) Studying Musical and Linguistic Prediction in Comparable Ways: The Melodic Cloze Probability Method. Front. Psychol. 6:1718. doi: 10.3389/fpsyg.2015.01718* generation and comprehension of complex, hierarchically structured sequences made from discrete elements combined in principled ways (Patel, 2003; Koelsch et al., 2013), and both rely heavily on implicit learning during development (Tillmann et al., 2000).

While neuropsychology has provided clear cases of selective deficits in linguistic or musical processing following brain damage (e.g., Peretz, 1993), several neuroimaging studies of healthy individuals suggest overlap in the brain mechanisms involved in processing linguistic and musical structure. One early demonstration of this overlap came from event-related potential (ERP) research, which revealed that a component known as the P600 is observed in response to syntactically challenging or anomalous events in both domains (Patel et al., 1998). Later research using MEG and fMRI provided further suggestions of neural overlap in structural processing, e.g., by implicating Broca's region in the processing of tonal-harmonic structure (e.g., Maess et al., 2001; Tillmann et al., 2003; LaCroix et al., 2015; Musso et al., 2015; though see Fedorenko et al., 2012). To resolve the apparent contradiction between evidence from neuropsychology and neuroimaging, Patel (2003) proposed the shared syntactic integration resource hypothesis (SSIRH). The SSIRH posits a distinction between domain-specific representations in long-term memory (e.g., stored knowledge of words and their syntactic features, and of chords and their harmonic features), which can be separately damaged, and shared neural resources which act upon these representations as part of structural processing. This "dual-system" model proposes that syntactic integration of incoming elements in language and music involves the interaction (via long-distance neural connections) of shared "resource networks" and domainspecific "representation networks" (see Patel, 2013 for a detailed discussion, including relations between the SSIRH and Hagoort's (2005) "memory, unification, and control" model of language processing).

The SSIRH predicted that simultaneous demands on linguistic and musical structural integration should produce interference. This prediction has been supported by behavioral and neural research (for a review, see Kunert and Slevc, 2015). For example, behavioral studies by Fedorenko et al. (2009) and Slevc et al. (2009) have shown that it is particularly difficult for participants to process complex syntactic structures in both language and music simultaneously (see also Hoch et al., 2011; Carrus et al., 2013; though cf. Perruchet and Poulin-Charronnat, 2013). Additionally, Koelsch et al. (2005) conducted an ERP study that observed an interaction between structural processing in language and music, as reflected by effects of music processing on the left anterior negativity (LAN, associated with processing syntax in language) and effects of language processing on the early right anterior negativity (ERAN, associated with processing musical syntax).

In addition to structural integration, it has been suggested that prediction may be another process that operates similarly in language and music (Koelsch, 2012a,b; Patel, 2012). Prediction is increasingly thought to be a fundamental aspect of human cognition (Clark, 2013), and is a growing topic of research in psycholinguistics (Van Petten and Luka, 2012; see Kuperberg and Jaeger, in press for a recent review). It has become clear that we regularly use context to predict upcoming words when comprehending language (Tanenhaus et al., 1995; Altmann and Kamide, 1999; Wicha et al., 2004; DeLong et al., 2005). This has been demonstrated using ERPs, a brain measure with millisecond-level temporal resolution that allows one to study cognitive processing during language comprehension. Recent evidence from ERP research has suggested that prediction in language processing occurs at multiple distinguishable levels (e.g., syntactic, semantic, phonological) (Pickering and Garrod, 2007; Kuperberg and Jaeger, in press).

Strong lexical predictions for a specific word occur when multiple types of information within a linguistic context constrain strongly for the semantic features, the syntactic properties, and the phonological form of a specific word. For example, the sentence *"The piano is out of \_\_\_\_"* leads to a strong expectation for the word "tune", so one can refer to this as a high lexical constraint sentence. It is well established that unexpected words following these contexts evoke a larger N400 ERP component (occurring 300–500 ms after the presentation of the final word) than expected words (Kutas and Hillyard, 1980; Kutas and Hillyard, 1984; Kutas and Federmeier, 2011). Such unexpected words do not necessarily need to be anomalous to produce an N400: predictions can also be violated with words that are perfectly coherent and non-anomalous. For example, if the final word delivered in the above sentence is "place" (i.e., "*The piano is out of place*") this word still violates a lexical prediction for the highly expected word "tune." As in the previous example, the N400 elicited by "place" would be larger than that elicited by "tune," as it is less expected. Moreover, in recent ERP research, violations of specific lexical predictions with other plausible words have also been observed to elicit a late anteriorly distributed positive component. This late frontal positivity has been observed at various time points after the N400, often peaking around 500–900 ms after the presentation of a critical item (Federmeier et al., 2007; Van Petten and Luka, 2012). Importantly, unlike the N400, the late frontal positivity is not produced by words that follow non-constraining contexts, when comprehenders have no strong prediction for a particular word (e.g., "place" following the context, "After a while, the boy saw the...").

Predictions in language are not always at the level of specific lexical items: they can also be generated at the level of semanticsyntactic statistical contingencies that determine the structure of an event ('who does what to whom') (Kuperberg, 2013). For example, at a certain point in a sentence we might expect a certain syntactic category of word, like a noun-phrase, with certain coarse conceptual features, such as animacy. For example, in the sentence *"Mary went outside to talk to the \_\_\_\_"* there is no strong indication of which word will come next, but it is clear that it must be an animate noun-phrase (Mary would likely not talk to an inanimate object like a truck). Violations of these semanticsyntactic structural predictions have been observed to elicit a different neural response from the anterior positivity discussed above, namely the P600 (a late posterior positivity, peaking from around 600 ms after onset of the violating word; see Kuperberg, 2007 for a review). This provides evidence that distinct neural signatures may be associated with violations of strong predictions at different representational levels (e.g., a late anterior positivity evoked by violations of strong lexical predictions, Federmeier et al., 2007; a late posterior positivity evoked by violations of strong semantic-syntactic predictions, Kuperberg, 2007; see Kuperberg, 2013 for discussion). The functional significance of these late positivites (both frontal and posterior) evoked by strong prediction violations remains unclear. One possibility, however, is that they reflect the neural consequences of suppressing the predicted (but not presented) information and adapting one's internal representation of context in order to generate more accurate predictions in the future (e.g., see Kuperberg, 2013; Kuperberg and Jaeger, in press, for discussion).

Turning to music, expectation has long been a major theme of music cognition research. Meyer (1956) first suggested a strong connection between the thwarting of musical expectations and the arousal of emotion in listeners. In recent years, theories of musical expectation have been brought into a modern cognitive science framework (e.g., Margulis, 2005; Huron, 2006; Huron and Margulis, 2010; Pearce et al., 2010), and expectation has been studied empirically with both behavioral and neural methods (e.g., Steinbeis et al., 2006). It is increasingly recognized that multiple sub-processes are involved in musical expectation (see Huron, 2006, for one theoretical treatment). Empirical research has shown that predictions are generated for multiple aspects of music, such as harmony, rhythm, timbre, and meter (Rohrmeier and Koelsch, 2012). Such expectations are thought to be automatically generated by enculturated listeners (Koelsch et al., 2000; Koelsch, 2012a).

Here, we focus on melodic prediction, and specifically on expectations for upcoming notes in monophonic (single-voice) melodies based on implicit knowledge of the melodic and harmonic structures of Western tonal music (Tillmann et al., 2000). For those interested in relations between predictive mechanisms in music and language, melodic expectancy provides an interesting analog to linguistic expectancy in sentence processing. Like sentences, monophonic melodies consist of a single series of events created by combining perceptually discrete elements in principled ways to create hierarchically structured sequences (Jackendoff and Lerdahl, 2006). Sentences and melodies have regularities at multiple levels, including local relations between neighboring elements and larger-scale patterns, e.g., due to underlying linguistic-grammatical or tonal structure.

In order to study relations between the cognitive mechanisms of prediction in sentences and melodies, it is necessary to measure prediction in these two types of sequences in comparable ways. In sentence processing, lexical expectancy has typically been measured using the cloze probability task, in which participants are asked to complete a sentence fragment with the first word that comes to mind (Taylor, 1953). For a given context, the percentage of participants providing a given continuation is taken as the "cloze probability" of that response. The cloze probability of an item is therefore a straightforward measure of how expected or probable it is. In addition to measuring the cloze probability of a particular word in relation to its context, it is also possible to use the cloze task to measure the 'lexical constraint' of a particular context by calculating the proportion of participants who produce a given word (see Federmeier et al., 2007). For example, a sentence such as *"The day was breezy so the boy went outside to fly a* ...*"* would likely elicit the highly expected continuation *"kite"* from most participants, and thus be a 'strongly lexically constraining' context. In contrast, a sentence such as *"Carol always wished that she'd had a* ...*"* would elicit a more varied set of responses, and thus be a 'weakly lexically constraining' context.

While expectancy in music has been measured in various ways over the years, to date there has been nothing comparable to the standard cloze probability method in language, i.e., a production-based task in which a person is presented with the beginning of a short coherent sequence and then asked to produce the event she thinks comes next.1 Most behavioral studies of expectancy in music have used perceptual paradigms, such as harmonic priming paradigms or ratings of how well a tone continues an initial melodic fragment. Harmonic priming paradigms consist of a prime context followed by a target event, in which the degree of tonal relatedness between the two is manipulated. Typically, harmonically related targets are processed faster and more accurately than unrelated targets (Tillmann et al., 2014). These studies have shown that chords that are more harmonically related to the preceding context are easier to process, while there is a cost of processing chords that are less related or unrelated to the context (Tillmann et al., 2003). Another genre of priming studies has shown that timbre identification is improved when a pitch is close in frequency to the preceding pitch and harmonically congruent with the preceding context (Margulis and Levine, 2006). In studies using explicit ratings of expectancy, listeners are asked to rate how well a target note continues a melodic opening, e.g., on a scale of 1 (very bad continuation) to 7 (very good continuation) (e.g., Schellenberg, 1996). More recently, a betting paradigm has been used in which participants place bets on a set of possible continuations for a musical passage, and bets can be distributed across multiple possible outcomes (Huron, 2006). The betting paradigm has the advantage of providing a measure of the *strength* of an expectation for a specific item. However, like the "continuation rating" task, this task requires *post hoc* judgments, and is therefore not an online measure of participants' real-time expectations. ERPs and measures of neural oscillatory activity can provide online measures of expectation in musical sequences (e.g., Pearce et al., 2010; Fujioka et al., 2012), but such studies have focused on perception, not production.

A handful of studies have used production tasks to measure musical expectancy, but they differ in important ways from the standard linguistic cloze probability task. Some studies have used extremely short contexts, in which participants are asked to sing a continuation after hearing only a single two-note interval, or even a single note (Carlsen, 1981; Unyk and Carlsen, 1987; Povel, 1996; Thompson et al., 1997; Schellenberg et al., 2002). Lake (1987) presented two-note intervals after establishing a tonal context

<sup>1</sup>Waters et al. (1998) used what they refer to as a "musical 'cloze' task," but theirs was a multiple-choice task where participants selected one of several pre-composed sections of musical notation.

consisting of major chords and a musical scale. However, no prior singing-based study of melodic expectation has used coherent melodies as the context (some studies using piano performance have used very long contexts, in which pianists have been asked to improvise extended continuations for entire piano passages, Schmuckler, 1989, 1990). Also, in all of these studies (and unlike in the linguistic cloze probability task), participants were asked to produce continuations of whatever length they chose in response to brief stimuli. The closest analog to a musical cloze task comes from a study of implicit memory for melody, in which listeners first heard a set of novel tonal melodies and then heard melodic stems of several notes and were asked to "sing the note that they thought would come next musically" (Warker and Halpern, 2005). However, the structure of the melodic stems was not manipulated, and the focus of the study was on implicit memory, not on expectation.

In order to advance the comparative study of prediction in language and music, it is necessary to develop comparable methods for studying prediction in the two domains. To this end, we have developed a melodic cloze probability task. In this task, participants are played short melodic openings drawn from novel coherent tonal melodies, and are asked to sing a singlenote continuation. In an attempt to manipulate the predictive constraint of the melodies, the underlying harmonic structure of each opening (henceforth, 'melodic stem') was designed to either lead to a strong expectancy for a particular note, or not (see Materials and Methods for details). For each melodic stem, the cloze probability of a given note is calculated as the percentage of participants producing that note. The predictive constraint of a melodic stem is determined by examining the degree of agreement between participants' responses. For example, if all participants sing the same note after a particular stem, the stem has 100% constraint. On the other hand, if the most commonly sung note is produced by 40% of the participants, then the stem has 40% constraint.

The melodic cloze probability method allows the cloze probabilities of notes to be quantitatively measured, and thus provides a novel way to study how different structural factors (e.g., local melodic interval patterns vs. larger-scale harmonic structure) interact in shaping melodic expectation. As demonstrated below, the method can also be used to test quantitative models of melodic expectation, such as Narmour's (1990) "Implication-Realization" model, using naturalistic musical materials. In the future, the method can facilitate the design of studies comparing predictive mechanisms in language and music, e.g., by systematically manipulating constraint and cloze probabilities across linguistic and musical stimuli in behavioral or ERP studies of expectancy (cf. Tillmann and Bigand, 2015).

# MATERIALS AND METHODS

### Participants

Fifty participants (29 female, 21 male, age range 18–25 years, mean age 20.3 years) took part in the experiment and were included in the data analysis (eight further participants were excluded due to difficulties with singing on pitch; see "Data Analysis"). All participants were self-identified musicians with no hearing impairment who had a minimum of 5 years of musical experience within the past 10 years (playing an instrument, singing, or musical training); 22 (44%) reported "voice" as one of their instruments. Participants had received a mean of 9.0 years of formal musical training on Western musical instruments (*SD* = 4.8) and reported no significant exposure to non-Western music. Participants were compensated for their participation and provided informed consent in accordance with the procedures of the Institutional Review Board of Tufts University.

# Materials

The stimuli consisted of 45 pairs of short novel tonal melodies created by the second author (JCR), a professional composer. Stimuli were truncated in the middle, creating "melodic stems." The melodies ranged across all 12 major keys and employed variety of meters (3/4, 4/4, and 6/8 time signatures). Each stem was 5–9 notes long (*M* = 8.38 notes, *SD* = 0.83), and was played at a tempo of 120 beats per minute (bpm). Note durations varied from eighth notes (250 ms) to half notes (1000 ms). Stems contained no rests, articulation indications, dynamic variability, or non-diatonic pitches. All stimuli were created using Finale software with sampled grand piano sounds. Across all melodies, the highest and lowest pitch were A5 (880.0 Hz) and D3 (146.8 Hz), respectively, and the mean pitch was near E4 (329.6 Hz). On average, stems had a pitch range of 11.4 semitones (distance between the highest and lowest pitch in the stem, *SD* = 3.2 st). Male participants heard the melodic stems transposed down one octave. The average stem duration was 5.02 s (*SD* = 1.23).

Each stimulus pair consisted of two stems in the same musical key: one was an "authentic cadence" version, which was designed to create a strong expectation for a particular note, and the other was a "non-cadence" (NC) version, which was designed to *not* generate a strong expectation for a particular note. AC stems ended preceding a strong beat within the meter on the 2nd, 5th, or 7th scale degree and with an implied AC that would typically be expected to resolve to a tonic function. NC stems ended with an implied IV, iv, or ii harmony, with the last presented note never on the 2nd or 7th scale degree and rarely on the 5th. The two stems in each pair were identical in length, rhythm, and melodic contour; they differed only in the pitch of some of their notes, which influenced their underlying harmonic structure (see **Figure 1** for an example). On average, the two stems of an AC-NC melodic pair differed in 48.3% of their notes (*SD* = 28.5%). When notes of an AC-NC pair differed, they remained close in overall pitch height, on average 1.90 semitones apart (*SD* = 0.38).

The extent to which the two groups of stems projected a sense of key was compared using the Krumhansl-Schmuckler key-finding algorithm (Krumhansl, 1990). This model is based on "key-profiles" of each potential key, which represent the stability of each pitch in the key, i.e., how well it fits in a tonal context (Krumhansl and Kessler, 1982). The pitch distribution of a given melody, weighted by duration, is compared to the keyprofile of each key, and a correlation value is calculated. When

correlations with the profiles of each potential key were calculated for each stem, the mean correlation with the correct key for AC stems [*r*(22) = 0.70] did not differ significantly from the mean correlation with the correct key of NC stems [*r*(22) = 0.73], *t*(44) = 1.24, *p* = 0.22 (averaging and statistics were performed on Fisher transformed correlation coefficients). The two groups of stems therefore did not differ in the degree to which they projected a sense of key.

# Procedure

Stimuli were played to participants over Logitech Z200 computer speakers at a comfortable listening volume within a sound attenuated room. The experiment was presented using PsychoPy (v1.79.01) on a MacBook Pro laptop, and sung responses were recorded as .wav files using the computer's built-in microphone.

Each participant was instructed that s/he would hear the beginnings of some unfamiliar melodies and would need to "sing the note you think comes next." Participants were asked to *continue* the melody—not necessarily complete it on the syllable "la." Each trial began when the participant pressed a button to hear a melodic stem. Immediately after the end of the last note of each stem, the word "Sing" appeared on the screen and participants were given 5 s to sing the continuation, after which they rated their confidence in their response on a 7-point Likert scale (1 = *low*, 7 = *high*).

Each participant was presented with 24 AC and 24 NC melodic stems (only one version from each AC-NC pair) in one of eight randomized presentation orders. (Three pairs were removed from analysis due to differences in the melodic contours of the two stems, hence data from 45 pairs was analyzed.) At the beginning of the experiment, each participant completed a pitch-matching task in which they heard and were asked to sing back a series of individual tones (F4, A4, B3, G#4, A#3, D4, C#4, and E-4 [corresponding to 349.2, 440.0, 246.9, 415.3, 233.1, 293.7, 277.2, 311.1 Hz, respectively]; one octave lower for male participants). This was used to evaluate participants' singing accuracy. Before the experimental trials began, participants were familiarized with the experimental procedure with a block of practice items, which ranged from simple scales and familiar melodies to unfamiliar melodies.

### Data Analysis

We extracted the mean fundamental frequency of the sung note using Praat (Boersma, 2002). The pitch of the sung note was determined by rounding the measured mean fundamental frequency to the closest semitone in the Western chromatic scale (e.g., A4 = 440 Hz), with the deviation from the frequency of this chromatic scale tone recorded (in cents, i.e., in hundredths of a semitone). The sung response was also represented in terms of its scale degree within the key of the stem in question. Responses were generalized across octaves for the purpose of this study. Participants' responses to the pitch-matching portion of the experiment were also analyzed; if any participant's pitchmatching responses did not round to the same note that was presented, or if their responses to at least 25% of the experimental trials were more than 40 cents away from the nearest semitone, the participant's responses were excluded from further analysis (eight participants were omitted for these reasons). Additionally, reaction times were measured using a sound onset measurement script in Praat (a sound's onset was detected when the sound reached a level −25 dB below its maximum intensity for a minimum of 50 ms) to determine how quickly the continuation was sung after the offset of the last note of the stem.

# RESULTS

Participants found the task intuitive and uncomplicated, suggesting that the melodic cloze probability task provides a naturalistic way to measure melodic expectations. On average, participants sang a continuation note with a reaction time of 899 ms (*SD* = 604 ms), and their sung notes were an average of 1896 ms long (*SD* = 808 ms). Given that that the melodies had a tempo of 120 BPM, this corresponds to an average time interval of 1.80 beats after the offset of the stem, and a sung note duration of 3.79 beats.

# Constraint

The primary dependent variable in our study was the predictive constraint of a melodic stem, as measured by the percentage of participants that sang the most common note after the stem. **Figure 2** illustrates how this was computed, based on the AC-NC melodic pair in **Figure 1**. **Figures 2A,B** show the distributions of sung notes after the AC and NC stems in **Figure 1**, respectively. **Figure 2A** shows that 92% of participants that heard the AC stem produced the most commonly sung note (the tonic, D), while **Figure 2B** shows that no more than 24% of participants that heard the NC stem produced any one note (in this case, there was a tie between C# and A, but in most cases, one pitch class was most common). Thus the constraint of this melodic pair was 92% (or 0.92) for the AC melody and 24% (or 0.24) for the NC melody. For this pair, the AC melody was indeed far more constraining than the NC melody, as predicted.

For each AC and NC stem, we computed the constraint as described above. After AC stems, the average constraint was 69% (i.e., on average, 69% of participants sang the same note after hearing an AC stem), while after NC stems, the average constraint was 42% (i.e., on average, only 42% of participants sang the same note after hearing an NC stem). Thus on average, melodic stems in the AC condition did prove to be more constraining than NC stems (AC *M* = 0.692, *SD* = 0.171; NC *M* = 0.415, *SD* = 0.153), [*t*(44) = 7.79, *p* < 0.001]. This pattern of higher constraint for the AC vs. NC stem was observed in 38 of the 45 item pairs (**Figure 3**).

On average, participants responded significantly more quickly after AC stems (mean *RT* = 767 ms, *SD* = 265 ms) than after NC stems (mean *RT* = 1033 ms, *SD* = 302 ms), *t*(49) = 9.78, *p* < 0.001. Additionally, on average participants were significantly more confident in their responses to AC stems (*M* = 5.14, *SD* = 0.95) than to NC stems (*M* = 4.36, *SD* = 1.04), *t*(49) = 9.60, *p* < 0.001.

# Scale Degree

When responses were represented in terms of their scale degree in the key of the stem in question, and compiled across all items in each condition, the distributions for AC and NC items were strikingly different. For six of the seven diatonic scale degrees, the frequency of response differed significantly between AC and NC items based on *t*-tests of each scale degree with a Bonferroni correction applied (see **Figure 4** for *<sup>p</sup>*-values). For AC items, responses were heavily weighted around the first note of the scale, or tonic (known as 'do' in solfege). For NC items, responses were more widely distributed; however, they were mainly restricted to in-key diatonic scale degrees.

# Variability

While AC stems were on average significantly more constraining than their matched NC stems, there was considerable variability across AC-NC pairs in the degree of difference in constraint between members of a pair (see **Figure 3**). Thirty-eight out of 45 pairs demonstrated the expected pattern, with the AC stem proving more constraining than the NC stem. For instance, the stem pair in **Figure 5A** has a highly constraining AC stem, with 92% of participants singing the same note, the melody's tonic pitch, C (in **Figures 5** and **6**, the most commonly sung note is shown as a red note head after the end of each stem). Why might this be? This stem is short, contains only one rhythmic value, and has very clear harmonic implications, beginning with an unambiguously arpeggiated tonic triad (C-E-G) and concluding with a similarly outlined complete dominant triad (G-B-D). This stem also ends on the leading tone of B, i.e., the seventh scale degree of the diatonic major scale, which customarily resolves to the tonic scale degree, particularly near the end of a phrase. Further structural factors that may contribute to the high degree of agreement on the final pitch are (1) the melody's consistent downward contour, which seems to close in on middle C, and (2) the fact that the tonic note is heard very close to the end of the phrase, which may make it more likely to be replicated. Turning to the NC stem in **Figure 5B**, it is similar in many respects to the AC stem, yet very different in constraint, with the most commonly sung note (F) being produced by just 24% of participants who heard this stem. What might

account for this? The NC stem does not have any resolutiondemanding dominant pitches at its conclusion, and as a result lacks a clear sense of harmonic direction. Instead, the melody follows a downward pattern of melodic thirds (E–G, C–E, A–C) whose continuation is ambiguous. The most commonly chosen completion of F could be explained as the next logical pitch in the chain of descending thirds, after A–C. Thus, when faced with a stem where harmonic direction is underdetermined, subjects may have recruited an alternative strategy of melodic pattern continuation.

Another example of an AC stem that proved to be highly constraining is shown in **Figure 5B** (same melodic pair as in **Figure 1**). As with the melody in **Figure 5A**, the AC stem begins on the tonic note and returns to it as the most expected continuation, with an overall melodic range that emphasizes the octave generated above the first scale degree. The melody's interior arpeggiates two chords, first the tonic (D–F#-A) in measure 1, then the subdominant (G-B-D) in measure 2. The subdominant chord frequently serves a syntactic role of "predominant," a harmonic function that signals the initiation of

a cadence. This is indeed how measure three is structured, with a heavily implied dominant harmony via scale degrees 2 and 7, and a melodic contour that insures D as a plausible completion due to an implied F#-E-D melodic descent and a unresolved leading tone of C#. The less constraining NC stem in **Figure 5B**, by contrast, ends on the sixth scale degree (the submediant). Unlike the leading tone, this note lacks a strong tendency to resolve in a particular way. It may plausibly serve as part of a stepwise motion to or away from the dominant, or as part of an arpeggiation of a predominant harmony; in either case, it negates the cadential function of the third measure and points to no obvious melodic completion.

Contrasting with these stems, where subjects' responses to stems adhered to the AC/NC designations, there were several items where the constraint of the NC stem unexpectedly *exceeded* that of the AC stem. For example, after the NC stem in **Figure 6A**, 80% of participants sang the same note (F#, the 5th scale degree). In this particular melody, we believe this reflects the tendency for a large melodic interval to be followed by stepwise motion in the opposite direction. This "gap-fill" pattern (Meyer, 1956; Narmour, 1990) likely strongly influenced the continuation most participants chose, which involved singing a note (F#) one step down from the last note of the stem (G#), following a large leap of a sixth to an already contextually unstable note (scale degree six). Additionally, this stem has a strongly implied compound melody, wherein most of the topmost notes form a rising, stepwise pattern of B-C#-D#-E, which leads to an F# if this pattern is continued. Meanwhile, the unexpectedly low constraint of the AC stem in **Figure 6A** was perhaps due to the lack of a strong tendency note (like the leading tone) as its last pitch, and the obscuring of the underlying harmonic implications by the relative rhythmic complexity of the melody. That is, the unpredictable and syncopated rhythm may have reduced the strength of the expectancy for the tonic scale degree (Schmuckler and Boltz, 1994). Similarly, in the stem pair in **Figure 6B**, the most common continuation for the NC stem was a gap-filling motion to fill the exceptionally wide upward leap of an octave from Bb4–Bb5. Landing on Ab, which 56% of subjects agreed on, helps close that gap with a downward step and continues the melody on the more stable pitch of scale degree 5. This note also has the advantage of mirroring the first note of the melody, thus promoting melodic symmetry. The AC stem of this melodic pair presented no such clearly determined ending. If subjects opted to fill in the large upward octave gap to Ab with a downward step, they would land on the unstable fourth scale degree (Gb). On the other hand, if they were to resolve the melody with a cadence on the tonic note (Db), they would land far from the final note of the stem, going against a general tendency in melodic expectation for pitches that are proximate in frequency to the previous note (see section on modeling below).

Based on the above observations, it is clear that underlying harmonic structure, which was manipulated in the AC vs. NC stems, does not alone determine melodic expectation. Melodic factors that likely contributed to increased constraint in our melodies include (but are not limited to) rhythmic simplicity, gap-fill pattern, compound-line implication, leadingtone resolution, and pattern completion. In this way, stems in which linear, contrapuntal, rhythmic and harmonic parameters were closely coordinated produced reliable agreement on melodic completions, while examples with a conflict or ambiguity between those factors were prone to considerably less consensus.

### Musical Experience

Prior research suggests that musical training enhances sensitivity to underlying harmonic structure (Koelsch et al., 2002). Since implicit harmony was used to guide the listeners' expectation for a tonic note after AC stems, we sought to determine if participants with greater degrees of musical training were more likely to sing the tonic after AC stems. Thus across AC stems, we correlated each participant's total years of formal musical training with their frequency of responding with the tonic. (Thus for example, if a participant sang the tonic after half of the AC stems they

heard, their frequency of responding with the tonic to an AC stem would be 0.5.) When all AC items were included in the analysis, there was no significant correlation with years of formal musical training, *r*(48) = 0.035, *p* = 0.812. However, when we divided AC stems according to the scale degree of their final note, an interesting pattern emerged. On average, after AC stems that ended on the 7th scale degree, participants sang the tonic 81% of the time, and in these melodies, there was a significant correlation between participants' years of formal training and their frequency of responding with the tonic, *r*(48) = 0.45, *<sup>p</sup>* <sup>=</sup> 0.001 (see **Figure 7**). This relationship with musical training was also observed with AC stems that ended on the 5th scale degree, where participants sang the tonic 55% of the time on average,*r*(48) = 0.33, *p* = 0.02. (The relationship was not seen for AC stems that ended on the 2nd scale degree, where participants sang the tonic 57% of the time on average.)

### Model Comparison

One potential use of the melodic cloze probability task is to test models of melodic expectation. While different forms of musical expectancy (e.g., melodic, rhythmic, harmonic) have been the subject of many important theoretical and empirical investigations (e.g., Schmuckler, 1989; Narmour, 1990; Schellenberg, 1996; Krumhansl et al., 1999; Large and Jones, 1999; Huron, 2006), melodic expectancy in particular has been a focus for quantitative modeling (e.g., Schellenberg, 1996; Krumhansl et al., 1999; Pearce et al., 2010). While comparison of behavioral and modeling data is not the primary focus of this paper, we present one such comparison to illustrate how melodic cloze data can be used for this purpose. We focus on the simplified version of the implication-realization (I-R) model of melodic expectancy (Narmour, 1990) developed by Schellenberg (1997).

This model computes the probability of each possible continuation of a melody based on two factors. The first of these factors is "pitch proximity," which states that listeners expect the next tone of a melody to be proximate in pitch to the last tone heard. (Another way of stating this is that listeners generally expect melodies to move by small steps.) The second factor is "pitch reversal," which states that after a leap, listeners expect the next tone to reverse direction (e.g., after an upward leap, they expect a downward pitch interval), and also expect the upcoming tone to land in a pitch region proximate to the penultimate tone (the first tone of the leap). A third factor relating to tonal stability was also included, based on values from the probe-tone profiles of Krumhansl and Kessler (1982). This factor reflects expectation for notes that fit well into the existing key context, with higher values for structurally more important/stable notes in key. Based on the equations for the simplified I-R model as codified in Schellenberg (1997) these three factors (proximity, reversal, and tonality) were weighted evenly by equalizing their maximum values, and were used to compute expectancies for all notes within two octaves of the final note of each melodic stem, using the MIDI toolbox (Eerola and Toiviainen, 2004).

In order to compare the model's predictions to cases where humans had strong expectations, we focused on high-constraint stems where most participants sang the same continuation (stems with constraint >69%, the mean of all AC stems). In the 22 stems satisfying this criterion, the simplified I-R model (including the tonality factor) correctly predicted the note most often sung by participants in 12 stems, i.e., 54.5% of the time. In the remaining 10 of these high-constraint stems (i.e., 45.5% of the time), the model's predictions were an average of 4.9 semitones away from participants' sung note (*SD* = 0.32 st) (see **Figure 8** for the distribution of distances between human data and model predictions). For the 10 stems where the model's predictions differed from the mostly commonly sung note, we checked if the note predicted by the model was the *second*-mostcommonly produced note by participants. This was true in only one stem. Overall, the model's performance suggests that our data cannot be accounted for solely by local factors of proximity and reversal, combined with tonality. This suggests that largerscale factors need to be taken into account, as further discussed below.

### DISCUSSION

We introduce the melodic cloze probability task, in which participants hear the opening of a short, novel tonal melody and sing the note they expect to come next. This task, which is modeled on the well-known cloze probability task in psycholinguistics, has not previously been used to study expectancy in the field of music cognition. Participants found the melodic cloze task easy to do, demonstrating that expectancy can be measured in a comparable way across linguistic and musical domains.

Prior work using singing to study melodic expectancy has focused on responses to two-note intervals (see introduction for references). Of these studies, the closest task to ours is Lake (1987), who had participants sing extended continuations in response to a two-note interval preceded by a tonal context. Unlike the current study, the tonal context was not the opening of a novel coherent melody, but a sequence of notes consisting of a major chord, a scale, and another major chord, which served to establish a strong sense of key before the two-note interval. One might ask how our results compare to those of Lake, since one

can conceive of our stimuli as also consisting of a key-inducing context followed by a final two-tone interval (i.e., the final two tones of the melodic stem).

While the last two notes of our stems clearly contribute to our results, our findings cannot be attributed to only hearing this final interval in a generic tonal context. A number of our stems are identical in the scale degrees of their final two notes, yet they elicit very different patterns of results from participants (see **Figure 9** for an example). This different pattern of responding to the same final interval reflects differences in the *structure* of the preceding notes. Thus our paradigm and results are not simply a replication of Lake (1987), and show the relevance of using melodically coherent materials as contexts for production-based studies of melodic expectation. Similarly, we note that our results are not simply a replication of the well-known probe-tone results of Krumhansl and Kessler (1982), since the pattern of responding was not just a reflection of the tonal hierarchy, and depended on the structure of the heard melody (e.g., **Figures 1** and **2**).

In addition to being the first study to obtain cloze probabilities for musical notes, to our knowledge the current study is the also the first to manipulate the predictive *constraint* of musical sequences as part of research on melodic expectation. By using pairs of monophonic melodic openings (or 'stems') matched in length, rhythm, and melodic contour, but differing in implied harmonic structure, we show that underlying harmonic progressions can strongly guide melodic expectations. Specifically, there was significantly more consistency in participants' responses to melodic stems ending on an implied authentic cadence (AC condition) than in their responses to stems ending non-cadentially (NC condition), as reflected by a higher percentage of participants singing the most common continuation for items in the AC condition. In other words, AC stems were more highly constraining than NC stems on average.

However, our data also clearly indicate that expectations based on larger-scale implied harmony interact with expectations based on melodic structure. That is, despite the fact that the harmonic differences between the AC and NC melodies in each pair were similar, we observed considerable variability in the constraint of melodies. In some pairs, the AC stem was considerably more constraining than the NC stem, but in other pairs the difference in constraint was mild, and in seven pairs the NC stem was actually equal to or more constraining than the AC stem (**Figure 3**). Analysis of two such 'reversed constraint' pairs (**Figure 6**) suggested that factors related to rhythmic simplicity, gap-fill pattern, compound line implication, and pattern completion may have been involved in overwhelming harmonic expectations. Further investigation of the factors driving the observed large variation in constraint among melodies is clearly warranted. From our results it is clear that expectancies related to melodic patterns (e.g., gap-fill) may sometimes trump those related to tonality.

Indeed, the variability in constraint observed in our data (**Figure 3**) suggests that the melodic cloze task is well suited for use in future studies aimed at exploring the relative contributions of melodic and harmonic patterns in shaping melodic expectation. Such studies can help test and improve quantitative models of melodic expectation (e.g., Schellenberg, 1996, 1997; Krumhansl et al., 1999; Eerola and Toiviainen, 2004; Margulis, 2005; Pearce, 2005; Pearce and Wiggins, 2006). In the current study, we compared human melodic expectations to predictions based on Schellenberg's (1997) simplified version of Narmour's (1990) Implication-Realization (I-R) model of melodic expectation, with an added tonality factor. For the 22 AC melodies with a high degree of measured constraint (i.e., where >69% of participants sang the same note), the model correctly predicted the sung pitch in 54.5% of these melodies. In the remaining 45.5% of these melodies, the model predicted a pitch that was on average 4.9 semitones from the pitch actually sung by participants. This discrepancy between human expectations and model predictions likely stems from the fact that the simplified I-R model focuses on just the last interval of a melody, and does not take larger-scale structural patterns into account (such as harmonic progressions and recurring motivic patterns). Successful models of melodic expectation will almost certainly need to operate at multiple timescales, reflecting the human tendency to integrate both local and global information in processing melodic sequence structure (Dowling, 2010). In the future, it will be interesting to use the melodic cloze method to test models which are sensitive to patterns at multiple timescales, including Margulis' (2005) model of melodic expectation, and Pearce's (2005) IDyOM model (cf. Pearce and Wiggins, 2006). Such models can be tested and improved by comparing their predictions with observed cloze probabilities from human participants.

The musical cloze probability task has further uses in the field of music cognition. For example, this paradigm can be used to investigate how different factors influence melodic expectancy. While we manipulated only the harmonic structure of melodies in the present experiment, the influence of any other factor (e.g., melodic contour, rhythm, dynamics, etc.) on musical expectations could be explored in subsequent studies by composing melodies in pairs and manipulating the one factor while keeping other factors constant. Additionally, the task could be varied to have participants sing multiple-note continuations, as has been done in previous studies (Carlsen, 1981; Lake, 1987; Unyk and Carlsen, 1987; Thompson et al., 1997; Schellenberg et al., 2002). This would allow responses to be examined on longer timescales than just the first sung note. In addition, it would reduce the possibility that participants are responding by *completing* the melodic sequences with the sung note, instead of *continuing* them (as instructed). This is an important issue, as the note sung after the stem may differ depending on whether listeners treat it as a continuation or a completion (Aarden, 2003, cf. Huron, 2006).

Of course, the melodic cloze paradigm does have its limitations. By focusing on what pitch a person sings, it cannot give independent measures of all the different types of expectations which may be at play at a given point in a melody, such as timbral expectations (if listening to complex textures) or rhythmic expectations. To study these sorts of expectations, modifications of the paradigm presented here would be necessary. For example, if studying rhythmic expectations, at the end of each stem one could ask participants to press a bar for as long as they think the next note will last.

The melodic cloze task can also be used to examine musical expectations in different populations. We observed a significant correlation between formal musical training and a tendency to sing the tonic after AC stems that ended on the 7th or 5th scale degrees. It has been suggested that having more musical experience leads to greater sensitivity to harmonic cues, which is consistent with our finding and with neural research on harmonic processing (Koelsch et al., 2002). Future studies could use the melodic cloze method to investigate how different kinds of musical experience might impact expectancy formation. For example, expectations may differ between musicians who have been educated in music theory vs. those who have experience

### REFERENCES


singing or improvising without reading music. Additionally, the melodic cloze paradigm could be used in studies with children, to investigate how melodic expectations develop (cf. Corrigall and Trainor, 2014).

Obtaining melodic cloze probabilities is crucial for future research comparing predictive processing in music and language, as it allows for the comparison of the effects of violating predictions of comparable strength in the two domains (cf. Tillmann and Bigand, 2015). Previous studies comparing expectancy violations in music and language have typically chosen violations that are intuitively thought to be comparable in the two domains. By using a cloze paradigm to quantify cloze probabilities for possible continuations in both domains, it is possible to compare effects of violations of the same degree, using normed stimuli (cf. Featherstone et al., 2012). For example, this will allow comparison of brain responses to plausible violations of expectations, instead of to frank structural violations (which rarely occur in naturalistic sequences). Also, studies that probe interactions between simultaneously presented music and language expectancy violations can be more precisely calibrated, in order to further elucidate cognitive and neural relations between language and music processing.


reading. *J. Cogn. Neurosci.* 16, 1272–1288. doi: 10.1162/08989290419 20487

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Fogel, Rosenberg, Lehman, Kuperberg and Patel. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Rhythmic Effects of Syntax Processing in Music and Language

Harim Jung, Samuel Sontag, YeBin S. Park and Psyche Loui\*

*Music, Imaging, and Neural Dynamics Lab, Psychology and Neuroscience and Behavior, Wesleyan University, Middletown, CT, USA*

Music and language are human cognitive and neural functions that share many structural similarities. Past theories posit a sharing of neural resources between syntax processing in music and language (Patel, 2003), and a dynamic attention network that governs general temporal processing (Large and Jones, 1999). Both make predictions about music and language processing over time. Experiment 1 of this study investigates the relationship between rhythmic expectancy and musical and linguistic syntax in a reading time paradigm. Stimuli (adapted from Slevc et al., 2009) were sentences broken down into segments; each sentence segment was paired with a musical chord and presented at a fixed inter-onset interval. Linguistic syntax violations appeared in a garden-path design. During the critical region of the garden-path sentence, i.e., the particular segment in which the syntactic unexpectedness was processed, expectancy violations for language, music, and rhythm were each independently manipulated: musical expectation was manipulated by presenting out-of-key chords and rhythmic expectancy was manipulated by perturbing the fixed inter-onset interval such that the sentence segments and musical chords appeared either early or late. Reading times were recorded for each sentence segment and compared for linguistic, musical, and rhythmic expectancy. Results showed main effects of rhythmic expectancy and linguistic syntax expectancy on reading time. There was also an effect of rhythm on the interaction between musical and linguistic syntax: effects of violations in musical and linguistic syntax showed significant interaction only during rhythmically expected trials. To test the effects of our experimental design on rhythmic and linguistic expectancies, independently of musical syntax, Experiment 2 used the same experimental paradigm, but the musical factor was eliminated—linguistic stimuli were simply presented silently, and rhythmic expectancy was manipulated at the critical region. Experiment 2 replicated effects of rhythm and language, without an interaction. Together, results suggest that the interaction of music and language syntax processing depends on rhythmic expectancy, and support a merging of theories of music and language syntax processing with dynamic models of attentional entrainment.

Keywords: syntax, music, harmony, language, rhythm, expectancy

# INTRODUCTION

Music and language are both universal human cognitive functions, but the degree to which they share cognitive resources is a long-standing debate in cognition. Theorists have argued for a shared evolutionary origin (Mithen, 2006), as well as extensive structural similarities between music and language (Lerdahl and Jackendoff, 1983; Botha, 2009), while others have argued for significant

### Edited by:

*Edward W. Large, University of Connecticut, USA*

### Reviewed by:

*Theodor Rueber, Bonn University Hospital, Germany Reyna L. Gordon, Vanderbilt University Medical Center, USA*

> \*Correspondence: *Psyche Loui ploui@wesleyan.edu*

### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

Received: *31 January 2015* Accepted: *03 November 2015* Published: *23 November 2015*

### Citation:

*Jung H, Sontag S, Park YS and Loui P (2015) Rhythmic Effects of Syntax Processing in Music and Language. Front. Psychol. 6:1762. doi: 10.3389/fpsyg.2015.01762* differences between music and language processing and domain specificity of the two domains (Peretz and Coltheart, 2003). Although syntax usually refers to the rules that govern how words and phrases are arranged in language, syntactic structure also exists in other domains, such as music. Musical syntax can be understood as the rules that define how pitches are organized to form melody and harmony. Western tonal harmony, like language, is organized in hierarchal structures that are built upon discrete and combined elements (Lerdahl and Jackendoff, 1983). Syntax in Western music can be realized in the structured organization of the 12 chromatic tones into diatonic scale degrees within tonal centers, which form chords within harmonic progressions. Both musical and linguistic structures unfold syntactically over time.

One theory that has influenced research in the structures of music and language is the Shared Syntactic Integration Resource Hypothesis (SSIRH), which postulates an "overlap in the neural areas and operations which provide the resources for syntactic integration" (Patel, 2003). The hypothesis reconciles contrasting findings between neuropsychology and neuroimaging studies on syntax processing, by suggesting that the same syntactic processing mechanisms act on both linguistic and musical syntax representations. The SSIRH predicts that the syntactic processing resources are limited, and thus studies with tasks combining musical and linguistic syntactic integration will show patterns of neural interference (Patel, 2003). While topics of ongoing debate concern the nature of the resources that are shared (Slevc and Okada, 2015) and the extent to which such resources are syntaxspecific (Perruchet and Poulin-Charronnat, 2013), convergent studies do provide evidence for some shared processing of music and language, with evidence ranging from behavioral manipulations of syntactic expectancy violations in music and language (e.g., Fedorenko et al., 2009; Slevc et al., 2009; Hoch et al., 2011) to cognitive neuroscience methods such as ERP and EEG studies that track the neural processing of syntax and its violations (e.g., Koelsch et al., 2005; Steinbeis and Koelsch, 2008; Fitzroy and Sanders, 2012).

One piece of evidence in support of the shared processing of musical and linguistic syntax comes from a reading time study in which musical and linguistic syntax were manipulated simultaneously (Slevc et al., 2009). Reading time data for a self-paced reading paradigm showed interactive effects when linguistic and musical syntax were simultaneously violated, suggesting the use of the same neural resources for linguistic and musical syntax processing. In this self-paced reading paradigm, linguistic syntax was violated using garden path sentences, whereas musical syntax was violated using harmonically unexpected musical chord progressions.

As both musical and linguistic syntax unfold over time, the timing of both musical and linguistic events may affect such sharing of their processing resources. Rhythm, defined as the pattern of time intervals in a stimulus sequence, is usually perceived as the time between event onsets (Grahn, 2012a). As a pattern of durations that engenders expectancies, rhythm may represent its own form of syntax and thus be processed similarly to both musical and linguistic syntax in the brain (Fitch, 2013). It has also been suggested that rhythm is an implicitly processed feature of environmental events that affects attention and entrainment to events in various other domains such as music and language (Large and Jones, 1999). Specifically, the Dynamic Attending Theory (DAT) posits a mechanism by which internal neural oscillations, or attending rhythms, synchronize to external rhythms (Large and Jones, 1999). In this entrainment model, rhythmic processing is seen as a fluid process in which attention is involuntarily entrained, in a periodic manner, to a dynamically oscillating array of external rhythms, with attention peaking with stimuli that respect the regularity of a given oscillator (Large and Jones, 1999; Grahn, 2012a). This process of rhythmic entrainment has been suggested to occur via neural resonance, where neurons form a circuit that is periodically aligned with the stimuli, allowing for hierarchical organization of stimuli with multiple neural circuits resonating at different levels, or subdivisions, of the rhythm (Large and Snyder, 2009; Grahn, 2012a; Henry et al., 2015). One piece of evidence in support of the DAT comes from Jones et al. (2002), in which a comparative pitch judgment task was presented with interleaving tones that were separated temporally by regular inter-onset intervals (IOIs) that set up a rhythmic expectancy. Pitch judgments were found to be more accurate when the tone to be judged was separated rhythmically from the interleaving tones by a predictable IOI, compared to an early or late tone that was separated by a shorter or longer IOI, respectively. The temporal expectancy effects from this experiment provide support for rhythmic entrainment of attention within a stimulus sequence.

Both SSIRH and DAT make predictions about how our cognitive system processes events as they unfold within a stimulus sequence, but predictions from SSIRH pertain to expectations for linguistic and musical structure, whereas those from DAT pertain to expectations for temporal structure. The two theories should converge in cases where expectations for music, language, and rhythm unfold simultaneously.

# Aims and Overall Predictions

The current study aims to examine the simultaneous cognitive processing of musical, linguistic, and rhythmic expectancies. We extend the reading time paradigm of Slevc et al. (2009), by borrowing from the rhythmic expectancy manipulations of Jones et al. (2002), to investigate how the introduction of rhythmic expectancy affects musical and linguistic syntax processing. Rhythmic expectancy was manipulated through rhythmically early, on-time, or late conditions relative to a fixed, expected onset time. As previous ERP data that have shown effects of temporal regularity in linguistic syntax processing (Schmidt-Kassow and Kotz, 2008), it is expected that rhythmic expectancy does affect syntax processing. The current behavioral study more specifically assesses how rhythmic expectancy may differentially modulate the processing of musical and linguistic syntax.

# EXPERIMENT 1

# Methods

Participants read sentences that were broken down into segments, each of which was paired with a chord from a harmonic chord progression. Linguistic syntax expectancy was

manipulated using syntactic garden-path sentences, musical expectancy was manipulated using chords that were either in key or out of key, and rhythmic expectancy was manipulated by presenting critical region segments early, on time, or late.

### Participants

Fifty six undergraduate students from Wesleyan University participated in this study in return for course credit. A recording error resulted in the loss of data for 8 out of the 56 total students, and so 48 participants' data were used in the final analysis. Of the remaining participants, all reported normal hearing. Twenty eight participants (58.3%) reported having prior music training, averaging 6.8 years (SD = 3.4). Twenty five (52%) participants identified as female, and 23 as male. Thirty eight (79.1%) reported that their first language was English, three were native speakers of English and one other language, and seven had a language other than English as their first language. Other than English, participants' first languages included Chinese (Mandarin), Arabic, Thai, Japanese, Spanish, French, German, Vietnamese, and Bengali. Sixteen participants (33.3%) spoke more than one language. All participants had normal or corrected-to-normal vision and reported being free of psychiatric or neurological disorders. Informed consent was obtained from all subjects as approved by the Ethics Board of Psychology at Wesleyan University.

### Materials

All experiments were conducted in Judd Hall of Wesleyan University. An Apple iMac and Sennheiser HD280 pro headphones were used for the experiments, with MaxMSP software (Zicarelli, 1998) for all stimulus presentation and response collection.

### Stimuli

The current study used 48 sentences from Slevc et al. (2009). These sentences were divided into segments of one or several words, and presented sequentially on the iMac screen using MaxMSP. Twelve of the sentences were syntactic garden paths, which were manipulated to be either syntactically expected or unexpected at the critical region (by introducing a garden path effect—see **Figure 2**). Reading time (RT) comparisons between different conditions were controlled for length of segment because the critical regions are always the same number of words (as shown in **Figure 1**) in the different conditions. Sentence segments with the paired harmonic progression were presented

at a critical region, either on-time (at the regular inter-onset interval of 1200 ms) or "jittered" to be either early or late. The early jitter was 115 ms earlier than the on-time presentation, and the late jitter was 115 ms later than the on-time presentation. Thus, the IOIs were either 1200–115 = 1085 ms (early), 1200 ms (on-time), or 1200 + 115 = 1315 ms (late; **Figure 2**). 115 ms was selected as the temporal jitter based on pilot testing and the IOIs used in Experiment 2 of Jones et al. (2002) in their manipulation of temporal expectancy. Accompanying chord progressions were played in MIDI using a grand piano timbre. These 48 different progressions were also from Slevc et al. (2009) and followed the rules of Western tonal harmony, and were all in the key of C major. Out-of-key chords violated harmonic expectancy given the context, but were not dissonant chords by themselves (**Figure 1**). A yes-or-no comprehension question was presented at the end of each trial (sentence). Participants' task was to press the spacebar on the keyboard as soon as they had read each sentence segment, and to answer "yes" or "no" to the comprehension questions. Ninety six unique comprehension questions, two for each sentence, were written so each sentence would have one comprehension question written to have a correct answer "yes," and another to have a correct answer "no." The comprehension questions are now given in the Supplementary Materials accompanying this manuscript.

Twelve unique experimental modules were created in order to counterbalance the experimental design. Each module contained all 48 sentences, with violation and filler conditions rotated through the sentences in order to control for systematic effects of content, length, and sentence order. Each module contained: 4 rhythmic violation trials (2 early and 2 late), 3 musical syntax violation trials, 1 linguistic syntax violation trial, 5 musical syntax plus rhythmic violation trials, 1 linguistic plus musical syntax violation trial, 2 linguistic syntax plus rhythmic violation trial, 2 trials with all 3 violations, and 30 sentences with no violations. Therefore, in a given module only 37.5% of trials contained any violation. Half of the sentences in a given module were assigned a "yes" question, the other half were assigned a "no." The order of the trials was randomized for each subject.

### Procedure

Before beginning the experiment, the participants gave informed consent and completed a short background survey. The participants were then instructed to pay close attention to the sentences being read, rather than the chord progressions that were heard over the headphones. Then, the participants ran through a set of practice trials. After the practice trials, in the actual experiment the experimenter selected one of the 12 possible experimental modules at random. Participants were instructed to press the spacebar on the keyboard as soon as they had read the sentence segment, and then wait for the next segment to be presented. Pressing the spacebar caused the current sentence segment to disappear and an indicator button labeled "I read it" to light up. The following segment appeared at a fixed IOI regardless of when the current segment disappeared. After the end of each sentence, a yes-or-no comprehension question was displayed, at which point participants answered the question by pressing Y or N on the keyboard. Answering the comprehension question cued a new trial. The experiment lasted ∼20 min. Examples of different types of trials are shown in a video demo in the Supplementary Materials accompanying this manuscript.

### Data Analysis

RT and response data were saved as text files from MaxMSP, and imported into Microsoft Excel and SPSS for statistical analysis. RTs were log-transformed to normal distribution for statistical testing. Only RTs pre-critical, critical, and post-critical regions for each trial were used for analysis. Filler trials were, therefore, excluded from analysis (21 trials per subject). Of the remaining trials, trials with RTs that were two or more standard deviations

expected and unexpected conditions during rhythmically early (A), on-time (B), and late (C) conditions. Error bars show standard error.

from the mean of log-transformed critical region RTs were excluded as outliers, resulting in a range of 102.76–816.74 ms. These criteria led to the exclusion of 92 (7.20%) of observations from critical regions in Experiment 1.

No significant differences were observed in log-transformed RTs between native English speakers (n = 41) and non-native English speakers [non-native n = 7, t(46) = 0.42, n.s.]. Similarly, no significant differences were observed between participants who reported musical training (n = 29) and those who reported no musical training [n = 19, t(46) = 1.53, n.s.]. To check for interactions between linguistic syntax and native English speaker experience, an ANOVA was run on the dependent variable of log-transformed RT with the fixed factor of linguistic syntax (congruent vs. incongruent) and the random factor of native English speaker status (native vs. non-native English speaker). No significant interaction between native English speaker status and linguistic syntax was observed [F(1, 92) = 0.53, MSE = 0.01, p = 0.47]. Similarly, to check for interactions between musical syntax and musical training, an ANOVA with the fixed factor of musical syntax (congruent vs. incongruent) and the random factor of musical training (musically trained vs. no musical training) showed no interaction between musical syntax and musical training [F(1, 92) = 0.091, MSE = 0.008, p = 0.764]. As we observed no main effects or interactions that were explainable by native English speaking experience or musical training, results were pooled between native and non-native English speakers, and between musically trained and untrained subjects.

### Results

On comprehension questions, participants performed significantly above chance in all conditions [overall M = 78.95%, s = 12.24, two-tailed t-test against chance level of 50% correct: t(47) = 16.38, p < 0.0001].

A Three-way ANOVA on the dependent variable of logtransformed RT during the critical region (log\_RT\_CR) was run with fixed factors of language (two levels: congruent and incongruent), music (two levels: congruent vs. incongruent), and rhythm (three levels: early, on-time, and late), with subject number as a random factor. Results showed a significant threeway interaction among the factors of linguistic, musical and rhythmic expectancies [F(2, 52) = 5.02, MSE = 0.008, p = 0.01], as well as a significant main effect of language [F(1, 54) = 12.5, MSE = 0.006, p = 0.001] and a significant main effect of rhythm [F(2, 99) = 13.2, MSE = 0.01 p < 0.001] and a marginally significant effect of music [F(1, 53) = 3.7, MSE = 0.01, p = 0.059]. Means and SDs of RTs are given in **Table 1** for each condition, and in **Table 2** for each cell.

To investigate any possible interactive effects between music and language syntax at different rhythmic conditions, an RT difference was computed between RTs for critical region and for pre-critical region. Two-way ANOVAs with fixed factors of language and music were used to test for interactions between music and language at each of the three rhythm conditions (early, on-time, and late). Results showed that for the rhythmically ontime condition, there was an interaction between language and music [F(1, 170) = 4.9, MSE = 4776.9, p = 0.027]. In contrast, the interaction between language and music was not significant at the rhythmically early condition [F(1, 170) = 0.27, MSE = 12882.0, p = 0.603] or the rhythmically late condition [F(1, 170) = 2.34, MSE = 5155.2, p = 0.127] (see **Figure 2**). These results suggest that the interaction between linguistic and musical syntax varies by rhythmic expectancy.

Further investigation of the degree to which factors interacted at the critical region required comparing RTs across the pre-critical, critical, and post-critical time regions. For this comparison, difference scores of linguistically congruent from linguistically incongruent RTs were calculated, and these difference scores were compared for musically in-key and outof-key trials across time regions for each rhythmic condition (see **Figure 3**). We found a significant effect of time region:

TABLE 1 | Mean critical region RTs (ms) under different conditions of linguistic syntax, musical syntax, and rhythmic expectancies.


TABLE 2 | Mean critical region RTs (ms) under different combinations of conditions of linguistic syntax, musical syntax, and rhythmic expectancies.


RT was longer in the critical region in the rhythmically early condition only [F(2, 92) = 4.67, p = 0.012]. In the rhythmically late condition only, musical syntax violations produced larger difference scores at the critical region; however this difference was not significant. In the rhythmically early condition and ontime conditions, musically in-key trials yielded larger difference scores than musically out-of-key trials at the critical regions, although these differences were not significant (see **Figure 3**).

# Discussion

Experiment 1 tested to see how rhythmic expectancy affected the processing of musical and linguistic syntax. Results from log-transformed RTs during the critical region (**Table 2**) and RT differences between critical and pre-critical regions (**Figure 2**) showed significant main effects of language and rhythm, a significant three-way interaction of language, music, and rhythm, and a significant two-way interaction between linguistic and musical syntax in the on-time condition only. These findings extend the results of past research (Slevc et al., 2009) to show that the sharing of cognitive resources for music and language appear specific to rhythmically expected events.

In contrast to critical region RTs, however, RT differences between linguistically incongruent and congruent trials (**Figure 3**) showed slower RTs within the critical region only during rhythmically early trials. The interaction patterns between musical and linguistic syntax over different time regions were inconclusive. This differs from the original findings of Slevc et al. (2009), who observed a synergistic interaction between musical syntax and time region on the reaction time difference between linguistically congruent minus incongruent trials, suggestive of a language and music interaction specifically during the critical region, when rhythm was not a factor. The less robust effect of critical region in this experiment may arise from spillover effects of linguistic incongruence that last beyond the critical region.

While neither SSIRH nor DAT makes specific predictions about this possible spillover effect, the main findings of a three-way interaction among language, music, and rhythm is generally consistent with both theoretical accounts and does suggest that any synergy or sharing of neural resources between music and language depends on rhythmic expectancy. Violations in rhythmic expectancy may disrupt the shared resources that are generally recruited for syntax processing, such as cognitive control (Slevc and Okada, 2015). As music and language both unfold over time, it stands to reason that our expectations for rhythm—defined here as the pattern of time intervals within a stimulus sequence (Grahn, 2012a)—would govern any sharing of neural resources between music and language, as is consistent with the DAT (Large and Jones, 1999), as well as prior behavioral data on rhythmic entrainment (Jones et al., 2002) and studies on the neural underpinnings of rhythmic entrainment (Henry et al., 2015) and their effects on linguistic syntax processing (Schmidt-Kassow and Kotz, 2008).

The three-way interaction between language, music, and rhythm is accompanied by significant main effects of language and rhythm, and marginally significant main effect of musical expectancy. The main effect of rhythm is similar to Jones et al. (2002) and others, in which perturbed temporal expectations resulted in longer RTs. Incongruent garden-path sentences elicit longer RTs during the critical region compared to their counterparts. This is consistent with Slevc et al. (2009) and Perruchet and Poulin-Charronnat, 2013) as well as with previous uses of the self-paced reading time paradigm (Ferreira and Henderson, 1990). The main effect of musical expectancy was only marginally significant. While it is worth noting that Slevc et al. (2009) also did not report a significant main effect of musical expectancy, this weak effect may also be due to task instructions to pay close attention to the sentence segments rather than to the chord progressions heard over headphones. To determine whether music generally taxed cognitive or attentional resources

away from subjects' monitoring of the sentence segments, it was necessary to compare comprehension accuracy with and without musical stimuli. This was a motivation for Experiment 2, in which the experiment was re-run without musical stimuli.

While previous studies that used a self-paced reading paradigm (Ferreira and Henderson, 1990; Trueswell et al., 1993; Slevc et al., 2009; Perruchet and Poulin-Charronnat, 2013) required subjects to activate the next sentence segment as part of the task, in order to implement a factor of rhythmic expectancy our design featured a fixed inter-onset interval of sentence segments, and subjects were asked instead to press a button to indicate that they had read each segment. To our knowledge this type of implementation is new for psycholinguistic studies. One of the goals of Experiment 2 is to check for the validity of this type of implementation by testing for an effect of linguistic congruency with fixed IOI presentations of sentence segments, even in the absence of musical stimuli.

# EXPERIMENT 2

Our modification of the standard self-paced reading paradigm resulted in fixed IOIs with the task of indicating that subjects had read the displayed sentence segment. This was a different task from the standard self-paced reading paradigm in which subjects' task was to advance the following sentence segment, and our task had yet to be confirmed as effective in detecting effects of linguistic syntax, even without the presence of musical stimuli. Furthermore, it was possible that the three-way and two-way interactions from Experiment 1 resulted from the complexity of our experimental design, and that the processing of multiple violations could affect attending and development of expectancy to task-irrelevant stimuli, as well as syntax processing per se. Experiment 2 thus follows up on Experiment 1 by investigating effects of rhythmic violations on comprehension and the processing of linguistic syntax stimuli, removing the variable of musical stimuli. A significant effect of linguistic syntax as well as rhythmic expectancy could validate the current manipulation of the self-paced reading paradigm, and a significant interaction between language and rhythm would suggest that the two domains tap into the same specific neural resources whereas no interaction might suggest more parallel processing.

# Methods

In experiment 2, participants again read sentences broken down into segments. Linguistic syntax expectancy was manipulated using syntactic garden-path sentences, and rhythmic expectancy was manipulated by presenting critical region segments early, on-time, or late.

### Participants

A new group of 35 undergraduate students from Wesleyan University participated in Experiment 2 in return for course credit. From these participants, all reported normal hearing, normal or corrected-to-normal vision, and no psychiatric or neurological disorders. Twenty-five participants (71.4%) reported having prior music training, averaging 5.9 years (SD = 3.0). Twenty (57.1%) participants identified as female, and 15 (42.3%) as male. Twenty-eight (80%) reported that their first language was English, and seven had a language other than English as their first language. Other than English, participants' first languages included Spanish, Chinese, and Thai. Twenty-four participants (68.6%) spoke more than one language. Informed consent was obtained from all subjects as approved by the Ethics Board of Psychology at Wesleyan University.

### Materials

The second experiment was conducted in the Music, Imaging, and Neural Dynamics (MIND) Lab Suite in Judd Hall at Wesleyan University. An Apple iMac was used for the experiment, with MaxMSP software for all stimulus presentation and response collection.

### Stimuli

The same experimental patch on MaxMSP and 12 experimental modules with the 48 sentences borrowed from Slevc et al. (2009) were used from the first experiment. However, to investigate how rhythmic violations would affect reading and interact with violations in linguistic syntax, independent of violations in musical syntax, the experimental patch was muted, so that chords were not heard with each sentence segment. The IOIs of sentence segments remained unaltered, and the same "yes" or "no" comprehension questions were also asked at the end of each trial, with randomized order of the trials for each subject.

### Procedure

Similar to Experiment 1, participants were instructed to read sentences carefully, and hit the spacebar as soon as they had read a sentence segment. After running through a practice set, the participants began the actual experiment. The experimenter selected one of the twelve possible experimental modules at random. At the end of each trial, participants answered the "yes" or "no" comprehension question, queuing the next trial.

### Data Analysis

RTs and comprehension question responses were saved as text files from MaxMSP, and imported into Microsoft Excel, and SPSS for statistical analysis. Only RTs at the pre-critical, critical, and post-critical regions for each trial were used for analysis. Filler trials were, again, excluded from analysis (21 trials per subject). The same parameters and methods of outlier exclusion were used from the previous experiment, resulting in a RT range of 123.63– 1121.40 ms. These criteria led to the exclusion of 19 (1.97%) of observations in Experiment 2. RTs were also log-transformed to normal distribution for statistical tests.

Results between musically trained and non-musically trained subjects were pooled because music was not a factor in this experiment. No significant differences were observed in logtransformed RTs between native English speakers and non-native English speakers [t(34) = 0.96, n.s.]. Similarly, an ANOVA with the fixed factor of linguistic syntax and the random factor of native English experience showed no significant interaction [F(1, 523) = 1.059, MSE = 0.018, p = 0.30]. As we observed no differences that were explainable by native English speaking experience, results were pooled between native and non-native English speakers.

### Results

Participants performed significantly above chance (M = 86.93%, s = 6.21) on comprehension questions in all conditions. To compare comprehension accuracy with and without musical stimulus presentation, a One-way ANOVA on average comprehension accuracy as the dependent variable was run with the factor of experiment, comparing average comprehension accuracy for subjects between Experiment 1 and 2. Results showed a significant main effect of experiment on comprehension accuracy, with subjects from Experiment 2 performing better on average on comprehension questions than those from Experiment 1 [F(1, 81) = 12.51, MSE = 0.01, p = 0.001]. This suggests that the added variable of musical expectancy further taxed participants' attention from the taskrelevant comprehension questions in Experiment 1.

A Two-way ANOVA on the dependent variable of logtransformed RT during the critical region was run with the factors of language and rhythm. Results showed a significant main effect of language [F(1, 34) = 7.69, MSE = 0.001. p = 0.009], a significant effect of rhythm [F(2, 68) = 9.69, MSE = 0.001, p < 0.001], and no significant two-way interaction [F(2, 68) = 1.07, MSE = 0.001, p = 0.83]. Mean and SD RTs are shown for each condition in **Table 3** and for each cell in **Table 4**.

# Discussion

Results from Experiment 2 showed main effects of language and rhythm, validating the use of this novel task. There was also a higher comprehension accuracy compared to Experiment 1, but no interactions between the two factors of linguistic syntax and rhythmic expectancy (see **Table 4**).

Experiment 2 further investigates the effects of rhythmic expectancy on linguistic syntax processing. When the factor of music was removed, main effects of language and rhythm were still observed. RTs were longer for syntactically unexpected



TABLE 4 | Mean critical region RTs (ms) under different combinations of conditions of linguistic syntax and rhythmic expectancies.


sentences, replicating results from Experiment 1 as well as previous experiments that used the self-paced reading time paradigm (Ferreira and Henderson, 1990; Trueswell et al., 1993). Notably, this finding of longer RTs during syntactically unexpected critical regions within the garden path sentences provides a validation of the current adaptation of the self-paced reading time paradigm: while previous studies that used the selfpaced reading time paradigm (Ferreira and Henderson, 1990; Trueswell et al., 1993; Slevc et al., 2009; Perruchet and Poulin-Charronnat, 2013) required subjects to advance the sentence segments manually, in the current study we adapted the paradigm with fixed IOIs to enable simultaneous investigations of rhythmic and linguistic syntax expectancy.

Effects of rhythmic expectancy were also observed, as participants were slower to respond to critical regions presented earlier or later than the expected IOI. This replicates results from Experiment 1 and suggests that temporal entrainment was possible even with a visual-only reading task, and thus is not limited to the auditory modality. This effect of rhythm on visual processing is consistent with prior work on rhythmic effects of visual detection (Landau and Fries, 2012) and visual discrimination (Grahn, 2012b).

Although main effects of language and rhythm were observed, there was no significant interaction. An explanation for this lack of interaction could be that removing the factor of music resulted in the implemented violations no longer being sufficiently attention-demanding to lead to an interaction between the remaining factors, resulting in parallel processing of language and rhythm. In this view, the data suggests that rhythm affects a general, rather than a syntax-specific, pool of attentional resources. When the factor of music was removed, attentional resources were less demanded from the available pool, reducing the interactive effects of language and rhythm on each other and resulting in no interaction and higher comprehension accuracy. Alternately, it could be that the rhythm only affected peripheral visual processing, without also affecting syntax processing at a central level. While the present experiment cannot tease apart these possible explanations, considering the extant literature on relationships between rhythm and grammar (Schmidt-Kassow and Kotz, 2009; Gordon et al., 2015b) it is clear that rhythm can affect central cognitive processes such as syntactical or grammatical computations.

Finally, another finding from Experiment 2 is that comprehension accuracy was higher compared to Experiment 1, suggesting that eliminating the factor of music restored some attentional resources to the task of comprehension. When the primary task was to read sentence segments for comprehension, musical stimuli in the background could have functioned as a distractor in a seeming dual-task condition of comprehending the entire sentence while responding to each segment (by pressing the spacebar).

Taken together, Experiment 2 helps to validate the paradigm used in Experiment 1. By simplifying the experiment to remove the factor of music, some attentional resources may have been restored, resulting in higher comprehension accuracy overall, as well as main effects of language and rhythm with no interaction between the two.

# GENERAL DISCUSSION

The goal of the current study is to examine how rhythmic expectancy affects the processing of musical and linguistic syntax. Experiment 1 shows main effects of language, music, and rhythm, and specificity of the interaction between musical and linguistic syntax in the rhythmically expected condition only. These data patterns confirm that rhythm affects the sharing of cognitive resources for music and language, and is largely consistent with SSIRH (Patel, 2003) and DAT (Large and Jones, 1999). However, some of the follow-up analyses are inconclusive as to the exact nature of these interactions over time. In particular, only in rhythmically early trials did we find that the critical region significantly affected the difference in RT between incongruent and congruent language trials, with no significant interactions with musical expectancy unlike in Slevc et al. (2009). The reason for this specific effect of critical region in rhythmically early trials is unclear. It might arise from some spillover effects of linguistic incongruence that last beyond the critical region in rhythmically on-time and late trials. Alternately, it might be a consequence of the complexity of our task in this experiment design. Although the significant main effects suggest that our manipulations were effective, this inconclusive data pattern may nevertheless result from low power due to relatively few trials per cell in the experiment design of Experiment 1.

As it is possible that results were due to the complexity of our design, Experiment 2 simplifies the design by eliminating the factor of music altogether. Results of Experiment 2 show superior comprehension accuracy compared to Experiment 1, and main effects of language and rhythm without an interaction between the two factors. The main effects help to validate our adaptation of the original self-paced reading time paradigm (Ferreira and Henderson, 1990; Trueswell et al., 1993) for research in rhythmic expectancy. The null interaction, when accompanied by significant main effects, suggests that given the task conditions and attentional allocation in Experiment 2, rhythm and language were processed in parallel and did not affect each other.

The superior comprehension accuracy in Experiment 2 may be explained by an increase in general attentional resources that are now available to subjects in Experiment 2 due to the removal of music as a factor. While it was not specifically tested whether these general attentional mechanisms may be the same or different from the temporal attention that is taxed by temporal perturbations of rhythmic expectancy, other literature on voluntary (endogenous) vs. involuntary (exogenous) attention might shed light on this distinction (Hafter et al., 2008; Prinzmetal et al., 2009). Voluntary or endogenous attention, such as that tested in dual-task situations when the task is to attend to one task while ignoring another, is similar to the general design of the present studies where subjects are instructed to pay attention to sentence segments while ignoring music that appears simultaneously. Involuntary or exogenous attention, in contrast, is driven by stimulus features such as rhythmic properties as tapped by our rhythmic expectancy manipulations. Previous research has shown that voluntary attention tends to affect accuracy whereas involuntary attention affects reaction time (Prinzmetal et al., 2005). This fits with our current findings where comprehension accuracy is affected by the removal of music as a factor (by comparing Experiments 1 and 2), whereas reading time is affected by rhythmic perturbations of the presentation of sentence segments.

In both experiments, effects of rhythm were observed in response to visually-presented sentence segments. While the rhythmic aspect of language might generally manifest itself more readily in the auditory than the visual modality, this effect observed from the visual manipulations suggests that rhythmic expectation for language is not limited to auditory processing, but may instead pervade the cognitive system in a modality-general manner, affecting even the visual modality. As visual detection and discrimination are both modulated by rhythm (Grahn, 2012b; Landau and Fries, 2012) and musical expectation can cross-modally affect visual processing (Escoffier and Tillmann, 2008), the current study provides support for the view that rhythmic, musical, and linguistic expectations are most likely not tied to the auditory modality, but instead affect the cognitive system more centrally.

Results appear to be independent of musical training and native English speaker experience. The link between linguistic and musical grammar processing could have been expected to vary by musical and linguistic expertise: children who perform well on phonemic or phonological tasks also outperform their counterparts in rhythmic discrimination as well as pitch awareness (Loui et al., 2011; Gordon et al., 2015b). At a neural level, brain areas and connections that subserve language are different in their structure and function among professional musicians (Sluming et al., 2002; Halwani et al., 2011), and some highly trained populations, such as jazz drummers, process rhythmic patterns in the supramarginal gyrus, a region of the brain that is thought to be involved in linguistic syntax (Herdener et al., 2014). Despite these effects of training and expertise, the current study found no effects of musical training or linguistic background, converging with the original study (Slevc et al., 2009) as well as prior reports of the languagelike statistical learning of musical structure (Loui et al., 2010; Rohrmeier et al., 2011). It is possible that only some types of task performance, such as those that tap more sensory or perceptual resources, might be affected by music training via selective enhancement of auditory skills (Kraus and Chandrasekaran, 2010).

In sum, the current study demonstrates that rhythmic expectancy plays an important role in the shared processing of musical and linguistic structure. The subject of shared processing of musical and language structure has been central to music cognition, as is the question of how rhythm affects attentional entrainment. While providing support for an overlap in processing resources for musical and linguistic syntax, the current results also suggest that perturbations in rhythmicity of stimuli presentation tax these attentional resources. By offering a window into how perturbations of rhythmic and temporal expectancy affect musical and linguistic processing, results may be translatable toward better understanding and possibly designing interventions for populations with speech and language difficulties, such as children with atypical language development (Przybylski et al., 2013; Gordon et al., 2015a). Toward that goal, the specific neural underpinnings of these shared processing resources still remain to be addressed in future studies.

### ACKNOWLEDGMENTS

Supported by startup funds from Wesleyan University, a grant from the Grammy Foundation and the Imagination Institute to

### REFERENCES


PL., and the Ronald E. McNair Scholars Program to HJ. We thank all our participants and L. Robert Slevc for helpful comments at an early stage of this project.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2015.01762


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Jung, Sontag, Park and Loui. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Pronunciation difficulty, temporal regularity, and the speech-to-song illusion

### *Elizabeth H. Margulis\*, Rhimmon Simchy-Gross and Justin L. Black*

Music Cognition Lab, University of Arkansas, Fayetteville, AR, USA

### *Edited by:*

McNeel Gordon Jantzen, Western Washington University, USA

### *Reviewed by:*

Erin E. Hannon, University of Nevada, USA Cyrille Magne, Middle Tennessee State University, USA

*\*Correspondence:* Elizabeth H. Margulis, Music

Cognition Lab, University of Arkansas, MUSC 201, Fayetteville, AR 72701, USA e-mail: ehm@uark.edu

The speech-to-song illusion (Deutsch et al., 2011) tracks the perceptual transformation from speech to song across repetitions of a brief spoken utterance. Because it involves no change in the stimulus itself, but a dramatic change in its perceived affiliation to speech or to music, it presents a unique opportunity to comparatively investigate the processing of language and music. In this study, native English-speaking participants were presented with brief spoken utterances that were subsequently repeated ten times. The utterances were drawn either from languages that are relatively difficult for a native English speaker to pronounce, or languages that are relatively easy for a native English speaker to pronounce. Moreover, the repetition could occur at regular or irregular temporal intervals. Participants rated the utterances before and after the repetitions on a 5-point Likert-like scale ranging from "sounds exactly like speech" to "sounds exactly like singing."The difference in ratings before and after was taken as a measure of the strength of the speech-to-song illusion in each case. The speech-to-song illusion occurred regardless of whether the repetitions were spaced at regular temporal intervals or not; however, it occurred more readily if the utterance was spoken in a language difficult for a native English speaker to pronounce. Speech circuitry seemed more liable to capture native and easy-to-pronounce languages, and more reluctant to relinquish them to perceived song across repetitions.

**Keywords: speech-to-song illusion, repetition, music and language, music perception, meter**

### **INTRODUCTION**

Music and speech offer excellent comparative cases to illuminate the mechanisms subserving human communication (cf. Patel, 2008). They share many acoustic features, but differ in salient ways too – music tends to feature slower pitch changes, more stable fundamental frequencies, and rhythmic structure that gives rise to the perception of an isochronous beat. Music and speech may share not only developmental origins (McMullen and Saffran, 2004), but also common evolutionary origins (Wallin et al., 2001), yet they often seem quite phenomenologically distinct. It can seem that music is heard as music, and speech is heard is speech, and that is that. Several years ago, however, Deutsch et al. (2008) reported a striking illusion where repeating a segment of speech could engender a perceived transformation from speech to song. In this illusion, participants first hear an ordinary spoken utterance. Then they hear a segment from this utterance repeated 10 times in succession. Finally, they rehear the original utterance, but on this hearing, the segment that had been repeated sounds as if it were being sung rather than spoken. Rhythmic and pitch content comes strikingly to the fore, and this change in perceptual orientation results in a change in the category to which listeners attribute the stimulus.

Since the discovery of this illusion, various studies have sought to examine what qualities must be in place for this perceptual transformation to occur. Deutsch et al. (2011) showed that no illusory change to song occurred if the repetitions were inexact – if they were slightly differently transposed in pitch on each repetition, or if the syllables were jumbled into different orderings on each repetition. Tierney et al. (2013) were able to collect a set of spoken utterances that tended to transform to song after repetition, and a set of spoken utterances that did not tend to transform. The utterances that did transform were distinguished from the others by slightly more stable fundamental frequency contours within syllables, and by more regular spacing of inter-accent intervals. When speech was perceived as song, regions associated with pitch processing such as the superior temporal gyrus and regions associated with auditory-motor integration such as the precentral gyrus were differentially activated. These results suggest that not only does a shift from speech to song reflect increased attention to pitch, but it might also entail more imagined motor involvement. When we hear a song, we tend to sing along in our heads in a way that is quite different from how we listen to speech.

Falk et al. (2014) showed that when the utterance's pitch contour was made up of stable tonal targets, people perceived the transformation to song earlier and more frequently. Rhythmic aspects of the utterance did not play as big of a role. They also manipulated the regularity of the pause between utterances, but found it had no effect on the speech to song transformation. These findings are consistent withMantell and Pfordresher (2013), which used a vocal imitation task to show that people could replicate the absolute pitch of song more accurately than the absolute pitch of speech, but there was no difference in accuracy between song and speech on replication of timing. People with and without formal musical training experienced the illusion the same way.

Given the increased auditory-motor integration for song perception revealed in Tierney et al. (2013), we wondered whether part of what distinguishes attending to music from attending to speech is a participatory stance, where the listener begins to sing through a tune in her head while it is playing after she has heard it a few times – a hypothesis explored in Margulis (2013). To address this hypothesis, languages of varying pronunciation difficulty were used. It should be easy to imaginatively reproduce native language speech after a few repetitions, but progressively harder as the language gets more difficult to pronounce relative to the native language. For example, since Catalan might be judged by English speakers to be easier to pronounce than Hindi, a few repetitions of a Catalan sentence might allow English speakers more accurate auditory imagery of the phrase than a few repetitions of a Hindi sentence, resulting in a stronger tendency for the Catalan sentence to transform to music. This hypothesis suggests, then, that the differences between pre and post repetition speech-tosong ratings should be greatest in the native language (English), and progressively smaller as the languages get more difficult to pronounce.

We also wondered whether the higher-level temporal regularity produced by spacing the repetitions at identical intervals was necessary for the illusion to occur. Falk et al. (2014) found that temporal regularity was not necessary, but we used a different method for making the repetitions temporally irregular, a method that made the difference between the regular and irregular versions more salient. We sought to confirm that higher-level temporal regularity was not required for repetition to transform speech into song.

Our study used recordings from an archive of native speakers telling the same story in different languages as stimuli. Half of the recordings were from languages hypothesized to be easier for English speakers to pronounce, and half were from languages hypothesized to be harder for English speakers to pronounce. The English language recording of the story was also included for comparison. Half of the participants heard these recordings in a temporally regular condition, where each repetition followed after an identical temporal interval, and half of the participants heard them in a temporally irregular condition, where the repetition occurred at unpredictable intervals. They rated a phrase from each utterance on a 5-point scale from speech to song both before and after the repetitions. The difference in ratings was taken as an index of the transformation from speech to song. At the end of the session, participants responded to various questions about the languages in the study, including how difficult each might be to pronounce, so that the results could be interpreted in terms of participants' actual ratings of pronunciation difficulty, in addition to the hypothesized categories.

### **MATERIALS AND METHODS**

### **PARTICIPANTS**

The 24 participants (8 male, 16 female) ranged in age from 18 to 22 with a mean age of 19.6 years (*SD* = 1.2). In exchange for participating, they received extra credit in a general music appreciation course aimed at non-majors called Music Lecture. Only one participant reported being enrolled as a music major. Only six of the participants reported formal training in music; all of it at a young age and all of it short lived. Thus, unlike Deutsch et al. (2011), which used musically trained listeners as participants, this study focuses predominantly on people without formal musical training. Since results did not change when the one music major participant was excluded, we retained all participants in the reported analyses. All participants were native English speakers, and all reported normal hearing.

One participant reported fluency in each of the following languages: Vietnamese, Japanese, Chinese, and Swedish. 12 participants reported some experience with Spanish. Of these, three reported they were fluent, three reported their level of Spanish ability to be advanced, three reported it to be at the beginner level, and the rest reported an intermediate ability. One participant had studied beginning Japanese, and one participant reported proficiency in Vietnamese. The Chinese speaker, the Japanese speaker and one Spanish speaker reported using the language in childhood. None of the participants reported receiving training in any of the languages used in the experiment.

All participants signed an informed consent form before starting the experiment. The protocol was approved by the University of Arkansas Institutional Review Board.

### **MATERIALS**

Seven excerpts from non-tonal languages were selected from the examples used in the Handbook of the International Phonetic Association (1999), available at http://web.uvic.ca/ling/resources/ ipa/handbook\_downloads.htm. Each excerpt consisted of a person speaking the following utterance "The north wind and the sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak. They agreed that the one who first succeeded in making the traveler take his cloak off, should be considered stronger than the other" in one of seven languages: English, Catalan, Portuguese, French, Croatian, Hindi, or Irish. Aside from English, three languages (Catalan, Portuguese, and French) were hypothesized to be easier for English speakers to pronounce, and three languages (Croatian, Hindi, and Irish) were hypothesized to be harder for English speakers to pronounce. All the languages except for Catalan were spoken by a female.

The mean utterance length was 12.1 s (*SD* = 2.7). For each language, a segment was extracted from the utterance using Audacity 2.0.3. The segment extraction was made at about the three-quarter mark of each utterance. The mean segment length was 2.7 s (*SD* = 0.3).

Two stimuli were created for each language: a temporally regular and a temporally irregular version. The regular versions consisted of the full utterance followed by 10 segment repetitions, each separated by 1000 ms. The irregular versions were more complex, consisting of the full utterance followed by 10 segments, each separated by time intervals that were random percentages (between 1% and 50%) shorter and longer than 1000 ms. For each randomly selected percentage, one interonset interval was created by shortening the 1000 ms span the appropriate amount, and another was created by lengthening it. To increase the salience of the temporal shifts, 400 ms was subtracted from each of the sub-1000 ms values, and added to each of the over-1000 ms values. For example, the randomly selected percentage 15% generated the interonset intervals 450 ms (850–400 ms) and 1550 ms (1150 ms + 400 ms).

A total of 10 interonset intervals were generated from 5 randomly selected percentages. The order of the 10 time interval lengths was randomized. In a few cases, this randomization resulted in two similar time intervals placed back to back (e.g., 510 ms followed by 520 ms); when this happened, one of the time intervals was moved to a different position in the sequence. The object was to create a series of time intervals that made the extraction of meter as unlikely as possible. One advantage of the procedure is that the total duration of all the repetitions was the same in the regular and the irregular condition, eliminating an explanation based on exposure length rather than temporal regularity.

### **PROCEDURE**

Participants were seated at a computer terminal in a Whisper-Room 4 by 4- Enhanced, Double Wall Isolation Booth and outfitted with Sennheiser HD 600 headphones. Instructions were presented on screen and stimuli were presented over the headphones. Participants made all responses using the keyboard and mouse.

Participants were randomly assigned to one of two groups. Group one heard the repetitions in temporally regular form; group two heard the repetitions in temporally irregular form. All other procedures for the two groups were the same.

First, participants answered a series of demographic questions. Next, they performed the task for each of the 7 languages, with the language order randomized. For each language, they were told to listen carefully to an utterance. After the utterance was complete, they were told they would hear a segment from the utterance and be asked to rate it on a scale from 1 to 5, with 1 signifying "sounds exactly like speech" and five signifying "sounds exactly like singing." They were played the segment, and asked to rate it. Next, they were told they would rehear the utterance, followed by 10 repetitions of the segment, followed by a restatement of the entire utterance. It was explained that they should then rate how the segment sounded within that utterance on the same 1 to 5 scale. Thus, participants rated the segment twice—once before and once after the repetitions. Finally, participants answered a series of questions about each of the languages in the experiment, with the language order randomized. For each language, they were replayed the utterance and asked to enter the name of the language. Next, they rated the familiarity of the language on a scale from 1 to 5. Finally, they were asked how easy they thought it would be to pronounce the words in the language accurately, on a scale from 1 to 5.

### **RESULTS**

A linear mixed model was used with the difference in speech–song ratings pre and post repetition as the dependent variable, language difficulty (native, easy, and hard) and temporal structure (regular vs. irregular) as fixed factors and language (English, Catalan, Portuguese, French, Croatian, Hindi, Irish) as a repeated variable. As shown in **Figure 1**, there was a main effect of the hypothesized pronunciation difficulty of the language on the change in speech–song ratings, *F*(2,140) = 6.45, *p* = 0.002; however, there was no main effect of temporal regularity *F*(1,27) = 0.03, *p* = 0.87.

**Table 1** shows the speech–song ratings for each language category before and after the repetitions. The pre and post repetition ratings were different for every category except English, signifying that a transformation from speech to song occurred in every foreign language, but not the native one. Rating changes from pre to post repetition increased from the native to easy to hard categories, signifying an intensification of the speech-to-song illusion for languages hypothesized to be difficult for native English speakers to pronounce.

**Table 2** shows the mean rating change for each language in the temporally regular and irregular condition. Patterns were broadly similar between the two groups, with easier to pronounce languages engendering less dramatic transformation from speech to song and harder to pronounce languages engendering more dramatic transformation. **Table 3** summarizes this effect for each of the three language difficulty categories.

As shown in **Table 4**, participants rated how difficult they thought each language would be to pronounce accurately. Participants' ratings generally correlated with the hypothesized difficulty ratings, but their judgments tended to group into four categories rather than three – Native (English); Easy (Catalan, Portuguese); Medium (French, Croatian); and Hard (Hindi, Irish). In the hypothesized categories, French was grouped with Easy and Croatian with Hard. The data were reanalyzed using thesefour categories rather than the original three as predictors. Rating change varied significantly according to the difficulty of each language as rated by the participants; *F*(3,138) = 5.30, *p* = 0.002, as shown in **Figure 2**. **Table 5** lists the means for these categories.

**Figure 2** shows the speech–song ratings before and after the repetition for each of the three hypothesized language difficulty categories. **Figure 3** shows the same trend for speech–song ratings within each of the four participant-rated language difficulty categories. For each breakdown of the categories, harder-topronounce languages were rated as more songlike to begin with;


**Table 1 | Mean speech-to-song ratings for each language difficulty category before and after repetition.**

**Table 2 | Mean changes in speech-to-song rating from pre to post repetition for each language in each condition.**


**Table 3 | Mean changes in speech-to-song rating from pre to post repetition for each category in each condition.**


however, harder-to-pronounce languages also experienced a larger speech-to-song transformation than easier-to-pronounce languages. The native language experienced the least transformation, the easy and medium more, and the hard the most.

Participants also rated the familiarity of each language (also shown in **Table 4**). These ratings were marginally predictive of speech–song rating changes post repetition, *F*(1,155) = 3.80, *p* = 0.05.

Ninety-six percent (all but one) of the participants correctly identified the English language. Eighty-three percent correctly identified the French language. Every other language was correctly identified by one participant (4% of respondents), except Portuguese, which was correctly identified by 4 (16%). A large percentage of participants misidentified Catalan as Spanish, potentially accounting for the high familiarity ratings for Catalan despite the low success with identifying its name. This pattern underscores the distinction between perceived pronunciation difficulty and mere familiarity; the sound of the French language was quite familiar to participants, and most were able to identify it correctly; however, they still rated the language as moderately difficult to pronounce. Responses did not differ by gender.

In order to ascertain whether there was something inherently more music-like about stimuli in some categories, we used two measures from Tierney et al. (2013), one to assess the degree of fundamental frequency stability across syllables in each utterance, and the other to assess the degree of temporal regularity among syllable stresses. Following the procedure outlined in that paper, we first used Praat to assess the fundamental frequency stability across each syllable in each utterance, by calculating the average fundamental frequency change in semitones per second. In Tierney et al. (2013), utterances that were more likely to transform to song had more within-syllable fundamental frequency stability (less change). **Table 6** lists the meanfundamentalfrequency change for each syllable in each of the seven languages. These means do not vary significantly between pronunciation difficulty categories,

**Table 4 | Participants' difficulty and familiarity ratings for each language compared with their hypothesized categories.**


except between the Native and Easy participant-rated categories. Stimuli in the category rated by participants as Easy exhibited more fundamental frequency variability per syllable than stimuli in the Native category. If acoustic characteristics were driving the effect, we would expect to see the stimuli with less intrasyllable frequency variability (the Native stimuli) transform to song more easily; however, the opposite effect occurred. This reinforces the notion that the pronunciation difficulty, rather than some more basic acoustic characteristic, influenced the degree to which particular utterances were susceptible to the speech-to-song illusion.

Next, following another procedure in Tierney et al. (2013), we identified the timepoints of the onsets of stressed syllables in each utterance. To assess the temporal regularity of the speech segment, we measured the SD of the duration between successive onsets of stressed syllables. The results for each language are shown in **Table 7**.

If the results were driven by these acoustic characterizations rather than by pronunciation difficulty, we would have expected to see fundamental frequency change correlate negatively with the size of the speech–song rating change across repetitions; languages with large intrasyllable fundamental frequency changes (lower frequency stability) should transform to song less easily, as shown by smaller speech–song rating changes. Instead, no consistent pattern emerged (*p* > 0.05). We would also have expected to see the standard deviations of stressed syllable onsets vary negatively with speech–song rating change across repetitions; more temporally irregular utterances should transform to song less easily. Again, however, no consistent pattern emerged (*p* > 0.05).

### **DISCUSSION**

Contrary to our initial hypothesis, utterances spoken in languages more difficult to pronounce relative to the listener's native tongue were actually*more*susceptible to the speech-to-song illusion. Since


**Table 5 | Mean speech-to-song ratings for each of the participants' language difficulty categories.**



**Table 7 | Variability of duration between stressed syllable onsets for each language.**


it should have been easier to imaginatively simulate the pronunciation of the syllables in easier to pronounce languages, it seems on first pass that this kind of virtual participation must not be essential to musical attending. Yet there is another way of understanding this result.

The pre-repetition ratings from this experiment show that harder to pronounce languages *started out* sounding more musical to listeners, even before any repetitions had contributed the illusory transformation. When the data are reanalyzed using the same methods except substituting initial ratings rather than rating differences as the dependent variable, there is a main effect of hypothesized language difficulty, *F*(2,40) = 5.05, *p* = 0.008. If the speech-to-song illusion had been independent of the pronunciation difficulty, the solid lines on **Figures 2** and **3** would have moved up in parallel to the dotted lines, signifying that languages in each of the categories transformed to song after repetition to roughly the same degree. But instead the slope of the solid lines is steeper; the languages that were more difficult to pronounce, and more songlike to start with, became *even more* songlike after repetition than did the easier to pronounce languages. This suggests that when speech circuitry captures acoustic input, it is more resistant to releasing it to other perceptual mechanisms. Speech circuitry seems more likely to capture acoustic input when it is easy to pronounce than when it is hard to pronounce.

To imagine what this release might entail, consider the semantic satiation effect (Severance and Washburn, 1907). It is normally very difficult to perceive a word independently of its semantic correlate. It takes many repetitions before the meaning starts to disintegrate and the sounds can be heard on their own terms. Across the course of these repetitions, it is almost possible to feel the release as the lexicon's grip on the word recedes. The harder to pronounce languages may not have elicited as strong a grip by language regions in the first place, allowing repetition to effect a starker shift to song.

Our results supported those in Falk et al. (2014) showing that the illusion occurred whether the repetitions were spaced regularly or irregularly. Temporal regularity does not seem to be a necessary factor in the speech-to-song illusion. The illusion seems to be driven by repetition itself rather than by the emergence of largerscale temporal regularity.

Additionally, the transformation to song does not seem to be driven by the acoustic characteristic of fundamental frequency stability within syllables, or the acoustic characteristic of regularity between stressed syllable onsets. This strengthens the case that pronunciation difficulty—and perhaps associatedly, the degree to which an utterance is captured by speech circuitry—can influence any particular utterance's susceptibility to the speech-to-song illusion.

Because the speech-to-song illusion exposes a border between the perception of language and the perception of music, it is especially useful for illuminating how different aspects of acoustic input get emphasized in different contexts. Listeners may start with more acute perception of the prosody and songlike aspects of foreign languages, especially if they are very difficult to pronounce relevant to their native tongue. The more closely acoustic input conforms to the sounds of their native language, the tighter a grip the language circuitry may have on that input, and the less accessible language-irrelevant (or less language relevant) aspects of the sound may be.

To return to the initial hypothesis, although the harder-topronounce languages may have been difficult to imaginatively *speak* along with, they might have actually been easier to imaginatively *sing* along with. If language circuitry was less dominant in the processing of utterances in these languages, it may have been easier to disregard formant transitions and tune into the prosodic contour and timing of the pitch changes, features that are already more traditionally musical. Future work might examine people's capacity for vocal imitation in languages relatively easier or harder to pronounce, similar to Mantell and Pfordresher (2013), to investigate this hypothesis.

### **REFERENCES**


Patel, A. D. (2008). *Music, Language, and the Brain*. New York, NY: Oxford University Press.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Received: 08 October 2014; accepted: 09 January 2015; published online: 29 January 2015.*

*Citation: Margulis EH, Simchy-Gross R and Black JL (2015) Pronunciation difficulty, temporal regularity, and the speech-to-song illusion. Front. Psychol. 6:48. doi: 10.3389/fpsyg.2015.00048*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology.*

*Copyright © 2015 Margulis, Simchy-Gross and Black. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Sex Differences in Music: A Female Advantage at Recognizing Familiar Melodies

### Scott A. Miles1,2 \* † , Robbin A. Miranda1,3 \* † and Michael T. Ullman<sup>1</sup> \*

<sup>1</sup> Brain and Language Laboratory, Department of Neuroscience, Georgetown University, Washington, DC, USA, 2 Interdisciplinary Program in Neuroscience, Georgetown University, Washington, DC, USA, <sup>3</sup> Infinimetrics Corporation, Vienna, VA, USA

### Edited by:

McNeel Gordon Jantzen, Western Washington University, USA

### Reviewed by:

Cyrille Magne, Middle Tennessee State University, USA Aaron J. Newman, Dalhousie University, Canada

### \*Correspondence:

Michael T. Ullman michael@georgetown.edu; Scott A. Miles sam337@georgetown.edu; Robbin A. Miranda raw25@georgetown.edu †These authors have contributed equally to the work.

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 30 July 2015 Accepted: 12 February 2016 Published: 01 March 2016

### Citation:

Miles SA, Miranda RA and Ullman MT (2016) Sex Differences in Music: A Female Advantage at Recognizing Familiar Melodies. Front. Psychol. 7:278. doi: 10.3389/fpsyg.2016.00278 Although sex differences have been observed in various cognitive domains, there has been little work examining sex differences in the cognition of music. We tested the prediction that women would be better than men at recognizing familiar melodies, since memories of specific melodies are likely to be learned (at least in part) by declarative memory, which shows female advantages. Participants were 24 men and 24 women, with half musicians and half non-musicians in each group. The two groups were matched on age, education, and various measures of musical training. Participants were presented with well-known and novel melodies, and were asked to indicate their recognition of familiar melodies as rapidly as possible. The women were significantly faster than the men in responding, with a large effect size. The female advantage held across musicians and non-musicians, and across melodies with and without commonly associated lyrics, as evidenced by an absence of interactions between sex and these factors. Additionally, the results did not seem to be explained by sex differences in response biases, or in basic motor processes as tested in a control task. Though caution is warranted given that this is the first study to examine sex differences in familiar melody recognition, the results are consistent with the hypothesis motivating our prediction, namely that declarative memory underlies knowledge about music (particularly about familiar melodies), and that the female advantage at declarative memory may thus lead to female advantages in music cognition (particularly at familiar melody recognition). Additionally, the findings argue against the view that female advantages at tasks involving verbal (or verbalizable) material are due solely to a sex difference specific to the verbal domain. Further, the results may help explain previously reported cognitive commonalities between music and language: since declarative memory also underlies language, such commonalities may be partly due to a common dependence on this memory system. More generally, because declarative memory is well studied at many levels, evidence that music cognition depends on this system may lead to a powerful research program generating a wide range of novel predictions for the neurocognition of music, potentially advancing the field.

Keywords: music, music cognition, melody, declarative memory, recognition, sex differences, musical training, language

# INTRODUCTION

fpsyg-07-00278 March 1, 2016 Time: 16:24 # 2

Sex differences have been observed in various cognitive domains. For example, it has been suggested that boys and men have advantages at aspects of visuospatial cognition, while girls and women are better at aspects of verbal cognition (Kimura, 1999; Halpern, 2013). Sex differences in a variety of other domains have also been examined, though inconsistent findings and variability in the magnitude of the effects have led to questions about the existence of sex differences in cognition (Hyde, 2005).

There has been little examination, however, of sex differences in the cognition of music. This seems somewhat surprising, given the surge of research on music cognition in recent decades (Levitin and Tirovolas, 2009; Tirovolas and Levitin, 2011), as well as the apparent sex differences found in verbal cognition. Recent evidence suggests that the processing of language and music may be subserved by at least partially overlapping neural substrates (Patel, 2003; Brown et al., 2006). It is possible that some of the sex differences observed in language are driven by sex differences in these common substrates, suggesting they may extend to music cognition as well.

A relatively small number of neurocognitive studies have examined behavioral sex differences in aspects of music cognition. These studies have focused mainly on the lowlevel perception of single auditory events, such as those involved in spontaneous and click-evoked otoacoustic emissions (Snihur and Hampson, 2011), transient evoked otoacoustic emissions (Cassidy and Ditty, 2001), and pitch memory (Gaab et al., 2003). Music is, however, a complex phenomenon, consisting of several such events unfolding and interacting in time. It is possible that this focus on the low-level perception of single auditory events has left undetected behavioral sex differences in higher-level aspects of music cognition.

A useful distinction can be made between two higher level aspects of music cognition: knowledge of the general patterns of a musical system, often referred to as knowledge of musical syntax (Koelsch and Friederici, 2003; Koelsch et al., 2013; Sammler et al., 2013; Matsunaga et al., 2014) or schematic knowledge (Bharucha, 1994; Tillmann and Bigand, 2001; Huron, 2006); and knowledge of the idiosyncratic representations in music, such as of specific melodies, sometimes referred to as veridical knowledge (Bharucha, 1994; Huron, 2006). It has been proposed that much of the aesthetic value of music comes from the adherence to and violation of expectations generated by each of these two types of knowledge (Bharucha, 1994). It has also been proposed that the two types of knowledge can be dissociated, and may depend on different memory systems in the brain (Huron, 2006; Miranda and Ullman, 2007). This proposal is supported by an event-related potential (ERP) study demonstrating a double dissociation between the processing of violations of musical syntax and violations of familiar melodies, which involve idiosyncratic representations (Miranda and Ullman, 2007). Given these dissociations, it is possible that sex differences may be found in either syntactic (schematic) or idiosyncratic (veridical) aspects of music cognition, but not in both.

We are aware of two studies that have examined behavioral sex differences in higher-level aspects of music cognition (Koelsch et al., 2003a,b). Both of these focused on musical syntax, probing responses to violations of syntactic expectations. Though sex differences in electrophysiological brain responses (as measured by ERPs) were observed in both studies, neither found sex differences in performance. Of course, such null effects could be due to many factors. The possibility remains, however, that there are indeed performance advantages for one sex over the other in tasks of higher-level music cognition, but that these involve knowledge of idiosyncratic aspects of music rather than knowledge of musical syntax.

Indeed, as we shall see, some previous evidence suggests that knowledge regarding specific aspects of melodies is stored, at least in part, in declarative memory, a generalpurpose memory system that is critical for learning idiosyncratic information in general, including in language. Crucially, declarative memory also shows sex differences, in particular a female advantage, including in the recognition of previously learned idiosyncratic verbal material such as vocabulary items. Thus it is possible that this female advantage might extend to aspects of music cognition that depend on this memory system. Specifically, a female advantage may be expected in the recognition of familiar melodies, which involve idiosyncratic representations. We tested this prediction in the present study by examining the performance of men and women in a familiar melody recognition task.

In the remainder of the Introduction, we first briefly summarize the nature of declarative memory and evidence suggesting sex differences in this system. We then lay out the evidence suggesting that in music cognition, the storage and retrieval of knowledge about specific melodies depends, at least in part, on declarative memory. Finally, we summarize the present study.

# Declarative Memory: Overview and Sex Differences

Declarative memory is quite well understood (for reviews, see Ullman, 2004, 2016; Henke, 2010; Squire and Wixted, 2011; Eichenbaum, 2012; Cabeza and Moscovitch, 2013). As its name suggests, this memory system underlies the learning, storage, and retrieval of explicit knowledge, which is available to conscious awareness – although increasing evidence indicates that it also subserves implicit knowledge (Henke, 2010; Ullman, 2016). The system is rooted in the hippocampus and other medial temporal lobe structures. These structures are critical for the learning and consolidation of new knowledge. The subsequent storage of much of this knowledge, however, eventually relies largely on neocortical regions, especially in the temporal lobes. Declarative memory may be specialized for learning arbitrary bits of information and binding them together (Henke, 2010; Squire and Wixted, 2011). Indeed, the system may be necessary for learning such idiosyncratic information. This may help explain evidence that damage to the declarative memory system can severely impair or even

prevent the learning of knowledge about words and other idiosyncratic information (Squire and Wixted, 2011; Ullman, 2016).

Increasing evidence suggests a female advantage at declarative memory, including in idiosyncratic aspects of language (for a discussion and review of the literature, see Ullman et al., 2008). Studies have shown female advantages for a wide variety of episodic memory tasks (which crucially depend on declarative memory), including those testing verbal material, landmarks, objects, object locations, novel faces, and complex abstract patterns (Ullman et al., 2008). A female advantage has also been reported for word learning (Kaushanskaya et al., 2011) and for the retrieval of well-established (previously learned) knowledge, including in tests of vocabulary, lexical retrieval, and verbal fluency (Ullman et al., 2008). These behavioral female advantages are consistent with anatomical sex differences (Ullman et al., 2008). For example, the hippocampus seems to develop at a faster rate, with respect to the rest of the brain, in girls than in boys between the ages of one and sixteen (Pfluger et al., 1999). The behavioral and anatomical sex differences may be at least partly mediated by estrogen, which is found in higher levels in girls and (pre-menopausal) women than in boys and men (Wilson et al., 1998), and affects declarative memory and hippocampal structure and function, through both organizational effects in utero and activation effects later on (Phillips and Sherwin, 1992).

Given the dependence of idiosyncratic (and other) aspects of language on declarative memory (Ullman, 2004, 2016), many if not most of the previously reported sex differences in language may in fact be explained by broader, domain-independent sex differences in the declarative memory system (Ullman, 2004, 2016; Ullman et al., 2008). Accordingly, the female advantage at the storage and retrieval of idiosyncratic representations may extend beyond previously studied verbal and non-verbal domains and functions to music cognition – in particular to the storage and retrieval of knowledge about specific melodies.

# Melodies, Declarative Memory, and Expected Sex Differences

As we have seen, the cognition of music, like that of language, requires the memorization of specific, idiosyncratic representations, including of familiar melodies. Melodies contain specific sequences of notes that must be veridically learned, even though the sequences are also schematically constrained by the syntax of a musical system – much like words involve particular sequences of phonemes that are also constrained by the rules of phonotactics. Given that declarative memory seems to underlie the learning and storage of knowledge about words, and more generally may be necessary for learning arbitrary bits of information and binding them together, it may be expected that this system is also critical for learning idiosyncratic representations in music, including knowledge about specific melodies.

Some evidence already suggests that this may be the case. In an electrophysiological study, an ERP component characterized as an N400 was observed in response to expectation violations resulting from altered notes within melodies that were well known (and thus likely to be familiar to participants), but not to violations of notes within novel melodies (Miranda and Ullman, 2007). N400s, which originate in part in the medial temporal lobe (McCarthy et al., 1995; Meyer et al., 2005), and are found in response to a variety of lexical stimuli, as well as to idiosyncratic non-verbal stimuli such as objects and faces (Kutas and Federmeier, 2011), have been linked to declarative memory (Ullman, 2001, 2016). The findings of the music ERP study (Miranda and Ullman, 2007) thus suggest that, like knowledge of these various types of non-musical idiosyncratic information, knowledge about familiar melodies may also be stored in and retrieved from declarative memory.

Given the female advantages observed in other tasks involving declarative memory, including in both the learning of new knowledge and the retrieval of previously learned information, such advantages might also extend to knowledge of idiosyncratic representations in music, including of familiar melodies. We thus predicted a female advantage at recognizing familiar melodies.

# The Present Study

To test this prediction we examined the recognition of wellknown melodies in adults. We focused on the recognition of already-known melodies, rather than the learning of new melodies, because previous evidence suggests that consolidation – even over the course of months or longer – can significantly affect outcomes (Marshall and Born, 2007; Morgan-Short et al., 2012).

Healthy men and women were presented with both wellknown and novel melodies. Participants were asked to indicate as quickly and accurately as possible during the presentation of each melody whether they were familiar with it. Response time (RT) as well as accuracy measures were obtained. RTs typically provide greater variability than accuracy, and minimize the likelihood of ceiling effects. In addition, some previous evidence suggests that the time element may be important in revealing the hypothesized female advantages (Walenski et al., 2008).

We examined both musicians and non-musicians. This allowed us to test how broadly the findings may hold across musical training. Testing across musicians and non-musicians is also important because previous studies examining neural sex differences have found interactions between sex and musical training (Evers et al., 1999; Hutchinson et al., 2003). Musicians might be expected to show stronger representations of familiar melodies simply due to greater exposure (Besson and Faïta, 1995). It is also plausible that members of either sex might have had greater previous exposure to the well-known melodies than members of the other sex. To attempt to address these issues, after each of their timed recognition responses, participants were asked to report a familiarity rating for the melody. By covarying out these ratings in our analyses, we were able to test whether any group differences in performance held even when familiarity was held constant.

All of the stimuli were presented instrumentally. However, since many of the melodies in the study are commonly associated with lyrics, any observed female advantages could in principle be due to advantages in the verbal domain, rather than in familiar

melody recognition itself. We therefore separated the melodies into those that are or are not associated with lyrics, to be able to test whether any sex differences might hold across both.

Finally, it is possible that any observed sex differences in the recognition of melodies might be due to sex differences in basic motor processes, rather than differences in music cognition. To help rule out this possibility, we also gave participants a control task, in which they were asked to respond to single tones as quickly as possible. If the sex differences were due to lower-level motor processes, any differences in the experimental task might also be reflected in the results of the control task.

Overall, given the hypothesis that the female advantage in declarative memory should extend to knowledge about familiar melodies, we predicted that women would show faster and perhaps more accurate recognition of well-known melodies than men. Moreover, we expected this advantage to hold broadly, over both musicians and non-musicians, and across melodies with and without lyrics, and that the advantage would not be fully explained by sex differences in familiarity or in basic motor processes.

# MATERIALS AND METHODS

# Participants

Participants were right-handed native speakers of American English. They had no known developmental, neurological, or psychiatric disorders. Since familiarity with the well-known melodies used in this study is largely culture-dependent, we selected only participants who had not lived outside of the United States for more than 6 months before the age of 18. Research methods were approved by the Institutional Review Board at Georgetown University. All participants gave written informed consent and received monetary compensation for their participation.

Two groups of participants were tested: 24 men and 24 women. Half of the participants within each group were musicians and half were non-musicians. The musicians had at least 4 years of formal musical training, which was defined as private instrument or voice lessons, or participation in a musical ensemble. The non-musicians had 1 year or less of formal musical training. In our initial analysis of RTs to well-known melodies (described below), we found that two of the participants were outliers (one female musician and one female non-musician), each having a mean RT greater than two standard deviations from the mean RT for their respective participant subgroup. The data from these two participants were excluded and replaced with data from two newly tested participants: one female musician and one female non-musician.

The final two groups of participants therefore also consisted of 24 participants each. **Table 1** shows information for each of the four 12-member subgroups regarding age, years of education, handedness (Oldfield, 1971), years of formal musical training, and (for the musicians only) age when formal musical training began, number of years since last formal musical training, number of instruments played (including voice), and number of participants who still regularly played an instrument or sang at the time of testing. Results from 2 × 2 analyses of variance (ANOVAs), with the factors Sex (male/female) and Musical Training (musician/non-musician), confirmed that the four subgroups did not differ significantly in age [Sex: F(1,44) = 0.20, p = 0.656, Musical Training: F(1,44) = 0.20, p = 0.656, Sex by Musical Training: F(1,44) = 0.04, p = 0.848], years of education [Sex: F(1,44) = 0.34, p = 0.561, Musical Training: F(1,44) = 0.18, p = 0.677, Sex by Musical Training: F(1,44) = 0.03, p = 0.868)], or handedness [Sex: F(1,41) = 0.03, p = 0.870, Musical Training: F(1,41) = 0.46, p = 0.500, Sex by Musical Training: F(1,41) = 3.27, p = 0.078; note that values were missing from three participants; see **Table 1**]. Importantly, the male and female musicians did not differ significantly in the number of years of formal musical training [t(22) = 0.46, p = 0.653]; the same was true for male and female non-musicians [t(22) = 0.67, p = 0.511]. Furthermore, there were no significant differences between male and female musicians regarding the age when musical training began [t(22) = 1.47, p = 0.156], the number of years since last formal musical training [t(22) = 1.11, p = 0.278], the number of instruments (including voice) played by each participant [t(22) = 0.99, p = 0.335], or the number of participants who regularly played a musical instrument or sang at the time of the experiment [t(22) = 0.80, p = 0.430].

# Stimuli

The musical stimuli consisted of 260 melodies ranging from 4.1 to 15.8 s in length (mean = 8.2 s, SE = 0.17). The stimuli were created in MIDI format using Finale Version 3.5.1 (Coda Music) and then converted to WAV files with a "grand piano" sound font using MidiSyn Version 1.9 (Future Algorithms). All melodies were in the key of C-major or C-minor. Half of the melodies (130) were segments from well-known tunes (see Appendix, in Supplementary Material), including traditional, folk, children's, patriotic, holiday, classical, and pop music, as well as themes from movies, television, and Broadway musicals. The other half (130) were novel melodies composed by one of the authors (RM). The novel melodies served only as foils for the familiar melody recognition task, and are not reported or analyzed here. Each novel melody was composed to correspond to one of the well-known melodies. More specifically, the tempo and implied harmony (possible accompanying chords that are not present, but strongly suggested by the sequence of notes in the melody) of each novel melody were identical to those of its corresponding well-known melody; moreover, pitch range was closely matched. Distinctive rhythms were slightly altered in some of the novel melodies in order to minimize false recognition of these melodies based on rhythm. False recognition of novel melodies based on rhythm was not of great concern, in any case, since pitch structure has been found to be a better cue for the recognition of melodies than rhythmic structure (Hébert and Peretz, 1997).

# Experimental Task

For the purpose of counterbalancing, the 260 melodies were presented over the course of three runs, with each run containing a similar number (43 or 44) of well-known and novel melodies. Any given well-known melody and its matched novel melody were always presented in separate runs. The order of the three


TABLE 1 | Participant information on age, education, and musical training.

runs was counterbalanced across participants, such that for every six participants in each of the subgroups, the runs were presented in all possible orders. The presentation order of well-known and novel melodies was randomized within each run for each participant. Completion time for each run was approximately 15 min.

Melodies were presented on a laptop computer running Microsoft Windows, using Meds 2002 Revision B-1 (UCLA, Los Angeles). Participants were instructed to listen to each melody and to press the space bar as soon as the melody sounded familiar. If the melody was not recognized as familiar, the participant was instructed to wait until the end of the melody and then press the space bar to advance (only the keystrokes that occurred prior to the end of the melody were analyzed as responses). The full melody was presented regardless of when the space bar was pressed.

Immediately after the melody was completed and the space bar was pressed (whichever came last), the participant was prompted to rate the familiarity of the melody from 0 to 100, with 0 being most familiar (we selected this rating scale due to software constraints). Prior to testing, each participant received written instructions specifying that a rating of "0" should indicate "very familiar" melodies that the participant would be able to hum along with, whereas a rating of "100" should indicate melodies that were not familiar at all to the participant. The rating scale was shown on the screen as a horizontal scroll bar with "0" on the left and "100" on the right, with the words "Familiar" and "Unfamiliar" positioned under the left and right sides of the bar, respectively. The participant used a mouse to move a marker on the scroll bar to select the rating of his or her choice. As expected, the participants were indeed broadly familiar with the well-known melodies (mean rating of 17.9, SD = 9.0).

All participants were instructed to press the space bar with the left hand and to operate the mouse with the right hand, keeping the left hand just over the space bar at all times in order to minimize RTs. Before starting the experiment, each participant was given a practice run that included eight melodies, four of which were well known and four of which were novel.

### Control Task

After five participants had been tested on the experimental task, a control task was added to determine whether possible RT differences between participant groups could be attributed to group-wide differences in basic motor functions. The remaining participants (9 male musicians, 11 male non-musicians, 11 female musicians, and 12 female non-musicians) were given this task after completing all three runs of the experimental task. The control task included 20 tones of different pitches, each 500 ms long, presented at staggered intervals (between 0.3 and 2.1 s) after the participant's previous response. Each participant was instructed to press the space bar with the left hand as soon as s/he heard a tone. Analysis of these RTs for each participant group revealed that three participants (one male musician, one female musician, and one female nonmusician) were outliers, each having a mean RT greater than two standard deviations from the mean RT of their corresponding participant subgroup. Data from these participants were excluded

fpsyg-07-00278 March 1, 2016 Time: 16:24 # 5

All

than 12 participants,

 since Edinburgh Handedness

 Inventory data were missing from one participant

 in each of these subgroups.

 NA, not applicable.

 See main text for details.

from analyses of this task, and the data from the remaining eight male musicians, 11 male non-musicians, 10 female musicians, and 11 female non-musicians were subjected to full analysis.

# RESULTS

# Response Times to Well-Known Melodies

Means for the recognition RTs to well-known melodies – that is, the latencies of responses registered during the presentation of these melodies – are shown for each of the four subgroups in the first column of **Table 2**. Prior to analysis, these were natural log transformed. Next, we eliminated very slow trials, which might result from diminished attention to the task. Specifically, for each participant, we eliminated trials with RTs that were greater than two standard deviations (SDs) above that participant's mean. This resulted in the exclusion of a total of 2.69% of responses as outliers (135 out 5,012 correct responses to well-known melodies). To maintain an overall Type I Error probability of 0.05, we applied the Bonferroni correction: since six AN(C)OVAs were performed on the data from the experimental task, the significance level was set at 0.05/6 = 0.0083.

These transformed and filtered RTs were then entered into a 2 × 2 ANOVA, with Sex (male/female) and Musical Training (musician/non-musician) as between-group factors. The ANOVA yielded a significant (i.e., following Bonferroni correction) main effect of Sex [F(1, 44) = 11.09, p = 0.002, η 2 p = 0.201], with a large effect size (Cohen, 1988), indicating that women were significantly faster than men at responding to wellknown melodies; see **Figure 1**. There was no significant main effect of Musical Training [F(1,44) = 6.27, p = 0.016, η 2 <sup>p</sup> = 0.125] (though there was a tendency for musicians to respond faster than non-musicians), nor was there any interaction between Sex and Musical Training [F(1,44) = 0.001, p = 0.981, η 2 <sup>p</sup> < 0.001], suggesting that the female advantage held similarly for musicians and non-musicians.

### Familiarity as a Possible Confound

There was a significant correlation between participants' mean recognition RTs and their mean familiarity ratings for wellknown melodies [r(46) = 0.62, p < 0.001]. Accordingly, it is possible that women were faster at responding to well-known melodies simply because they were more familiar with the melodies, as compared to men. If this were the case, then including familiarity ratings as a covariate in the analysis would be expected to eliminate the finding of sex differences in RTs.

To examine this issue, a 2 (Sex) × 2 (Musical Training) analysis of covariance (ANCOVA) was performed on recognition RTs, with the covariate constituting each participant's mean familiarity rating over all of the well-known melodies. The pattern of significance was identical to that described above. The analysis yielded a main effect of Sex [F(1,44) = 9.79, p = 0.003, η 2 <sup>p</sup> = 0.185], with a large effect size, but there was no significant effect of Musical Training [F(1,44) = 4.04, p = 0.051, η 2 <sup>p</sup> = 0.086], nor an interaction between Sex and Musical Training [F(1,44) = 1.22, p = 0.276, η 2 p = 0.028]. These results suggest that the effects of Sex on recognition RTs could not be explained by group differences in familiarity. [Note that an ANOVA on mean familiarity ratings over all well-known melodies revealed no effects of Sex (F(1,44) = 1.57, p = 0.217, η 2 <sup>p</sup> = 0.034) or Musical Training (F(1,44) = 1.96, p = 0.168, η 2 <sup>p</sup> = 0.043), nor an interaction between them (F(1,44) = 2.23, p = 0.142, η 2 <sup>p</sup> = 0.048)].

### Response Bias as a Possible Confound

It is possible that the observed sex difference in RTs could be explained by differential response biases between the men and women. In particular, if the women had a greater tendency to respond with a recognition key press to all stimuli (novel as well as well-known melodies), this might account for their RT advantage in recognizing familiar melodies.

To address this concern, we performed a 2 (Sex) × 2 (Musical Training) ANOVA on bias scores [c = −0.5<sup>∗</sup> z(Hit rate) + z(False Alarm rate)]. This analysis revealed no main effects and no interaction [Sex: F(1,44) = 2.19, p = 0.146, η 2 <sup>p</sup> = 0.048; Musical Training: F(1,47) = 2.95, p = 0.093, η 2 <sup>p</sup> = 0.063; Sex by Musical Training: F(1,44) = 0.12, p = 0.726, η 2 <sup>p</sup> = 0.003], suggesting that there were no differences between the groups in their response biases. This in turn suggests that the advantage for women over men at RTs in recognizing familiar melodies could not be explained by group differences in response biases.

### Verbal Ability as a Possible Confound

As mentioned above, although the musical stimuli were presented without lyrics, many of the well-known melodies used in the study are often associated with lyrics. Thus, it might be argued that the female participants' speed advantage at recognizing familiar melodies may have been specifically due to faster RTs for



Means (and standard deviations), computed over participants' untransformed data (i.e., without natural log or arcsine transformations). ms, milliseconds.

those melodies associated with lyrics, which women recognized more quickly because of their verbal associations. On this view, the sex differences observed here might be explained by a female advantage at processing verbal information, rather than an advantage at recognizing purely musical aspects of familiar melodies. If this were the case, we might expect to see an interaction between the factors of Sex and "Lyricness" (i.e., whether or not melodies are associated with lyrics). On the other hand, no such interaction would be expected if the sex difference held similarly across melodies that are associated with lyrics and those that are not.

To examine this issue, we first assessed each well-known melody's association with lyrics by testing six native speakers of American English (four women, two men), ages 19–36 (mean = 23.8 years), with 1–14 years of musical training (mean = 9.2 years), none of whom had lived outside the United States for more than 6 months before age 18. None of these six participants were included in the larger experiment. The participants listened to all of the 130 well-known melodies. After each melody, they were presented with two questions, to which they responded "Yes" or "No." The questions were presented, one after another, on a computer screen: "(1) Are you familiar with this melody?" and "(2) Do you associate this melody with any lyrics?" For the second question, participants were instructed to answer "Yes" to any melody for which they thought they knew either the actual lyrics or any other (informal) lyrics (e.g., any lyric that they had ever heard or sung with that particular melody). To determine the strength of the association between each melody and its possible lyrics, a "lyric familiarity" score was calculated as the percentage of participants who associated lyrics with the melody, only out of those participants who were familiar with the melody itself (since unfamiliarity with a melody inevitably resulted in unfamiliarity with that melody's lyrics). Of the 130 melodies, 105 received a lyric familiarity score of 50% or higher (mean = 86.0%) and were considered "lyrics" melodies, while the remaining 25 melodies received a score below 50% (mean = 8.6%) and were considered "no-lyrics" melodies.

We then performed an ANOVA with the between-group factors Sex and Musical Training, and the within-group factor Lyricness (lyrics/no-lyrics melodies). This yielded a main effect of Sex [F(1,42) = 8.87, p = 0.005, η 2 <sup>p</sup> = 0.168] as well as of Musical Training [F(1,42) = 7.78, p = 0.008, η 2 <sup>p</sup> = 0.150], both with large effect sizes, but no interaction between Sex and Musical Training [F(1,42) = 0.05, p = 0.833, η 2 <sup>p</sup> = 0.001]. Importantly, there was no significant main effect of Lyricness [F(1,42) = 5.450, p = 0.024, η 2 <sup>p</sup> = 0.110], nor any significant interactions between Sex and Lyricness [F(1,42) = 0.930, p = 0.340, η 2 <sup>p</sup> = 0.021], between Musical Training and Lyricness [F(1,42) = 0.390, p = 0.536, η 2 <sup>p</sup> = 0.009], nor among Sex, Musical Training, and Lyricness [F(1,42) = 0.532, p = 0.470, η 2 <sup>p</sup> = 0.012]. This analysis suggests that the RT advantage for women at the recognition of familiar melodies held similarly for melodies that were associated with lyrics and those that were not.

### Basic Motor Processes as Possible Confounds

To test for the possibility that sex differences in basic motor processes could account for the women's RT advantage over men, we administered a control task (see Materials and Methods for details, and **Table 2** for mean RTs by subgroup). Prior to analyses, the RTs were natural log transformed. Next, negative RTs (1.9% of all responses) resulting from premature responses were excluded from analysis. There were no very slow RTs (RTs greater than two SDs above each participant's mean), so none were eliminated.

The 2 (Sex) × 2 (Musical Training) ANOVA on these RTs yielded no main effects of Sex [F(1,36) = 0.094, p = 0.760, η 2 <sup>p</sup> = 0.003] or of Musical Training [F(1,36) = 0.778, p = 0.383, η 2 <sup>p</sup> = 0.021], and no interaction between them [F(1,36) = 0.736, p = 0.396, η 2 <sup>p</sup> = 0.020]. This suggests that the group differences in recognition RTs to well-known melodies are not likely to be explained by group differences in basic motor processes (at least those measured by this task).

### Accuracy

To examine whether the findings of a female advantage might extend beyond RTs, we also examined accuracy. Each participant's percentage of correct recognition responses to all well-known melodies constituted the dependent variable in this analysis; see **Table 2**. These percentages were arcsine-transformed prior to analyses. A 2 (Sex) × 2 (Musical Training) ANOVA revealed no significant main effects, that is, neither of Sex [F(1,47) = 6.755, p = 0.013, η 2 <sup>p</sup> = 0.132], nor of Musical Training [F(1,47) = 6.189, p = 0.017, η 2 <sup>p</sup> = 0.123], nor an interaction between them [F(1,47) = 0.008, p = 0.928, η 2 <sup>p</sup> < 0.001].

### DISCUSSION

This study examined the prediction that women would have an advantage at recognizing familiar melodies, as compared to men. Indeed, women were significantly faster than men at recognizing familiar melodies, based on a Bonferroni corrected significance level. This sex difference yielded a large effect size (defined as η 2 <sup>p</sup> ≥ 0.138; Cohen, 1988). The result held across musicians

and non-musicians, as reflected by the absence of an interaction between sex and musicianship.

Unlike in the case of recognition RTs, we did not find a significant female advantage in our measure of accuracy, after correcting for multiple comparisons. However, as discussed above, accuracy is a less sensitive indicator of performance than RT. Perhaps for this reason, a female advantage was found for RTs but not accuracy in a recent study of lexical retrieval (Walenski et al., 2008). Indeed, it is possible that women are more accurate than men in their familiarity recognition responses, but our sample sizes (two groups of 24 participants each) were not large enough to demonstrate this effect. The finding of a significant female advantage in accuracy prior to correction for multiple comparisons is consistent with this view – especially since Bonferroni correction is quite conservative.

The female RT advantage was not explained by a number of potentially confounding factors. First, there were no significant group differences in various demographic variables that might have otherwise accounted for the observed advantages. The four subgroups (male musicians, male non-musicians, female musicians, and female non-musicians) did not differ in age, years of education, or handedness. Additionally, the male and female musicians did not differ in years of formal musical training, and likewise for the male and female non-musicians. The male and female musicians also did not differ regarding the age when their musical training began, the years since their last musical training, the number of instruments played (including voice), or the number of participants in each subgroup who were currently engaged in instrumental or vocal activities. Second, the advantages were not explained by group differences in familiarity with the well-known melodies. It might be suggested that the women were faster at recognizing wellknown melodies because they were simply more familiar with the melodies than the men. However, the female advantage was observed even when familiarity ratings were covaried out. Third, since there were no group differences or interactions on bias scores, group differences in bias are also not likely to explain the observed female advantage. Fourth, the advantages could not be fully accounted for by associations between the melodies and lyrics. It might be argued that a female advantage in the verbal domain could explain the sex difference observed here, rather than an advantage in the recognition of familiar melodies per se. In particular, since quite a few of the melodies in the study are associated with lyrics, it might have been the case that the female advantage held only or mainly for these items. However, there were no significant interactions between lyricness and sex, suggesting the speed advantage for women held across melodies that are and are not commonly associated with lyrics. This in turn suggests that the findings cannot be explained by a female advantage purely in the verbal domain. Fifth, it is not likely that group differences in basic motor processes accounted for the female advantage in melody recognition, since there were no significant differences between the groups in performance during a simple tone detection control task. This suggests that at least the basic motor processes examined in this task did not differ between the groups, and thus were not likely to have explained the observed differences in melody recognition.

We suggest instead that the sex differences in recognition RTs are at least partly explained by the previously reported female advantage at declarative memory. As discussed in Section "Introduction," this advantage has been found not only for learning new material, but also for the retrieval of previously learned material, as was tested in the present study. Together with independent electrophysiological evidence suggesting that the processing of familiar melodies depends at least in part on declarative memory (see Introduction, and Miranda and Ullman, 2007), the data from the present study suggest that the female advantage at declarative memory may indeed extend to music cognition, in particular to the retrieval of stored knowledge about melodies. However, given that this is the first study to examine sex differences in familiar melody recognition, some caution in interpreting the findings is warranted; see Section "Limitations and Future Studies" below for further discussion.

The claim that knowledge about familiar melodies depends on declarative memory does not presuppose that this is the only memory or other cognitive system involved in the learning, storage, or retrieval of such knowledge. For example, attention and working memory systems may be expected to play roles, at least in part because of their interactions with the declarative memory system for learning and retrieval (Ullman, 2004, 2016).

A role for declarative memory in stored knowledge of melodies also would not preclude additional roles for this system in music cognition. One interesting possibility is that declarative memory might, to some extent, play redundant roles with procedural memory in certain aspects of music cognition – for example, in learning and processing syntactic (schematic) knowledge, that is, knowledge about the regularities of musical systems. Increasing evidence suggests that such redundancy between declarative and procedural memory exists for language and other domains (Ullman, 2004, 2016). For example, individuals or groups with declarative memory advantages, or with deficits of procedural memory, appear to rely more on declarative memory, relative to procedural memory, for various grammatical functions (Ullman and Pullman, 2015; Ullman, 2016). Of particular interest here, girls and women seem to rely more on declarative memory than boys and men for aspects of grammar, likely due in part to the female advantages at declarative memory (Ullman et al., 2008; Ullman, 2016). It is plausible that such a sex difference might be found analogously for syntactic aspects of music cognition. Intriguingly, two studies have reported more bilateral negativities in girls and women than boys and men in response to syntactic anomalies within musical stimuli (Koelsch et al., 2003a,b). Although these negativities had primarily anterior distributions, their bilaterality suggests the possibility that they may be related to N400s, consistent with a greater dependence of musical syntactic processing on declarative memory in females than males. Indeed, such redundancy is consistent with the lack of sex differences in performance reported in these studies, since the errors

may be processed equally well in the two systems (Ullman, 2004, 2016). However, this interpretation of these studies should be treated with caution, and future research is needed.

Although the goal of the present study was to test sex differences in melody recognition, and the observed female advantage was indeed the most robust effect, an advantage of musicians over non-musicians was also found. Musicians showed a significant (i.e., following Bonferroni correction) RT advantage in the analysis with lyricness as a factor, as well as RT and accuracy advantages that were significant or borderline significant prior to Bonferroni correction, in other analyses. The cause of this apparent effect is not entirely clear. One possibility is that musicians simply have greater familiarity with the melodies. Another possibility is that the training involved in learning to perform music results in improvements in declarative memory. Indeed, some evidence hints at declarative memory improvements from other types of training (Draganski et al., 2006; Woollett and Maguire, 2011). Alternatively (or in addition), perhaps individuals with better declarative memory (and maybe other advantages as well) are more likely to become musicians, or to stick with musical training. Finally, the fact that a significant musician advantage only emerged in the analysis with lyricness may be attributed to a reduction of the error term in this analysis due the inclusion of this factor. Future studies examining the apparent musician advantage at familiar melody recognition seem warranted.

# Implications

The present study has implications for various disciplines and endeavors. In the domain of music cognition, together with the ERP results of Miranda and Ullman (2007), it provides evidence suggesting that knowledge of melodies depends at least in part on declarative memory. This, in turn, has further implications. First of all, it suggests that, like language, music cognition may depend on general-purpose brain systems. We emphasize, however, that portions of these systems could become subspecialized for aspects of music cognition, both evolutionarily and developmentally, as has been suggested for language (Ullman, 2004, 2016).

Importantly, because declarative memory has been well studied at multiple levels (including its behavioral, computational, neuroanatomical, physiological, cellular, molecular, genetic, and pharmacological correlates), this vast independent knowledge about the memory system could also pertain to music cognition (Ullman, 2004, 2016). Thus, as with language, linking music cognition to declarative memory could generate a wide range of novel predictions that there might be no independent reason to make based on the more circumscribed study of music cognition alone (Ullman, 2016). For example, the anatomical, developmental and genetic correlates of declarative memory might also be expected to underlie music, in particular ways. An understanding of the dependence of music cognition on declarative memory may therefore provide important insights regarding the evolution and development of music cognition. Overall, linking music to declarative memory could prove to be a powerful approach that may lead to substantial advances in the understanding of the neurocognition of music. These advances could include efforts to understand how knowledge about specific melodies contributes to the development, within the brains of listeners, of musical expectations. Such an understanding is crucial to the effort to understand how music is able to evoke powerful emotions and pleasure in listeners.

Linking aspects of music cognition to declarative memory could also help clarify commonalities between the cognition of music and language. Unlike proposals that have suggested that music cognition has 'piggybacked' on language circuitry (e.g., Pinker, 1997), here we suggest that the language/music neurocognitive commonality lies at least in part with declarative memory (also see Miranda and Ullman, 2007). On this view, this general-purpose system may underlie the cognition of both language and music, rather than music cognition depending directly on language circuitry. Of course, such a common dependence on declarative memory does not preclude any additional 'piggybacking' of music cognition on language (or vice versa) – either in portions of declarative memory that have evolutionarily or developmentally become specialized for language, or in any additional circuitry that might be specific to language (Ullman, 2004, 2016). Moreover, a joint language/music dependence on declarative memory does not preclude any additional joint dependence on other brain systems, including working memory and procedural memory (Miranda and Ullman, 2007).

From the perspective of memory systems, the findings presented here and in Miranda and Ullman (2007) underscore the view that declarative memory seems to underlie a wide range of types of knowledge, functions, domains, and modalities, and is not limited to episodic (event) and semantic (fact) knowledge as has traditionally been suggested (for discussion, see Ullman and Pullman, 2015; Ullman, 2016).

From a language perspective, the findings of the present study underscore the plausibility that highly specialized areas of knowledge, which are moreover found across human cultures, may depend importantly on general-purpose brain systems. This underscores the plausibility of the reliance of language on declarative memory and other general-purpose cognitive systems (Ullman, 2004, 2016).

The findings also have important implications for the study of sex differences. They reveal, for the first time, that women seem to have an advantage at recognizing familiar melodies, as compared to men. The findings also show for the first time that there are behavioral sex differences in higherlevel aspects of music cognition. Importantly, the observed female superiority does not seem to be due to an exclusively verbal advantage, since the female advantage did not interact with lyricness. This not only strengthens the evidence of an overall female advantage at tasks involving declarative memory, and evidence of its extension to the domain of music, but also crucially throws doubt on the claim that the female advantage at many verbal tasks is specific to the verbal domain. Rather, many if not most of the previously observed female

advantages at verbal tasks may instead be partly if not largely due to female advantages in declarative memory (Ullman et al., 2008). This controversial issue seems to warrant further research.

The findings of the present study also have educational and clinical implications. Pedagogical techniques that have been shown to improve learning and retention in declarative memory, such as spaced presentation and the testing (retrieval practice) effect (Cepeda et al., 2006; Roediger and Butler, 2011; Ullman and Lovelett, under review) may also be expected to enhance music learning, in particular the learning of specific melodies, just as they seem to enhance language, in particular the learning of words (Ozemir et al., in preparation; Ullman and Lovelett, under review). Also, understanding the neural substrates of the learning of knowledge about specific melodies could help guide music therapy, an approach that has been shown to be effective in helping patients with conditions involving deficits of both language and memory, such as aphasia and Alzheimer's disease (Norton et al., 2009; Ueda et al., 2013).

# Limitations and Future Studies

This study has various limitations. Perhaps most importantly, it does not directly tie the observed sex differences in melody recognition to female advantages in declarative memory. Thus, some other factor or factors could at least partially account for the findings. For example, it is possible that females generally make quicker decisions than males regarding information on which confidence is not high, or that sex differences in other aspects of music cognition involved in melody recognition could lead to the observed findings.

However, the sex differences found here were predicted on the basis of independent findings of sex differences in declarative memory, and moreover, analyses suggested they were not due to a wide range of potentially confounding factors or alternative explanations. Additionally, previous evidence has linked knowledge of familiar melodies to declarative memory (Miranda and Ullman, 2007). Together, this suggests that the study provides initial support for the view that the female advantage at declarative memory extends to music cognition, and can at least partly explain the observed sex differences in melody recognition.

Importantly, the findings constitute a useful foundation for future studies to more directly examine the issue. For example, further studies might examine whether participants' ability at melody recognition correlates with their ability at various declarative memory tasks. One could also examine the neural underpinnings of the observed sex differences, for example with fMRI or ERPs. Further research should also probe how broadly the apparent female advantage might hold, for example across different musical systems (e.g., in the Javanese or North Indian classical musical systems), age groups, and so on. One might also examine whether the female advantage would also hold in the actual identification of melodies (as in the game show "Name that Tune"), or whether it might be limited to binary familiarity judgments. Given the importance of sex hormones on cognition, including declarative memory (Hausmann et al., 2000; Hausmann, 2005; Ullman et al., 2008), the influence of estrogen and other sex hormones, and their variability throughout the menstrual cycle, also warrant investigation. For example, further studies may examine whether the findings obtained here might be due in part to elevated levels of estrogen during particular points along the menstrual cycle. The possibility of cultural influences (Hoffman et al., 2011) on the observed sex differences should also be investigated. Although the control task examined very simple aspects of auditory processing (i.e., the participants heard various tones and responded with a simple key press to any tone), the task did not directly examine pitch processing (since the same response was made to any pitch), nor other aspects of auditory processing such as rhythm. Futures studies could control for such aspects of auditory processing, for example with different responses for different pitches or rhythms. It would additionally be highly informative to examine the learning of new specific melodies, and whether and how this depends on declarative memory. Finally, future studies might extend the investigation of music cognition to procedural memory, to examine whether and how the learning or use of musical syntax, or other aspects of music, might depend on this system.

# CONCLUSION

This study revealed, for the first time, a female advantage at recognizing familiar melodies, as compared to males. This pattern, which showed a large effect size, held across musicians and non-musicians, and over melodies with and without commonly associated lyrics. We predicted the female advantage based on independent evidence suggesting both a female advantage at declarative memory and a dependence of knowledge of familiar melodies on this system. Although some caution is warranted because this is the first study to examine sex differences in melody recognition, the findings lend support to the hypothesis that knowledge pertaining to specific melodies indeed depends on declarative memory, which in turn leads to a female advantage at familiar melody recognition, thanks to a more general female advantage at declarative memory. The finding that the female advantage held across melodies that are and are not associated with lyrics argues against the view that the commonly observed female advantage at tasks involving verbal (or verbalizable) material is best explained by a sex difference specific to the verbal domain. Additionally, because declarative memory also underlies language, it seems likely that the cognitive commonalities between music and language may be explained, at least in part, by a common dependence on declarative memory. More generally, because declarative memory is well studied at many levels, evidence that aspects of music cognition rely on this system could lead to a powerful research program capable of generating a wide range of novel predictions for the neurocognition of music.

### AUTHOR CONTRIBUTIONS

fpsyg-07-00278 March 1, 2016 Time: 16:24 # 11

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

### ACKNOWLEDGMENTS

We thank Matthew Walenski, Matthew Moffa, Jocelyn Curchack, Marco Piñeyro, Harriet Bowden, João Veríssimo, Natasha Janfaza, Rochelle Tractenberg, and Benson Stevens for help on

### REFERENCES


various aspects of this study. Support was provided to RAM from an NSF Graduate Research Fellowship, to SM from an NSF Graduate Research Fellowship, and to MU from NIH R01 MH58189 and R01 HD049347.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2016.00278



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Miles, Miranda and Ullman. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Associations between musical abilities and precursors of reading in preschool aged children

*Franziska Degé\*, Claudia Kubicek and Gudrun Schwarzer*

*Department of Developmental Psychology, Justus Liebig University Giessen, Giessen, Germany*

The association between music and language, in particular, the overlap in their processing results in the possibility to use one domain for the enhancement of the other. Especially in the preschool years music may be a valuable tool to train language abilities (e.g., precursors of reading). Therefore, detailed knowledge about associations between musical abilities and precursors of reading can be of great use for designing future music intervention studies that target language-related abilities. Hence, the present study investigated the association between music perception as well as music production and precursors of reading. Thereby, not only phonological awareness, the mostly studied precursor of reading, was investigated, but also other precursors were examined. We assessed musical abilities (production and perception) and precursors of reading (phonological awareness, working memory, and rapid retrieval from long-term memory) in 55 preschoolers (27 boys). Fluid intelligence was measured and controlled in the analyses. Results showed that phonological awareness, working memory, and rapid retrieval from long-term memory were related to music perception as well as to music production. Our data suggest that *several* precursors of reading were associated with music perception as well as music production.

Keywords: musical abilities, precursors of reading, phonological awareness, working memory, preschoolers

# Introduction

The non-musical benefits of music lessons have intrigued the public and fascinated researchers. However, the potential of music lessons to enhance cognitive abilities (e.g., IQ or memory) still remains under discussion. There are only few experiments that clearly demonstrated a causal relationship between music lessons and cognitive abilities (e.g., IQ; Schellenberg, 2004). The vast majority of studies only established an association between music lessons and cognitive abilities, but did not examine the effect of music lessons on cognitive abilities. However, regarding languagerelated abilities, several studies showed that music interventions could cause improvements. Particularly, intervention studies demonstrated that music training can enhance vocabulary (Moreno et al., 2011), reading (Moreno et al., 2009), and phonological awareness – a precursor of reading (Degé and Schwarzer, 2011). These results indicate that music interventions might be able to support the development of language-related abilities (e.g., reading and phonological awareness). Until now, the majority of studies have primarily focused on music perception abilities (i.e., rhythm, pitch, meter, and timbre) and their relationship to language-related abilities, mostly addressing only one precursor (i.e., phonological awareness). However, there are other precursors of reading that have been identified: working memory, and rapid retrieval from long-term memory

### *Edited by:*

*McNeel Gordon Jantzen, Western Washington University, USA*

### *Reviewed by:*

*Mireille Besson, Centre National de la Recherche Scientifique – Institut de Neurosciences Cognitives de la Méditerranée, France Chantel Spring Prat, University of Washington, USA*

### *\*Correspondence:*

*Franziska Degé, Department of Developmental Psychology, Justus Liebig University Giessen, Otto-Behaghel-Straße 10F, 35394 Giessen, Germany franziska.dege@psychol.unigiessen.de*

### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

> *Received: 31 March 2015 Accepted: 03 August 2015 Published: 17 August 2015*

### *Citation:*

*Degé F, Kubicek C and Schwarzer G (2015) Associations between musical abilities and precursors of reading in preschool aged children. Front. Psychol. 6:1220. doi: 10.3389/fpsyg.2015.01220* (Jansen et al., 2002). But up to now there has not been much research concerning musical abilities and precursors of reading other than phonological awareness. Another issue concerns the inclusion of musical abilities: although musical abilities comprise perception as well as production abilities, the association between musical production abilities and precursors of reading is still understudied. Therefore, our exploratory study is aimed at investigating the association between musical perception abilities as well as musical production abilities and *several* precursors of reading, such as phonological awareness, working memory, and rapid retrieval from long-term memory.

### Explanations for the Associations between Music and Language

Explanations draw upon the functional overlap of brain structures that are involved in music and speech processing (Besson et al., 2011). It is assumed that domain-general abilities (abilities that are used for music as well as for speech processing) build the basis of the connection between music and language. Explanatory approaches only differ in the auditory features that they promote as the connecting domain-general abilities. Besson et al. (2011) assume that musicians have an enhanced sensitivity to auditory parameters (e.g., frequency and duration) that are important for music processing. Because these parameters are involved in music and speech processing, the higher sensitivity results in a more elaborated auditory perception of speech. This enhancement on lower levels of processing can also create an advantage for higher levels of speech processing (e.g., phonological processing).

While the former approach focuses on frequency and duration, another one postulates timing as the important acoustical feature. Tierney and Kraus (2014) put forward the precise auditory timing hypothesis that explains the connection between auditory motor entrainment and phonological skills. They assume that music training requires entrainment, and entrainment necessitates the precise perception of acoustic event timing. Hence, musical training over extended periods of time might result in higher timing precision in the automated representation of acoustic events in the auditory system. This higher precision also benefits speech sound perception, which is important for phonological skills. Taken together, it is highly likely that common auditory features trained by musical experience have a positive effect on speech processing. As different explanations focus on different auditory features, it might be possible that more mechanisms drive the effect of music training on language abilities (Tierney and Kraus, 2014).

### Music Training and Vocabulary

There is correlational (i.e., quasi-experimental) and experimental evidence of associations between music training and vocabulary. Piro and Ortiz (2009) found in a study with second-grade students that *music training* was associated with improvements in vocabulary. Also, musically trained 10-year-old children outperformed their untrained counterparts in vocabulary tasks (Forgeard et al., 2008). However, these correlational studies allowed no inferences about causation; it therefore remained unclear whether music training actually caused improvements in vocabulary. Nonetheless, there is one experimental study that used a computerized music-listening training as a potential intervention for vocabulary (Moreno et al., 2011). Preschoolers were pseudo-randomly assigned to music training or visual arts training. Before and after 4 weeks (20 days) of training, vocabulary was assessed. Only the music group showed increases from pre- to post-test on the vocabulary task, which unequivocally demonstrates that the music intervention indeed enhanced vocabulary.

### Music Aptitude, Music Training, and Reading

Regarding *music aptitude*, Anvari et al. (2002) showed that music perception abilities (pitch and rhythm) were related to reading abilities in 4-year-old children. This association remained reliable when phonological awareness was held constant. In 5-yearold children only pitch perception was associated with reading, whereas rhythm was not related to reading. The relationship between pitch perception and reading was again independent of phonological awareness. This independence might indicate that, apart from phonological awareness, other precursors of reading might mediate the association between musical abilities and reading. In 7- to 8-year-old children, Douglas and Willatts (1994) demonstrated an association between reading and pitch perception as well as rhythm perception.

Correlational research regarding *music training* and reading skills demonstrated an association between music training and spelling in 8- to 9-year-old children (Hille et al., 2011). Additionally, an association between music training and reading was revealed; this association disappeared when socioeconomic status (SES) was controlled (Hille et al., 2011). However, a study with 6- to 9-year-old children revealed an association between length of music training and reading comprehension even when SES was controlled (Corrigall and Trainor, 2011). Even some experimental or longitudinal studies supported the assumption of a causal relation between music training and readings skills. Douglas and Willatts (1994) performed a small scale intervention study. They trained the music group and a discussion control group for 6 months, and measured reading skills before and after the training. The music group showed small improvements in reading, whereas the control group did not improve from pre- to post-test. In a recent study with a larger sample size, Moreno et al. (2009) trained 8-year-old children either in music or painting and tested reading skills (i.e., reading of inconsistent words; inconsistent with respect to phoneme grapheme mapping and pronunciation) before and after 6 months of training. The music group improved its reading skills, while the painting group showed no improvements.

Two meta-analyses were conducted on music training and reading skills. Butzlaff (2000) reported a strong association between music training and reading with respect to correlational studies. Though, for experimental studies he found no reliable results because some studies demonstrated a positive effect of music training on reading skills, while others did not. However, a more recently conducted meta-analysis (Standley, 2008) did reveal a modest but significant positive effect of music training on reading skills.

### Music Aptitude, Music Training, and Phonological Awareness

Phonological awareness describes the insight into the phonological structure of language. It refers to the ability to analyze and manipulate language on two levels. On the word level, phonological awareness describes the ability to manipulate and analyze larger phonological units (e.g., rhyming and blending words). On the phoneme level, phonological ability refers to the ability to analyze and manipulate the individual sound units (phonemes) within a word. It has been shown that phonological awareness is an important precursor of later reading ability (Pratt and Brady, 1988; Bruck, 1992).

Phonological awareness is related to music aptitude as well as to music training. A few studies investigated the effect of *music training* on phonological awareness, and revealed that music training can indeed enhance phonological skills. In a quasi-experiment, Gromko (2005) investigated the effect of music training on phonological awareness. Children in the treatment kindergarten received music training for 4 months, while children in the control kindergarten received no treatment. Gromko (2005) revealed significantly greater gains in phonological awareness in the treatment kindergarten children than in the control kindergarten children. The pseudorandom assignment of the preschoolers to the treatment and the control group, however, precludes firm conclusions. Children were not assigned randomly on an individual basis, but the kindergartens were chosen to be the control or the treatment kindergarten. Therefore, children in the treatment group may have systematically differed (e.g., in SES) from children in the control group. Furthermore, the control group did not receive an alternative training. Degé and Schwarzer (2011) randomly assigned preschoolers to music production and perception training, phonological skills training, and sports training. Children in all three groups received training for 20 weeks. The phonological skills training as well as the music training enhanced phonological awareness, whereas the sports training did not. The advancement of phonological awareness in the music group and in the phonological skills group was mainly driven by improvements in phonological awareness on the word level. These results demonstrated unequivocally that music training could enhance phonological awareness.

Several studies revealed positive associations between *music aptitude* and phonological awareness. Huss et al. (2011) found a relationship between metrical perception and phonological awareness in a sample of 10-year-old dyslexic and non-dyslexic children. Also, Norton et al. (2005) found that audiation (the ability to hear, feel, and comprehend music for which the sound is not physically present) was correlated with phonological awareness. The test that was applied in this study comprises pitch perception and rhythm perception. Hence, phonological awareness was associated with a global music perception factor. Moreover, in a study with 4- and 5-year-old children, Lamb and Gregory (1993) investigated the association between phonological awareness and pitch as well as timbre perception. They found that pitch perception (but not timbre perception) was related to phonological awareness (Lamb and Gregory, 1993). The association between pitch perception and phonological awareness was reliable even when age and fluid intelligence was controlled. By testing the same age group as Lamb and Gregory (1993), Anvari et al. (2002) investigated the association between musical abilities and phonological awareness. They assessed melody perception, chord perception, chord analysis, rhythm perception, and rhythm production. However, they ran factor analyses and found that one factor for the 4-year-olds and two factors for the 5-year-olds represented the musical abilities best. Thus, for their further analyses, they created one music macrovariable that contained all assessed musical abilities for the 4 year-old children and two music variables (i.e., pitch and rhythm) for the 5-year-old children. For the 4-year-old children, the music macro-variable was correlated with phonological awareness, and for the 5-year-olds both music variables (i.e., pitch and rhythm) were correlated with phonological awareness. In sum, these data suggest that pitch perception as well as rhythm perception is associated with phonological awareness.

### Objectives

Music intervention studies showed that music training could enhance vocabulary, reading skills, and phonological awareness. An explanation could be that music interventions that mostly target music perception abilities enhanced music listening skills. This enhancement was then accompanied by improvements in speech perception and might have promoted some aspects of language processing (Corrigall et al., 2013). This language enhancing potential of music training might be particularly valuable. Therefore, analyzing the association between musical abilities and language-related abilities might be of great importance in order to understand relevant underlying mechanisms, and to design effective interventions for reading or phonological awareness. So far, studies investigating the association between musical abilities and language-related abilities focused on music perception abilities only. Thus, the first aim of our study was to examine the association between precursors of reading in preschoolers and music perception as well as music production. Furthermore, we applied a musical test battery that assessed music perception (melody perception, pitch perception, rhythm perception, meter perception, tone length perception) as well as music production (singing a song, rhythm production, meter production) on a detailed level. This detailed investigation represents a more complete picture of associations between musical abilities and precursors of reading. This will in turn help to identify reasonable tasks for music training programs that focus, for example, on the improvement of reading.

Anvari et al. (2002) revealed that there might be abilities other than phonological awareness involved in the association between musical abilities and reading skills. However, most studies have investigated only one precursor of reading (i.e., phonological awareness). Therefore, our second aim was to investigate the association between musical abilities and several precursors of reading (i.e., phonological awareness, working memory, rapid retrieval from long-term memory) in preschool children. This exploratory approach which comprises several correlational analyses will broaden our understanding of associations between musical abilities and precursors of reading.

# Materials and Methods

The study was conducted in full accordance with the Ethical Guidelines of the German Association of Psychologists (DGPs). In accordance with the ethical guidelines mentioned above informed consent was obtained from the parents for each participant.

### Participants

The sample comprised 55 preschoolers (27 boys, 28 girls; mean age = 75.13 months; SD = 4.02 months) from five different kindergartens in Giessen, Germany. Participants had a mean fluid intelligence score of *M* = 113.25 (SD = 11.06), see below. Hence, average fluid intelligence scores were higher than the published norms. The sample showed diversity with respect to parents' education: for 40% of the children neither parent had a university degree, for 27.3% of the children one parent had a university degree, and for 32.7% of the children both parents had a university degree.

### Measures

Possible confounding variables such as age, gender, SES, and intelligence were assessed. As predictor variables musical abilities and as criterion variables precursors of reading were measured.

Parents completed a demographic questionnaire that asked for information about their education as one possible measure of SES. Mothers' and fathers' education was initially coded as a dichotomous variable (0 for "no university degree" and 1 for "a university degree"). For the statistical analyses, parents' education was collapsed into a single variable: 0, 1, or 2 parents with a university degree. Although parental income or parental profession could also be used as a measure of SES, we decided to ask for parents' education, because in former studies parents have been mostly willing to share this information. This questionnaire was also used to assess gender and age of the participants.

To measure intelligence, the culture fair test (CFT1; Weiß and Osterland, 1977), which measures fluid intelligence, was employed. The test consisted of five subtests (substitution, mazes, classification, similarities, and matrices) and was administered in groups that did not exceed six children. The duration of test administration was 60 min including instructions and breaks. Age norms were used to determine the intelligence score for each participant.

Precursors of reading were measured with the Bielefelder Screening (BISC; Jansen et al., 2002). The screening allows the assessment of different precursors of reading: phonological awareness, working memory, and rapid retrieval from long-term memory.

Phonological awareness was assessed with the following four subtests: rhymes, word segmentation, phoneme synthesis, and phoneme recognition. Two tests (rhymes and word segmentation) measured phonological awareness for large phonological units (words) and the other two subtests (phoneme synthesis and phoneme recognition) assessed phonological awareness for small phonological units (phonemes). Each subtest consisted of two to four practice items and 10 test items. In the rhymes task, children were asked whether two words rhyme or do not rhyme (e.g., Do train and rain rhyme?). Children were asked to segment words by clapping their hands in the word segmentation task. The phoneme synthesis task requested the synthesis of the initial sound and the remaining word (e.g., m-ouse) into one word. The phoneme recognition task required recognition of a particular phoneme in a word (e.g., Is there a "u" in elephant?). A composite score of all of the subtest scores was calculated. For the statistical analyses the subtest scores as well as the composite score were used.

Working memory was assessed with recall of non-sense words (e.g., gor-ki-ra-si-mi). In this task the children had to listen to a non-sense word and recall it immediately after listening. The test consisted of two practice items and 10 test items. Seven of these test items were four syllables long, two were five syllables long, and one test item was six syllables long. The practice items were three and four syllables long, respectively. If any syllable was recalled incorrectly or omitted the non-sense word was marked as incorrect. For each correctly recalled word children received one point.

Rapid retrieval from long-term memory was assessed with a speeded naming task. This task consisted of two parts. In the first part, children were asked to name the appropriate color of black and white fruits as fast as possible. Reaction time was measured and the amount of correct answers was registered. This task was designed to assess rapid retrieval from long-term memory. In the second part, children were asked to name the appropriate color of wrongly colored fruits (e.g., yellow salad or blue lemon). As in the first part, reaction time and correctness were registered. This task should assess interference of rapid retrieval from longterm memory. The interference score was built by subtracting the "correct answer and time score" from part one from the "correct answer and time score" from part two. This difference score (small difference indicating little interference) was then transformed into the interference score (high score indicating little interference).

Musical abilities (music perception and production abilities) were measured with the music screening for children (Jungbluth and Hafen, 2005). We applied five subtests to measure music perception: melody perception, pitch perception, rhythm perception, tone length perception, and meter perception. Each subtest consisted of 10 items with increasing difficulty. All subtests required same-different discriminations, wherein the position or direction of changes should be indicated. In the melody perception subtest, children were asked to identify a change and the position of the change in two consecutive melodies. On the pitch perception task, children had to decide whether the second tone was higher, lower, or the same as the first. In the subtest rhythm perception, children had to decide whether two short rhythmic patterns were the same or different and they had to indicate the position of the difference. In the subtest tone length perception, two tones of the same pitch were played to the children and they had to indicate whether the second tone was longer, shorter, or of same duration as the first. In the subtest meter perception, two different meters were presented. Each consisted of five beats. The children had to decide whether the second meter was faster, slower, or the same as the first. A music perception total score was built for each child by adding the scores reached in each subtest.

We used three subtests to assess music production abilities: singing a song, rhythm production, and meter execution. In the subtest singing a song, children learned and sang a 4-bar-song. Two independent raters analyzed the recorded performance. They rated melody contour, rhythm, starting tone, and intonation. The interrater reliability was *r* = 0.94. In the subtest rhythm production, 10 rhythms of increasing levels of difficulty were presented from a CD and reproduced by the children on a keyboard. This subtest was recorded and scored by two independent raters, as well. The interrater reliability was *r* = 0.96. On the meter execution subtest, children had to perform four different tasks, while they always listened to the same piece of music. Firstly, children had to walk in the meter of the musical piece. Secondly, children had to clap their hands in the meter of the musical piece. In task three and four, children had to clap their hands in the meter of the music and continue clapping in the correct meter when the music had stopped. They continued until they heard "stop" from the CD. All four tasks were recorded and were coded by two raters. The interrater reliability was *r* = 0.80. In addition to the subtest scores, a music production total score was built for each child by adding the scores reached in each subtest.

### Procedure

Prior to testing, the informed consent of the parents was attained. Additionally, the demographic questionnaire was sent to the participants and they were asked to hand them back to a person working in the kindergarten. This way the experimenter could collect them. All test sessions took place in the kindergarten during their daily routine. All kindergartens provided a quiet room for the test sessions. The intelligence test was performed in groups of five to six children with two experimenters present. One experimenter instructed the children, while the second experimenter made sure that the children remained seated and concentrated on their own sheet of paper. The precursors of reading were assessed in individual sessions. Two experimenters applied the tests that measured music perception abilities in groups of five to six children. Again one experimenter instructed the group, while the other experimenter made sure that the children were focused on the tasks. The items of the music perception test were presented via speakers and the children indicated their responses on a sheet of paper. Items were coded with little cartoons or pictures to guide the children through the test. In case the children had to indicate the positions of differences (melody and rhythm) they could mark on the sheet of paper the notes or drums, respectively. Music production abilities were assessed in individual test sessions. The assessments were performed on consecutive days. All in all, children participated in four group sessions (two sessions intelligence test, two sessions music perception tasks) and two individual sessions (one session precursors of reading, one session music production tasks). At the end of the project, each child received a present and a certificate for participation.

# Results

### Preliminary Analyses

We correlated possible confounding variables (age, gender, SES, and IQ) with music perception as well as production abilities. Only a significant correlation between IQ and music production (*r* = 0.321, *p* = 0.017) and IQ and music perception (*r* = 0.497, *p* = 0.000) was found. Age, gender, and SES were not significantly correlated with musical abilities (**Table 1**). Also possible confounding variables and precursors of reading were correlated. Age was significantly correlated with interference of rapid retrieval (*r* = 0.313, *p* = 0.020). SES was significantly correlated with working memory (*r* = 0.269, *p* = 0.047). IQ was significantly correlated with working memory (*r* = 0.325, *p* = 0.016). Gender did not show any significant correlations with any precursor of reading, for details see **Table 1**. Because IQ was significantly correlated with musical abilities and precursors of reading, it was controlled in further statistical analyses.

### Principal Analyses

### Correlations between Musical Abilities and Precursors of Reading

Correlations (with IQ partialed out) between musical abilities (music perception and production total scores) and precursors of reading revealed significant associations between musical abilities and phonological awareness, working memory, and rapid retrieval from long-term memory (**Table 2**).

Phonological awareness was correlated with music perception (*r* = 0.417, *p* = 0.002) and music production (*r* = 0.650, *p* = 0.000). Also, working memory was correlated significantly with music perception (*r* = 0.363, *p* = 0.007) as well as music production abilities (*r* = 0.280, *p* = 0.040). We observed a significant correlation for interference of rapid retrieval and music perception (*r* = 0.337, *p* = 0.013). Higher scores in music perception (total) were associated with less interference

TABLE 1 | Correlations among possible confounding variables [age, gender, socioeconomic status (SES), and IQ], musical abilities (music perception and music production), and precursors of reading (phonological awareness, working memory, and rapid retrieval from long-term memory).


∗*p < 0.05,* ∗∗*p < 0.001.*



*p-values in parentheses. Significant results in bold.*

in rapid retrieval. Music production was not significantly correlated with interference of rapid retrieval (*r* = 0.262, *p* = 0.056). In further analyses the associations between musical abilities and precursors of reading were explored in more detail.

### Correlations between Music Perception, Music Production, and Phonological Awareness

The more detailed (on subtest level) analyses showed that phonological awareness total score was significantly correlated (IQ controlled) with pitch perception (*r* = 0.321, *p* = 0.018), rhythm perception (*r* = 0.335, *p* = 0.018), and tone length perception (*r* = 0.322, *p* = 0.018). The subtests of phonological awareness on the word level were significantly correlated with rhythm perception (rhymes: *r* = 0.320, *p* = 0.018) and marginally significantly correlated with tone length perception (rhymes: *r* = 0.264, *p* = 0.053) and pitch perception (word segmentation: *r* = 0.269, *p* = 0.050). For the subtests regarding phonological awareness on the phoneme level only a correlation between phoneme recognition and tone length perception (*r* = 0.347, *p* = 0.010) was revealed. All the other correlations between phonological awareness and music perception were not significant (see **Table 3** for more details). The phonological awareness total score was significantly correlated with singing a song (*r* = 0.529, *p* = 0.000) and rhythm production (*r* = 0.632, *p* = 0.000). Both phonological awareness subtests on the word level were significantly correlated with singing a song (rhymes: *r* = 0.347, *p* = 0.010; word segmentation: *r* = 0.344, *p* = 0.011) and rhythm production (rhymes: *r* = 0.332, *p* = 0.014; word segmentation: *r* = 0.473, *p* = 0.000). Additionally, rhymes were also significantly correlated with meter execution (*r* = 0.443, *p* = 0.001). Only one subtest operating on the phoneme level was correlated with singing a song. Phoneme synthesis was significantly correlated with singing a song (*r* = 0.387, *p* = 0.004), whereas the correlation between phoneme recognition and singing a song was not significant (*r* = 0.257, *p* = 0.061). However, phoneme recognition was significantly correlated with rhythm production (*r* = 0.391, *p* = 0.003). None of the other correlations between phonological awareness and music production reached significance (see **Table 3**).

### Correlations between Music Perception, Music Production, and Working Memory

We calculated partial correlations with IQ controlled between working memory and the music perception subtests. Working memory was associated significantly with rhythm perception (*r* = 0.435, *p* = 0.001). For the other subtests (melody perception, pitch perception, tone length perception, and meter perception) no significant correlations were found (**Table 4**). Partial correlations between the music production subtests (singing a song, rhythm production, and meter execution) and working memory revealed a marginal significant relationship between rhythm production and working memory (*r* = 0.265, *p* = 0.053). Singing a song and meter execution were not significantly correlated with working memory (**Table 4**).

TABLE 3 | Associations between phonological awareness (total score, rhymes, word segmentation, phoneme synthesis, phoneme recognition) and the subtests of music perception as well as the subtests of music production.


*p-values in parentheses. Significant results in bold.*

Working memory execution

0.154 (0.267)

production

0.265 (0.053)


perception

0.225 (0.102) perception

0.214 (0.120)

TABLE 4 | Associations among working memory and the subtests of music perception as well as the subtests of music production.

perception

0.435 (0.001)

*p-values in parentheses. Significant results in bold.*

perception

0.147 (0.290)

### Correlations between Music Perception, Music Production, and Interference of Rapid Retrieval from Long-Term Memory

perception

0.102 (0.456)

Partial correlations (IQ controlled) between interference of rapid retrieval from long-term memory and music perception revealed a significant association with rhythm perception (*r* = 0.344, *p* = 0.011). No significant associations between interference of rapid retrieval from long-term memory and any other tested music perception ability were found (see **Table 5** for details). With respect to music production abilities, only rhythm production was significantly correlated with interference of rapid retrieval from long-term memory (*r* = 0.295, *p* = 0.030). Neither singing a song nor meter execution was significantly related with interference of rapid retrieval from long-term memory (see **Table 5**).

Although this study was exploratory in nature and therefore several correlations were calculated, it should be taken into account that this exploratory approach affects the alpha level. Because in the principal analyses 64 correlations were calculated, it might be reasonable to adjust the alpha level. On an adjusted alpha level (*p* = 0.0008) only four correlations remained significant: the correlation between music production and phonological awareness total score, the correlation between singing a song and phonological awareness total score, the correlation between rhythm production and phonological awareness total score, and the correlation between rhythm production and word segmentation.

### Discussion

In the present study, we investigated associations between music perception as well as music production and precursors of reading. In particular, we assessed musical abilities on a detailed (subtest) level and examined their associations with several precursors of reading (phonological awareness, working memory, and rapid retrieval from long-term memory).

song

0.208 (0.131)

The total scores of music perception as well as of music production were associated with phonological awareness. Furthermore, we found correlations between the music perception total score and working memory as well as between the music production total score and working memory. Finally, the music perception total score correlated significantly with interference of rapid retrieval from long-term memory.

All in all, our results indicated that music production as well as music perception was associated with several precursors of reading. Thus, our study complements already existing studies by showing associations with music production abilities. With respect to production abilities, rhythm production was associated with three of the precursors, singing a song was correlated with all phonological awareness subtests, and meter execution showed only one significant relationship with rhymes, one subtest of phonological awareness. Our assessment of several precursors of reading and their relationship to musical abilities showed that above and beyond phonological awareness, working memory, and interference of rapid retrieval from long-term memory were associated with musical abilities. Taken together, our data suggest that there are several links between musical abilities and precursors of reading.

In a next step we explored the revealed associations between musical abilities and phonological awareness, working memory, and interference of rapid retrieval from long-term memory in more detail (i.e., on the subtest level).

Phonological awareness on the word level (rhymes and word segmentation) was correlated with pitch perception (marginally), rhythm perception, and tone length perception (marginally). Regarding phonological awareness on the word level and music production abilities, we found associations between singing a song, rhythm production, and meter execution. Phonological awareness on the phoneme level (phoneme recognition) was

TABLE 5 | Associations among interference of rapid retrieval from long-term memory and the subtests of music perception as well as the subtests of music production.


*p-values in parentheses. Significant results in bold.*

correlated with tone length perception. Furthermore, we revealed a significant correlation between phonological awareness on the phoneme level (phoneme recognition) and rhythm production and a significant correlation between phonological awareness on the phoneme level (phoneme synthesis) and singing a song. Thus, first of all, our study is in accordance with earlier findings that indicate a relationship between phonological awareness and musical abilities (Lamb and Gregory, 1993; Anvari et al., 2002). Furthermore, our results demonstrated that phonological awareness on the word level is involved in more associations with musical abilities than phonological awareness on the phoneme level. This fits to the results by Degé and Schwarzer (2011), who found an improvement especially for phonological awareness on the word level after a music training program. Our research indicates that more associations between phonological awareness on the word level and musical abilities are evident. Thus, it might have been easier to observe an effect of music training on phonological awareness on the word level in the Degé and Schwarzer study. Hence, musical abilities had more ways of interacting with phonological awareness on the word level. Therefore, a positive effect might have emerged earlier for phonological awareness on the word level than for phonological awareness on the phoneme level. Additionally, it is also possible that music training is not suitable to train phonological awareness on the phoneme level. However, we also found associations between phonological awareness on the phoneme level and musical abilities. Therefore, it should be possible to train phonological awareness on the phoneme level with a music training program. Because we found only few associations between musical abilities and phonological awareness on the phoneme level, it could be speculated that these associations might be weaker as compared to the word level, which in turn suggests that simply increasing training length might drive effects of music training on phonological awareness on the phoneme level. It remains to future research to test this specific hypothesis. Like Lamb and Gregory (1993), our results showed an association between phonological awareness and pitch perception that remained reliable after controlling for fluid intelligence. Hence, this association was not due to the influence of a third variable (i.e., fluid intelligence), but a direct association between pitch perception and phonological awareness. Contradictory to the results by Lamb and Gregory (1993), we found also reliable associations between phonological awareness and rhythm perception as well as rhythm production. This finding, though, is supported by the results of Anvari et al. (2002), who also found associations between pitch as well as rhythm and phonological awareness. Taken together, pitch and rhythm seem to be related to phonological awareness. Thus, both aspects of musical abilities might contribute to positive effects of music training on phonological awareness.

With respect to meter, we only found a significant relationship between meter production and phonological awareness. Meter perception was not significantly correlated with phonological awareness. Hence, our results are only partly in accordance with the study by Huss et al. (2011); in so far that both studies found a link between meter and phonological awareness. Huss et al. (2011) found correlations between phonological awareness and metrical perception. Interestingly, we revealed an association between phonological awareness and meter production (i.e., meter execution). Possibly, the difference in the applied meter perception tasks was responsible for the slightly different results. In the study by Huss et al. (2011) beats per minutes remained stable and only the accents were changed between consecutive stimuli, whereas in the task we applied two sequences differed in beats per minutes and not in accents. In the light of the precise auditory timing hypothesis (Tierney and Kraus, 2014) it is surprising that meter did not show more associations with several aspects of phonological awareness. Because meter execution and meter perception heavily rely on a precise perception of auditory timing, someone might have expected that meter is strongly related to phonological awareness.

For working memory results regarding perception and production showed an association between working memory and the rhythm subtests. Working memory was correlated with rhythm perception and rhythm production (marginally significant). These findings are in line with the results of Anvari et al. (2002). They showed an association between musical abilities and working memory. Considering the applied tasks to assess rhythm perception (compare two drum sequences), rhythm production (reproducing a rhythm on a keyboard), and working memory (reproduce non-sense words), it seems reasonable to conclude that all of them were processed by the phonological loop (i.e., the subsystem responsible for auditory/verbal input of Baddeley's working memory model; Baddeley, 1986). Therefore, the phonological loop was possibly the common basis of these tasks and reflected in task outcome. This might indicate that in intervention studies music training might have trained phonological loop processes and thereby produced a benefit for language processing. Moreover, the rhythm perception and production tasks were to some extent quite similar. For both of them the children had to keep in mind a rhythm and either reproduce it or compare it to a second one. Thus, it comes as no surprise that rhythm perception and rhythm production tasks are related to the same precursor of reading; working memory. However, it might not be possible to generalize our findings to all kinds of working memory tasks. Because the task we applied uses non-sense words to assess working memory, it relies not only on working memory capacity but also on phonological processing and articulatory acuity. For example, if a child is not able to reproduce all letters of a syllable correctly this might be due to working memory capacity, but it is also possible that this child has difficulties with phonological processing. In the context of working memory as a precursor of reading it might be reasonable to use language material (nonsense words) in the working memory task. However, to claim a general link between rhythmical abilities and working memory future studies should investigate this association by using tests that do not rely on phonological abilities.

Interference of rapid retrieval from long-term memory was correlated with rhythm perception as well as with rhythm production. Hence, for interference of rapid retrieval from long-term memory associations with perception and production showed the same pattern of results. To the best of our knowledge our study is the first study that demonstrated an association between musical abilities and interference of rapid retrieval from long-term memory. As mentioned above, the rhythm tasks placed similar cognitive demands on the children. Therefore, it is plausible that they showed comparable relationships. This result again points toward the importance of memory for language processing. Although the task assessed rapid retrieval from longterm memory, it might be speculated that again a subsystem of working memory (the central executive) could build the common ground of the music and the language task. The central executive is, among other things, responsible for providing a link between working memory and long-term memory (Baddeley, 2007). Furthermore, the nature of the task (speeded naming of color incongruent fruits) is comparable to other set shifting tasks that typically assess central executive. Hence, the central executive, in particular set shifting abilities, might be important for musical abilities and rapid retrieval from long-term memory. Indeed, there is correlational evidence of an association between set shifting and music lessons (Degé, et al., 2011).

With regard to the design of music interventions studies focusing on enhancing precursors of reading, our results point out that music training should target music listening skills as well as music production skills. A combination of perception and production tasks will probably provide a more successful training of precursors of reading. Our most important finding is that music production should be part of a music training program. Additionally, our results show that rhythm perception and production tasks may be a powerful tool to enhance memory-related precursors of reading. Moreover, the present results suggest that not only pitch perception in the sense of discriminating pitches of different frequencies, but also discriminating different length of pitches may be important to train in a music intervention. Lastly, body movements, as required in meter execution, may be helpful in training phonological awareness. However, they probably play only a minor role, because meter execution was only

### References


related to the rhymes subtest of phonological awareness. It is important to note that all of the above mentioned suggestions were inspired by correlational data. Our study provides no evidence of a causal effect of music training on the mentioned precursors of reading, but sheds light on the path a music training might take to improve precursors of reading.

As already mentioned, our study aimed at analyzing associations between musical abilities and precursors of reading on a detailed level. Therefore, a detailed and exploratory approach was chosen. However, there is a tradeoff between the detailed picture we could show and the high amount of correlations tested. Due to several comparisons the level of significance should be adjusted. If the alpha level is adjusted, only associations between music production and phonological awareness remain reliable. Although in our analyses alpha inflation is a problem, we believe that this approach is seminal. It draws a more complete picture of association between musical abilities and precursors of reading than former research has done. Moreover, for future studies it is now possible to test specific relations and hypothesis.

Future studies should replicate our results with a larger sample size and should extend them by using different measures to assess musical abilities as well as precursors of reading to investigate the stability of the revealed associations. Finally, it remains to future research to employ an experimental design to allow inferences about causation.

# Acknowledgments

This research was supported by a grant for educational research (TP9 6050085) from the Justus Liebig University Giessen. The authors would like to thank the participating kindergartens as well as all of the participants and their parents.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Degé, Kubicek and Schwarzer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Does Music Training Enhance Literacy Skills? A Meta-Analysis

### Reyna L. Gordon1, 2 \*, Hilda M. Fehd<sup>3</sup> and Bruce D. McCandliss <sup>4</sup>

<sup>1</sup> Music Cognition Lab, Program for Music, Mind and Society, Department of Otolaryngology, Vanderbilt University Medical Center, Nashville, TN, USA, <sup>2</sup> Vanderbilt Kennedy Center, Vanderbilt University Medical Center, Nashville, TN, USA, <sup>3</sup> Institute for Software Integrated Systems, School of Engineering, Vanderbilt University, Nashville, TN, USA, <sup>4</sup> Department of Psychology, Graduate School of Education, Stanford University, Stanford, CA, USA

Children's engagement in music practice is associated with enhancements in literacy-related language skills, as demonstrated by multiple reports of correlation across these two domains. Training studies have tested whether engaging in music training directly transfers benefit to children's literacy skill development. Results of such studies, however, are mixed. Interpretation of these mixed results is made more complex by the fact that a wide range of literacy-related outcome measures are used across these studies. Here, we address these challenges via a meta-analytic approach. A comprehensive literature review of peer-reviewed music training studies was built around key criteria needed to test the direct transfer hypothesis, including: (a) inclusion of music training vs. control groups; (b) inclusion of pre- vs. post-comparison measures, and (c) indication that reading instruction was held constant across groups. Thirteen studies were identified (n = 901). Two classes of outcome measures emerged with sufficient overlap to support meta-analysis: phonological awareness and reading fluency. Hours of training, age, and type of control intervention were examined as potential moderators. Results supported the hypothesis that music training leads to gains in phonological awareness skills. The effect isolated by contrasting gains in music training vs. gains in control was small relative to the large variance in these skills (d = 0.2). Interestingly, analyses revealed that transfer effects for rhyming skills tended to grow stronger with increased hours of training. In contrast, no significant aggregate transfer effect emerged for reading fluency measures, despite some studies reporting large training effects. The potential influence of other study design factors were considered, including intervention design, IQ, and SES. Results are discussed in the context of emerging findings that music training may enhance literacy development via changes in brain mechanisms that support both music and language cognition.

Keywords: music training, reading, literacy, phonological awareness, meta-analysis, brain development

# INTRODUCTION

Acquiring fluency in reading requires children to transform symbolic information provided by print into mental representations based on their prior language experience. This literacy acquisition relies heavily on the process of phonological awareness. In particular, children's ability to focus their attention on sub-syllabic phonological units within words is a critical factor for mastering

### Edited by:

McNeel Gordon Jantzen, Western Washington University, USA

### Reviewed by:

Virginia Penhune, Concordia University, Canada Franziska Degé, Justus-Liebig-University, Germany

> \*Correspondence: Reyna L. Gordon reyna.gordon@vanderbilt.edu

### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 22 July 2015 Accepted: 05 November 2015 Published: 01 December 2015

### Citation:

Gordon RL, Fehd HM and McCandliss BD (2015) Does Music Training Enhance Literacy Skills? A Meta-Analysis. Front. Psychol. 6:1777. doi: 10.3389/fpsyg.2015.01777 the early challenge of alphabetic decoding. Phonological awareness has also been linked to neural mechanisms that help explain individual differences in early literacy (Schlaggar and McCandliss, 2007). Moreover, a growing number of studies have linked music skills and music training to differences in speech perception (Wong et al., 2007; François and Schön, 2011); basic auditory perception (Shahin et al., 2003; Hyde et al., 2009) and acquisition of second language or an artificial language (Slevc and Miyake, 2006; Brod and Opitz, 2012). Basic auditory processing appears to be a building block of phonological awareness (Walker et al., 2006), and music training is associated with both superior auditory perception (Seither-Preisler et al., 2014) and enhanced language skills (see Patel, 2008, for a review).

Understanding the potential connection between music training and literacy skills is informed by two areas of research literature. The first is a well-established body of research showing that some language-related skills, such as phonological awareness, are a fundamental pre-cursor of reading skills (see meta-analysis by Melby-Lervag et al., 2012), and the second is an emerging literature investigating the potential role of music training as an activity that may induce plastic changes and perceptual enhancements within neural systems crucial for reading (e.g., Kraus et al., 2014b). Learning to play an instrument or to sing requires a complex series of neural transformations in order to process fine-grained acoustic variations in timing, frequency, spectral characteristics, and intensity into musically relevant auditory-motor actions to create rhythm, pitch, timbre, and dynamics. The OPERA hypothesis (Patel, 2011, 2014) provides a framework for highlighting the multiple perceptual demands musical training requires and the benefits such demands may bestow on neural systems that are important for literacy and language skills. Together, these two literatures provide constraints on understanding pathways through which musical training may enhance early literacy acquisition.

A rapidly accumulating body of evidence has shown associations between language and music skills in children. For instance, 7-to-9-year-old musicians outperformed their nonmusician peers at detecting small prosodic (pitch) incongruities in sentences (Magne et al., 2006). Likewise, 9-year-olds musicians (vs. non-musicians) showed enhanced brain responses and behavioral performance on detection of deviants of the voiceonset-time, frequency, and duration of syllables (Chobert et al., 2011). Foreign language pronunciation skills and brain response to duration deviants (in music and speech) were better in 10 to 12-year-olds with musical training (Milovanov et al., 2009). Even without explicit music training, some of the variability in language skills can be accounted for by measuring individual differences in music aptitude. Measures of music aptitude have been found to account for over 40% of the variance in reading performance in typically developing 8- to 13-year-old children with little to no music training (Strait et al., 2011). Rhythm perception skills were robustly correlated with grammar production skills in 6-year-olds (Gordon et al., 2015b); a followup study of grammatical categories and musical rhythm revealed that musical rhythm explains production of complex sentence structure in particular (Gordon et al., 2015a).

Reading is one language skill that has received recent attention in the neuroscience community regarding potential shared neural resources with music. Anvari et al. (2002) showed that pitch and rhythm skills in 4- and 5-year-olds correlated with phonological awareness and early reading skills, converging with prior findings of a correlation between pitch discrimination and both phonemic awareness and early reading abilities in a similar age group (Lamb and Gregory, 1993). Musical rhythm in particular has been linked to reading skills in prior work using a wide variety of methods for measuring rhythm in young children, across many native languages. American-English-speaking preschoolers who excelled at synchronizing to an acoustic beat ("Synchronizers") outperformed their "Nonsynchronizer" peers at phonological awareness and rapid naming tasks (Woodruff Carr et al., 2014). A French study with large sample size (n = 695) showed that kindergarteners' ability to reproduce musical rhythms was significantly predictive of their second grade reading skills (Dellatolas et al., 2009). Interestingly, Banai and Ahissar (2013) found a stronger relationship between reading and auditory processing skills in Israeli children without musical training, while the musician children in the study showed better auditory processing but no advantage in reading skills.

The relation between rhythm and reading-related skills continues to be significant in later stages of language development. Tierney and Kraus (2013b) found that beat tapping variability (to an isochronous metronome at a 2 Hz rate) negatively correlated with reading skills in adolescents, such that those who tapped to the beat more consistently were more likely to have better performance on the reading measures. Correlational studies in adults have shown that musicians have greater sensitivity to speech rhythm (Marie et al., 2011), better reading-related skills (e.g., phoneme discrimination: Zuk et al., 2013a) and that individual differences in speech rhythm sensitivity is related to variability in musical aptitude when participants with a wide range across the continuum of musical abilities are studied (Magne et al., in revision). Over the course of aging, there is evidence that early musical training is associated with protection against age-related linguistic and cognitive declines (Parbery-Clark et al., 2011; Bidelman et al., 2014; Bidelman and Alain, 2015), even in adults with hearing loss (Parbery-Clark et al., 2013). However, as noted in Butera (2015), associations with musical training in these correlational studies cannot be interpreted in favor of causality in the absence of longitudinal data that rules out other genetic and environmental contributions to the observed findings of neural enhancements in individuals with musical training.

If enhanced language skills and musical skills are correlated, then would individuals with language disorders also have deficits in musical processing? Research on reading disabilities and language impairment suggests that this is often the case (e.g., Goswami, 2011; Gordon et al., 2015a). Seminal work by Overy (2003) revealed that a small group of children with reading disability improved their phonological awareness and spelling skills faster during an 8-week period of music instruction than during the same amount of time with no music training. Sensitivity to musical rhythm predicted significant variance in phonological awareness concurrently and longitudinally in 10-year-olds with dyslexia (Huss et al., 2011; Goswami et al., 2013). Difficulties processing the prosodic aspect of speech (i.e., variations in timing and pitch that mark linguistic events) are thought to be reflected in both musical deficits and weaknesses in phonological awareness (Goswami et al., 2010; Power et al., 2013; Leong and Goswami, 2014) in individuals with reading disabilities. Given these connections, musical practice holds promise as a tool to contribute to reading skills, potentially via a pathway of enhancing children's sensitivity to prosodic aspects of speech.

Correlational evidence does not, of course, exclude potential effects of self-selection or environmental and genetic differences that could alternatively account for enhanced language skills in musicians (Schellenberg, 2015). Evidence from longitudinal studies that administer a controlled and specific amount of musical training is crucial for investigating a possible causal influence of music on non-musical skills. The potential that music training could enhance reading skills is especially pertinent now that there are ongoing debates in educational systems about the most effective strategies for impacting academic achievement in the core curriculum. However, it is important to note that much of this work has focused on training-related brain changes (rather than behavioral outcomes); the significance for academic achievement of these modifications in brain activity is difficult to ascertain in the absence of reporting of behavioral gains in language skills (as discussed in Evans et al., 2014; Schellenberg, 2015). As reviewed in the present study, a considerable collection of controlled training studies has provided positive evidence for the hypothesis that musical training transfers to literacyrelated skills. Taken as a whole, however, the range of studies published to date present a rather mixed set of results, marked by a large range of potential outcome measures related to literacy skills. To assess and quantify the state of the evidence that may potentially support the hypothesis that musical training in children transfers into enhancements in literacy-related skills, we first set out to delineate the subset of peer-reviewed papers that directly address this issue via training and pre- post-assessment designs.

A meta-analytic approach is useful in assessing the efficacy of music training for language outcomes and identifying the attributes of music training paradigms that are relevant to specific reading outcomes. The present meta-analysis is thus aimed at synthesizing previous research on music training and reading-related outcomes. The following research questions were examined:

(1) Does music training improve reading-related outcomes when other reading instruction is controlled for? Are certain aspects of learning how to read (i.e., reading fluency and phonological awareness) particularly susceptible to transfer from music training?


# MATERIALS AND METHODS

# Literature Search

### Search Strategies

The goal of this meta-analysis is to evaluate the effectiveness of musical interventions on reading-related measures. To find all articles that met our criteria, we conducted a literature search using the PubMed, Web of Knowledge, and ProQuest article databases. ProQuest functioned as a meta-database, allowing us to search 12 databases simultaneously: ERIC, International Index to Music Periodicals Full Text, Linguistics and Language Behavior Abstracts, MLA International Bibliography, ProQuest Education Journals, ProQuest Psychology Journals, ProQuest Research Library, ProQuest Science Journals, ProQuest Social Science Journals, PsychARTICLES, PsycINFO, and RILM Abstracts of Music Literature. The search terms used in each of the three searches are listed in Supplementary Table 1. The initial search was conducted in November 2013, and it was repeated/updated in March 2014. In total, the search returned 4855 articles whose article titles were searched for relevance to the topic. Additionally, to pass this first screening phase, each article could not be a conference presentation, thesis or dissertation, or trade newspaper or magazine article, and had to be written in English. A preliminary search of these titles narrowed down the potentially relevant articles to 178. The abstracts of these remaining articles were then reviewed for inclusion criteria and relevance. The criteria in this second phase of screening required that articles not be a review or meta-analysis, that they have a music intervention with a control group, and that they investigated reading-related outcomes.

### Inclusion and Exclusion Criteria

In our literature review, we defined inclusion and exclusion criteria based on meta-analysis guidelines for distinguishing features of studies (e.g., characteristics of the participants, key variables, research methods, and publication type; Lipsey and Wilson, 2001). Only articles that met the following criteria were included in the study:


with the National Reading Panel's standards for meta-analysis (Lonigan and Shanahan, 2009) and with previous metaanalysis on literacy education (e.g., Bus and van IJzendoorn, 1999).


Out of 178 studies that were reviewed at the abstract level (with full-text examination if necessary to determine inclusion based on above criteria), 17 articles met these criteria. The types of interventions used and contrasting control groups were found to vary substantially across the studies, with some showing confounds of uneven amounts of reading instruction across the groups or failed to provide more musical training to one of the groups. We thus added the following constraint to study design for inclusion:


After applying this final design constraint, an additional 5 studies were excluded (Register, 2001; Register et al., 2007; Bolduc, 2009; Darrow, 2009; Bhide et al., 2013) and only 12 papers still qualified, as listed in **Table 1** (Register, 2004; Gromko, 2005; Myant et al., 2008; Moreno et al., 2009, 2011; Yazejian and Peisner-Feinberg, 2009; Degé and Schwarzer, 2011; Herrera et al., 2011; Bolduc and Lefebvre, 2012; Cogo-Moreira et al., 2013; Moritz et al., 2013; Thomson et al., 2013). Herrera et al. presented results from two independent samples (each with its own control group) that received the same intervention, and was thus coded as two separate studies in our analysis, giving a final study count of k = 13 for the meta-analysis.

# Coding Procedures

### Procedure and Outcome Variables

A custom data entry system was created for the study using the Research Electronic Data Capture (REDCap) tools (Harris et al., 2009) hosted at Vanderbilt University (REDCap is a secure, web-based application designed to support data capture for research studies, providing an intuitive interface for validated data entry and automated export procedures for seamless data downloads to common statistical packages). All study characteristics and data were coded and entered into the custom forms.

The outcomes measures used within these 13 studies are somewhat variable; each can be classified into one of the two broad categories of Reading Fluency and Phonological Awareness. For studies that reported more than one measure in an outcome category, we selected the measure that most directly tapped into the category. For Reading Fluency, measures that emphasized fluent use of known words and letters were chosen over those that used non-words. Within Phonological awareness, two subcategories were identified: Rhyming and Other Phonological measures. For Rhyming, measures that involved discrimination of rhymes were chosen over those that involved producing rhymes. For Other Phonological, measures that involved identification, discrimination, or manipulation of phonemes were chosen over those that dealt with non-word reading fluency or syllabic segmentation. All measures included are reported in **Table 1**.

### Potential Moderating Variables

These 13 studies were then carefully coded for the following study design features, which are reported in **Tables 2**, **3**.


### TABLE 1 | Study characteristics.


Study information, primary language of participants, age, and outcome measures of studies included in the meta-analyses.

### TABLE 2 | Training components.


Hours of music training and components of the music intervention for each study.

# Statistical Analysis

Effect Size Calculation

For each outcome and measure, a single effect size was computed in the following manner, where ES = effect size:

interventions. Given the relatively small number of studies included in the meta-analysis, it was not possible to test additional moderators for each component of training and level of random assignment.

ES = (Posttest MeanTx − Pretest MeanTx) − (Posttest MeanControl − Pretest MeanControl) Pooled Pretest SD

$$\text{Pooled Pretest SD} = \sqrt{\frac{\text{Preest SD}\_{T\text{x}}\,^2 \* (N\_{T\text{x}} - 1) + \text{Preest SD}\_{Control}\,^2 \* (N\_{Control} - 1)}{N\_{T\text{x}} + N\_{Control} - 2}}$$

### Data-analysis

Meta-analysis was performed using the open-source statistical software package R (R Core Team, 2015), and employing the "metafor" package (Viechtbauer, 2010). Heterogeneity was computed as I <sup>2</sup> = residual heterogeneity divided by unaccounted variability, and H<sup>2</sup> = unaccounted variability divided by sampling variability (Higgins and Thompson, 2002). Metaanalysis was carried out using two different approaches: random effects model for the separate analysis of each of the three outcome types (Reading Fluency, Rhyming, and Other Phonological outcomes), and mixed effects model for the moderator analysis. Mixed effects was also used for the broader All Phonological Outcomes category since it included non-independent samples from studies that included both Rhyming and Other Phonological Outcomes. Moderator analysis was used to test influence of age, control intervention type, and number of training hours on the efficacy of music

### RESULTS

### Characteristics of the Studies Included

Publication information, language, age of participants, and outcomes measured are reported in **Table 1**. Participants ranged in mean age from 4.53 to 9.33 years, with a weighted average mean of 6.25. Participants identified with a wide range of native languages (English, Portuguese, German, French, Spanish, and Tamazight). The components of music training are reported in **Table 2** and varied greatly across studies; total hours of training ranged from 3 (Thomson et al., 2013) to 90 (Moritz et al., 2013). Many studies included singing (k = 12), rhythm (k = 9), instruments (k = 7), movement/kinesthetics (k = 8), and less than half used Phonology in music context (k = 6), rhyming (k = 5), clapping/marching (k = 5), visual representations of musical concepts (k = 5), and only k = 3 included music notation.

### TABLE 3 | Study controls.


This table reports population, IQ, SES, type of assignment, and control interventions for each study.

Several aspects of control factors in the study design are reported in **Table 3**. All but two studies (Cogo-Moreira et al., 2013; Thomson et al., 2013) were conducted on a typically developing children. IQ was reported as equivalent across groups in k = 9 studies, and SES was reported as equivalent across groups in only k = 6 studies. Many different types of group assignment were found, and only k = 6 studies used "true" student random assignment. The remaining studies assigned preexisting classes (or schools) to different treatment conditions. Control interventions included k = 3 studies in which the control group received phonological training, k = 3 studies with non-auditory control activities such as art or sports, k = 6 studies with no special extra-curricular activities (no-treatment control), and one study where the control group also received music lessons but to a much lesser extent ("less intensive music" control).

### Effect Sizes

Means, standard deviations, pre- and post-training, N's per group, and the computed effect sizes are reported in **Table 4**. Given that this meta-analysis was designed to investigate (1) how music training affects different types of reading-related measures; and (2) how selected aspects of study design (age of participants, hours of training, and type of control intervention)

### TABLE 4 | Effect sizes.


Means and SD's for each group, and effect sizes, are listed for each study (grouped by outcome category type).

would moderate outcomes, the choice to limit the moderator analysis to these three moderator variables was also constrained by the statistical power of conducting meta-regression on only a small number of studies that met the criteria. Thus, meta-analyses were computed separately on reading fluency and phonological awareness, and moderator analyses tested the influence of each of the abovementioned factors on the outcomes.

# Meta-analysis Results for Phonological Awareness

Due to the non-independence of the studies that reported both types of phonological awareness outcomes (Rhyming and Other Phonological) in the same sample, mixed effects analysis was employed to test overall Phonological Awareness. This analysis on All Phonological Awareness (k = 18) revealed an effect size of 0.20 (95% CI [0.04, 0.36], p = 0.01), showing small but significant gains of music training on phonological skills, shown in the forest plot in **Figure 1**. The test for Heterogeneity [Q(df <sup>=</sup>17) = 28.8, p = 0.04] was significant, indicating potential influence of other factors. To investigate these factors and their relation with moderators, phonological outcomes were then further broken down into two separate categories corresponding to Rhyming and Other Phonological outcomes (see Methods section for more information on how measures/outcomes were chosen).

### Rhyming Outcomes

Random-effects analysis on the subset of rhyming outcomes (k = 7 studies) yielded a weighted average effect size of 0.18 (95% CI [−0.06, 0.42]), which was non-significant at p = 0.14. A mixed effects analysis then revealed no significant influence of age (p = 0.31) or control intervention type (p = 0.75) on the results, but a significant influence (p = 0.04) of training hours on rhyming outcomes. These results suggest that an increase in the length of training by 1 h corresponds to an increase of 0.01 (95% CI [0, 0.03]) in the effectiveness of music intervention on rhyming outcomes. The results of this model were then used to predict values of effectiveness given different amounts of training hours. Using the range of values from across all studies from the entire meta-analysis, (3–90 h), and assuming a constant age (5 years) and constant control intervention type, the model predicts that at least 40 h of training are needed to have a significant effect on Rhyming outcomes, as shown in **Figure 2**. These results should be interpreted with caution, given that the study showing the strongest positive relationship between hours of training and rhyming outcomes (Moritz et al., 2013) had only 15 participants in each group.

### Other Phonological Outcomes

Random effects analyses on Other Phonological outcomes (k = 11), yielded an average effect size of 0.20 (95% CI [−0.03, 0.42]), which weakly trended toward significance (p = 0.08). A mixed effects analysis revealed no significant influence of age (p = 0.24), control group type (p = 0.34), or training hours (p = 0.09) on the model. Heterogeneity was moderate (I <sup>2</sup> = 40.2%; H<sup>2</sup> = 1.67) but residual heterogeneity did not reach significance [QE(df <sup>=</sup>7) = 11.89, p = 0.10].

# Meta-analysis Results for Reading Fluency

Random effects analysis on the five studies that included Reading Fluency outcomes showed a weighted average effect size of 0.16 (95% CI [−0.03, 0.35], p = 0.10), thus showing only a weak trend toward significance of music intervention on reading fluency. Results are shown in **Figure 3**. Heterogeneity was low (I <sup>2</sup> = 0%; H<sup>2</sup> = 1), and given the small number of studies (k = 5), moderator analysis was not pursued.

### Test for Publication Bias

The Rank Correlation Test for Funnel Plot Asymmetry indicated no publication bias for either Reading Fluency (Kendall's tau = 0.60, p = 0.23) or Phonological Awareness (Kendall's tau = 0.18, p = 0.33).

### DISCUSSION

The current meta-analysis was carried out to assess the impact of music intervention on reading-related skills in children, and adds to the literature by specifically highlighting effects of music training transferring to reading-related skills when non-musical reading training is held constant. Results of the meta-analysis on the broad category of Phonological Awareness outcomes suggest modest gains (a small effect size of d = 0.20) for music vs. control groups. This finding is in line with a number of other studies showing better phonological awareness skills in musicians compared to their non-musician peers (Forgeard et al., 2008; Zuk et al., 2013b), and also converges with work showing correlations between music aptitude and phonological skills in children (Lamb and Gregory, 1993; Anvari et al., 2002; Peynircioglu et al., 2002; Dellatolas et al., 2009; Tierney and Kraus, 2013a).

When broken down into subcategories (Rhyming and Other Phonological outcomes), moderator analysis revealed that the effectiveness of music intervention on Rhyming outcomes was dependent on the number of training hours. Total music intervention training hours ranged between 3 and 90 h in the studies included here, and the model estimated that at least 40 h are needed to improve Rhyming skills. To put this number in perspective, other work (e.g., Hambrick et al., 2014) has shown that thousands of hours are typically involved in reaching adult levels of musical expertise. Consideration of how children's music training improves rhyming skills must assess the possibility that results could merely reflect the inclusion of greater rhyming practice within the music interventions relative to the control conditions. Indeed, early childhood music education in group settings typically include activities such as singing and chanting rhyming lyrics. However, several aspects of the studies that support the positive transfer effect for rhyming outcomes suggest that this effect cannot be entirely attributed to this explanation. First, it should be noted that the study with the strongest positive relationship between rhyming outcomes and hours of training (Moritz et al., 2013) reported no rhyming-related training activities, and rather emphasized rhythmic aspects of musical training. Furthermore, of the seven studies with rhyming outcomes, only four were coded as including any report of rhyming training (see **Table 2**). These results, taken together with reports of robust associations between musical rhythm skills and rhyme awareness, in both children with typical development and reading disabilities (e.g., Huss et al., 2011), suggest that other aspects of musical training may impact rhyming skills. Future work is needed to make more definitive conclusions regarding whether intensive rhythm training can improve rhyming and phonological skills in general, given the links between rhythm and reading skills in the literature (e.g., Strait et al., 2011).

The separate meta-analysis on eleven datasets with Other Phonological Outcomes was inconclusive: the effect size was small (d = 0.2) and only trended toward significance, with no moderators (age, control intervention type, or training hours) reaching significance. This pattern of results could potentially be due to variability in the many different types of phonological tasks that were included in this category (i.e., Initial Phoneme Oddity, Alliteration, Spoonerisms and others; see **Table 1**) or even to the wide variety of native languages spoken by participants. Further study is needed to determine if certain phonological skills are more susceptible to a positive transfer from music training than others.

The effect size for the separate meta-analysis assessing the impact of music training on reading fluency outcomes was also small (d = 0.16) and did not reach significance: moderator analysis was precluded due to only having five studies in this category. However, it should be pointed out that two of the studies (Cogo-Moreira et al., 2013; Thomson et al., 2013) were on children with reading disabilities, and while there are solid theoretical reasons (see Overy, 2003; Tierney and Kraus, 2013a) to believe that music training could improve reading skills in struggling readers, the intensity of the intervention would likely be an important factor in such attempts.

Moreover, previous meta-analyses with different parameters than the present study have found both a non-significant effect of music on reading skills (Butzlaff, 2000) and significant effects (Standley, 2008). The present study extends these results by including data from additional studies published between 2008 and 2014, and by limiting the scope of studies included to a more rigorously defined comparison, for which reading instruction is controlled across groups. The data quality and variability of study outcomes and confidence intervals are comparable to studies included in other meta-analyses on literacy education (e.g., Lonigan and Shanahan, 2009) and this heterogeneity should be taken into account in the interpretation (as discussed below).

It is interesting to note that a previous meta-analysis on literacy development found medium-to-strong effects of phonological awareness training on reading skills (yet longer term studies produced only small effects), and that phonological awareness was a necessary, but not sufficient condition for reading (Bus and van IJzendoorn, 1999). One could hypothesize that music skills share more variance with phonological skills (due to their auditory bases) than with reading fluency skills, and thus music training may have larger effects on phonological awareness than on reading. Nonetheless, it is also possible that music training could impact reading fluency via a more gradual pathway: beginning more generally by improving auditory discrimination, then affecting rhyming skills and using them to bootstrap further phonological awareness. More intensive training may be needed for these improvements to occur at a level that produces measurable improvements in reading fluency across heterogeneous participant populations.

Overall, the findings of the current meta-analyses are somewhat inconclusive with regards to the hypothesized impact of music education on reading-related skills. The literature search revealed a large amount of variability in outcomes studied, content and intensity of music training, native language of participants, type of subject populations (typically developing vs. reading disordered) and age of participants. In addition, some of the study designs in the set of studies included in this metaanalysis are laden with potential biases that make it difficult to draw broader conclusions from the findings (see **Table 3**). These inconsistencies include variability in control group activities, lack of information about IQ differences or equivalence across groups; and only 6 studies of 12 reported controlling for socio-economic status across groups. Importantly, most of the studies were quasi-experimental and did not use random assignment to create treatment and control groups. In the case of studies that compared a class (or school) receiving the intervention vs. another control class or school, it is possible that other differences in teacher/student dynamics and educational environment differed across the groups (and therefore either diminished or exaggerated the gains in music training). Although we were able to code and report many of the above characteristics, there were too few studies included in the total meta-analysis to allow a sufficiently powered moderator analysis that would effectively shed light on whether these study characteristics were linked with different trends of results. Thus, the limitations of the present meta-analysis are the heterogeneity of approaches and study designs used, and that the dataset was too underpowered to test all of the potentially influential moderator variables that were coded. Nevertheless, it is interesting to note that all three of the studies (Moreno et al., 2009, 2011; Degé and Schwarzer, 2011) in which SES and IQ were equivalent, and student random assignment was used, also showed large effect sizes on at least one reading-related outcome, indicating a robustness of music training efficacy for improving reading-related skills under methodologically sound circumstances. The quality and breadth of all studies included in the present meta-analyses also provides complementary information to results of a prior meta-analysis on the impact of music on reading skills (i.e., Standley, 2008) in which aspects of the music training may have confounded the findings (e.g., some studies included in their meta-analysis included contrasts where both groups received different types of music training and whether a given group got more music training was unclear). Suggestions for creating a standard of implementation steps for reducing heterogeneity and bias are summarized in **Table 5**.

Moreover, the small effects of music on reading-related outcomes observed in this meta-analysis stand in contrast to the robust results seen in the correlational literature reporting (broadly defined) linguistic advantages in musician children (Magne et al., 2006; Chobert et al., 2011) and adults who had


musical training as children (Skoe and Kraus, 2012). One key difference is that the correlational studies tend to include children who have already had several years of individual instrumental instruction, whereas the intervention studies included here have shorter and less intense music training, and all were conducted in a group rather than individual setting. It could be that the music training in a group setting is less demanding and therefore less likely to make a large impact in terms of transferring to language skills (see OPERA hypothesis for a theoretically driven set of criteria for plasticity; Patel, 2011). Nevertheless, Hyde et al. (2009) showed neural plasticity and improvements in auditory and motor tasks, along with structural brain changes in auditory and motor areas, after 15 months of music training on an instrument. Furthermore, other experimental studies administering groupsetting music training to participants randomly assigned to a music group (vs. a non-music control group, i.e., Moreno et al., 2009; Chobert et al., 2014) also found transfer to language perception skills; thus, individual instruction does not appear to be a pre-requisite for music-training-driven improvements in language skills. However, less is known about whether individual lessons and intensive instruction on an instrument are needed to improve reading-related skills.

The literature review encompassed by the present study revealed two somewhat opposing trends: on the one hand, an approach that favors the contextual use of music as a fun and motivational context to teach reading and other skills (Standley and Hughes, 1997; Standley, 2008; Darrow, 2009); and on the other hand, an auditory neuro-development framework that attributes music-training-related language gains primarily to auditory neural plasticity (Kraus and Chandrasekaran, 2010; Patel, 2011). In the "contextual" approach, phonological awareness and other literacy skills are taught in a musical context: for example, one intervention was described as teaching "literacy skills such as rhyming, letter sounds, vocabulary, or decoding sounds that were accompanied by a chant or song; children's storybooks that were either read or sung or accompanied by the students on musical instruments as they recognized a previously identified vocabulary word; rearrangement of storybook parts with students asked to put the story pages in order and to retell the story in their own words" (Darrow, 2009, p. 14). Use of nursery rhymes is common and constitutes the foundation of one of the intervention curricula described in a study in the metaanalysis (Bolduc and Lefebvre, 2012). A number of studies have specifically targeted literacy skills within the music training, with musical activities designed to increase print awareness (Standley and Hughes, 1997); letter-naming, letter-sound correspondence, and word building (Register, 2004); and decoding (Register et al., 2007). Interestingly, in many of the contextual studies, music is thought of as a positive reinforcer of reading-related exercises, and little mention is made of the auditory system or its physiological underpinnings.

In contrast, the auditory neurodevelopment framework posits that music training strengthens basic auditory and speech processing, which in turn influence phonological perception and reading skills. These gains have been described as domaingeneral improvements in auditory brain mechanisms underlying temporal and frequency resolution, auditory processing, and phonological awareness (Tierney and Kraus, 2013a). Experiencebased plasticity of brain networks involved in language acquisition is a plausible explanation for the putative transfer of music training to language and literacy skills (reviewed in Kraus and Chandrasekaran, 2010). Randomized study designs conducted with neuro-imaging methods have shown that music lessons (in typically developing children) enhance neural responses to voice-onset-times and syllable durations (Chobert et al., 2014), detection of pitch variations in speech (Moreno et al., 2009), speech segmentation skills (François and Schön, 2011), and discrimination of consonants (Kraus et al., 2014b). Moreover, an association between brain responses to syllables (using the complex Auditory Brainstem Response method) and degree of active engagement (i.e., better classroom participation and attendance) in a music program suggests that the amount of training and level of engagement is an important factor in music-training-driven plasticity (Kraus et al., 2014a).

Another important aspect of the neurodevelopmental framework, thus far not definitively investigated in the literature, is that individual differences in innate (or pre-existing) musical traits may differentially affect music-training-driven plasticity and transfer to language skills. The extant literature does suggest that the relationship between language and music skills varies with different levels of music aptitude (Banai and Ahissar, 2013) and that pre-existing genetic differences likely account for some variation in level of music achievement attained (reviewed in Schellenberg, 2015). Given that individual differences in music abilities can predict some aspects of linguistic competence, even in non-musician children (Strait et al., 2011; Woodruff Carr et al., 2014; Gordon et al., 2015b), taking these individual differences into account could potentially provide a significant path to predicting response to music intervention. In this vein, Seither-Preisler et al. (2014) propose a fascinating neurocognitive model of competence development that would account for the interaction between pre-dispositions and intervention efficacy by modeling plasticity and anatomical influences on music development. They found that the size of the right Heschl's Gyrus significantly predicted variance in the amount of time that children spent practicing their instruments; the authors interpret this finding as evidence that this particular neurophysiological morphology interacts with motivational factors that determine the amount of time/effort devoted to music. More generally, it is theoretically conceivable that a subset of children has a particular brain architecture that pre-disposes them to faster musical growth and more efficient transfer to language skills; while others may have neural substrates that respond better to other types of language interventions (e.g., phonological only). Continued investigation of these and other hypotheses regarding individual differences may turn out to reduce heterogeneity of findings in future individual studies and meta-analyses on the topic of music-training-driven changes in neural and cognitive activity.

The mixed results obtained in the current meta-analysis could instead signify possible limitations of music training for literacy skills in children. Such an interpretation could be regarded in accordance with previous accounts of modularity of some aspects of language and music (Peretz, 2006). For instance, Peretz et al. (2015) argue that studies showing "neural overlap" of music and language in brain areas do not necessarily indicate that the same neuronal populations within a given brain area are active for both musical and speech processing. Moreover, it is important to bear in mind that small or non-robust effects of transfer from training to another skill are not unusual in the context of the larger literature on skills transfer. Many of the same methodological challenges (i.e., control group selection) encountered in the current meta-analysis are cited as prevalent issues for the skilllearning field much beyond music and language (Green et al., 2014). To this point, Green and Bavelier (2008) state "in the field of skill learning, transfer of learning from the trained task to even other very similar tasks is generally the exception rather than the rule." Bransford and Schwartz (1999) suggest that the difficulty in finding consistent results of skills transfer stems in part from the idea that assessments of current knowledge generally do not capture the dynamics of the learning process. In the current meta-analysis, evidence that music training (that in some cases involves rhyming materials) has impacted performance on a standardized test of pre-reading skills (that has different surface features, cues, and demand characteristics) has crossed a substantial hurdle in establishing skills transfer; thus, even small gains should not be considered trivial.

To develop a full picture of the extent of transfer from music experiences to language skills and the possible applicability of the neuro-developmental framework, more work is also needed on the underlying mechanisms of music-related improvements in language when they are reported (either in individual studies or future meta-analyses). These effects could potentially be due to all-around, general acoustic perception/auditory processing skills (affecting perception of pitch, timing, and spectral characteristics); or, the benefits may be only specific to certain aspects of phonology such as fine-tuned detection of voice-onset-time (Zuk et al., 2013b), or perception of prosodic patterns on the supra-syllabic level. Indeed, a growing number of studies have linked speech rhythm sensitivity to early literacy skills. Sensitivity to stress patterns in spoken language are correlated with emerging reading skills in early readers (ages 5–7; Holliman et al., 2008; Goswami et al., 2010), and predict later reading development (Holliman et al., 2010). Struggling readers are also more likely to show weaknesses in perception of speech rhythm (Holliman et al., 2012) and musical rhythm (Huss et al., 2011; Flaugnacco et al., 2014). The temporal sampling theory (Goswami, 2011), along with work on neural oscillations involved in speech comprehension (Luo and Poeppel, 2007; Abrams et al., 2008: Hickok, 2012) converge in their explanation of a temporal scaffolding created by low-frequency stress patterns that facilitates acquisition and comprehension of higher-frequency (e.g., phonetic) information in the speech signal. These mechanisms may be shared by musical rhythm skills (Gordon et al., 2011; Hausen et al., 2013; Hickok et al., 2015; Morillon and Schroeder, 2015). Recent work translating related concepts of rhythm entrainment from dynamic attending theory to speech perception (Schön and Tillmann, 2015) suggest that even short-term rhythmic stimulation can impact phonological processing. A general deficit in these mechanisms of rhythm sensitivity could hinder acquisition of language and literacy skills (e.g., Leong et al., 2011; Power et al., 2013); individual differences in rhythm sensitivity could possibly mediate response to treatment, and should be taken into account. Likewise, the role of auditory working memory in music-training-driven plasticity is not yet well-understood (Kraus et al., 2012; Ramachandra et al., 2012; Tierney and Kraus, 2013b) and should be accounted for in future intervention studies. **Table 5** summarizes potential questions to be addressed in future work.

The present meta-analysis contributes to the literature by examining the influence of music training on readingrelated skills while also constraining the amount of reading instruction received across groups and modeling potentially important moderators (age, hours of training and type of control intervention). The findings yielded modest gains in phonological awareness (mainly in rhyming skills) for music vs. control interventions, but the small subset of studies examining reading fluency skills found no significant aggregate improvements in music vs. control groups. The literature review synthesized results from previous work suggesting potential benefits of music training on non-musical academic skills (e.g., Patel, 2011), supported by some evidence for a transfer from music training to rhyming and phonological awareness skills yielded by the present meta-analysis. This approach has also laid some groundwork for exploring specific aspects of the relationship between reading and music, which may take place in part through enhancement to perception of rhyming. This finding converges with the hypothesis that music supports phonological awareness; further study is needed to determine if intensive and long-term music training can enhance reading fluency via improvements to auditory skills, phonological awareness, and rhyming in particular. Given the limitations discussed here of the work included in this meta-analysis and the potential factors to address (summarized in **Table 5**), further investigation of a positive transfer from music education to reading-related skills is warranted. These investigations should eventually be considered in light of current trends in educational policy to cut funding for arts education (Kratus, 2007), such as when music lessons are eliminated in order to increase instructional time and resources for core subjects.

To draw definitive conclusions on a causal link from music to literacy and possible mediating mechanisms, there is abundant room for further progress in using longitudinal studies to address both the study design factors and the potential moderators of music-training-driven plasticity in reading-related skills. Brain imaging methods may reveal mechanisms underlying this plasticity, and can potentially be exploited to establish innovative approaches for predicting individual differences in response to music training. Recent work linking rhythmic processing to speech sound sensitivity and literacy skills suggests candidate mechanisms for improving reading skills via music education, and warrant further investigation in the context of using music training to remediate reading disabilities in schoolage children. Future longitudinal studies incorporating both behavioral reading-related outcomes and measures of neural plasticity in typically developing and struggling readers are also needed in order to assess the viability of the neuro-developmental framework for music interventions.

### ACKNOWLEDGMENTS

This project was funded in part through NIH award R01DC007694 to BM. The RedCAP system, made possible through UL1 TR000445 from NCATS/NIH to the Vanderbilt Institute for Clinical and Translational Research, was used for coding and secure data storage. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors would like to gratefully acknowledge Sandra Wilson and Gloria Han for insightful comments on the approach, McKenzie Miller for assistance with coding and data entry, Alison Williams for formatting assistance, and Rita Pfeiffer and two reviewers for

### REFERENCES


feedback and helpful comments on the manuscript. We would also like to thank Noreen Yazejian, Cathy Moritz, Georgios Papadelis, Sylvain Moreno, San Luis Castro, Mireille Besson, Franziska Degé, Hugo Cogo-Moreira, Dena Register, and Wendy Armstrong for providing data and/or additional information about their studies.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2015.01777


and atypical language development. Ann. N.Y. Acad. Sci. 1337, 16–25. doi: 10.1111/nyas.12683


cognitive abilities and hearing speech in noise. PLoS ONE 6:e18082. doi: 10.1371/journal.pone.0018082


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Gordon, Fehd and McCandliss. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

∗ Studies included in the meta-analyses are indicated with asterisks.

# **Sound frequency affects speech emotion perception: results from congenital amusia**

### *Sydney L. Lolli, Ari D. Lewenstein, Julian Basurto, Sean Winnik and Psyche Loui\**

*Department of Psychology, Program in Neuroscience and Behavior, Wesleyan University, Middletown, CT, USA*

Congenital amusics, or "tone-deaf" individuals, show difficulty in perceiving and producing small pitch differences. While amusia has marked effects on music perception, its impact on speech perception is less clear. Here we test the hypothesis that individual differences in pitch perception affect judgment of emotion in speech, by applying lowpass filters to spoken statements of emotional speech. A norming study was first conducted on Mechanical Turk to ensure that the intended emotions from the Macquarie Battery for Evaluation of Prosody were reliably identifiable by US English speakers. The most reliably identified emotional speech samples were used in Experiment 1, in which subjects performed a psychophysical pitch discrimination task, and an emotion identification task under low-pass and unfiltered speech conditions. Results showed a significant correlation between pitch-discrimination threshold and emotion identification accuracy for low-pass filtered speech, with amusics (defined here as those with a pitch discrimination threshold *>*16 Hz) performing worse than controls. This relationship with pitch discrimination was not seen in unfiltered speech conditions. Given the dissociation between low-pass filtered and unfiltered speech conditions, we inferred that amusics may be compensating for poorer pitch perception by using speech cues that are filtered out in this manipulation. To assess this potential compensation, Experiment 2 was conducted using high-pass filtered speech samples intended to isolate non-pitch cues. No significant correlation was found between pitch discrimination and emotion identification accuracy for high-pass filtered speech. Results from these experiments suggest an influence of low frequency information in identifying emotional content of speech.

### **Keywords: amusia, tone-deafness, pitch, filtering, speech, emotion, frequency**

# **Introduction**

Pitch is a perceptual attribute of sound that allows us to order sounds on a frequency-related scale. It is an integral component of auditory processing, including music and language. Across all spoken languages, pitch is one of several cues used to convey emotional prosody, and in some language (tone languages) pitch is also used to convey meaning in words. Understanding how pitch perception affects our interpretation of speech is essential to fully comprehend the ways in which we communicate emotion through language.

Amusic, or "tone-deaf " individuals, are limited in their ability to perceive and produce pitch (Peretz et al., 2002; Hyde and Peretz, 2004; Vuvan et al., 2015). Though amusia is traditionally

### *Edited by:*

*Edward W. Large, University of Connecticut, USA*

### *Reviewed by:*

*Erin E. Hannon, University of Nevada, Las Vegas, USA Sébastien Paquette, University of Montréal, Canada*

### *\*Correspondence:*

*Psyche Loui, Department of Psychology, Program in Neuroscience and Behavior, Wesleyan University, 207 High Street, Middletown, CT 06459, USA ploui@wesleyan.edu*

### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

*Received: 08 April 2015 Accepted: 20 August 2015 Published: 08 September 2015*

### *Citation:*

*Lolli SL, Lewenstein AD, Basurto J, Winnik S and Loui P (2015) Sound frequency affects speech emotion perception: results from congenital amusia. Front. Psychol. 6:1340. doi: 10.3389/fpsyg.2015.01340* thought of as a music-specific disorder, studies have shown that it may also affect perception of speech. In commonpractice Western music, pitches typically vary by a minimum of one semitone. In language, intonation patterns that help us discriminate between statements and questions are characterized by pitch differences that range from 5 to 12 semitones, and occur primarily at the conclusion of a speech fragment (Hutchins et al., 2010). By contrast, pitch changes that reflect prosody in emotional speech lie somewhere in between one and five semitones, and occur over the course of a speech fragment, suggesting that pitch variations in emotion expression are harder to detect than question-statement differences (Dowling and Harwood, 1986).

Consistent with this hypothesis, Hutchins et al. (2010) showed that when asked to discriminate between statements and questions, amusics performed as well as controls. However, when asked to judge whether the same stimuli ended with a rising or falling contour, amusics were significantly less accurate and consistent, suggesting a deficit of pitch awareness in amusics. Though amusic subjects self-reported no difficulties during dayto-day speech processing, Jiang et al. (2012) found that amusics' brain activity was not reliably elicited in response to pitch changes of one semitone in speech [this is in contrast to some early processing of small pitch changes without conscious awareness in music (Peretz et al., 2005, 2009)]. Also, (Nguyen et al., 2009) observed some decreases in sensitivity to pitch inflections found in a tonal language among amusic non-tonal language speakers (Nguyen et al., 2009). Although results from amusics are task-dependent and do overlap with non-amusic controls, studies generally show that amusics have some impairments in speech intonation processing, extending the effects of the disorder beyond music. Other studies have shown that amusics self-report difficulty detecting certain nuances in speech, such as sarcasm, and that they struggle to judge emotional content of speech as accurately as non-amusics (Thompson et al., 2012). In addition, individuals with amusia-like deficiencies report difficulty in determining emotion solely from speech, and may rely more on facial expressions and gestures than control subjects do (Thompson et al., 2012). Though there are other cues in emotional communication that are available to amusics, limitations in the ability to perceive pitch clearly contribute to deficiencies in emotional speech perception.

It has been hypothesized that deficiencies may only be noticeable when amusics are presented with very subtly different stimuli. Liu et al. (2012) presented statement-question discrimination tasks to Mandarin speakers, under conditions of natural speech and gliding tone analogs. Amusics were worse at discriminating gliding tone sequences, and had significantly higher thresholds than controls in detecting pitch changes as well as pitch change directions. However, amusics and controls performed similarly in tasks involving multiple acoustic cues, suggesting that instead of using fine-grained pitch differences to interpret meaning, individuals with pitch perception deficits might have relied on some non-pitch cues. In another study, Liu et al. (2010) presented similar statement-question discrimination tasks under the conditions of natural speech, gliding tones, and non-sense speech analogs. Amusics performed significantly worse than non-amusic control participants in discrimination under all three conditions, suggesting deficiencies not only in samples with isolated pitch contour, but also in natural speech. Liu et al. (2015) again examined this link between amusia and speech processing in Mandarin speakers using speech samples with normal or flattened fundamental frequency contours. Amusics showed reduced speech comprehension when listening to flattened samples in quiet and noisy conditions, while controls only showed reduced speech comprehension in noisy conditions, suggesting that amusics experience speech comprehension difficulties in everyday listening conditions, with deficits extending to impaired segmental processing, rather than being limited to pitch processing.

Our study aims to analyze the extent of impairment in more nuanced areas of speech, namely emotional recognition. It has been suggested that individuals may compensate for poor pitch perception by relying more heavily on alternative cues within speech to infer emotional content, such as stress and emphasis (Hutchins et al., 2010). Speech segments that express five emotions (happy, sad, irritated, fearful, tender) and no emotion are presented as both filtered and non-filtered stimuli to participants. Rather than focusing exclusively on amusic populations, our goal is to test how individual differences in pitch perception can impact the processing of emotional prosody.

Frequency filtering methods are often used in tests that diagnose deficits in auditory perception, in order to simulate subtle differences in music and speech content (Patel et al., 1998; Ayotte et al., 2002; Bhargava and Başkent, 2012; O'Beirne et al., 2012). Low-pass filters may be used to examine speech intelligibility independently or in conjunction with other auditory disturbances (Horwitz et al., 2002; Bhargava and Başkent, 2012). The majority of speech prosody cues are preserved, while speech intelligibility is lost, with a sharply sloped low-pass filter around 500 Hz (Knoll et al., 2009; Guellaï et al., 2014). In our first experiment we applied a low-pass filter that attenuates frequencies above 500 Hz to disrupt intelligibility while still maintaining the fundamental frequency of speech sounds, which gives rise to their pitch contour. In our second experiment, we applied a high-pass filter in order to retain cues other than pitch contour, such as accents and sibilants, which may provide emotional cues. High-pass filters have been used in previous studies, but rarely in amusic populations. Our filter attenuated frequencies below 4800 Hz, providing the listener with minimal pitch contour while preserving rhythmic structure and sibilants.

Natural speech contains many cues that amusics can perceive, prompting them to report predominantly normal speech perception. Studies suggest that amusics who do not report deficiencies in everyday speech may more heavily weigh tempo, mode, and linguistic content in processing emotional significance (Peretz et al., 1998; Gosselin et al., 2015). Low-pass and high-pass filtered speech, in contrast to natural, unfiltered speech, contain less information to factor into individuals' interpretation of emotional content. We hypothesize that there will be a negative correlation between pitch discrimination thresholds and accuracy in emotional identification under low-pass conditions, i.e., that individuals with poorer pitch perception skills are less able to use low-frequency speech cues

to identify emotional prosody. We also hypothesize that unlike low-pass filtering, high-pass filtering speech samples will not affect emotional identification disproportionately for poor pitch perceivers.

# **Norming Study**

The Macquarie Battery for Evaluation of Prosody (MBEP) has been used in previous experiments to assess the effects of amusia on emotional prosody perception (Thompson et al., 2012). The Macquarie database was created from semantically neutral statements (e.g., "The broom is in the closet and the pen is in the drawer"), read by four male and four female actors to represent no emotion and five different emotions (happy, sad, tender, irritated, and frightened). The statements are 14 syllables long, and the emotions were chosen for the variety of acoustic cues that they offer. In total, the database included 96 recorded statements. The statements in the MBEP were recorded in Australia, and thus are recorded with an Australian accent. We performed a norming study on Amazon Mechanical Turk to ensure that American subjects would be able to properly identify emotion in Australianaccented speech.

### **Methods**

Ninety-six statements from MBEP were presented as separate, single-question surveys on Amazon's Mechanical Turk, and subjects were allowed 1 min to listen and respond by identifying the emotion. Subjects were paid \$0.05 per question. Each of the 96 statements in the database received 10 responses from users in the United States.

### **Results**

Results from the norming study are shown in **Figure 1**. Subjects performed well above chance levels in all emotional categories, confirming that American subjects were able to identify emotion in Australian-accented speech.

### **Discussion**

Listeners were reliably successful at identifying the intended emotion from MBEP speech samples. "Irritated" was the most commonly correctly identified emotion, while "tender" was the least commonly correctly identified emotion. Based on listeners' responses, statements in which respondents chose the target emotion less than 50% (chance level = 16.7%) of the time were excluded from use in the study. Tender statements were more likely than other emotions to be excluded, as they were typically more difficult to identify. Twelve statements from the set were excluded from use, resulting in 84 speech samples in the rest of the study. These 84 speech samples included 16 of Happy, Frightened, Irritated, and No Emotion, 14 Sad samples, and six Tender samples.

# **Experiment 1: Low-Pass Filter**

### **Materials and Methods** Participants

Forty participants (21 women and 19 men) aged 18–22 from an introductory psychology course at Wesleyan University participated in exchange for course credit. All participants gave informed consent as approved by the Psychology Ethics Board of Wesleyan University. Participants reported no hearing impairment, neurological disorders, or psychiatric disorders. Twenty-five of the forty participants reported musical training with varying instruments for lengths of time ranging from 6 months to 13 years. Across participants with previous musical training, an average of 6.5 years of training was reported. All subjects took the Montreal Battery of Evaluation of Amusia (MBEA) as well as the pitch discrimination test. Pitch discrimination thresholds, as identified by the pitch discrimination task (described below), ranged from 1.5 to 48 Hz (mean = 10.5 Hz). Nine subjects were considered amusic based on their inability to identify differences in pitch greater than 16 Hz apart (at 500 Hz) in the pitch discrimination task (amusic mean = 23.2 Hz, SD = 10.4 Hz; control mean = 6.8 Hz, SD = 3.9 Hz). Fifteen subjects were considered amusic based on their scores on the MBEA contour subtest (fewer than 23 correct responses out of 31 possible). Four subjects failed both the pitch discrimination threshold test and the MBEA. While the MBEA and pitch discrimination test both measure aspects of musical perception, especially pitch perception, MBEA is broader and also measures attention and working memory. Here we rely on the pitch discrimination test because we are interested more specifically in pitch discrimination aspects of musical function, rather than the attention and working memory components.

### Materials

Several tests were administered to assess musical ability and training: the contour subtest of the a pitch discrimination threshold test MBEA, a questionnaire on demographic information and musical training, and the Shipley Institute of Living Scale (Shipley, 1940), used as a non-verbal IQ control task as it has been shown to be a predictor of WAIS-IQ scores (Paulson and Lin, 1970). Amusia was measured using the contour subtest of the MBEA (Peretz et al., 2003) and a pitch discrimination task. In the contour subtest, two brief melodies are presented that are either identical or differ to varying degrees in pitch contour. The pitch discrimination threshold test (Loui et al., 2008) determines the smallest pitch interval that participants are able to distinguish by presenting a series of two tones and asking whether the second tone is higher or lower in pitch than the first. The test uses a three-up one-down staircase procedure to find the threshold range of pitch perception. The questionnaire administered to the participants included questions about the following: sex, date of birth, first languages, and history of hearing impairment, neurological disorders, or psychological disorders. The questionnaire also included questions on participants' musical training history. If the subject responded that they had trained on an instrument, he or she was asked to share the length of training, age of onset, and the instrument(s) trained on.

A behavioral test was then administered using 84 non-filtered and 84 low-pass filtered speech samples from the MBEP, chosen from the norming study reported above. The non-filtered trial condition consisted of natural (unfiltered) speech samples directly from the database, excepting 12 samples that Mechanical Turk workers did not reliably identify with above 50% accuracy. The low-pass filtered trial condition consisted of frequency-filtered versions of the same 84 speech samples, filtering out frequencies above 500 Hz. Filtering was done in Logic X with the plugin "Channel EQ" (Q factor = 0.75, slope = 48 dB/Octave). This low-pass filtered condition was intended to eliminate formants and other high-frequency cues from the speech samples, while preserving the pitch contour of the speech samples. See **Figure 2** for spectrogram representations of unfiltered (**Figure 2A**) and low-pass filtered (**Figure 2B**) speech samples.

### Procedure

Participants were individually administered the tests as stated above in a laboratory setting with minimal noise interference. Stimuli were presented through Sennheiser 280 HD Pro headphones connected to a desktop iMac computer at a comfortable volume for the subject. The experiment was created using Max/MSP and the two trial blocks were presented in a randomized order, with the aim of balancing out any potential order effects of the blocks. All subjects were equally likely to start on unfiltered and filtered speech. The speech samples within each trial block were also presented in a randomized order. Subjects used the mouse to choose one of the six emotion categories listed from among six options: Happy, Sad, Irritated, Frightened, Tender, and No emotion.

### Data Analysis

Data were exported from the experiment in Max/MSP to Excel and SPSS for analysis. Pitch discrimination thresholds were logtransformed (log base 10) to achieve normal distribution.

# **Results**

Log pitch discrimination threshold was significantly correlated with emotional identification accuracy in the low-pass filtered condition [*r*(38) = *−*0.38, *p* = 0.015; **Figure 3A**] but not in the unfiltered speech condition [*r*(38) = 0.04, n.s.; **Figure 3B**].

Amusics (as identified by pitch discrimination thresholds) performed worse than controls in the filtered condition [*t*(38) = *−*3.13, *p* = 0.003], but not in the unfiltered speech condition [*t*(38) = *−*0.58, n.s.; **Figure 3C**]. When amusics were identified using the contour subtest of the MBEA, their performance in the low-pass filtered condition was still below that of controls (amusics mean = 62%, SD = 16%; controls mean = 70%, SD = 12%); however the difference was only marginally significant [*t*(38) = 1.7, *p* = 0.09]. Amusics identified using the MBEA contour test did not differ in performance from controls in the unfiltered speech condition [amusics mean = 84%, SD = 10%, controls mean = 81%, SD = 9%, *t*(38) = 1.14, *p* = 0.26].

When holding musical training constant in a partial correlation, accuracy under low-pass conditions was still correlated with pitch discrimination threshold [*r*(37) = *−*0.35, *p* = 0.028] and unfiltered speech condition accuracy remained uncorrelated [*r*(37) = *−*0.04, n.s.]. These results confirm that even when controlling for musical training, pitch perception was significantly correlated with emotional identification accuracy under low-pass filtered but not under unfiltered speech conditions. When controlling for Shipley Abstraction scores, the correlations hold at *r*(37) = *−*0.38, *p* = 0.018 for the low-pass condition, and *r*(37) = *−*0.04, n.s. for the unfiltered speech

**FIGURE 2 | Spectrograms of a representative speech sample in (A) unfiltered, (B) low-pass filtered, and (C) high-pass filtered conditions.**

condition. Using both Shipley scores and musical training as control variables, accuracy in the filtered condition remained correlated with pitch discrimination scores [*r*(36) = *−*0.35, *p* = 0.033] and unfiltered speech accuracy remained uncorrelated [*r*(36) = 0.03, n.s.].

As subjects were randomly assigned to begin the experiment with the low-pass filtered block (*n* = 19, 6 amusics) or the unfiltered block (*n* = 21, 3 amusics), it was possible for block order to have influenced results: specifically, experience with the unfiltered speech condition could have helped a subject's subsequent performance on the low-pass condition. A followup analysis was conducted to assess the effects of block order on performance in the low-pass filtered condition. Order was incorporated as a variable in a between-subject ANOVA. A two-way ANOVA on the dependent variable of accuracy in the low-pass condition, with the factors of group (amusics vs. controls) and block order (low-pass first vs. unfiltered speech first) showed a significant main effect of amusia [*F*(1,36) = 5.5, *p* = 0.025] and a significant main effect of block order [*F*(1,36) = 7.3, *p* = 0.01], as well as a significant interaction between amusia and block order [*F*(1,36) = 4.8, *p* = 0.034]. In addition to confirming that amusics performed worse at emotional identification in low-pass filtered speech, this result suggests that subjects learned to identify emotions via prosody throughout the course of the experiment: those who started with unfiltered speech subsequently performed better on the low-pass filtered condition, compared to those who started on the low-pass filtered condition, presumably because subjects learned during the

unfiltered speech condition to listen for pitch as an emotional cue. Interestingly, the significant interaction between group and block order shows that the amusics who started on the lowpass condition performed worse than the amusics who started on the natural speech condition, who were indistinguishable in performance from controls. This interaction suggests that learning throughout the experiment may occur even more in amusics than in controls.

Scores on the MBEA showed no significant correlation with emotional identification accuracy in the low-pass filtered condition [*r*(38) = 0.18, n.s.]. MBEA was not correlated with emotional identification accuracy under unfiltered speech conditions [*r*(38) = *−*0.04, n.s.]. Amusics (as identified by MBEA score) did not perform significantly differently between the filtered condition [*t*(37) = *−*0.33, n.s.] and the unfiltered speech condition [*t*(38) = 1.20, n.s.].

# **Discussion**

Results show a robust association between pitch perception ability and accuracy of emotional identification in speech in the low-pass filtered conditions, but not in unfiltered speech. Amusic individuals, identified as those who have poor pitch perception abilities, are impaired in identifying the emotional content of speech when high-frequency cues are removed from the speech. These individual differences are uniquely related to pitch discrimination abilities, and are not explained by differences in general IQ or musical training.

Given the dissociation between low-pass filtered and unfiltered speech conditions, we inferred that amusics may be compensating for poorer pitch perception by using speech cues that are filtered out in the former manipulation. To assess this potential compensation, a second experiment was conducted, using highpass filtered speech samples intended to isolate non-pitch cues.

# **Experiment 2: High-Pass Filter**

### **Materials and Methods** Participants

Twenty-nine participants (17 women and 12 men) aged 18–28 from an introductory psychology course at Wesleyan University participated in exchange for course credit. Participants reported

no hearing impairment, neurological disorders, or psychiatric disorders. Twenty-one of the 27 participants reported musical training with varying instruments for lengths of time ranging from 1 to 11 years. Among participants with previous musical training, an average of 5.6 years of training was reported. All subjects took the Montreal Battery as well as the pitch discrimination test. Pitch discrimination thresholds, as identified by the pitch discrimination task (described below), ranged from 1.3 to 27.5 Hz (mean = 10.5 Hz). Three participants were considered amusic based on their inability to identify differences in pitch greater than 16 Hz apart (at 500 Hz) in the pitch discrimination task (amusic mean = 26 Hz, SD = 2.1 Hz; control mean = 7.8 Hz, SD = 4.3 Hz). Twelve participants were considered amusic based on their scores on the MBEA contour subtest (fewer than 23 correct responses out of 31 possible). Three participants failed both the pitch discrimination and the MBEA tests.

### Materials

The tests used to assess musical ability and training and the Shipley Institute of Living Scale were the same as administered in Experiment 1. A behavioral test of emotional identification was then administered using the same 84 unfiltered (original) speech samples from the MBEP (the same unfiltered speech samples used in Experiment 1, chosen from the norming study reported above), and 84 new high-pass filtered speech samples generated for this experiment. Filtering was done in Logic X with the plugin "Channel EQ" (Q factor = 0.75, slope = 48 dB/Octave). The frequency cutoff for high-pass filtering was chosen at 4800 Hz (i.e., frequencies lower than 4800 Hz were attenuated) to eliminate cues such as pitch contour and the majority of formant frequencies, while preserving other cues such as speech rate, stress patterns, and rhythm.

### Procedure

Stimuli were presented through Sennheiser 280 HD Pro headphones connected to a desktop iMac computer at a comfortable volume for the subject. The main experiment was

created using Max/MSP and the two trial blocks were presented in a randomized order to the participant. The speech samples within each trial block were also presented in a randomized order. Subjects used the mouse to choose one of the six emotion categories as in Experiment 1.

### Data Analysis

As in Experiment 1, data were exported from the experiment in Max/MSP to Excel and SPSS for analysis. Pitch discrimination thresholds were log-transformed (log base 10) to achieve normal distribution.

# **Results**

As shown in **Figures 4A,B**, pitch discrimination threshold was not significantly correlated with accuracy under high-pass conditions [*r*(27) = *−*0.05, n.s.], or with accuracy under unfiltered speech conditions [*r*(27) = *−*0.28, n.s.]. MBEA was also not significantly correlated with overall accuracy of subjects under unfiltered speech conditions or under high-pass conditions.

While it appears that the high-pass filtering manipulation on the speech samples did not result in the same sensitivity to pitch discrimination differences compared to the low-pass filtered speech in Experiment 1, an additional possibility was that differences between the two experiments resulted from using different subjects between the two experiments, i.e., a sampling difference, which is potentially a confound especially since there were only three subjects who met the pitch-discrimination threshold criterion for amusia within the sample of Experiment 2. In a follow-up analysis to test the equivalence of samples between Experiments 1 and 2, we chose a subset of subjects from among our subjects in Experiment 1 who were matched for pitch discrimination thresholds, Shipley scores, and musical training to our subjects in Experiment 2, thereby repeating our analysis with only 3 amusics. A significant negative correlation was still observed between log pitch discrimination threshold and accuracy in the low-pass filtered speech condition, even within this reduced subset of the Experiment 1 sample [*r*(27) = *−*0.37, *t*(27) = 2.07, *p* = 0.048]. This confirms that the samples of amusic and control subjects are comparable between the two experiments, and that the difference in data pattern between Experiments 1 and 2 is due to our experimental manipulations of the speech samples rather than to sampling differences between the experiments.

# **Discussion**

Results showed no significant relationship between emotional identification accuracy and individual differences of pitch discrimination, in either the unfiltered speech or the high-pass filtered speech conditions. Although only three of the 29 subjects in this experiment showed pitch discrimination thresholds that exceeded the cutoff for amusia, a continuum of individual differences in pitch discrimination was captured in the present sample. High-pass filtering the speech samples did not result in any positive relationship between emotional identification and pitch discrimination, suggesting that individuals with poor pitch perception were not systematically using high-frequency information in speech as a potential source of compensatory cues toward emotional identification. Importantly, results were not explained by sampling differences between Experiments 2 and 1, as a matched subset of data from Experiment 1 replicated the negative correlation in the low-pass filtered condition that was not observed in the high-pass filtered condition in Experiment 2.

# **General Discussion**

Results showed a significant negative correlation between pitch discrimination thresholds and emotional identification for lowpass filtered speech, but not high-pass filtered or unfiltered speech. Subjects with poor pitch perception, especially amusics, performed worse than their counterparts in identifying emotions from speech, but only when the speech was low-pass filtered. Amusics were defined here as those with a pitch discrimination threshold of *>*16 Hz, resulting in nine identified amusics in Experiment 1 and three subjects identified as amusics in Experiment 2. The behavioral dissociation between low-pass and unfiltered speech conditions suggests that low frequency energy bands in speech carry important emotional content, to which amusics are less sensitive.

In the low-pass filtered condition, the observed correlation between emotional identification accuracy and individual differences in pitch discrimination threshold was significant even after controlling for IQ and musical training. This finding suggests that individual differences in pitch perception can exist above and beyond differences in cognitive capacity and musical training, and can have far-reaching consequences that generalize to domains of life beyond musical ability. However, unlike previous reports (Thompson et al., 2012), we did not observe a significant relationship between emotional identification accuracy and pitch discrimination threshold in unfiltered speech. While further work is needed to explain the differences in experiment design that might give rise to our different findings, the observed dissociation from the current study between low-pass filtered and unfiltered speech conditions supports the hypothesis that amusics could have been compensating for their poorer pitch perception in low frequency sounds by using other cues in the speech stimuli. However, the high-pass filtering manipulation (Experiment 2) did not reveal more reliance on high frequency speech cues among poorer pitch perceivers. This may suggest that frequencies above 4800 Hz (the chosen cutoff for high-pass filtering in Experiment 2) were also not the primary source of the compensatory information in speech that amusics might be using to approach the task of emotional identification. Alternately, both groups were using other cues in speech, not captured in the filters used in these studies, to accomplish the task of emotional identification.

Pitch discrimination thresholds were used to define amusia in these experiments rather than the MBEA, as the latter focuses more on melodic discrimination than on individual differences in pitch discrimination *per se*. While amusic participants performed worse in low-pass trials, accuracy for all participants was well above the chance level of 16%. This finding implies that while the fundamental frequency (below 500 Hz) provides some prosodic information such as pitch contour, cues that exist in the range of frequencies between 500 and 4800 Hz may provide further prosodic cues. These midrange frequencies may have been used for emotion recognition in music, in light of recent findings that amusics are able to show normal recognition of musical emotions (Gosselin et al., 2015). Results are also consistent with recent reports showing that amusia is limited to resolved harmonics (Cousineau et al., 2015). Given these results, examining specific frequency bands for prosodic cues may reveal more in the future about the cues that amusics could be using to identify emotions, and to understand speech and music in communication more generally.

Insight into several additional questions may lead to a more complete model explaining this relationship between pitch discrimination and emotional identification. It remains to be determined if there is a causal link between poor pitch perception and poor emotional recognition, or if a third underlying process leads to both deficiencies, as posited by the musical protolanguage hypothesis (Thompson et al., 2012). Poor pitch perception is associated with multiple behavioral and neural differences, such as differences in neural connectivity (Loui et al., 2009), pitch awareness (Loui et al., 2008; Peretz et al., 2009), learning ability (Loui and Schlaug, 2012), and working memory (Williamson and Stewart, 2010), and different contributions of one factor or another may further affect prosodic recognition.

In that regard, one factor that may affect prosodic recognition is learning differences, which was addressed in a follow-up analysis looking at order effects. This showed a significant interaction between amusia and block order: amusics who started the experiment by listening to low-pass filtered speech performed worse than other amusics who started on unfiltered speech. This interaction suggests that learning throughout the experiment may occur even more in amusics than in controls. While more studies are needed to address this possibility in the future, learning could potentially be one of the compensatory mechanisms that amusics use to approach the task of emotional identification when pitch perception is impaired.

Given that a significant correlation between pitch discrimination ability and emotional recognition accuracy was found only when high frequency bands were removed, the data suggest that higher frequency information must have played a role in accurate recognition. Further studies may benefit from examining whether these trends are present among all amusics, or whether in-group distinctions can be made between different amusic individuals. Amusia may be a complex class of disorders with subtle disabilities that are currently categorized under a single category. Related symptoms of amusia, such as rhythmic disabilities, poor singing ability, and deficiencies in musical memory, may be examined to determine if these types of disabilities also correlate with deficiencies in recognition of emotional prosody. By investigating emotional identification in speech by individuals with various musical difficulties, future results may contribute further to the debate on the origins of music and language.

### **References**


# **Conclusion**

The present study investigated the relationship between pitch perception and emotional identification in speech. Using a battery of speech that was spoken with different emotional prosody, we showed that poor pitch perception is correlated with lower accuracy in emotional identification tasks, but only for low-pass filtered speech, and not for high-pass filtered or unfiltered speech. The relationship between pitch discrimination and emotional identification accuracy is not explained by differences in IQ and musical training. Future research should be focused toward identifying which speech cues are used by amusics in order to compensate for impaired pitch perception.

# **Acknowledgments**

Supported by startup funds from Wesleyan University and grants from the Grammy Foundation and the Imagination Institute to PL. We thank W. Thompson and two reviewers for helpful comments at different stages of this project.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Lolli, Lewenstein, Basurto, Winnik and Loui. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Cross-domain processing of musical and vocal emotions in cochlear implant users

Alexandre Lehmann1, 2, 3 \* and Sébastien Paquette2, 3

*<sup>1</sup> Department of Otolaryngology Head and Neck Surgery, McGill University, Montreal, QC, Canada, <sup>2</sup> International Laboratory for Brain, Music and Sound Research, Center for Research on Brain, Language and Music, Montreal, QC, Canada, <sup>3</sup> Department of Psychology, University of Montreal, Montreal, QC, Canada*

Keywords: cross-domain processing, emotion, music, voice, cochlear implant, brain plasticity, neural overlap

Music and voice bear many similarities and share neural resources to some extent. Experience dependent plasticity provides a window into the neural overlap between these two domains. Here, we suggest that research on auditory deprived individuals whose hearing has been bionically restored offers a unique insight into the functional and structural overlap between music and voice. Studying how basic emotions (happiness, sadness, and fear) are perceived in auditory stimuli constitutes a favorable terrain for such an endeavor. We outline a possible neuro-behavioral approach to study the effect of plasticity on cross-domain processing of musical and vocal emotions, using cochlear implant users as a model of reversible sensory deprivation and comparing them to normal-hearing individuals. We discuss the implications of such developments on the current understanding of cross-domain neural overlap.

### Edited by:

*McNeel Gordon Jantzen, Western Washington University, USA*

> Reviewed by: *Takako Fujioka, Stanford University, USA*

\*Correspondence: *Alexandre Lehmann, alexandre.lehmann@mcgill.ca*

### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

Received: *01 July 2015* Accepted: *10 September 2015* Published: *24 September 2015*

### Citation:

*Lehmann A and Paquette S (2015) Cross-domain processing of musical and vocal emotions in cochlear implant users. Front. Neurosci. 9:343. doi: 10.3389/fnins.2015.00343*

# Cross-domain Neural Overlap and Plasticity

Our musical and vocal perception abilities have such a close relationship that some authors suggested that the former originated from the latter or vice-versa (Honing et al., 2015; Peretz et al., 2015). To what extent do music and voice share functional and structural networks and at which stage of auditory processing they are differentiated are open questions. Functional magnetic resonance imaging (fMRI) studies show the co-activation of brain regions with possibly distinct underlying neural populations (Peretz et al., 2015). Research on expert populations has suggested reciprocal interactions between neural circuits associated with the domains of music and voice (Patel, 2011; White-Schwoch et al., 2013; summarized by Paquette and Mignault Goulet, 2014). Indeed studies have shown that musicians have enhanced speech processing capacity, which is reflected in both cortical and subcortical neural measures (Bidelman et al., 2011, 2014; Parbery-Clark et al., 2012). Musicians can be used as a model of learning-induced plasticity to investigate how such cross-domain transfer effects unfold over time (Strait and Kraus, 2014; Strait et al., 2014). Here we argue that, sensory deprivation offers a complementary model to shed light on the plastic reorganization of brain networks involved in particular functions.

# Temporary Deafened Individuals offer a Unique Insight into Auditory Neural Plasticity

Cochlear implants (CI) are bionic devices that can restore the sense of hearing in profoundly deaf individuals. We argue that cochlear implant users offer a promising model to study the mechanisms of cross-domain plasticity because they undergo different trajectories of auditory development: deafness of various origins results in a variable period of auditory deprivation followed by surgical restoration of auditory input and an intense rehabilitation period, yielding variable individual auditory outcomes.

Signal transmitted from the implant to the auditory nerve is impoverished compared to natural hearing. Critically, the access to pitch cues is impaired, reduced to a small number of frequency bands. As a result, cochlear implant users can potentially perceive speech relatively well in a quiet setting, but understanding it in noise, or accurately perceiving music is very challenging since both tasks rely on pitch information (Gfeller et al., 2007). Perception is not only affected by the impoverished auditory input, but also by neural re-organization following auditory deprivation, from the periphery to the cortex. In absence of auditory input, auditory nerve fibers start to degenerate and the auditory cortex can be recruited by visual and somatosensory systems (Collignon et al., 2011; Lazzouni and Lepore, 2014). Such plastic changes can prevent the auditory cortex from fully recovering its initial function after the auditory input is restored via an implant (Lee et al., 2001; Bavelier and Hirshorn, 2010; Sandmann et al., 2012; Sharma et al., 2015).

To date, little is known of the neural correlates of music and voice processing in cochlear implants and the extent to which those processes overlap. Only one study has performed a direct comparison of the neural correlates of speech and music perception in CI users. Using positron emission tomography (PET), Limb et al. (2009), reported increased activation and greater cortical recruitment in implant recipients compared to normal hearing controls, during both speech and music listening. This effect was stronger for speech—for which CI users are more proficient than music—and suggest a link between auditory performance and degree of auditory cortical activation.

# Emotion as a Cross-domain Terrain of Choice to Study Neural Overlap

An important part of our social interaction relies on accurate emotion perception. In normal-hearing individuals, evidence from neuropsychology suggest the existence of an auditory emotional neural pathway, distinct from auditory perception, that might be shared across musical and vocal domains and have both cortical and subcortical components (Peretz, 2011). A systematic comparison of the vocal and musical domains suggests a close acoustical relationship for emotional expression, with similar emotion-specific acoustic cues patterns (Juslin and Laukka, 2003). Several of those patterns relate to the pitch dimension, such as prosody for voice (variations in the pitch contour) and melody for music. The perception of pitch is severely degraded in cochlear implant users, thus limiting their access to those important cues, but other non-pitch based cues can also convey emotions (Gabrielsson and Lindström, 2010). It was recently demonstrated in amusics (individuals with a lifelong pitch perception deficit; Peretz, 2013) that nonpitch based cues (e.g., tempo, pulse clarity) can be used to identify musical emotions (Gosselin et al., 2015). These cues are available to some extent to CI users (Kong et al., 2004; Looi et al., 2012), and should allow them a certain degree of emotional perception. CI users have a documented deficit in both vocal and musical emotion recognition; emotional categories and dimensions are not uniformly impaired. They can recognize some categories of emotion in voice or music above chance, but not as well as normal hearing controls (Hopyan et al., 2012; Nakata et al., 2012; Volkova et al., 2013; Wang et al., 2013). They have difficulty perceiving arousal of musical excerpts but not valence (Ambert-Dahan et al., 2015). These differences could be due to the relatively spared abilities of CI users to perceive temporal variations, while having an impaired pitch perception. They could also reflect differences in the complexity of stimuli employed and how they are handled by speech-optimized processors, suggesting that ad-hoc stimuli are required to accurately compare the two domains. This could explain why no study has yet directly compared emotion processing in CI users across the domains of music and voice. To date, there is very little neuro-imaging evidence building up on the aforementioned behavioral findings. Only one study evaluated the impact of two implant processing strategies on the perception of prosody (Agrawal et al., 2013) and demonstrated that electroencephalography (EEG) is a useful tool to reveal differences between strategies coding specific features.

# Toward a Study of Cross-domain Processing of Musical and Vocal Emotions in Cochlear Implant Users

A large part of the research on auditory affective processing has been conducted on prosody utilizing words or sentences spoken with various emotional expressions and complex musical pieces expressing varying degrees of emotion. It is not possible to directly compare those results between music and voice because of many confounding variables; factors such as speech semantics, length, harmony, and context are likely to recruit different neural networks. We argue that a necessary first step to study crossdomain processing of musical and vocal emotions is to use an experimental paradigm that moves away from the fairly complex sounds used in the existing literature, using stimuli that enable a controlled comparison between the domains of music and voice. A possible approach would be to use the most primitive affect expressions (primal interjections close to those of babies and animals) in each domain: non-speech vocalizations and brief mono-instrumental musical excerpts.

In the vocal domain, non-speech vocalizations (e.g., screams, laughter) depicting basic emotions that are minimally conventionalized, relatively universal and fundamental to spontaneous human communication (Scherer, 1986), could be used. Stimuli like the Montreal Affective Voices (Belin et al., 2008), consisting of short vocal interjections on the vowel /a/ expressing basic emotions, represent the most primitive form of emotion in their domain. They have minimal semantic information and minimal interaction with linguistic processes (Bestelmeyer et al., 2010). Compared to speech prosody, vocalizations are treated preferentially in the brain (Pell et al., 2015). When it comes to music, finding the most basic emotions and avoiding interaction with other processes require stepping away from conventional structure (limited by mode or tempo), reducing the length of the stimuli and reducing its emotional complexity. Stimuli like the Musical Emotional Bursts (Paquette et al., 2013) could be used for comparison, they consist of a few spontaneous notes on a clarinet or violin expressing basic musical emotions, they are minimally conventionalized and represent the most primitive form of emotion in their domain. They are all the more similar to vocal stimuli because they use continuous pitch instrument (e.g., the violin which offers a seamless progression between notes, giving the stimuli a quasi-vocal quality), whereas most studies have used discrete pitch instruments (e.g., the piano where one key corresponds to one pitch), which further hinders the direct comparison with vocal stimuli.

These highly similar vocal and musical stimuli seem wellsuited to study cross-domain overlap in any population and their primitive quality could be extremely useful to study plasticity in CI users.

A second step would be to pair a well-controlled behavioral paradigm using those stimuli (allowing a direct comparison of musical and vocal domains) with a neuro-imaging modality that is acceptable for use with cochlear implants. Except for a few recent exceptions, implants are not MR-compatible. Hidensity EEG (Gilley et al., 2010; Zhang et al., 2011; Timm et al., 2014) and PET-scan (Okazawa et al., 1996; Limb et al., 2009; Lazard et al., 2010) have both been used successfully in cochlear implant users. Both methods have drawbacks; EEG recordings are contaminated by massive electrical artifacts from the implant and PET requires the injection of a radioactive isotope. Emerging as a promising brain-imaging modality for CI research is functional near-infrared spectroscopy (fNIRS). fNIRS has been successfully used to study the response to auditory stimuli in cochlear implant users (Sevy et al., 2010) and emotionrelated activation in the general population (Herrmann et al., 2003; Plichta et al., 2011). This non-invasive technique measures blood oxygenation level differences using infrared light and is therefore unaffected by electrical artifacts. It is portable and has a better temporal resolution than functional MRI (Villringer and Chance, 1997). Conversely it has a worse spatial resolution and cannot access subcortical sources such as the limbic system (Köchel et al., 2011).

The proposed neuro-behavioral approach would be wellsuited to study the effect of plasticity on cross-domain processing of musical and vocal emotions, using cochlear implant users as a model of reversible sensory deprivation and comparing them to normal-hearing individuals. The effect of multiple regressors

# References

Agrawal, D., Thorne, J. D., Viola, F. C., Timm, L., Debener, S., Buchner, A., et al. (2013). Electrophysiological responses to emotional prosody perception in cochlear implant users. NeuroImage Clin. 2, 229–238. doi: 10.1016/j.nicl.2013.01.001

could be assessed by recruiting an heterogeneous cohort of individuals spanning the continuum of factors known to affect plasticity such as the duration of auditory deprivation or the age at implantation (Lazard et al., 2012).

This would represent a stepping-stone to ask further questions of interest regarding the effect of plasticity on cross-domain neural overlap. From a basic science perspective, the rationale is to understand a complex system by reverse-engineering its dysfunctions. What are the structural and functional overlaps between music and voice processing after implantation? Would the reduction of auditory cortical resources, together with the fact that music and vocal signals are more similar after being processed by the device, favor an increased neural overlap between domains? Conversely, would any remaining overlap break-down in favor of a more segregated re-organization guided by the non-pitch based, domain relevant cues?

Characterizing those mechanisms can inform novel clinical approaches, possibly through individualized rehabilitation and brain stimulation. For instance, if good performers (CI users with good speech scores) make use of overlapping structures in an optimal fashion compared to poor performers, can we boost residual neural processes in the latter group? It has been suggested that musical training can improve speech outcomes in this population (Patel, 2014), but what stages of the auditory pathway are best candidates for a cross-domain shaping of function and/or structure? Auditory features found to maximize activity of brain networks processing musical and vocal emotions in CI users could be made more salient in device processors.

Cross-domain research on cochlear implant users not only offers a unique insight into auditory neural plasticity, but also has practical implications for patients' rehabilitation, implant design, and programming. We believe that highly comparable stimuli are needed to carry out such studies, together with an optimal imaging technique within a paradigm fine enough to reveal subtle behavioral and neural differences. Such scientific undertaking can further our understanding of how our brain processes vocal and musical emotions and how such crossdomain processing is affected by plasticity. Furthermore, such studies could provide objective measures to support the use of music in the rehabilitation of various disorders.

# Acknowledgments

This work was supported by a CRBLM research incubator award (funded by the Fonds de Recherche Nature et Technologies and Société et Culture) to AL and by a graduate scholarship from the Canadian Institutes of Health Research to SP.

Ambert-Dahan, E., Giraud, A. L., Sterkers, O., and Samson, S. (2015). Judgment of musical emotions after cochlear implantation in adults with progressive deafness. Name Front. Psychol. 6:181. doi: 10.3389/fpsyg.2015.00181

Bavelier, D., and Hirshorn, E. A. (2010). I see where you're hearing: how crossmodal plasticity may exploit homologous brain structures. Nat. Neurosci. 13:1309. doi: 10.1038/nn1110-1309


evoked potential in cochlear implant users. Hear. Res. 275, 17–29. doi: 10.1016/j.heares.2010.11.007

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Lehmann and Paquette. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# **Music and literature: are there shared empathy and predictive mechanisms underlying their affective impact?**

*Diana Omigie\**

*Music Department, Max Planck Institute for Empirical Aesthetics, Frankfurt am Main, Germany*

It has been suggested that music and language had a shared evolutionary precursor before becoming mainly responsible for the communication of emotive and referential meaning respectively. However, emphasis on potential differences between music and language may discourage a consideration of the commonalities that music and literature share. Indeed, one possibility is that common mechanisms underlie their affective impact, and the current paper carefully reviews relevant neuroscientific findings to examine such a prospect. First and foremost, it will be demonstrated that considerable evidence of a common role of empathy and predictive processes now exists for the two domains. However, it will also be noted that an important open question remains: namely, whether the mechanisms underlying the subjective experience of uncertainty differ between the two domains with respect to recruitment of phylogenetically ancient emotion areas. It will be concluded that a comparative approach may not only help to reveal general mechanisms underlying our responses to music and literature, but may also help us better understand any idiosyncrasies in their capacity for affective impact.

### *Edited by:*

*McNeel G. Jantzen, Western Washington University, USA*

### *Reviewed by:*

*Bradley W. Vines, Nielsen, USA Stefan Koelsch, Freie Universität Berlin, Germany*

### *\*Correspondence:*

*Diana Omigie, Music Department, Max Planck Institute for Empirical Aesthetics, Grüneburgweg 14, 60322 Frankfurt am Main, Germany diana.omigie@aesthetics.mpg.de*

### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

> *Received: 09 March 2015 Accepted: 05 August 2015 Published: 24 August 2015*

### *Citation:*

*Omigie D (2015) Music and literature: are there shared empathy and predictive mechanisms underlying their affective impact? Front. Psychol. 6:1250. doi: 10.3389/fpsyg.2015.01250*

**Keywords: music, literature, emotions, esthetic, empathy, theory of mind, tension, active inference**

# **Introduction**

The creation, performance, and consumption of music and literary works are preoccupations present in all cultures. Music and literature (in its original form of storytelling) have ancient origins and do not only lie at the heart of religious and cultural practices and narratives, but also provide widely popular leisure activities in everyday life. A number of psychological theories, some building on ideas dating as far back as classical antiquity (Aristotle, 1961; Longinus, 1965), have sought to account for music and literature's affective capacity independently (Meyer, 1956; Goldman, 2006; Huron, 2006; Zunshine, 2006; Keen, 2007). Others have examined the nature of emotion in response to the arts more generally (Hjort and Lavers, 1997; Robinson, 2007). Most accounts, however, maintain that in spite of any differences in propositional content (Slevc and Patel, 2011), and notwithstanding any prevailing notions regarding evolutionary capacity for emotion (Brown, 2000), both music and literary works share a considerable ability to evoke powerful feelings. Further, evidence that music may evoke semantic representations (Koelsch et al., 2004; Steinbeis and Koelsch, 2008) would seem to temper any strong claims that the two are incomparable in terms of their ability to convey meaning.

Perhaps one of the most important qualities that binds music and literary reading, and differentiates them from a number of other cultural artifacts (such as paintings and sculpture), is that both unfold in time, offering a kind of "narrative" that can be followed (Rabkin, 1973; Maus, 1991; Levinson, 2004). While poetry, like music, may exercise affective impact through its emphasis on temporal stress and repetition, many different accounts emphasize the impact that even non-versed literary forms (e.g., short stories and novels) can have (Oatley, 1994; Hogan, 2010). Thus, given the timeless and universal appeal of music and literary storytelling, not to mention the claims from a range of theoretical accounts, it seems relevant to explore the emerging neuroscience research for any evidence of an overlap in underlying affective mechanisms. Recent research into music and literature has investigated a breadth of issues varying in their degree of domain specificity (e.g., Bohrn et al., 2012), and a comprehensive review of all possible links that can be made between the two art forms would require a longer format. Accordingly, the current perspective article focuses solely on what are considered key lines of investigation that have seen significant interest in both domains: namely, music and literature's invocation of empathy and predictive processes and the potential role these mechanisms may play in emotion induction.

# **Inferring and Sharing Emotions**

The notion of empathy-like mechanisms being involved in literary response is one that dates as far back as Aristotle's Poetics (Aristotle, 1961). Similarly, as early as in the eighteenth century, it was suggested that engagement of children with music is especially valuable in teaching emotions and a good social attitude. Today, in neuroscience and psychology, empathy may be broadly defined as the ability to infer and share emotional experiences (Gallese, 2003). It is held to comprise two different components: a cognitive and an emotional one that encompass the notion of perspectivetaking and shared visceral feeling respectively (Shamay-Tsoory et al., 2009). Importantly, while the former is related to the notion of Theory of Mind (TOM) and mentalizing (Frith, 1999), the latter is seen as coinciding with the notion of emotional contagion (Juslin and Västfjäll, 2008; Juslin, 2013), with evidence of a double dissociation observable in the neuroscience literature (Eslinger, 1998; Shamay-Tsoory et al., 2003; Schulte-Ruther et al., 2007).

Fiction, the form of literature seen in short stories and novels, has been described as a kind of simulation of the social world (Mar and Oatley, 2008) and it has been suggested that its invocation of social situations not only explains readers' tendency to mentalize during reading (Gygax et al., 2003) but also to feel emotions themselves (Cupchik et al., 1998; Miall and Kuiken, 2002; Oatley, 2002). Over the years, a large body of neuroimaging studies has focused on the neural correlates of text comprehension (see Mar, 2011, for a review), emotion processing in single words (Citron, 2012) and perspective-taking (van Overwalle, 2009). However, the extent to which cognitive or emotional empathy could be directly linked to the affective impact of literature remained limited. In recent years, however, it is becoming apparent that as text stimuli are rendered increasingly story-like or as feelings and emotions play a larger role in them (in other words, as text stimuli begin to resemble fiction or narrative literature), increasingly recruited are not only those areas involved in TOM processing [e.g., ventromedial prefrontal cortex (vmPFC) and temporoparietal junction (TPJ)] but also limbic or emotion areas like the amygdala, thalamus, and orbitofrontal cortex (OFC; Wallentin et al., 2011). Indeed, in line with the notion of emotion induction occurring as a consequence of perspective-taking during literary reading, TOM areas and structures like the amygdala have been implicated in narrative contexts concerning characters' feelings (Ferstl et al., 2005), in negatively valenced stories (Altmann et al., 2012), in emotional relative to non-emotional sections of *Harry Potter* (Hsu et al., 2014) and when participants heard spoken narratives describing real-life emotional episodes (Nummenmaa et al., 2014). A recent series of studies has provided further compelling evidence that the greater the emotional content of a story, the greater the recruitment of both cognitive and emotional empathyrelated structures such as anterior insula and mid cingulate (Altmann et al., 2012; Hsu et al., 2014, 2015a,b,c). Such findings are in line with the so-called *fiction feeling hypothesis* (Jacobs, 2015), which states that greater emotionality in a narrative results in greater feelings of empathy and immersion.

In music, several studies have implicated various limbic and paralimbic structures in the processing of basic emotions, arousal and valence (e.g., happy vs neutral and consonance vs dissonance; see Koelsch, 2014, for a review). However, it may be argued that since music is not itself an emotional object, at least some emotions induced while listening to it must be inferred (Downey et al., 2013). Supporting the notion that musical emotion may be inferred is the evidence that listeners show activation in structures associated with cognitive empathy during music listening. Steinbeis and Koelsch (2009) showed that when music listeners believed they were listening to a piece of music composed by a human rather than a computer, brain areas typically involved in mentalising, such as the medial prefrontal cortex (mPFC), were activated. Further, in the condition known as Behavioral variant frontotemporal dementia, which is associated with a large network of structures including those involved in mentalising, it was shown that the mentalising deficits normally exhibited by these patients also extended to the music domain (Downey et al., 2013). Specifically, patients were impaired in attributing mental states (e.g., dreamy), but not non-mental characteristics (e.g., raindrops) to music, with performance on the former task being more strongly associated with the vmPFC. Recent evidence of the recruitment of the default mode network (DMN) while listeners listened to their preferred music (Kay et al., 2012; Wilkins et al., 2014) also begs the question of the extent to which mentalising processes determine music preferences. The DMN is a network of structures that is preferentially activated when individuals engage in internal tasks like mind wandering and imagining the future. Critically, however, its sharing of a key structure, the mPFC, with the empathizing network, has been used to explain its frequent recruitment during mentalising and empathizing tasks (Gusnard et al., 2001; Li et al., 2014).

In general, while it may seem highly plausible that readers empathize with human characters in a literary work, the notion of music-evoked empathy has tended to be less intuitive. It is therefore worth noting that in addition to the evidence obtained using neuroscience techniques, numerous behavioral and physiological studies continue to provide persuasive support for the role of empathy-related processes during music listening. For instance, it has been reported that the strength of emotions induced in music listeners (self-report and physiology) modulates as a function of perspective-taking with the music performer (Miu and Baltes, 2012). Further, it has also been suggested that the discrepancy that sometimes exists between *expressed* and *felt* emotions (Gabrielsson, 2002) may be explained by the subjective degree of empathy felt by the listener for the musician (Egermann and Adams, 2013). Finally, it has been argued that the degree of the empathy trait possessed by a music listener may predict their appreciation of sad music (Taruffi and Koelsch, 2014).

Thus, taken together, a growing body of behavioral, physiological and neuroscience research provides support for the longstanding notion that empathy processes may contribute to the intensity of felt emotion during both literary reading and music listening. Social cognition comprises just one aspect of these activities, however, and, as hinted at above, the temporal unfolding of "information" over time in the two domains, may have an important influence on the way they are experienced. Accounts of brain function that emphasize prediction and active inference (Friston, 2010) are particularly relevant to dynamically unfolding activities like music listening and reading. Thus, it is interesting to consider how such accounts are informing the investigation of emotional responses to these activities and what the result of such investigations are showing.

# **Predicting the Uncertain**

In general, both music and language (the building blocks of literature) are comprised of discrete elements that are not combined haphazardly, but according to a set of principles (Patel, 2008). Just as linguistic syntax refers to the rules that guide the way language is constructed, so also has the term *musical syntax* been used to describe the set of principles guiding the combination of musical elements. In the field of cognitive neuroscience, a comparative approach has revealed similar electrophysiological signatures to irregular or unexpected events in the context of music and language (Patel, 2008). Specifically, "mismatch" responses to low probability events (e.g., Koelsch et al., 2001; Omigie et al., 2013) have been associated with longer processing times (Bharucha and Stoeckig, 1986; Omigie et al., 2012) and localized to the left and right inferior frontal gyrus (Maess et al., 2001; Koelsch et al., 2005). At this point it is worth acknowledging that the necessarily short and highly controlled stimuli that have commonly been used to bring about the signature mismatch responses may seem far removed from the rich and complex literary and musical materials experienced in everyday life. However, these mismatch responses have increasingly been interpreted as support for the Bayesian brain hypothesis, which posits that the brain continuously makes active inferences about how events in the environment will unfold (Garrido et al., 2009; Friston, 2010; Gebauer et al., 2012). Critically, growing investigations into the emotional implications of such predictive processes (e.g., Joffily and Coricelli, 2013; Omigie, 2015) raise the possibility that commonly observed electrophysiological responses reflect a broader mechanism underlying our affective responses to a wide range of stimuli.

Recently, attempts to characterize the experience of continuously and actively predicting have moved away from emphasizing correlates of incorrect predictions (as in the electrophysiological responses described above) to emphasizing the state of *uncertainty* experienced as a given sequence unfolds (e.g., Hansen and Pearce, 2014; Lehne and Koelsch, 2015). In a recent comprehensive account, the concept of *Tension* was held to be relevant to music, literature (where tension is referred to as suspense), and a range of other activities, and was operationalized as an emotional experience, accompanying continuous prediction making, that arises from a state of uncertainty and need for resolution (Lehne and Koelsch, 2015). The concept of *Tension* has long been used in music listening (Madsen and Fredrickson, 1993; Bigand and Parncutt, 1999; Lerdahl and Krumhansl, 2007; Farbood, 2012), where its buildup and relief is held to be made possible by listeners' having internalized the tonal systems and forms of their culture's music. Importantly, feelings of tension in music have also long been related to changes in physiological responses, for instance in response to increased harmonic complexity (e.g., Krumhansl, 1997; Steinbeis et al., 2006). However, only recently, have the neural correlates of musical tension been directly examined using neuroimaging methods (Lehne et al., 2014; see Koelsch, 2014). Indeed, while Koelsch et al. (2008) had demonstrated that structures like the amygdala and OFC are involved in the processing of syntactically irregular musical events (that brought about the previously mentioned mismatch responses), it was also of interest to see that such structures may be linked to the subjective feelings of musical tension (Lehne et al., 2014). Specifically, it was shown that that continuous subjective ratings of tension as provided by participants, correlated with unfolding activity in left pars orbitalis, an area associated with both predictive and affective processing. Further, a region-of-interest analysis was able to confirm the role of amygdala in mediating feelings of increasing relative to decreasing tension during music listening (Lehne et al., 2014). Interestingly, a number of other studies have also been able to indirectly associate subcortical and limbic structures with uncertainty and anticipation in music (Salimpoor et al., 2011; Trost et al., 2012). For instance, Trost et al. (2012) described neural activity in response to a "tension" emotion (characterized by high arousal, negative valence, and unpredictability) not only in sensory and motor areas (linked to prediction making; Schubotz, 2007) but also in structures like the parahippocampal gyrus and caudate nucleus.

Suspense, the concept equivalent to musical tension in its induction of feelings of uncertainty and anticipation (Lehne and Koelsch, 2015), is held to constitute a critical component of narrative literature (Zillmann, 1980; Brewer and Lichtenstein, 1982; Comisky and Bryant, 1982; Oatley, 1994). Further, like musical tension, it has been shown to modulate physiological responses (Zillmann et al., 1975). Thus, of considerable interest was whether suspense evoked by literary reading would also activate the limbic and deep subcortical structures associated with musical tension (Koelsch et al., 2008; Lehne et al., 2014). In a first ever attempt to isolate the neural correlates of suspense that emerges as participants read a literary text for the first time, the authors presented participants with a narrative broken up into numerous shorter segments while their haemodynamic responses were measured (Lehne et al., 2015). Participants were required to rate each segment, following its presentation, for subjective feelings of suspense. Consequently a parametric regressor that summarized these ratings across participants was used to identify suspense-associated brain regions. The findings were interesting in implicating areas that have been associated with predictive processing in a range of contexts (e.g., inferior frontal gyrus and lateral premotor cortex, see Schubotz, 2007). Further, they were interesting in confirming the role, during the reading of literary texts, of brain areas related to mentalising (e.g., mPFC and TPJ). However, in not implicating a role of subcortical limbic structures in literary tension (as was seen in musical tension), the study from Lehne et al. (2015) suggested differences in the nature of musical and literature-induced uncertainty. Specifically, it suggested differences in the extent of these art forms' recruitment of evolutionary ancient parts of the emotion-processing network.

It remains possible that any conclusions that may be drawn from the studies reviewed above will be moderated following future research. Indeed an important limitation of the literary tension study from Lehne et al. (2015) was the interrupted way in which the stimuli were presented, namely, in segments rather than all at once as in the musical tension study. It remains possible that these interruptions compromised the affective power of the narrative stimuli and, consequently, the extent to which limbic structures could be recruited. Indeed, as seen in the research reviewed earlier, several studies have been able to show a link between limbic activity and perceived emotional intensity of literary stimuli (Ferstl et al., 2005; Wallentin et al., 2011; Altmann et al., 2012; Nummenmaa et al., 2014). Further, there is compelling evidence of the recruitment of limbic regions during the processing of literary stimuli that have been rendered more complex using artistic devices. Here, it is important to point out that in addition to those states of uncertainty that arise from following a plot in the many literary genres that employ suspense (e.g., crime novels, thrillers), states of uncertainty may also arise from a writer's use of literary techniques, of which *defamiliarization* is one (van Peer, 1986; Oatley, 1994; Giora et al., 2004). Defamiliarization is defined as the process whereby a writer makes the familiar unfamiliar and has been shown to reduce the overall predictability of a text, while increasing its

## **References**


perceived esthetic value (Miall and Kuiken, 1994; Hanauer, 1998). In a recent imaging study, evidence was sought of a contribution of defamiliarization to the affective and esthetic perception of written words (Bohrn et al., 2012). Interestingly, it was shown that defamiliarized proverbs, such as "Time eats money" (a variant of Time *is* money) increased activity not just in syntax and semantics related brain areas, but also in limbic structures like the amygdala and medial OFC. Such findings suggest that even if uncertainty in the unfolding of a plot may not implicate the limbic network to the same degree as uncertainty in music, literature's artistic use of language may provide a rich additional source of emotional power.

# **Conclusion**

In sum, the research literature provides an ever-increasing body of support for the notion of a role of empathy processes during both music listening and literary reading. It also suggests an important role of predictive processes during the consumption of such stimuli, although of interest will be to explore the extent to which uncertainty in the two domains is bound (or not) to activity in the core limbic network. In general, it may be concluded that a comparison of research findings from music and literature focused studies will continue to be enlightening, and that particularly important insights will emerge when studies in the two domains use similar concepts. Critically, it may be expected that while observed overlaps may help to explain the common appeal of music and literature as art forms, differences may help to explain any idiosyncrasies in their respective capacities for affective impact.

### **Acknowledgments**

This work was supported by a stipend to DO from the Max Planck Institute for Empirical Aesthetics. The author thanks Melanie Wald-Fuhrmann and Winfried Menninghaus for their critical reading of earlier drafts of the manuscript. The author is also very grateful to the reviewers for their highly insightful comments and recommendations.


an Event-Related fMRI Study. *J. Cogn. Neurosci.* 17, 724–739. doi: 10.1162/0898929053747658


tension-resolution patterns. *Cereb. Cortex* 18, 1169–1178. doi: 10.1093/cercor/ bhm149


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2015 Omigie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*